Recently, I have been trying to port the GC3.1 suite (u-di622) from ARCHER2 to MONSOON3. I spent quite a long time setting up the sources and made some modifications to the source code to fix inconsistencies between the GCOM version and MPI.
The copied suite (u-dw345) on MONSOON3 now gets through all the FCM_make tasks successfully. However, it still falls into a deadlock when it reaches the coupled task. The model simply gets stuck with the following output, but without any fatal error messages:
[LINK_DRIVERS] attempting to run with command: aprun -n 1 -d 1 ./atmos.exe : -n 16 -d 1 ./ocean.exe : -n 6 -d 1 ./xios.exe
[DRIVER_TEST_SCRIPT] Drivers successfully linked
I also tried running the suite with only a single process for atmos.exe. In that case, I get the following error:
Stash Sect No 0 Item No 3
Start Address in SI 2407541
Start Address in LOOKUP Table 2350081
You probably need to RECONFIGURE the start dump
Failure in call to INITDUMP ???????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 1
? Error from routine: INITIAL_4A
? Error message: ADDR_CHK : Mis-match in start addresses
? Error from processor: 0
? Error number: 11 ???????????????????????????????????????????????????????????????????
At the moment, I’m not sure where the issue is coming from. I have tried many different configurations and combinations, but the job still fails in a similar way.
I would really appreciate any advice or suggestions on this.
It will help if you can commit the changes made to u-dw345 so that others can see.
However, from your runtime settings it looks like Reconfiguration has been turned On but the astart (i.e. Recon output) file name has not been changed- effectively causing the Recon to read from and write to the same file, which is probably why this has been corrupted.
The ainitial file will have be re-extracted/ copied over fresh from ARCHER2 and the astart filename (panel um => namelist => Model Input and Output => Dumping and Meaning) set to $DATAM/$RUNID.astart
I have committed the suite. It is still under debugging, so some configurations (such as the domain decomposition and the corresponding PBS resource requests) are tentative at the moment.
I am now trying your suggestion regarding astart. However, the collab queues are currently quite busy, so it may take some time before I can see the results. Once I have them, I will get back to you.
I have reset the astart file to $DATAM/$RUNID.astart and rerun the suite, but unfortunately the output and error messages remain unchanged.
One possible cause I am considering is that I directly copied some precompiled libraries (e.g. OASIS and XIOS) from ARCHER2, which may have introduced inconsistencies in the environment. As a next step, I plan to rebuild and reconfigure all dependent modules directly on MONSOON3 to ensure consistency.