Hi CMS,
A coupled run on Archer2, u-dv795, keeps failing well into a cycle with the job.err showing:
MPICH ERROR [Rank 608] [job id 12054896.0] [Wed Dec 31 21:36:36 2025] [nid001363] - Abort(1) (rank 608 in comm 288): application called MPI_Abort(comm=0xC4000002, 1) - process 608
srun: error: nid001363: tasks 1,33,76,93,96: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=12054896.0+0
srun: launch/slurm: _step_signal: Terminating StepId=12054896.0+2
srun: launch/slurm: _step_signal: Terminating StepId=12054896.0+1
slurmstepd: error: *** STEP 12054896.0+2 ON nid001380 CANCELLED AT 2025-12-31T21:36:36 ***
slurmstepd: error: *** STEP 12054896.0+0 ON nid001260 CANCELLED AT 2025-12-31T21:36:36 ***
slurmstepd: error: *** STEP 12054896.0+1 ON nid001363 CANCELLED AT 2025-12-31T21:36:36 ***
Retriggering just yields the same error.
The debug.root.02 shows:
(oasis_init_comp) OPEN debug file for root pe, unit : 1025
(oasis_mem_init) Initset conversion flag is T
(oasis_mem_init) 8 MB memory alloc in MB is 8.00
(oasis_mem_init) 8 MB memory dealloc in MB is 0.00
(oasis_mem_init) Memory block size conversion in bytes is 3775.25
(oasis_mem_print) memory use (MB) = 818.1926 74.3546 (oasis_init_comp)
(oasis_mem_print) memory use (MB) = 989.2313 171.0567 (oasis_enddef):start
(oasis_mem_print) memory use (MB) = 990.0450 178.4914 (oasis_enddef):part_setup
(oasis_mem_print) memory use (MB) = 990.0450 178.7291 (oasis_enddef):var_setup
(oasis_mem_print) memory use (MB) = 990.0450 178.7291 (oasis_enddef):write2files
(oasis_coupler_setup) smatread_method = ceg
(oasis_mem_print) memory use (MB) = 1019.8668 207.7300 (oasis_enddef):coupler_setup
(oasis_mem_print) memory use (MB) = 1019.8668 207.9352 (oasis_enddef):advance_init
(oasis_mem_print) memory use (MB) = 1019.8668 207.9352 (oasis_enddef):end
(oasis_abort) ABORT: compid = 0
(oasis_abort) ABORT: called by = mppstop
(oasis_abort) ABORT: message = NEMO initiated abort
(oasis_abort) ABORT: on model = toyoce
(oasis_abort) ABORT: on global rank = 576
(oasis_abort) ABORT: on local rank = 0
(oasis_abort) ABORT: CALLING ABORT FROM OASIS LAYER NOW
which appears similar to MPICH ERROR - HadGEM3 GC5e but that issue appears not to have been solved.
Would you have any advice on this?
Cheers,
James