MPICH/ OASIS error in coupled run

James_Weber · 5 January 2026 11:25

Hi CMS,

A coupled run on Archer2, u-dv795, keeps failing well into a cycle with the job.err showing:

MPICH ERROR [Rank 608] [job id 12054896.0] [Wed Dec 31 21:36:36 2025] [nid001363] - Abort(1) (rank 608 in comm 288): application called MPI_Abort(comm=0xC4000002, 1) - process 608

srun: error: nid001363: tasks 1,33,76,93,96: Aborted

srun: launch/slurm: _step_signal: Terminating StepId=12054896.0+0

srun: launch/slurm: _step_signal: Terminating StepId=12054896.0+2

srun: launch/slurm: _step_signal: Terminating StepId=12054896.0+1

slurmstepd: error: *** STEP 12054896.0+2 ON nid001380 CANCELLED AT 2025-12-31T21:36:36 ***

slurmstepd: error: *** STEP 12054896.0+0 ON nid001260 CANCELLED AT 2025-12-31T21:36:36 ***

slurmstepd: error: *** STEP 12054896.0+1 ON nid001363 CANCELLED AT 2025-12-31T21:36:36 ***

Retriggering just yields the same error.

The debug.root.02 shows:

(oasis_init_comp) OPEN debug file for root pe, unit : 1025

(oasis_mem_init) Initset conversion flag is T

(oasis_mem_init) 8 MB memory alloc in MB is 8.00

(oasis_mem_init) 8 MB memory dealloc in MB is 0.00

(oasis_mem_init) Memory block size conversion in bytes is 3775.25

(oasis_mem_print) memory use (MB) = 818.1926 74.3546 (oasis_init_comp)

(oasis_mem_print) memory use (MB) = 989.2313 171.0567 (oasis_enddef):start

(oasis_mem_print) memory use (MB) = 990.0450 178.4914 (oasis_enddef):part_setup

(oasis_mem_print) memory use (MB) = 990.0450 178.7291 (oasis_enddef):var_setup

(oasis_mem_print) memory use (MB) = 990.0450 178.7291 (oasis_enddef):write2files

(oasis_coupler_setup) smatread_method = ceg

(oasis_mem_print) memory use (MB) = 1019.8668 207.7300 (oasis_enddef):coupler_setup

(oasis_mem_print) memory use (MB) = 1019.8668 207.9352 (oasis_enddef):advance_init

(oasis_mem_print) memory use (MB) = 1019.8668 207.9352 (oasis_enddef):end

(oasis_abort) ABORT: compid = 0

(oasis_abort) ABORT: called by = mppstop

(oasis_abort) ABORT: message = NEMO initiated abort

(oasis_abort) ABORT: on model = toyoce

(oasis_abort) ABORT: on global rank = 576

(oasis_abort) ABORT: on local rank = 0

(oasis_abort) ABORT: CALLING ABORT FROM OASIS LAYER NOW

which appears similar to MPICH ERROR - HadGEM3 GC5e but that issue appears not to have been solved.

Would you have any advice on this?

Cheers,

James

mdalvi · 5 January 2026 12:18

Hi James,
In this case (looking at “(oasis_abort) ABORT: message = NEMO initiated abort”) it is the ocean model that has stopped and ocean.output says:

===>>> : E R R O R ==========

stpctl: the zonal velocity is larger than 20 m/s

kt= 11043 max abs(U):  4.4092E+05, i j k:   118  289   10

This is one example of NEMO blowing up so you might have to follow the atmos dump perturbing workaround

Mohit

James_Weber · 5 January 2026 14:08

Thanks, Mohit, I will give this ago.

James

James_Weber · 6 January 2026 18:54

Thanks, Mohit, this has worked.

James

Topic		Replies	Views
MPICH ERROR - HadGEM3 GC5e	3	32	21 November 2025
Failing at ‘Coupled’ Unified Model	12	235	19 February 2024
Failing at coupled after 1 month	2	101	1 February 2024
Suite fails on restart after extending length Unified Model ARCHER2	4	45	22 February 2025
Time out error in 'coupled' task for UKESM suite on Archer2 Unified Model ARCHER2	14	338	19 February 2024

MPICH/ OASIS error in coupled run

Related topics