Failing at coupled after 1 month

I have a job u-da314 that is failing at coupled after 1 month of running ok.
It is very similar to a suite I have used succesfully multiple time (u-da510) - only u-da314 starts in October. Could you help find the issue?

many thanks,
Jeremy

snippet from
/work/n02/n02/jgrist02/cylc-run/u-da314/log/job/19501001T0000Z/coupled/04/job.err:

[126] exceptions: An non-exception application exit occured.
[126] exceptions: whilst in a serial region
[126] exceptions: Task had pid=58239 on host nid005993
[126] exceptions: Program is “/mnt/lustre/a2fs-work2/work/n02/n02/jgrist02/cylc-run/u-da314/work/19501001T0000Z/coupled/./atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
MPICH ERROR [Rank 126] [job id 5366844.0] [Mon Jan 29 11:11:19 2024] [nid005993] - Abort(9) (rank 126 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 9) - process 126

???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 24
? Error from routine: WRITHEAD
? Error message: WRITHEAD: Addressing conflict
? Error from processor: 99
? Error number: 69
???

[99] exceptions: An non-exception application exit occured.
[99] exceptions: whilst in a serial region
[99] exceptions: Task had pid=58212 on host nid005993
[99] exceptions: Program is “/mnt/lustre/a2fs-work2/work/n02/n02/jgrist02/cylc-run/u-da314/work/19501001T0000Z/coupled/./atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
MPICH ERROR [Rank 99] [job id 5366844.0] [Mon Jan 29 11:11:19 2024] [nid005993] - Abort(9) (rank 99 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 9) - process 99

srun: error: nid005993: task 82: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=5366844.0+0
srun: launch/slurm: _step_signal: Terminating StepId=5366844.0+2
srun: launch/slurm: _step_signal: Terminating StepId=5366844.0+1
slurmstepd: error: *** STEP 5366844.0+2 ON nid006097 CANCELLED AT 2024-01-29T11:11:19 ***
slurmstepd: error: *** STEP 5366844.0+1 ON nid006079 CANCELLED AT 2024-01-29T11:11:19 ***

Hi Jeremy

This error happens if the suite is not started with a reconfigured UM 10.6 start file.

Rebuild the model after removing:
branches/dev/simonwilson/vn11.6_stochastic_header

Grenville

Hi Grenville,

Thank you for this.

Jeremy