Hi,
I have a suite u-dm204, running UKESM1-0-LL, which worked fine through a 20-year run. I then tried to extend and restart but the suite has now failed on coupled.
Specifically, I updated the length to 35 years, and then ran ‘rose suite-run --restart’, but i find that coupled job submits okay, and then fails immediately on starting. The suite is running on archer2, my username is aduffeyum.
I can’t see anything obvious in the log files. job.err shows a sequence of errors as follows:
MPICH ERROR [Rank 599] [job id 8540910.0] [Tue Jan 21 11:51:23 2025] [nid003425] - Abort(1) (rank 599 in comm 288): application called MPI_Abort(comm=0xC4000002, 1) - process 599
Any suggestions on where to start with debugging would be much appreciated!
Thanks,
Alistair
Alistair
Look in /home/n02/n02/aduffeyum/cylc-run/u-dm204/work/20550101T0000Z/coupled/ocean.output
===>>> : E R R O R
===========
stpctl: the zonal velocity is larger than 20 m/s
It might be worth trying to run with a shorter ocean time step to get past this (then revert the time step if successful.)
Grenville
Hi Grenville,
Thanks very much for getting back to me on this.
I tried halving the ocean timestep (from 32 to 64 steps per day), but the suite failed again with the same error in the ocean.output file. Were you thinking a bigger change to the timestep than this?
Alternatively, is there a way for me to take the suite back a cycle to see if that helps?
Hi Alistair
Changing the ocean time step was a very long shot. It is odd that the model fails so quickly - that might indicate something wrong with input data. I’m looking again, but you might check that all the required files are present for the restart. An option might be to start a new run (with bit compare options set) with what yuo think are the correct start files.
Grenville