Suite fails on restart after extending length

AlistairDuffey · 21 January 2025 12:16

Hi,

I have a suite u-dm204, running UKESM1-0-LL, which worked fine through a 20-year run. I then tried to extend and restart but the suite has now failed on coupled.

Specifically, I updated the length to 35 years, and then ran ‘rose suite-run --restart’, but i find that coupled job submits okay, and then fails immediately on starting. The suite is running on archer2, my username is aduffeyum.

I can’t see anything obvious in the log files. job.err shows a sequence of errors as follows:

MPICH ERROR [Rank 599] [job id 8540910.0] [Tue Jan 21 11:51:23 2025] [nid003425] - Abort(1) (rank 599 in comm 288): application called MPI_Abort(comm=0xC4000002, 1) - process 599

Any suggestions on where to start with debugging would be much appreciated!

Thanks,

Alistair

grenville · 21 January 2025 16:41

Alistair

Look in /home/n02/n02/aduffeyum/cylc-run/u-dm204/work/20550101T0000Z/coupled/ocean.output

===>>> : E R R O R
         ===========

  stpctl: the zonal velocity is larger than 20 m/s

It might be worth trying to run with a shorter ocean time step to get past this (then revert the time step if successful.)

Grenville

AlistairDuffey · 22 January 2025 11:51

Hi Grenville,

Thanks very much for getting back to me on this.

I tried halving the ocean timestep (from 32 to 64 steps per day), but the suite failed again with the same error in the ocean.output file. Were you thinking a bigger change to the timestep than this?

Alternatively, is there a way for me to take the suite back a cycle to see if that helps?

grenville · 23 January 2025 12:58

Hi Alistair

Changing the ocean time step was a very long shot. It is odd that the model fails so quickly - that might indicate something wrong with input data. I’m looking again, but you might check that all the required files are present for the restart. An option might be to start a new run (with bit compare options set) with what yuo think are the correct start files.

Grenville

Topic		Replies	Views
UKESM run stopped without errors ARCHER2	7	123	4 November 2024
Cycle point for restarting suites? Unified Model ARCHER2 , PUMATest	20	912	1 March 2022
Problems extending UKESM runs beyond original runlength	5	50	1 April 2026
Failing at ‘Coupled’ Unified Model	6	257	11 May 2023
Restarting a failed coupled suite ARCHER2	5	52	18 December 2025

Suite fails on restart after extending length

Related topics