Cylc8 UKESM suite auto retriggering into an unrecoverable state

Robin_Smith · 27 October 2025 16:04

Hi guys,

I have an intermittent problematic behaviour with my cylc8 UKESM1.2 (ARIA PROMOTE) suites on archer2. I think the current state of u-ds260/19560701T0000Z/coupled illustrates the symptoms.

The first attempt at running “coupled” seems to succeed, and restarts to start the next cycle are written. For some reason however cylc seems to think it has failed, and automatically retriggers. This doesn’t work though, since the UM xhist file is already pointing to the restart for the next cycle, and the drivers object to the existence of NEMO restarts for the future cycle too, so deletes them. The retriggered coupled exec then just sits there until it times out.

I think that means I’m stuck - I can neither rerun this cycle, nor manually trigger the next one now that the NEMO dumps have been deleted. I’ve had several suites get into this state now, although sometimes the first attempt at coupled is explicitly marked as having timed out too.

Can you see what’s causing this? Have I missed something?

cheers,

robin

RosalynHatcher · 27 October 2025 16:42

Hi Robin,

Looks like polling issues. If you look in the job.status file you can see it JOB_RUNNER_EXIT_POLLED. I.e. it tried to poll but couldn’t do it properly and thus determined that the job failed. We have seen this occur before and you can end up with multiple copies of the same coupled task running. Ideally cylc would say “I couldn’t poll properly so I’ll hang fire and try again in 5minutes”. I would suggest you switch off retries for the coupled task to prevent this situation from happening.

Assuming you have the correct restarts for the current cycle you will need to replace the xhist with the one from the previous cycle and then retrigger. Hopefully these instructions will help: https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral#RestartingFailingSuites

Cheers,
Ros

Topic		Replies	Views
UKESM run stopped without errors ARCHER2	7	126	4 November 2024
Retriggering completed coupled task? Unified Model ARCHER2 , PUMATest	2	238	24 February 2022
'Coupled' task fails on second model cycle Unified Model ARCHER2	2	201	8 August 2023
Restarting a failed coupled suite ARCHER2	5	54	18 December 2025
Submit-failed for all tasks today? Unified Model PUMA , ARCHER2	11	471	13 May 2022

Cylc8 UKESM suite auto retriggering into an unrecoverable state

Related topics