Cylc8 UKESM suite auto retriggering into an unrecoverable state

Hi guys,

I have an intermittent problematic behaviour with my cylc8 UKESM1.2 (ARIA PROMOTE) suites on archer2. I think the current state of u-ds260/19560701T0000Z/coupled illustrates the symptoms.

The first attempt at running “coupled” seems to succeed, and restarts to start the next cycle are written. For some reason however cylc seems to think it has failed, and automatically retriggers. This doesn’t work though, since the UM xhist file is already pointing to the restart for the next cycle, and the drivers object to the existence of NEMO restarts for the future cycle too, so deletes them. The retriggered coupled exec then just sits there until it times out.

I think that means I’m stuck - I can neither rerun this cycle, nor manually trigger the next one now that the NEMO dumps have been deleted. I’ve had several suites get into this state now, although sometimes the first attempt at coupled is explicitly marked as having timed out too.

Can you see what’s causing this? Have I missed something?

cheers,

robin

Hi Robin,

Looks like polling issues. If you look in the job.status file you can see it JOB_RUNNER_EXIT_POLLED. I.e. it tried to poll but couldn’t do it properly and thus determined that the job failed. We have seen this occur before and you can end up with multiple copies of the same coupled task running. Ideally cylc would say “I couldn’t poll properly so I’ll hang fire and try again in 5minutes”. I would suggest you switch off retries for the coupled task to prevent this situation from happening.

Assuming you have the correct restarts for the current cycle you will need to replace the xhist with the one from the previous cycle and then retrigger. Hopefully these instructions will help: https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral#RestartingFailingSuites

Cheers,
Ros

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.