Suite says that it failed but it's still running

My u-da046 JULES suite is running on JASMIN. The cylc GUI said that it failed in the cycle for years 1860-1870. But there isn’t anything in the job.err or job.out or job-activity.log files that says that it failed. I tried to poll the suite, but it still says that it failed, even though it is still running as job ID 17377026 on par-single on SLURM. I reset the state of the job as running. Was that the right thing to do? Is there any explanation for it saying that it failed in the cylc GUI?
Patrick

Hi Patrick,

These symptoms usually indicate an intermittent issue with the slurm command. Cylc uses squeue to query the scheduler to see if the job is still running. If the command doesn’t work and doesn’t error, cylc will think the task has exited the queue and thus flags it as failed. It’s a known issue.

The job.status file has the tell-tale lines:

CYLC_BATCH_SYS_EXIT_POLLED=2024-01-29T08:07:18Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2024-01-29T10:41:10Z

Cheers,
Ros.

Hi Ros
Thanks for explaining!
Much appreciated.
Patrick

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.