I am running a cylc 8 suite – u-dr783 and I am getting a failure from remote_setup.
Looking in the workflow log I have a bunch of errors like:
2025-08-01T15:45:42Z WARNING - platform: ln02 - Could not connect to ln02.
* ln02 has been added to the list of unreachable hosts
* remote-init will retry if another host is available.
which suggests, to me, cylc is having trouble connecting to ln02. If I do ssh ln02 from puma I am asked for my archer2 ssh passphrase. If that is happening for cylc then that would explain the failure.
Did that and ssh ln01 works as expected. However, looks my cylc jobs u-dr157/run7 is behaving strangely. u-dr157 thinks it is running atmos_main since 14:53 on the 1st of August. squeue –me shows no atmos_main running… And re-polling doesn’t change anything.
My other job, u-dr783, had failures from remote_setup. I got u-dr783 to start by triggering remote_setup – which has ran. I then got a submit failure from install_ancil so I triggered that again which failed again. Error is:
Great. And how do I get u-dr157/run7 going again? Should I just trigger the next atmos_main task?
From the log 19801001T000Z/atmos_main has ran. The .err file has a message at the end:
/work/n02/n02/tetts/cylc-run/u-dr157/run7/bin/save_wallclock.sh: /work/n02/n02/
tetts/cylc-run/u-dr157/run7/bin/iteration_bins.py: /usr/bin/python: bad interpreter: No such file or directory
I could fix that by modifying iteration_bins.py but I think that this does not matter… It is in the logs for other atmos_main and they worked.
I’ve tried to get u-dr157/run7 started again by getting it to poll. This is some of the workflow output:
2025-08-05T11:02:27Z INFO - Command “poll_tasks” received. ID=6a10c32c-7f38-40c3-8cf3-c53b2e8b6233
poll_tasks(tasks=[‘19801001T0000Z/atmos_main’])
2025-08-05T11:02:27Z INFO - Command “poll_tasks” actioned. ID=6a10c32c-7f38-40c3-8cf3-c53b2e8b6233
2025-08-05T11:02:28Z WARNING - platform: archer2-nvme - Could not connect to ln01.
* ln01 has been added to the list of unreachable hosts
* jobs-poll will retry if another host is available.
2025-08-05T11:02:28Z WARNING - platform: archer2-nvme - Could not connect to ln02.
* ln02 has been added to the list of unreachable hosts
* jobs-poll will retry if another host is available.
2025-08-05T11:02:29Z WARNING - platform: archer2-nvme - Could not connect to ln03.
* ln03 has been added to the list of unreachable hosts
* jobs-poll will retry if another host is available.
How do I fix?
For u-dr783 I just removed it and started again. That seems to be working… (Well jobs are started)
[ OK ] Transfer command succeeded: globus transfer --format unix --jmespath ‘task_id’ --recursive --fail-on-quota-errors --sync-level checksum --label u-dr157/19801001T0000Z --verify-checksum --notify off 3e90d018-0d05-461a-bbaf-aab605283d21:/work/n02/n02/tetts/cylc-run/u-dr157/run7/share/cycle/19801001T0000Z a2f53b7f-1b4e-4dce-9b7c-349ae760fee0:/gws/nopw/j04/terrafirma/tetts/um_archive/u-dr157/19801001T0000Z [ OK ] Transfer: Transfer OK. (ReturnCode=0) 2025-08-06T05:45:49Z INFO - succeeded
I’d check on JASMIN that the data is all there, then if it is, change the task status to succeeded.