Remote_setup failing

I am running a cylc 8 suite – u-dr783 and I am getting a failure from remote_setup.

Looking in the workflow log I have a bunch of errors like:

2025-08-01T15:45:42Z WARNING - platform: ln02 - Could not connect to ln02.
* ln02 has been added to the list of unreachable hosts
* remote-init will retry if another host is available.

which suggests, to me, cylc is having trouble connecting to ln02. If I do ssh ln02 from puma I am asked for my archer2 ssh passphrase. If that is happening for cylc then that would explain the failure.

What do I do?

Simon

Hi Simon,

This suggests that your ssh-agent on PUMA2 has died.

See here for how to fix it: 11. Appendix B: SSH FAQs — NCAS Unified Model Introduction

Cheers,
Ros

Did that and ssh ln01 works as expected. However, looks my cylc jobs u-dr157/run7 is behaving strangely. u-dr157 thinks it is running atmos_main since 14:53 on the 1st of August. squeue –me shows no atmos_main running… And re-polling doesn’t change anything.

My other job, u-dr783, had failures from remote_setup. I got u-dr783 to start by triggering remote_setup – which has ran. I then got a submit failure from install_ancil so I triggered that again which failed again. Error is:

ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no ln01 env CYLC_VERSION=8.4.4 CYLC_ENV_NAME=cylc-8.4.4-1 bash --login -c ‘’“‘”‘exec “$0” “$@”’“’”‘’ cylc jobs-submit --utc-mode --remote-mode --clean-env --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin – ‘$HOME/cylc-run/u-dr783/log/job’ 19790101T0000Z/install_ancil/01
[jobs-submit ret_code] 1
[jobs-submit out] 2025-08-04T08:27:45Z|19790101T0000Z/install_ancil/01|1|None

If I, interactively, do ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no ln01 ls

I get an error:

Warning: Permanently added ‘ln01,10.252.1.65’ (ECDSA) to the list of known hosts.
tetts@ln01: Permission denied (keyboard-interactive).

Simon

The problem with install_ancil submission is this:

2025-08-04T08:28:24Z [STDERR] sbatch: error: AssocMaxCpuMinutesPerJobLimit
2025-08-04T08:28:24Z [STDERR] sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

n02-TERRAFIRMA has used all its current budget. I’ll sort it out when I get to the office shortly.

Cheers,
Ros

Hi Roz,

thanks a lot! So many ways in which things can go wrong. I don’t think it is me burning the budget!

Simon

All topped up. Indeed, you’re not the biggest user. :slightly_smiling_face:

Great. And how do I get u-dr157/run7 going again? Should I just trigger the next atmos_main task?

From the log 19801001T000Z/atmos_main has ran. The .err file has a message at the end:

/work/n02/n02/tetts/cylc-run/u-dr157/run7/bin/save_wallclock.sh: /work/n02/n02/
tetts/cylc-run/u-dr157/run7/bin/iteration_bins.py: /usr/bin/python: bad interpreter: No such file or directory

I could fix that by modifying iteration_bins.py but I think that this does not matter… It is in the logs for other atmos_main and they worked.

Simon

I’ve tried to get u-dr157/run7 started again by getting it to poll. This is some of the workflow output:

2025-08-05T11:02:27Z INFO - Command “poll_tasks” received. ID=6a10c32c-7f38-40c3-8cf3-c53b2e8b6233
poll_tasks(tasks=[‘19801001T0000Z/atmos_main’])
2025-08-05T11:02:27Z INFO - Command “poll_tasks” actioned. ID=6a10c32c-7f38-40c3-8cf3-c53b2e8b6233
2025-08-05T11:02:28Z WARNING - platform: archer2-nvme - Could not connect to ln01.
* ln01 has been added to the list of unreachable hosts
* jobs-poll will retry if another host is available.
2025-08-05T11:02:28Z WARNING - platform: archer2-nvme - Could not connect to ln02.
* ln02 has been added to the list of unreachable hosts
* jobs-poll will retry if another host is available.
2025-08-05T11:02:29Z WARNING - platform: archer2-nvme - Could not connect to ln03.
* ln03 has been added to the list of unreachable hosts
* jobs-poll will retry if another host is available.

How do I fix?

For u-dr783 I just removed it and started again. That seems to be working… (Well jobs are started)

Simon

Hi Simon,

I’d probably try stopping and restarting the suite

Cheers,
Ros.

Do I do that by:

cylc stop --now --now

and then cylc play ?

Simon

Simon

Yes, try that.

Grenville

It looks like 19801001T0000Z/pptransfer is hanging. Should I kill the task and trigger it?

Simon

The log says:

[ OK ] Transfer command succeeded: globus transfer --format unix --jmespath ‘task_id’ --recursive --fail-on-quota-errors --sync-level checksum --label u-dr157/19801001T0000Z --verify-checksum --notify off 3e90d018-0d05-461a-bbaf-aab605283d21:/work/n02/n02/tetts/cylc-run/u-dr157/run7/share/cycle/19801001T0000Z a2f53b7f-1b4e-4dce-9b7c-349ae760fee0:/gws/nopw/j04/terrafirma/tetts/um_archive/u-dr157/19801001T0000Z
[ OK ] Transfer: Transfer OK. (ReturnCode=0)
2025-08-06T05:45:49Z INFO - succeeded

I’d check on JASMIN that the data is all there, then if it is, change the task status to succeeded.

Grenville