Issues with submitting coupled task

Hi,
I’m running UKESM1.1 on ARCHER2.
After the job had run for about 25 model years one of the coupled tasks has failed. Looking through the error messages this looks like a disk quota issue. I’ve now freed up space on ARCHER2 and on PUMA (so should have no quota issues) and have tried to retrigger the task.
It now fails due to:
ERROR: file not found: /home/aschurer/cylc-run/u-ck651/log/job/18730701T0000Z/coupled/17/job.err
With no other information given.
What would be the best way to restart this job?
Many thanks,
Andrew

Hi Andrew:
Have you tried logging in to ARCHER2 to see if that log file is there, at the command line, with ls?

Sometimes the Cylc GUI on PUMATEST can’t see the files on ARCHER2.

Also, you might try stopping the suite on PUMATEST, and doing a rose suite-run --restart , and then retriggering the failed apps. I have had to do that a couple of times in the past week, for my suites.
Patrick

Hi Andrew,

Looks like your ssh-agent on pumatest has died.

In log/suite/log

   Permission denied (publickey).
        ERROR: command returns 255: ssh -oBatchMode=yes -oConnectTimeout=10 schn02@login.archer2.ac.uk env CYLC_VERSION=7.8.7 CYLC_UTC=True TZ=UTC bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc jobs-submit --utc-mode --remote-mode -- ''"'"'$HOME/cylc-run/u-ck651/log/job'"'"'' 18730701T0000Z/coupled/17
2022-04-07T14:28:13Z ERROR - [jobs-submit cmd] cylc jobs-submit --utc-mode --host=login.archer2.ac.uk --user=schn02 --remote-mode -- '$HOME/cylc-run/u-ck651/log/job' 18730701T0000Z/coupled/17

When a task fails to submit there is no job.err or job.out file generated hence the cylc GUI error that it can’t find the job.err file.

Cheers,
Ros.

Hi Ros,
Thanks, I’ve now logged out and logged back in.
Restarted the ssh-agent
exec ssh-agent $SHELL
And triggered the task again.

I now get a slightly different but similar error:
ERROR: file not found: /home/aschurer/cylc-run/u-ck651/log/job/18730701T0000Z/coupled/NN/job.err
Traceback (most recent call last):
File “/home/fcm/cylc-7.8.7/bin/cylc-cat-log”, line 439, in
main()
File “/home/fcm/cylc-7.8.7/bin/cylc-cat-log”, line 435, in main
tmpfile_edit(out, options.geditor)
File “/home/fcm/cylc-7.8.7/bin/cylc-cat-log”, line 265, in tmpfile_edit
modtime1 = os.stat(tmpfile).st_mtime
TypeError: coercing to Unicode: need string or buffer, int found

Any ideas what the problem is?
Thanks
Andrew

Hi Andrew,

You don’t need to to run exec ssh-agent $SHELL; doing so won’t work for cylc.
Your ~/.profile is already set up to start up a new ssh-agent as required.

Please do the following to kill the current agent and start up a clean one:

pumatest$ rm ~/.ssh/environment.pumatest.nerc.ac.uk

Then log out of PUMA and back in again. You should then see a message similar to:

Initialising new SSH agent...

And you should then be able to run ssh-add ~/.ssh/id_rsa_archerum successfully.

Regards,
Ros.

Hi Ros,
I have now done this:

Initialising new SSH agent…
-bash-4.1$ ssh-add ~/.ssh/id_rsa_archerum
Enter passphrase for /home/aschurer/.ssh/id_rsa_archerum:
Identity added: /home/aschurer/.ssh/id_rsa_archerum (/home/aschurer/.ssh/id_rsa_archerum)

Also:
-bash-4.1$ ssh schn02@login.archer2.ac.uk
PTY allocation request failed on channel 0
Comand rejected by policy. Not in authorised list
Connection to login.archer2.ac.uk closed.

So it all looks OK to me. But when I tried to retrigger the task. I still get an error, which is now back to:
ERROR: file not found: /home/aschurer/cylc-run/u-ck651/log/job/18730701T0000Z/coupled/23/job.err

Hi Andrew:
I had similar problems recently. After ensuring that your SSH agent is working, you might try stopping the suite on PUMATEST, and doing a rose suite-run --restart , and then retriggering the failed apps. I have had to do that a couple of times in the past week, for my suites. Does that help?
Patrick

Hi Patrick,
Thanks for your reply.
I have done what you have suggested and the task has been submitted and is now running. So that looks like its solved the problem.
Thanks for your help,
Andrew

Just to confirm yes when you fix a dead ssh-agent you will always need to stop and restart the suite in order for it to pick up the new ssh-agent.

1 Like

I am glad it’s working, Andrew!
Patrick

Hi Ros,
I have run into a similar problem. Two suites stopped due to what I think was a disk usage error. u-cn515 and u-cn440.
I’ve freed up space so I think that there shouldn’t be any issues any more but the jobs now don’t submit:

ERROR: file not found: /home/aschurer/cylc-run/u-cn515/log/job/17820701T0000Z/postproc_atmos/NN/job.out
Traceback (most recent call last):
File “/home/fcm/cylc-7.8.7/bin/cylc-cat-log”, line 439, in
main()
File “/home/fcm/cylc-7.8.7/bin/cylc-cat-log”, line 435, in main
tmpfile_edit(out, options.geditor)
File “/home/fcm/cylc-7.8.7/bin/cylc-cat-log”, line 265, in tmpfile_edit
modtime1 = os.stat(tmpfile).st_mtime
TypeError: coercing to Unicode: need string or buffer, int found

I’ve killed the ssh-agent. Stopped the suite. Logged out. Ran ssh-add ~/.ssh/id_rsa_archerum. Restarted the job and still get a submit failed error.

What is the best way to solve this?

Thanks
Andrew

Hi Andrew,

When a task fails to submit there will not be a job.err or job.out file so you’ll see the error file not found if you try and look them through the cylc GUI. With submit-failed errors you need to look in the job-activity.log file (/home/aschurer/cylc-run/u-cn515/log/job/17821001T0000Z/coupled/10/job-activity.log)

The error message is:
(schn02@login.archer2.ac.uk) 2022-05-11T10:35:02Z [STDERR] sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)

ARCHER2 are undertaking an upgrade to Slurm this week (if you didn’t receive the email please make sure you are subscribed to notifications in SAFE).

See also the ARCHER2 status page:
https://www.archer2.ac.uk/support-access/status.html#service-alerts

Cheers,
Ros.

Hi Ros,
Thanks for the help and information. And apologies I should have noticed that this upgrade might have been the cause if this problem (I did get the email).
Thanks,
Andrew