Yesterday I had several suites running on ARCHER2, which failed because of the SLURM update. However I’m now unable to restart these: I always get submit-failed on PUMA when I retrigger the task that failed. I’ve also tried stopping the suite and using suite-run --restart, but I still get submit-failed (see e.g. u-cn925 18510101T0000Z).
I’ve checked there’s still a connection from PUMA to ARCHER2, and have also raised this with the ARCHER2 helpdesk but they can’t see a problem on that end.
How should I get my suites running again?
In the log file /home/radiam24/cylc-run/u-cn925/log/suite/log.20220510T104649+01 there were a slew of permission denied (public key) errors. This was the cause of the suite not being able to submit when Slurm came back.
It looks like you might have fixed this, post 18:00 yesterday, after you put in this query, as I can now see a few postprocs managed to submit and run ok.
Cylc cannot track the status of tasks when the Slurm scheduler is out of action and will likely indicate the task has failed when it can’t find it any more. If you have any tasks in the queues/running when Slurm goes down for further maintenance on Monday, I would suggest manually checking their statuses, to see if they have actually succeeded, before resubmitting any “failed” tasks when Slurm comes back.
Yes, I did see that and fix the public key problem on Wednesday evening. However I still am having problems submitting suites - as you said, some jobs succeeded but weren’t tracked by cylc, e.g. /work/n02/n02/radiam24/cylc-run/u-ck767/log/job/190007*/coupled/01/job.status has status succeeded (but this job was automatically resubmitted, and then failed). I tried triggering the postproc tasks after this, but got submit-failed, tried manually setting this coupled task to ‘succeeded’, got submit-failed for the postproc tasks, and tried to retrigger the coupled task but got submit-failed.
Is there a solution other than following the instructions here to restart the suite: https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral#RestartingFailingSuites?
I followed the instructions here (https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral#RestartingFailingSuites?) to restart suite u-ck832 (which had the same issue as my other failing suites). I tried to restart from coupled task of cycle point 19090201T0000Z. I made sure the UM, NEMO and CICE were all initialised from cycle point 19090201T0000Z. However, the coupled task failed with ‘huge initialisation errors’ in NEMO, see /work/n02/n02/radiam24/cylc-run/u-ck832/work/19090201T0000Z/coupled/ocean.output
u-ck767 is currently showing permission denied in the log/suite/log for your attempted submissions this morning. Have you stopped and restarted this suite since you fixed the ssh-agent problem yesterday? Suites often need to be stopped and restarted to pick up the new ssh-agent.
About u-ck767 - I stopped and restarted the suite just now, and the postproc tasks ran. However, the following coupled task has now failed with an error I haven’t seen before:
Traceback (most recent call last):
File “./link_drivers”, line 183, in
envinsts, launchcmds = _run_drivers(common_envars, mode)
File “./link_drivers”, line 66, in _run_drivers
‘(common_envars,’%s’)’ % (drivername, mode)
File “”, line 1, in
File “/mnt/lustre/a2fs-work2/work/n02/n02/radiam24/cylc-run/u-ck767/work/19000801T0000Z/coupled/nemo_driver.py”, line 648, in run_driver
exe_envar = _setup_executable(common_envar)
File “/mnt/lustre/a2fs-work2/work/n02/n02/radiam24/cylc-run/u-ck767/work/19000801T0000Z/coupled/nemo_driver.py”, line 568, in _setup_executable
controller_mode)
File “/mnt/lustre/a2fs-work2/work/n02/n02/radiam24/cylc-run/u-ck767/work/19000801T0000Z/coupled/top_controller.py”, line 370, in run_controller
nemo_dump_time)
File “/mnt/lustre/a2fs-work2/work/n02/n02/radiam24/cylc-run/u-ck767/work/19000801T0000Z/coupled/top_controller.py”, line 248, in _setup_top_controller
% top_dump_time % nemo_dump_time)
TypeError: not enough arguments for format string
[FAIL] run_model <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2022-05-13T10:16:11Z CRITICAL - failed/EXIT
About u-ck832 - I did stop this suite before following the instructions for a warm start, so I’m not sure what the issue here is.
We’re looking at the suites and will take them one at a time, otherwise it gets very confusing, please don’t do anything with them until we get back to you with further advice.
*** Info read in restart :
previous time-step : 680640
*** restart option
nrstdt = 2 : calendar parameters read in restart
===>>> : E R R O R
===========
===>>>> : problem with nit000 for the restart
verify the restart file or rerun with nrstdt = 0 (namelist)
*** Info used values :
date ndastp : 19090130
number of elapsed days since the begining of run : 21270.
=======>> 1/2 time step before the start of the run DATE Y/M/D = 1909/ 1/30 nsec_day: 85050 nsec_week: 85050
======>> time-step = 681601 New day, DATE Y/M/D = 1909/02/01 nday_year = 031
nsec_year = 2593350 nsec_month = 1350 nsec_day = 1350
Indicates that there is a mismatch between the restart file timestep and the start timestep (nn_it000) for this cycle in the namelist_cfg file. The previous timestep was 680640 and for some reason the namelist_cfg has 681601 as the next timestep.
Please copy the namelist_cfg from the NEMOhist directory into the cycle work directory:
Sorry, my mistake, of course it calculates the next starting point from what is in the NEMOhist/namelist_cfg so I should have said try copying the namelist_cfg from the previous cycle into the NEMOhist directory: