Submit-failed for all tasks today?

Dear NCAS helpdesk,

Yesterday I had several suites running on ARCHER2, which failed because of the SLURM update. However I’m now unable to restart these: I always get submit-failed on PUMA when I retrigger the task that failed. I’ve also tried stopping the suite and using suite-run --restart, but I still get submit-failed (see e.g. u-cn925 18510101T0000Z).
I’ve checked there’s still a connection from PUMA to ARCHER2, and have also raised this with the ARCHER2 helpdesk but they can’t see a problem on that end.
How should I get my suites running again?

Many thanks.

Best wishes,

Rachel

Hi Rachel,

In the log file /home/radiam24/cylc-run/u-cn925/log/suite/log.20220510T104649+01 there were a slew of permission denied (public key) errors. This was the cause of the suite not being able to submit when Slurm came back.

It looks like you might have fixed this, post 18:00 yesterday, after you put in this query, as I can now see a few postprocs managed to submit and run ok.

Cylc cannot track the status of tasks when the Slurm scheduler is out of action and will likely indicate the task has failed when it can’t find it any more. If you have any tasks in the queues/running when Slurm goes down for further maintenance on Monday, I would suggest manually checking their statuses, to see if they have actually succeeded, before resubmitting any “failed” tasks when Slurm comes back.

Regards,
Ros.

Hi Ros,

Yes, I did see that and fix the public key problem on Wednesday evening. However I still am having problems submitting suites - as you said, some jobs succeeded but weren’t tracked by cylc, e.g. /work/n02/n02/radiam24/cylc-run/u-ck767/log/job/190007*/coupled/01/job.status has status succeeded (but this job was automatically resubmitted, and then failed). I tried triggering the postproc tasks after this, but got submit-failed, tried manually setting this coupled task to ‘succeeded’, got submit-failed for the postproc tasks, and tried to retrigger the coupled task but got submit-failed.
Is there a solution other than following the instructions here to restart the suite: https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral#RestartingFailingSuites?

Best wishes,

Rachel

Hi,

I followed the instructions here (https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral#RestartingFailingSuites?) to restart suite u-ck832 (which had the same issue as my other failing suites). I tried to restart from coupled task of cycle point 19090201T0000Z. I made sure the UM, NEMO and CICE were all initialised from cycle point 19090201T0000Z. However, the coupled task failed with ‘huge initialisation errors’ in NEMO, see /work/n02/n02/radiam24/cylc-run/u-ck832/work/19090201T0000Z/coupled/ocean.output

How should I get this suite running again?

Best wishes,

Rachel

Hi Rachel,

u-ck767 is currently showing permission denied in the log/suite/log for your attempted submissions this morning. Have you stopped and restarted this suite since you fixed the ssh-agent problem yesterday? Suites often need to be stopped and restarted to pick up the new ssh-agent.

Regards,
Ros.

Hi,

Thanks so much.

About u-ck767 - I stopped and restarted the suite just now, and the postproc tasks ran. However, the following coupled task has now failed with an error I haven’t seen before:
Traceback (most recent call last):
File “./link_drivers”, line 183, in
envinsts, launchcmds = _run_drivers(common_envars, mode)
File “./link_drivers”, line 66, in _run_drivers
‘(common_envars,’%s’)’ % (drivername, mode)
File “”, line 1, in
File “/mnt/lustre/a2fs-work2/work/n02/n02/radiam24/cylc-run/u-ck767/work/19000801T0000Z/coupled/nemo_driver.py”, line 648, in run_driver
exe_envar = _setup_executable(common_envar)
File “/mnt/lustre/a2fs-work2/work/n02/n02/radiam24/cylc-run/u-ck767/work/19000801T0000Z/coupled/nemo_driver.py”, line 568, in _setup_executable
controller_mode)
File “/mnt/lustre/a2fs-work2/work/n02/n02/radiam24/cylc-run/u-ck767/work/19000801T0000Z/coupled/top_controller.py”, line 370, in run_controller
nemo_dump_time)
File “/mnt/lustre/a2fs-work2/work/n02/n02/radiam24/cylc-run/u-ck767/work/19000801T0000Z/coupled/top_controller.py”, line 248, in _setup_top_controller
% top_dump_time % nemo_dump_time)
TypeError: not enough arguments for format string
[FAIL] run_model <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2022-05-13T10:16:11Z CRITICAL - failed/EXIT

About u-ck832 - I did stop this suite before following the instructions for a warm start, so I’m not sure what the issue here is.

Best wishes,

Rachel

Hi Rachel,

We’re looking at the suites and will take them one at a time, otherwise it gets very confusing, please don’t do anything with them until we get back to you with further advice.

Best Regards,
Ros.

Hi Ros,

Thanks so much for letting me know.

Best wishes,

Rachel

Hi Rachel,

u-ck832:

The error in the ocean.output:

*** Info read in restart : 
    previous time-step                               :  680640
  *** restart option
  nrstdt = 2 : calendar parameters read in restart
 

 ===>>> : E R R O R
         ===========

  ===>>>> : problem with nit000 for the restart
  verify the restart file or rerun with nrstdt = 0 (namelist)
  *** Info used values : 
    date ndastp                                      :  19090130
    number of elapsed days since the begining of run :  21270.
 
 =======>> 1/2 time step before the start of the run DATE Y/M/D =   1909/ 1/30  nsec_day:   85050  nsec_week:   85050
======>> time-step =  681601      New day, DATE Y/M/D = 1909/02/01      nday_year = 031
         nsec_year =  2593350   nsec_month =    1350   nsec_day =  1350

Indicates that there is a mismatch between the restart file timestep and the start timestep (nn_it000) for this cycle in the namelist_cfg file. The previous timestep was 680640 and for some reason the namelist_cfg has 681601 as the next timestep.

Please copy the namelist_cfg from the NEMOhist directory into the cycle work directory:

cp /work/n02/n02/radiam24/cylc-run/u-ck832/share/data/History_Data/NEMOhist/namelist_cfg /work/n02/n02/radiam24/cylc-run/u-ck832/work/19090201T0000Z/coupled

This has:

nn_it000=680641
nn_itend=681600

And try submitting the task again.

Cheers,
Ros.

Hi,

Thanks. I copied the file and retriggered the task, but it failed with a similar error. Did I miss a step - should I have reloaded the suite?

Best wishes,

Rachel

Hi Rachel,

Sorry, my mistake, of course it calculates the next starting point from what is in the NEMOhist/namelist_cfg so I should have said try copying the namelist_cfg from the previous cycle into the NEMOhist directory:

cp /work/n02/n02/radiam24/cylc-run/u-ck832/work/19090101T0000Z/coupled/namelist_cfg /work/n02/n02/radiam24/cylc-run/u-ck832/share/data/History_Data/NEMOhist

No need to reload the suite, just retrigger.

Cheers,
Ros.

Hi Ros,

Thanks so much for the help, I fixed this for u-ck832 and all the other failing suites and they are now running fine.

Best wishes,

Rachel

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.