Multiple failures from save_wallclock.sh

I’ve had a bunch of failures of the form (I am running 10 jobs at once; all failed):

/work/n02/n02/tetts/cylc-run/opt_dfols46/dn00u/bin/save_wallclock.sh: /work/n02/n02/tetts/cylc-run/opt_dfols46/dn00u/bin/iteration_bins.py: /usr/bin/python: bad interpreter: No such file or directory

So save_wallclock.sh runs at the end of the atmos_main job so no big deal. Trigger the postproc, pptransfer and the next cycle to restart the cycle (if rather tedious with 10 jobs). However, there is something strange going on as the models are showing as failed around 18:56 but from the pe files they were running much longer – basically till the end of the cycle. Is there some problem with archer2? I am using the nvme q’s.

Simon

P.S. Is there a less tedious way than triggering the three jobs by hand?

Hi Simon,

We’ve had multiple Archer2 issues over the weekend unfortunately.

NVMe ran out of disk space early Sunday morning which caused job failures. Then there was some kind of Slurm issue which meant that Cylc job polling failed. Unfortunately this causes jobs to be set as failed although they may have still been running.

In this case you can get Cylc to update the job states by re-polling. If you have lots of workflows and tasks you can poll them all with:

cylc poll '*//*'

They may still show up coloured red in the GUI or tui but it does set the tasks as completed (if indeed they did succeed) and downstream tasks will run.

Hope this helps,
Annette

Hi Annette,

thanks for the hint!

Simon

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.