Jobs on slurm sit forever on "submitted" before failing

Dear CMS,

I am finding that my Cylc jobs repeatedly fail on JASMIN Slurm. However, not in the obvious way: they seem to start, run one of the subtasks and then sit there in status “submitted” doing nothing until eventually the walltime makes them fail. squeue gives me no information for these types of jobs.

Please see screenshot https://www.tobymarthews.com/uploads/1/1/3/1/11315558/screenshot_submitted_jobs.png showing an example using a rose stem test.

Any help on this would be hugely appreciated (!). I had this issue with some unrelated jobs a few months ago (reported on Job submitted but mysteriously on hold for 6 days ), and the issue seems to have returned.

Best wishes,

Toby

UPDATE 1 day later: as expected, all these jobs failed on Jasmin:

submission timeout: 1/fcm_make_debug/01
submission timeout: 1/fcm_make_mpi/01
submission timeout: 1/fcm_make_mpi_rivers-only/01

submission failed: 1/fcm_make_debug/01
submission failed: 1/fcm_make_mpi/01
submission failed: 1/fcm_make_mpi_rivers-only/01

suite: vn7.2_elevq
host: cylc1.jasmin.ac.uk
port: 43013
owner: tmarthews

UPDATE: I have found out a little more and submitted this as a query to Cylc Support on Jobs show as 'submitted' on Cylc GUI but actually they have failed - Cylc Support - Cylc Workflow Engine .

Hi Toby

I was just looking at /home/users/tmarthews/cylc-run/vn7.2_elevq/log/job/1/fcm_make_debug/01 when it changed under me. Is the current job the one that has problems (or have you changed the slurm configuration)?

Grenville

Hi Grenville,

Thank you for looking at this one.

I had a theory that my problem may lie in my .ssh/known_hosts file. I edited that file and then suddenly found myself logged out of JASMIN and I now can’t log back in (!).

I have emailed JASMIN Support to try and correct this (definitely my fault this time not their though!), but perhaps a consequence of my action there has been to kill those jobs as well.

Sorry about that! Perhaps you could give me an hour or two to try to sort out what I just broke about my profile?

Toby

This email and any attachments are intended solely for the named recipients and are confidential. If you are not the intended recipient, please reply to the email to highlight the error and delete this email from your system; you must not use, disclose, copy, or distribute this email or any of its attachments. UK Centre for Ecology & Hydrology (UKCEH) has taken reasonable precautions to minimise risk of this email or any attachments containing viruses or malware, but the recipient should carry out its own virus and malware checks before opening the attachments. UKCEH does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. Opinions, conclusions or other information in this message and attachments that are not related directly to UKCEH business are solely those of the author and do not represent the views of UKCEH. We process your personal data in accordance with our Privacy Notice, available on the UKCEH website. Privacy notice | UK Centre for Ecology & Hydrology Registered office address; Maclean Building Benson Lane, Crowmarsh Gifford, Wallingford, Oxfordshire, United Kingdom, OX10 8BB Companies Registered Name; UK Centre for Ecology & Hydrology Place of Registration; England Registered Company Number; 11314957

Toby

Having this in your bashrc may be problematic

export PATH=$HOME/.local/cylc/bin:$PATH
export PATH=$HOME/.local/rose/bin:$PATH
export PATH=$HOME/.local/fcm/bin:$PATH
export PATH=$HOME/.local:$PATH

This is likely the cause of the cylc version mismatch. Check that the version isn’t set in the suite.

Grenville

Hi Grenville,

Thank you for the response above (on CMS).

I restarted everything from scratch and the same problem persists as you can see from my screenshot https://www.tobymarthews.com/uploads/1/1/3/1/11315558/screenshot.png . Could I ask one more question please?

  • All looks fine from this screenshot taken just now, but these tasks have been ‘submitted’ (=PENDING according to sacct) for 9 hours now and refuse to start ‘running’ even though the server load is quite light (see below). Eventually, after 24 hrs or so they will all simply ‘fail’ because of the walltime limit on par-multi, but I have no idea why this is happening to me when apparently it doesn’t happen to other users
  • I have removed those PATH= specifications in my .bashrc, so I am definitely using the globally installed Cylc version 7.8.12 .

QUESTION: I notice that squeue tells me “1 (QOSMaxJobsPerUserLimit)” (see screenshot). According to Slurm Workload Manager - Resource Limits this is a maximum number of jobs per user: have I been limited to 1 job at a time by any chance?

Apart from that I simply can’t see what is wrong: please help!

Are there any more diagnostic commands that could perhaps give me more information on what is going wrong here?

Very many thanks,

Toby


** JASMIN shared host status at 08:58:22 on 2023-05-12 **


Average load on each VM over the last hour:

Host Users Free memory CPU

sci1.jasmin.ac.uk 17 28.2G 16.0%
sci2.jasmin.ac.uk 19 18.8G 31.0%
sci3.jasmin.ac.uk 28 455.7G 7.0%
sci4.jasmin.ac.uk 17 23.7G 15.0%
sci5.jasmin.ac.uk 11 29.6G 1.0%
sci6.jasmin.ac.uk 32 565.1G 34.0%
sci8.jasmin.ac.uk 0 387.7G 0.0%

[tmarthews@login2 ~]$ jcylc

Hi Toby,

No new jobs are currently being run on LOTUS at the moment. You should have received an email from JASMIN yesterday - see below. Looking at the par-multi queue everything is held with the QOSMaxJobsPerUserLimit Reason.

Cheers,
Ros.


Dear Jasmin user,
The issue with PFS storage reported earlier is also affecting the batch cluster LOTUS. Batch jobs are timing out and many compute nodes going into a drain state. Please note that we are going to have to stop new LOTUS jobs running temporarily.We apologise for the disruption caused by the storage issues but are working to understand and resolve them as soon as possible.
JASMIN Team

Hi Rosalyn,

Thank you very much for the reply below: my runs all worked perfectly on JASMIN after the reservation had been lifted.

This email and any attachments are intended solely for the named recipients and are confidential. If you are not the intended recipient, please reply to the email to highlight the error and delete this email from your system; you must not use, disclose, copy, or distribute this email or any of its attachments. UK Centre for Ecology & Hydrology (UKCEH) has taken reasonable precautions to minimise risk of this email or any attachments containing viruses or malware, but the recipient should carry out its own virus and malware checks before opening the attachments. UKCEH does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. Opinions, conclusions or other information in this message and attachments that are not related directly to UKCEH business are solely those of the author and do not represent the views of UKCEH. We process your personal data in accordance with our Privacy Notice, available on the UKCEH website. Privacy notice | UK Centre for Ecology & Hydrology Registered office address; Maclean Building Benson Lane, Crowmarsh Gifford, Wallingford, Oxfordshire, United Kingdom, OX10 8BB Companies Registered Name; UK Centre for Ecology & Hydrology Place of Registration; England Registered Company Number; 11314957