Job submitted but mysteriously on hold for 6 days

Hi Patrick,

Thank you for doing squeue. The job in question was not listed on your output, however: it’s u-cq915. Doing the command myself I get:

     48731862 long-seri u-cq915. tmarthew PD       0:00      1 (ReqNodeNotAvail, Reserved for maintenance)

Is the “ReqNodeNotAvail” the problem? Does this mean the job will sit there until tomorrow after the maintenance?

Best,
Toby

Hi Toby:

I did an squeue -p long-serial, and I see:

49227560 long-seri u-ct751. tmarthew PD 0:00 1 (ReqNodeNotAvail, Reserved for maintenance)
49228185 long-seri u-cv906. tmarthew PD 0:00 1 (ReqNodeNotAvail, Reserved for maintenance)
49228084 long-seri u-cv905. tmarthew PD 0:00 1 (ReqNodeNotAvail, Reserved for maintenance)
49229358 long-seri u-cs296. tmarthew PD 0:00 1 (ReqNodeNotAvail, Reserved for maintenance)

There is scheduled JASMIN maintenance tomorrow.

Patrick

On Mon, Apr 24, 2023 at 11:27 AM Toby Marthews wrote:

Patrick,

Apologies to email you rather than Jasmin IT Support, but my previous problem with Jasmin queues seems to be arising again.

I have a suite u-cq915. I compiled and submitted it on 18th April and it has been sitting there ‘submitted’ since then (see screenshot attached from today).

The CPU loads on Jasmin don’t seem to be extreme (see below), so I’m wondering why I seem to be stuck in a queue like this. My walltime on this job is the maximum of 1 week (168 hrs) and when it gets there tomorrow it’ll just fail my job without ever having actually run it.

Before it does that tomorrow, could I ask whether there are any diagnostics you can run on the queue to help me understand why this is happening? It doesn’t happen to all my jobs (many just go straight to ‘running’) but every once in a while I get a job which falls into this ‘submitted’ waiting area and then eventually fails (because the walltime applies to the whole job, this ‘submitted waiting time’ is included). Obviously, I lose a week of waiting every time this happens (!).

Have I kind of used up my ‘credits’ on Jasmin by doing too many jobs over the last few months perhaps?

Anything you can tell me to shed further light on this situation would be much appreciated.

Best regards,

Toby


** JASMIN shared host status at 10:31:47 on 2023-04-24 **


Average load on each VM over the last hour:

===============================================================

Host Users Free memory CPU


sci1.jasmin.ac.uk 17 25.2G 31.0%

sci2.jasmin.ac.uk 19 9.0G 2.0%

sci3.jasmin.ac.uk 40 1017.8G 9.0%

sci4.jasmin.ac.uk 16 30.0G 29.0%

sci5.jasmin.ac.uk 19 26.6G 2.0%

sci6.jasmin.ac.uk 39 736.6G 24.0%

sci8.jasmin.ac.uk 31 270.7G 73.0%

===============================================================

**Dr Toby Marthews

succeeded: 1/fcm_make/01

suite: u-cq915
host: cylc1.jasmin.ac.uk
port: 43087
owner: tmarthews

Hi Toby
Yes, that’s how I would interpret the squeue message. You can also use scontrol for each job number to get more info. If a job gets kicked out of the queue you can do a restart and then retrigger the jules app in the cylc GUI.
Patrick

Hi Toby:
It looks like you have 3 jobs running now for some 4 hours in long-serial, after most of the scheduled maintenance has been finished. See:

squeue -p long-serial | grep tm
48731862 long-seri u-cq915. tmarthew  R    4:25:55      1 host588
49227560 long-seri u-ct751. tmarthew  R    4:25:37      1 host458
49229358 long-seri u-cs296. tmarthew  R    4:25:37      1 host458

Hopefully they are running right, since the maintenance is not finished yet, and for example, the home directories are read-only.
Patrick