Help with cylc and slurm with JULES on JASMIN

pmcguire · 27 July 2022 18:59

Hi Patrick
I am having a great deal of trouble getting my one off TAMSAT-ALERT forecast job to run. It was stuck on par-multi for a week* and I have now re-submitted it to the test queue. But the test queue is also full, with about 200 jobs ahead of mine. Jobs are running ~3 at a time, which means that my jobs won’t run any time soon. What I’m wondering now is whether I can run this without submitting to lotus. The suite is in

/gws/nopw/j04/tamsat/eblack/cylc/u-ci349

I have tried but failed to figure out how to do this…

The first job, which runs jules to the last day for which we have driving data does not submit a job. It just runs.

The second job, which runs the ensemble forecast (jules-alert) submits a job to slurm.

I have looked at the code and I can’t see any difference in the script for submission of jules-alert and jules.

Any help in getting the jules-alert jobs running sequentially without submission to slurm would be much appreciated!

Many thanks in advance,
Emily

*Fatima from jasmin tried to help me with this but even bumping up the priority on par-multi made no difference.

pmcguire · 27 July 2022 19:01

Hi Emily
I would be happy to try to help.

Since you are able to run it 3 at a time in the test queue, maybe you don’t need to use the par-multi queue? Maybe you can use the short-serial queue instead? How long are your jobs supposed to take? Less than 4 hours? Less than 48 hours? The par-multi queue is really only needed if you want to do parallel processing. And the short-serial queue can work without parallel processing, and it has a lot more nodes available than does the test queue.
Patrick

pmcguire · 27 July 2022 19:01

Thanks so much Patrick.

I don’t really know what is actually required in terms of computational resource for the job as Ewan set it up. It is all Africa at 0.25 degrees but we only need to run each ensemble member for 150 days.

Thanks again
Emily

pmcguire · 27 July 2022 19:53

Hi Emily:

I just applied for access to the tamsat GWS, since I don’t currently have access to /gws/nopw/j04/tamsat/eblack/cylc/u-ci349

I have also checked out a copy of u-ci349 from MOSRS. I don’t know if it’s the same as the version in /gws/nopw/j04/tamsat/eblack/cylc/u-ci349
My copy of it is in ~pmcguire/roses/u-ci349

Are you restricted to running in the test queue right now? See: SLURM queues - JASMIN help docs

This restriction is for new accounts or new workflows. Is your JASMIN account new?

If you are not restricted, and since this workflow has been pretty well tested already, you might consider changing from the test queue to the par-single queue.

The par-single queue might be more available than either the par-multi queue or the test queue.

I suspect that since you are using the mpirun as the ROSE_LAUNCHER, and that since you currently have --ntasks set to 8, and since you are simulating for all of Africa, and from prior regional modelling experience, that you actually do need the parallel queues like test or par-multi or par-single (and not short-serial, for example).

You can do this change by:

changing directory to the suite directory
editing your suite.rc file with vi or emacs or something, so that the two occurences of --partition=test are replaced by --partition=par-single
also, you can try editing your suite so that --exclusive=user is commented out in both places in suite.rc. This can cause queueing times to skyrocket if it is not commented out, I suspect.
rose suite-run --reload
if the cylc GUI doesn’t appear, then do a rose sgc
retrigger any of the failed apps.

Patrick

pmcguire · 27 July 2022 20:02

Hi again, Emily:

And yes, the JASMIN rules are that for parallel processing, we can’t run the jobs interactively on the sci* VM’s. We should use LOTUS/SLURM for the parallel processing.

Since your code currently seems to require parallel processing, then I think you will need to use LOTUS/SLURM.

The JASMIN rules for this are stated here:

where it says ‘serial’ jobs only.

Patrick

pmcguire · 27 July 2022 20:15

Hi again2, Emily:

I have been granted access to the tamsat GWS, and I have done a

diff -r /gws/nopw/j04/tamsat/eblack/cylc/u-ci349 ~pmcguire/roses/u-ci349

There are a lot of differences . You might consider checking out a copy of the suite and checking in your changes.

Or I guess you could check in your changes directly in the original suite.

But I still recommend trying the par-single option I suggested earlier, together with maybe getting rid of the exclusive node requirement.

You could also alternatively try par-multi with ntasks=4 instead of 8, and maybe a longer wall clock time, and maybe again getting rid of the exclusive node requirement.

With the par-single option, I think it will try to use a single node anyways. With the new alternative par-multi option, maybe it will fail since it might want to use a node exclusively, and since it might try to put the 4 tasks on separate nodes. But maybe with ntasks=4 and a no exclusive-node requirement, then it might actually get through the queue and start running.

Patrick

Topic		Replies	Views
Jobs on slurm sit forever on "submitted" before failing Rose/Cylc and FCM JASMIN	7	248	19 May 2023
Suite running but not producing output Rose/Cylc and FCM	7	365	4 January 2022
JULES on SLURM JULES JASMIN	12	251	17 February 2022
Suite for running JULES with Cylc 8 freezes JULES JASMIN , JULES	3	26	13 July 2025
Wall time issue - related to JASMIN env?	36	507	20 November 2022

Help with cylc and slurm with JULES on JASMIN

Related topics