Hi Patrick
I am having a great deal of trouble getting my one off TAMSAT-ALERT forecast job to run. It was stuck on par-multi
for a week* and I have now re-submitted it to the test
queue. But the test
queue is also full, with about 200 jobs ahead of mine. Jobs are running ~3 at a time, which means that my jobs won’t run any time soon. What I’m wondering now is whether I can run this without submitting to lotus. The suite is in
/gws/nopw/j04/tamsat/eblack/cylc/u-ci349
I have tried but failed to figure out how to do this…
The first job, which runs jules to the last day for which we have driving data does not submit a job. It just runs.
The second job, which runs the ensemble forecast (jules-alert) submits a job to slurm.
I have looked at the code and I can’t see any difference in the script for submission of jules-alert and jules.
Any help in getting the jules-alert jobs running sequentially without submission to slurm would be much appreciated!
Many thanks in advance,
Emily
*Fatima from jasmin tried to help me with this but even bumping up the priority on par-multi
made no difference.
Hi Emily
I would be happy to try to help.
Since you are able to run it 3 at a time in the test
queue, maybe you don’t need to use the par-multi
queue? Maybe you can use the short-serial
queue instead? How long are your jobs supposed to take? Less than 4 hours? Less than 48 hours? The par-multi
queue is really only needed if you want to do parallel processing. And the short-serial
queue can work without parallel processing, and it has a lot more nodes available than does the test
queue.
Patrick
Thanks so much Patrick.
I don’t really know what is actually required in terms of computational resource for the job as Ewan set it up. It is all Africa at 0.25 degrees but we only need to run each ensemble member for 150 days.
Thanks again
Emily
Hi Emily:
I just applied for access to the tamsat
GWS, since I don’t currently have access to /gws/nopw/j04/tamsat/eblack/cylc/u-ci349
I have also checked out a copy of u-ci349
from MOSRS. I don’t know if it’s the same as the version in /gws/nopw/j04/tamsat/eblack/cylc/u-ci349
My copy of it is in ~pmcguire/roses/u-ci349
Are you restricted to running in the test
queue right now? See: SLURM queues - JASMIN help docs
This restriction is for new accounts or new workflows. Is your JASMIN account new?
If you are not restricted, and since this workflow has been pretty well tested already, you might consider changing from the test
queue to the par-single
queue.
The par-single
queue might be more available than either the par-multi
queue or the test
queue.
I suspect that since you are using the mpirun
as the ROSE_LAUNCHER
, and that since you currently have --ntasks
set to 8, and since you are simulating for all of Africa, and from prior regional modelling experience, that you actually do need the parallel queues like test
or par-multi
or par-single
(and not short-serial
, for example).
You can do this change by:
- changing directory to the suite directory
- editing your
suite.rc
file with vi
or emacs
or something, so that the two occurences of --partition=test
are replaced by --partition=par-single
- also, you can try editing your suite so that
--exclusive=user
is commented out in both places in suite.rc
. This can cause queueing times to skyrocket if it is not commented out, I suspect.
rose suite-run --reload
- if the cylc GUI doesn’t appear, then do a
rose sgc
- retrigger any of the failed apps.
Patrick
Hi again, Emily:
And yes, the JASMIN rules are that for parallel processing, we can’t run the jobs interactively on the sci* VM’s. We should use LOTUS/SLURM for the parallel processing.
Since your code currently seems to require parallel processing, then I think you will need to use LOTUS/SLURM.
The JASMIN rules for this are stated here:
where it says ‘serial’ jobs only.
Patrick
Hi again2, Emily:
I have been granted access to the tamsat
GWS, and I have done a
diff -r /gws/nopw/j04/tamsat/eblack/cylc/u-ci349 ~pmcguire/roses/u-ci349
There are a lot of differences . You might consider checking out a copy of the suite and checking in your changes.
Or I guess you could check in your changes directly in the original suite.
But I still recommend trying the par-single
option I suggested earlier, together with maybe getting rid of the exclusive node requirement.
You could also alternatively try par-multi
with ntasks=4
instead of 8, and maybe a longer wall clock time, and maybe again getting rid of the exclusive node requirement.
With the par-single
option, I think it will try to use a single node anyways. With the new alternative par-multi
option, maybe it will fail since it might want to use a node exclusively, and since it might try to put the 4 tasks on separate nodes. But maybe with ntasks=4
and a no exclusive-node requirement, then it might actually get through the queue and start running.
Patrick