I am having a great deal of trouble getting my one off TAMSAT-ALERT forecast job to run. It was stuck on
par-multi for a week* and I have now re-submitted it to the
test queue. But the
test queue is also full, with about 200 jobs ahead of mine. Jobs are running ~3 at a time, which means that my jobs won’t run any time soon. What I’m wondering now is whether I can run this without submitting to lotus. The suite is in
I have tried but failed to figure out how to do this…
The first job, which runs jules to the last day for which we have driving data does not submit a job. It just runs.
The second job, which runs the ensemble forecast (jules-alert) submits a job to slurm.
I have looked at the code and I can’t see any difference in the script for submission of jules-alert and jules.
Any help in getting the jules-alert jobs running sequentially without submission to slurm would be much appreciated!
Many thanks in advance,
*Fatima from jasmin tried to help me with this but even bumping up the priority on
par-multi made no difference.
I would be happy to try to help.
Since you are able to run it 3 at a time in the
test queue, maybe you don’t need to use the
par-multi queue? Maybe you can use the
short-serial queue instead? How long are your jobs supposed to take? Less than 4 hours? Less than 48 hours? The
par-multi queue is really only needed if you want to do parallel processing. And the
short-serial queue can work without parallel processing, and it has a lot more nodes available than does the
Thanks so much Patrick.
I don’t really know what is actually required in terms of computational resource for the job as Ewan set it up. It is all Africa at 0.25 degrees but we only need to run each ensemble member for 150 days.
I just applied for access to the
tamsat GWS, since I don’t currently have access to
I have also checked out a copy of
u-ci349 from MOSRS. I don’t know if it’s the same as the version in
My copy of it is in
Are you restricted to running in the
test queue right now? See: SLURM queues - JASMIN help docs
This restriction is for new accounts or new workflows. Is your JASMIN account new?
If you are not restricted, and since this workflow has been pretty well tested already, you might consider changing from the
test queue to the
par-single queue might be more available than either the
par-multi queue or the
I suspect that since you are using the
mpirun as the
ROSE_LAUNCHER, and that since you currently have
--ntasks set to 8, and since you are simulating for all of Africa, and from prior regional modelling experience, that you actually do need the parallel queues like
par-single (and not
short-serial, for example).
You can do this change by:
- changing directory to the suite directory
- editing your
suite.rc file with
emacs or something, so that the two occurences of
--partition=test are replaced by
- also, you can try editing your suite so that
--exclusive=user is commented out in both places in
suite.rc. This can cause queueing times to skyrocket if it is not commented out, I suspect.
rose suite-run --reload
- if the cylc GUI doesn’t appear, then do a
- retrigger any of the failed apps.
Hi again, Emily:
And yes, the JASMIN rules are that for parallel processing, we can’t run the jobs interactively on the sci* VM’s. We should use LOTUS/SLURM for the parallel processing.
Since your code currently seems to require parallel processing, then I think you will need to use LOTUS/SLURM.
The JASMIN rules for this are stated here:
where it says ‘serial’ jobs only.
Hi again2, Emily:
I have been granted access to the
tamsat GWS, and I have done a
diff -r /gws/nopw/j04/tamsat/eblack/cylc/u-ci349 ~pmcguire/roses/u-ci349
There are a lot of differences . You might consider checking out a copy of the suite and checking in your changes.
Or I guess you could check in your changes directly in the original suite.
But I still recommend trying the
par-single option I suggested earlier, together with maybe getting rid of the exclusive node requirement.
You could also alternatively try
ntasks=4 instead of 8, and maybe a longer wall clock time, and maybe again getting rid of the exclusive node requirement.
par-single option, I think it will try to use a single node anyways. With the new alternative
par-multi option, maybe it will fail since it might want to use a node exclusively, and since it might try to put the 4 tasks on separate nodes. But maybe with
ntasks=4 and a no exclusive-node requirement, then it might actually get through the queue and start running.