JULES on SLURM

Hi Patrick,

I am running JULES standalone version for UK and have been working well on ‘lsf’. Now I have managed to get the ‘fcm_make’ working on SLURM after reading a lot of help queries and answers between you and other users on the ncas page : http://cms.ncas.ac.uk/query .

My jobs work slightly different from the ‘GL’ versions, as we separate only the ‘fcm_make’ and ‘jules’. So now my problem is, the ‘fcm_make’ is successful and the ‘jules’ is running without producing any outputs! It is creating the ‘OUTPUT_DIR’ and all, but nothing is writing out to that folder. Do you have any idea what could be missing in my configuration? Do I need a ticket to get this question sorted out? If so, please issue me a ticket.

My suit is, ‘u-bz926’ in case you want to have a look.

Thanks for any hint!
Best regards,
Semeena

Hi Semeena

I looked at your suite u-bz926.

I note that you’re using the -q flag for directives in your suite.rc file, but that flag doesn’t work on SLURM. It has been replaced with -p or --partition.

Maybe that -q flag gets ignored. Maybe you need to replace it with -p.

I note that you have an error message that your run timed out due to wallclock time:

/home/users/semval/cylc-run/u-bz926/log/job/1/jules/01/job.err :

slurmstepd: error: *** JOB 26512316 ON host442 CANCELLED AT 2020-11-27T20:00:45 DUE TO TIME LIMIT ***

2020-11-27T20:00:50Z CRITICAL - failed/EXIT

You have 1 hour of wallclock time. Is that enough for your run? Maybe with SLURM, it is not enough.

But maybe there are other problems.

Patrick

Hi Patrick,

Thank you very much for getting back to me and looking at my suite.

I have made the corresponding changes in my ‘suite.rc’ by replacing the ‘-q’ flag with ‘-p’ and resubmitted the jobs but with no success.

I also have requested more walltime though my jobs would have been completed within few minutes as I’m testing the model setup with a short 3 months run.

Crossing my fingers and hope to sort it out soon.

Best regards,

Semeena

Hi Semeena

I looked again at your suite.
You are using both -p and --partition. You only need to use one of those, since they are the same thing.
And maybe it’s getting confused by using both. Probably not, but you should fix it.

Also, you are compiling as a background job. It might be better to compile on the SLURM short-serial or short-serial-4hr queue.
Do you know what processor type is used by the batch node on the SLURM that par-multi queue assigned to you? If you’re compiling as
a background job on an interactive node, the processor type of the interactive node needs to match that on the batch node.
I don’t know if this is the problem for you, but it’s a possibility to be thinking of.

I looked at your jules executable that you compiled:
ls -ltr ~semval/cylc-run/u-bz926/share/fcm_make/build/bin/jules.exe
It seems to be there. I am not sure if when it is being run, if it is picking up all the libraries it needs.
I tried this:
sbatch ~semval/cylc-run/u-bz926/log/job/1/jules/01/job
This might be able to try starting the run, by submitting it to the queue.
To do this, I had to copy your job script and modify the log the directories:
sbatch ~pmcguire/test/semval/job
Right now, I have a job run like this that has been running for 55 mins. It hasn’t crashed yet, but no output yet either.

I think you might want to change this line in your suite.rc file, so that it doesn’t have --quiet in it:
rose task-run --quiet --path=share/fcm_make/build/bin
Then your log files might make your sense and give you log messages and error messages from when JULES is
running.

Patrick

Hi Patrick,

I made the modifications in my suit.rc now and submitted another run, trying to write out one more set of output. In the initial one, I had only one set of ‘daily output’ to be written out. Now, I have specified to write out another set of ‘monthly output as well’.

It is exactly the problem I am having that it is not producing any output, but running as in your test case. This will simply run til it reaches the wallclock time L

Semeena

Hi Semeena

I am trying to run that job script of yours by decreasing the print_step from 4232 to 1.

In that case info is printed to the log file every time step instead of every 4232 time steps.

Maybe you can try the same thing?

PCM

Hi Patrick,

Sure, I had it as 4232 earlier as I was producing daily outputs and didn’t want to come up with a huge log file!

Will try with print_step as ‘1’ and keep you updated.
Best regards,

Semeena

You might also consider switching mpirun.lotus to mpirun. I am trying that now.

Patrick

Ok, let me try with switching to ‘mpirun’. I am also considering to do a complete make clean, recompile and do the runs. The ‘~semval/cylc-run/u-bz926/share/fcm_make/build/bin/jules.exe’ was created on 27th Nov. I am not sure of the library option it would have used.

Cheers,
Semeena

The mpirun.lotus is used in a couple of places in the subdirectories, even if it is not used in the suite.rc file.

PCM

Hi Patrick,

I did notice that it is using ‘mpirun’ in suit.rc. Where else is it used as mpirun.lotus? I unfortunately Can’t find it.

Semeena

Hi Semeena

You can search for the occurrences with

grep -r mpirun *

Patrick

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.