I am running JULES standalone version for UK and have been working well on ‘lsf’. Now I have managed to get the ‘fcm_make’ working on SLURM after reading a lot of help queries and answers between you and other users on the ncas page : http://cms.ncas.ac.uk/query .
My jobs work slightly different from the ‘GL’ versions, as we separate only the ‘fcm_make’ and ‘jules’. So now my problem is, the ‘fcm_make’ is successful and the ‘jules’ is running without producing any outputs! It is creating the ‘OUTPUT_DIR’ and all, but nothing is writing out to that folder. Do you have any idea what could be missing in my configuration? Do I need a ticket to get this question sorted out? If so, please issue me a ticket.
My suit is, ‘u-bz926’ in case you want to have a look.
I note that you’re using the -q flag for directives in your suite.rc file, but that flag doesn’t work on SLURM. It has been replaced with -p or --partition.
Maybe that -q flag gets ignored. Maybe you need to replace it with -p.
I note that you have an error message that your run timed out due to wallclock time:
Thank you very much for getting back to me and looking at my suite.
I have made the corresponding changes in my ‘suite.rc’ by replacing the ‘-q’ flag with ‘-p’ and resubmitted the jobs but with no success.
I also have requested more walltime though my jobs would have been completed within few minutes as I’m testing the model setup with a short 3 months run.
I looked again at your suite.
You are using both -p and --partition. You only need to use one of those, since they are the same thing.
And maybe it’s getting confused by using both. Probably not, but you should fix it.
Also, you are compiling as a background job. It might be better to compile on the SLURM short-serial or short-serial-4hr queue.
Do you know what processor type is used by the batch node on the SLURM that par-multi queue assigned to you? If you’re compiling as
a background job on an interactive node, the processor type of the interactive node needs to match that on the batch node.
I don’t know if this is the problem for you, but it’s a possibility to be thinking of.
I looked at your jules executable that you compiled:
ls -ltr ~semval/cylc-run/u-bz926/share/fcm_make/build/bin/jules.exe
It seems to be there. I am not sure if when it is being run, if it is picking up all the libraries it needs.
I tried this:
sbatch ~semval/cylc-run/u-bz926/log/job/1/jules/01/job
This might be able to try starting the run, by submitting it to the queue.
To do this, I had to copy your job script and modify the log the directories:
sbatch ~pmcguire/test/semval/job
Right now, I have a job run like this that has been running for 55 mins. It hasn’t crashed yet, but no output yet either.
I think you might want to change this line in your suite.rc file, so that it doesn’t have --quiet in it:
rose task-run --quiet --path=share/fcm_make/build/bin
Then your log files might make your sense and give you log messages and error messages from when JULES is
running.
I made the modifications in my suit.rc now and submitted another run, trying to write out one more set of output. In the initial one, I had only one set of ‘daily output’ to be written out. Now, I have specified to write out another set of ‘monthly output as well’.
It is exactly the problem I am having that it is not producing any output, but running as in your test case. This will simply run til it reaches the wallclock time L
Ok, let me try with switching to ‘mpirun’. I am also considering to do a complete make clean, recompile and do the runs. The ‘~semval/cylc-run/u-bz926/share/fcm_make/build/bin/jules.exe’ was created on 27th Nov. I am not sure of the library option it would have used.