Jobs not launching, constraint=ivybridge128G

Hi,

I was following Patrick McGuire’s tutorial on running JULES at N flux sites:

While the test queue ran (–partition=test), when you use constraint=“ivybridge128G” none of the jobs would launch. I am now testing using constraint=“intel” following the information on this page (LOTUS cluster specification - JASMIN help docs).

Patrick has updated his tutorial but I am noting it here in case there is alternative advice (…?) or in case others run into the same issue.

Martin

Martin, it sounds like you have this under control (intel should be good).
The only point I’d add is that if you are doing development work on the code, and want to compare to the KGO, then picking

        --constraint="skylake348G"

would be a good idea. skylake348G is the choice in the trunk.
Similarly if you want to make runs which are bitwise reproducible, then picking a single named model of processor (such as this one) would be the thing to do

Dave

Thanks Dave, OK I can test that.

I did just test the --constraint=“intel” and now fcm_make doesn’t seem to be building the executable, I get a series of errors implying the necessary modules were not loaded:

ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'intel/19.0.0'
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'contrib/gnu/gcc/7.3.0'
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'eb/OpenMPI/intel/3.1.1'
2022-06-20T15:37:32+01:00 CRITICAL - failed/EXIT

I can’t immediately see the issue as this worked last week and the load cmds are there in ./site/suite.rc.CEDA_JASMIN?

Any thoughts would be appreciated.

Thanks,

Martin

It’s possible that there are problems with JASMIN: the system is ‘at risk’ from today until the 24th. If things are only working intermittently then this may be the reason:
https://www.ceda.ac.uk/blog/jasmin-reminder-of-maintenance-work-on-weekend-of-1819-june-and-extended-at-risk-period-20-24-june-1/

As for recommended set up - I think that you are close to what is on the trunk:

https://code.metoffice.gov.uk/trac/jules/browser/main/trunk/rose-stem/include/jasmin/runtime.rc

and this definitely worked for the last release. Copying the trunk is normally good.

You should be able to load modules from the sci or cylc nodes, for example, as a test. If the system is undergoing major testing then perhaps today is a bad day…

Hi Martin
Yes, I emailed JASMIN support about the modules not getting loaded today on cylc1. They responded that it was due to the maintenance that DaveC was talking about in his response to you here. The maintenance could go on for the rest of this week.
Patrick

Hi Martin:
It looks like module load is working today on cylc1.
Patrick

I had another go, some of the jobs seem to be running but a lot are crashing, not sure if this is still related to the previous issues?

Example error:

slurmstepd: error: *** JOB 4412982 ON host390 CANCELLED AT 2022-06-25T15:24:12 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
2022-06-25T15:24:14+01:00 CRITICAL - failed/EXIT

Martin

A NODE FAILURE is a hardware problem. Please try again. If the problem persists, please inform the JASMIN helpdesk and seek their advice.

Grenville

Hi Martin:
As Grenville suggests, yes, do please try again. Furthermore, JASMIN was under maintenance last week, so there might have been a few more node failures than normal. But node failures do happen.

If you want to try again, one way to do it is to re-open the Cylc GUI if it isn’t open already, with rose sgc. Then you can right click on each of the failed JULES tasks for the FLUXNET sites, and retrigger that task from the drop-down menu. There’re probably other ways to do this, but this is the most straightforward way. You can also look at the job.err and job.out and job.activity files (etc.) from this drop-down menu.

If rose sgc doesn’t open the suite, you might need to do something first, like rose suite-run --reload or rose suite-run --restart.
Patrick

Thanks Grenville, I tried the suggestion from Patrick to manually trigger the jobs that failed and that worked, so it doesn’t seem to be an issue anymore. I can still report the error (I saved it) if folks think that is helpful.

However Patrick, I wasn’t able to trigger the plotting step (I tried a couple of times). Unfortunately, when I attempted to open the error message to see what the issue was it won’t display anything, I’ve attached a screenshot. Perhaps I can see the log via the cmd line, just not sure where to look…?

Thanks

Hi Martin:
I am glad the jules runs completed for the FLUXNET suite!

If you’re having trouble viewing the error file for the plotting step with the Cylc GUI, I would suggest viewing the error file at the command line. It will be in a subdirectory of
~/cylc-run/u-co635/log/job/1/make_plots/ .
You can use more or vi or emacs or maybe another editor to view the error log files.
More than half the time I do study the stderr and stdout files at the command line instead of in the Cylc GUI. I would recommend looking at the other files in that subdirectory too, not just the job.err file.
Patrick

Hi Martin:
Another thing: I don’t know if you tried to retrigger the plotting step before all the JULES jobs had finished. The suite should be set up so that the plotting step won’t start until all the JULES jobs had finished. But if you manually trigger the plotting step before the JULES jobs are finished, it will probably try to run the plotting step, and then fail somewhere. Instead of manually retriggering the plotting step, you can manually set the state of the plotting step as ‘waiting’ (by right clicking on the plotting-app task in the GUI), and then it will wait for all its JULES-run dependencies to finish.

Patrick

Hi Martin:
Did you figure out what is wrong with your make_plots app yet? Did you get it to run successfully?
Patrick

Hi Patrick,

I’m still looking into it. There wasn’t anything discernable from the log files so I’m currently waiting to make sure all the individual sites have been completed before testing again. I will post an update hopefully later…

Thanks,

Martin

Do all the JULES runs for the individual sites show up as a green, completed color in the Cylc GUI? That’s a good sign that they all completed.

Also, if you switch the make_plots to waiting, then it should automatically figure out if all the JULES jobs are finished or not, and it would start running if all of them are indeed finished.

Patrick

Hi again, Martin:
And if you give us read permission to your home directory and subdirectories, I could take a look at your setup and log files. You can do this with:
chmod -R g+rX /home/users/mdekauwe/

If you have anything private or confidential, you might want to change back the read access on those items.
Patrick

Yep done, there is nothing that exciting in there!

Martin

Hi Martin:
Thanks for the permissions change.

I then did this, to see which of your jules jobs did not have the ‘succeeded’ string in their job.out file:
grep -L succeeded ~mdekauwe/cylc-run/u-co635/log/job/1/jules*/*/job.out

This gave a list of files. Then I picked the latest run (run # 02) of one of them, and looked at the job.err file with vi:
vi /home/users/mdekauwe/cylc-run/u-co635/log/job/1/jules_ch_oe2_presc0/02/job.err
This shows:

slurmstepd: error: *** JOB 5173870 ON host424 CANCELLED AT 
2022-06-29T12:42:41 DUE TO TIME LIMIT ***
2022-06-29T12:42:42+01:00 CRITICAL - failed/EXIT

This suggests that it ran out of wall clock time.

I ran this suite recently without an issue, so maybe it picked some node where there were problems or something. But if you still have problems after retriggering in a 3rd try, you might increase the wall clock limit from 2 hours to say 5 hours and then rerun those sites.

You can do this by editing the file:
/home/users/mdekauwe/roses/u-co635/site/suite.rc.CEDA_JASMIN so that it says 5 hours instead of 2 hours:

   [[JULES_CEDA_JASMIN]]
        inherit = None, JASMIN_LOTUS

        [[[directives]]]
            --time = 5:00:00
            --ntasks =  1

And then after the editing is finished, you can do a rose suite-run --reload at the command line, followed by manually retriggering the JULES runs for failed sites in the Cylc GUI.

Let me know if this helps or not,
Patrick

Hi Patrick,

OK had another go. The sites have definitely all finished, but I’m still getting the plot error. I’ve tried retriggering it but that hasn’t worked. I also increased the runtime as suggested. The error message isn’t very insightful:

e.g.

more /home/users/mdekauwe/cylc-run/u-co635/log.20220630T121650Z/job/1/make_plots/02/job.err

022-07-01T14:27:20+01:00 CRITICAL - failed/EXIT

Hi Martin:
I have done some partial testing of a copy of your copy of the suite.
My copy is in: ~pmcguire/roses/u-co635_mdekauwe and ~pmcguire/cylc-run/u-co635_mdekauwe .
My copy of your copy of make_plots seems to work with your model output data, and it seems to make it further than your quick error.

I would suggest making a new copy of this suite and rerunning from scratch. This way you have a copy of the original configuration and more tracing can be done later to figure out what went wrong:

  1. cp -pr ~/roses/u-co635 ~/roses/u-co635v2
  2. cd ~/roses/u-co635v2
  3. edit your rose-suite.conf file so that it uses u-co635v2 instead of u-co635
  4. rose suite-run

The alternative to this is to just do:

  1. cd ~/roses/u-co635
  2. rose suite-run --new
    You can do this if you want/prefer. It cleans up the whole set up, deleting a lot of things. But you will lose everything that is deleted. Generally, it is rather safe to do this, but if you want to later try to figure out what was going wrong, it might be helpful to keep the original set up.

Another thing that you might try is to try running the make_plots job script at the command line:

  1. /home/users/mdekauwe/cylc-run/u-co635/log.20220630T121650Z/job/1/make_plots/02/job

You might look at this script with your editor prior to running it, and maybe you might want to make simple changes to it. You can kill this script with control-Z and kill %. It’s a little easier to debug this way maybe, than it is to run from the Cylc GUI.

Patrick