Jobs not launching, constraint=ivybridge128G

Hi,

I was following Patrick McGuire’s tutorial on running JULES at N flux sites:

While the test queue ran (–partition=test), when you use constraint=“ivybridge128G” none of the jobs would launch. I am now testing using constraint=“intel” following the information on this page (LOTUS cluster specification - JASMIN help docs).

Patrick has updated his tutorial but I am noting it here in case there is alternative advice (…?) or in case others run into the same issue.

Martin

Martin, it sounds like you have this under control (intel should be good).
The only point I’d add is that if you are doing development work on the code, and want to compare to the KGO, then picking

        --constraint="skylake348G"

would be a good idea. skylake348G is the choice in the trunk.
Similarly if you want to make runs which are bitwise reproducible, then picking a single named model of processor (such as this one) would be the thing to do

Dave

Thanks Dave, OK I can test that.

I did just test the --constraint=“intel” and now fcm_make doesn’t seem to be building the executable, I get a series of errors implying the necessary modules were not loaded:

ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'intel/19.0.0'
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'contrib/gnu/gcc/7.3.0'
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'eb/OpenMPI/intel/3.1.1'
2022-06-20T15:37:32+01:00 CRITICAL - failed/EXIT

I can’t immediately see the issue as this worked last week and the load cmds are there in ./site/suite.rc.CEDA_JASMIN?

Any thoughts would be appreciated.

Thanks,

Martin

It’s possible that there are problems with JASMIN: the system is ‘at risk’ from today until the 24th. If things are only working intermittently then this may be the reason:
https://www.ceda.ac.uk/blog/jasmin-reminder-of-maintenance-work-on-weekend-of-1819-june-and-extended-at-risk-period-20-24-june-1/

As for recommended set up - I think that you are close to what is on the trunk:

https://code.metoffice.gov.uk/trac/jules/browser/main/trunk/rose-stem/include/jasmin/runtime.rc

and this definitely worked for the last release. Copying the trunk is normally good.

You should be able to load modules from the sci or cylc nodes, for example, as a test. If the system is undergoing major testing then perhaps today is a bad day…

Hi Martin
Yes, I emailed JASMIN support about the modules not getting loaded today on cylc1. They responded that it was due to the maintenance that DaveC was talking about in his response to you here. The maintenance could go on for the rest of this week.
Patrick

Hi Martin:
It looks like module load is working today on cylc1.
Patrick

I had another go, some of the jobs seem to be running but a lot are crashing, not sure if this is still related to the previous issues?

Example error:

slurmstepd: error: *** JOB 4412982 ON host390 CANCELLED AT 2022-06-25T15:24:12 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
2022-06-25T15:24:14+01:00 CRITICAL - failed/EXIT

Martin

A NODE FAILURE is a hardware problem. Please try again. If the problem persists, please inform the JASMIN helpdesk and seek their advice.

Grenville

Hi Martin:
As Grenville suggests, yes, do please try again. Furthermore, JASMIN was under maintenance last week, so there might have been a few more node failures than normal. But node failures do happen.

If you want to try again, one way to do it is to re-open the Cylc GUI if it isn’t open already, with rose sgc. Then you can right click on each of the failed JULES tasks for the FLUXNET sites, and retrigger that task from the drop-down menu. There’re probably other ways to do this, but this is the most straightforward way. You can also look at the job.err and job.out and job.activity files (etc.) from this drop-down menu.

If rose sgc doesn’t open the suite, you might need to do something first, like rose suite-run --reload or rose suite-run --restart.
Patrick

Thanks Grenville, I tried the suggestion from Patrick to manually trigger the jobs that failed and that worked, so it doesn’t seem to be an issue anymore. I can still report the error (I saved it) if folks think that is helpful.

However Patrick, I wasn’t able to trigger the plotting step (I tried a couple of times). Unfortunately, when I attempted to open the error message to see what the issue was it won’t display anything, I’ve attached a screenshot. Perhaps I can see the log via the cmd line, just not sure where to look…?

Thanks

Hi Martin:
I am glad the jules runs completed for the FLUXNET suite!

If you’re having trouble viewing the error file for the plotting step with the Cylc GUI, I would suggest viewing the error file at the command line. It will be in a subdirectory of
~/cylc-run/u-co635/log/job/1/make_plots/ .
You can use more or vi or emacs or maybe another editor to view the error log files.
More than half the time I do study the stderr and stdout files at the command line instead of in the Cylc GUI. I would recommend looking at the other files in that subdirectory too, not just the job.err file.
Patrick

Hi Martin:
Another thing: I don’t know if you tried to retrigger the plotting step before all the JULES jobs had finished. The suite should be set up so that the plotting step won’t start until all the JULES jobs had finished. But if you manually trigger the plotting step before the JULES jobs are finished, it will probably try to run the plotting step, and then fail somewhere. Instead of manually retriggering the plotting step, you can manually set the state of the plotting step as ‘waiting’ (by right clicking on the plotting-app task in the GUI), and then it will wait for all its JULES-run dependencies to finish.

Patrick