Wall time issue - related to JASMIN env?

mdekauwe · 8 November 2022 09:57

Hi,

I’m encountering a wall time issue when running my copy of Heather Ashton’s PLUMBER suite (which runs a series of single-site JULES runs at N flux sites). The suite is “u-cr731”

The error I get is:

slurmstepd: error: *** JOB 22029700 ON host400 CANCELLED AT 2022-11-04T17:22:50 DUE TO TIME LIMIT ***

The thing is, I’ve tested increasing the wall time:

--time = 2:00:00

and also the

execution polling intervals = PT1H

And neither seems to fix the error. Given I’m only testing 2 x single site runs, I can’t see how it can be taking JULES so long to run. If I was running CABLE this would take me a few minutes at most, so I’m worried that there is something else in the JASMIN environment set-up that is causing this rather than sufficient run time.

If I look at the output directory, I can see that it is writing output dump files, so things are in motion:

$ ls /work/scratch-nopw/martindekauwe/PLUMBER2/outputs/output_GAL9/AT-Neu/

local_AT-Neu_fluxnet2015_GAL9.AT-Neu.nc
local_AT-Neu_fluxnet2015_GAL9.dump.20020101.0.nc
local_AT-Neu_fluxnet2015_GAL9.dump.spin1.20020101.0.nc
local_AT-Neu_fluxnet2015_GAL9.dump.spin1.20030101.0.nc
local_AT-Neu_fluxnet2015_GAL9.dump.spin1.20040101.0.nc
local_AT-Neu_fluxnet2015_GAL9.dump.spin1.20050101.0.nc
local_AT-Neu_fluxnet2015_GAL9.dump.spin1.20060101.0.nc
local_AT-Neu_fluxnet2015_GAL9.dump.spin1.20070101.0.nc
local_AT-Neu_fluxnet2015_GAL9.dump.spin1.20080101.0.nc
local_AT-Neu_fluxnet2015_GAL9.dump.spin1.20090101.0.nc
local_AT-Neu_fluxnet2015_GAL9.dump.spin1.20100101.0.nc
local_AT-Neu_fluxnet2015_GAL9.dump.spin1.20110101.0.nc
etc

Thanks,

Martin

pmcguire · 8 November 2022 10:29

Thanks, Martin:
I’ll have a look.
Patrick

pmcguire · 8 November 2022 10:41

Hi Martin:
Your stdout log files are super-long.
see:
~mdekauwe/cylc-run/u-cr731/log/job/1/jules_AT-Neu/01/job.out
Maybe you can make ‘print_step’ much longer, in order to make the files shorter, and maybe your program will run faster too.
Also, your output is going to scratch-nopw. The nopw can work with this JULES suite because you’re doing single-site studies. But it might be slower than using scratch-pw.
Does that help?
Patrick

mdekauwe · 8 November 2022 10:53

Hi,

I’m not sure I was aware of the different in output destination, I was amending Heather’s suite. I will test scratch-pw. I will also look at the print step and test now.

More soon.

mdekauwe · 8 November 2022 12:01

Great Patrick that works. The change in scratch path fixes things. I will also test changing the print statement, as I can see that should also speed things up.

pmcguire · 8 November 2022 12:17

Hi Martin:
I am glad that changing from scratch-nopw to scratch-pw helps.
How much faster is it? Does it finish in less than 20 mins? (as opposed to greater than 120 mins before?)
BTW, ‘pw’ stands for ‘parallel-write’, and ‘nopw’ stands for ‘no-parallel-write’. The ‘pw’ is higher-performance storage.
Patrick

mdekauwe · 8 November 2022 13:53

Hi Patrick,

It is a bit hard for me to exactly say as I can’t locate a summary runtime summary from JASMIN; is there one? I was anticipating a report summary similar to the one I’m used to seeing running the Australian supercomputer but don’t see one - perhaps I’m looking in the wrong place…

Roughly though…it took 20 mins to run 9 site years and 37 mins to run 11 site years. These times still seem pretty slow, but as I said - I’m coming from my CABLE experiences so not sure what to expect. I guess I can still reduce further the ‘print_step’ statement, I’m not clear how coarse I can go here as ultimately I don’t care what CABLE dumps to the standard out. You can turn this off completely in CABLE but seemingly not in JULES…

Thanks

pmcguire · 8 November 2022 14:11

Hi Martin:
Getting it down to 20 to 37 mins is a big step. Congrats!
You can see how long the program took at:
~mdekauwe/cylc-run/u-cr731/log/job/1/jules_AT-Neu/01/job.status

I think if you’re printing daily or weekly instead of more than hourly, then you should save some of the run time, and furthermore, the stdout log files are easier to read.

BTW, I know you were running JULES only for single sites, but you should be very careful if you try to run gridded JULES with an output either to scratch-nopw or a nopw group-workspace. This will cause the hard drive to hang. So don’t do it. For gridded JULES, you should definitely use scratch-pw or a pw group workspace instead. For gridded JULES, it uses the parallel NETCDF, and so it needs to write in parallel to the disk, so a pw disk (parallel-write disk) is required. Writing in parallel to a nopw disk doesn’t end well.
Patrick

mdekauwe · 8 November 2022 16:59

OK thanks, I can see the recorded runtime, knew it must be somewhere!

Noted your point about gridded, I’m not there yet but will do my best to make a note of this.

One thing I’m not sure about now is whether the MPI arguments are being correctly set for the JASMIN environment. I’ve checked and when running two sites, things worked fine, so I just tested running all the sites in Heather’s suite (169). When I do, things crash…

I’m wondering if this relates to the noomp and nonetcdf args, i.e.

$ more /home/users/mdekauwe/roses/u-cr731/app/fcm_make/rose-app.conf
meta=jules-fcm-make/vn7.0

[env]
JULES_BUILD=normal
!!JULES_COMPILER=cray
JULES_FFLAGS_EXTRA=
JULES_LDFLAGS_EXTRA=
!!JULES_MPI=nompi
JULES_NETCDF=nonetcdf
JULES_OMP=noomp
JULES_PLATFORM=jasmin-lotus-intel
!!JULES_REMOTE=local
!!JULES_REMOTE_HOST='xc'
JULES_SOURCE=${JULES_PATH}

If so, then I don’t really follow how it read the netcdf files and ran for 2 sites?

Yet if I look at the error message, it does imply to me at least an MPI issue…

$ more /home/users/mdekauwe/cylc-run/u-cr731/log.20221108T162144Z/job/1/jules_AU-Whr/01/job.err

(skipping stuff)

--------------------------------------------------------------------------
None of the TCP networks specified to be included for out-of-band communications
could be found:

  Value given: p4p2

Please revise the specification and try again.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--------------------------------------------------------------------------
[host586.jc.rl.ac.uk:20736] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 532
[host586.jc.rl.ac.uk:20736] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 166
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[host586.jc.rl.ac.uk:20736] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee th
at all other processes were killed!
[FAIL] rose-run jules.exe <<'__STDIN__'
[FAIL]
[FAIL] '__STDIN__' # return-code=1
2022-11-08T16:49:13Z CRITICAL - failed/EXIT

Thanks in advance

pmcguire · 8 November 2022 19:10

Hi Martin
You got about 100 out of 106 of the sites to work. Good job so far!
It is very important to get all the sites to work though, I agree.

The setting that you have of JULES_PLATFORM=jasmin-lotus-intel refers to code in the file ~pmcguire/jules/jules-vn7.0/etc/fcm-make/platform/jasmin-lotus-intel.cfg . (That’s my locally-checked out copy of the file.) All the settings in that file override the settings like

!!JULES_MPI=nompi
JULES_NETCDF=nonetcdf

in your file
~mdekauwe/roses/u-cr731/app/fcm_make/rose-app.conf.
The actual settings are from ~pmcguire/jules/jules-vn7.0/etc/fcm-make/platform/jasmin-lotus-intel.cfg as listed:

$JULES_MPI = mpi
$JULES_NETCDF = netcdf

You might want to figure out a way to override the JULES_MPI setting. One way to do this is to checkout a branch of the jules 7.0 trunk, and modify the jasmin-lotus-intel.cfg file to have $JULES_MPI = nompi. Then you can either checkin your changes to the branch and use that branch in your suite, or refer to the local copy of the branch directly in your suite.
Patrick

pmcguire · 8 November 2022 19:18

Hi again, Martin:
To amend my previous reply from a few mins ago. Another way that you can try, in order to enable the nompi option on JASMIN is to replace JULES_PLATFORM=jasmin-lotus-intel in ~mdekauwe/roses/u-cr731/app/fcm_make/rose-app.conf with JULES_PLATFORM=jasmin-intel-nompi. This refers to the settings in the trunk version of (my local copy is shown here) ~pmcguire/jules/jules-vn7.0/etc/fcm-make/platform/jasmin-intel-nompi.cfg.
Does that help?
Patrick

mdekauwe · 9 November 2022 09:55

Hi Patrick,

I think there is a mistake here as it didn’t complete 100 of 106. A series of them had crashed, at least 5, so I killed the job as I figured something was wrong.

If I’m following, I should look at your jasmin-lotus-intel.cfg script and match your arguments. I’m not sure I fully followed the point about needing to override the JULES_MPI settings, do you mean something other than match your args? Sorry, I’m possibly being slow and not following here.

pmcguire · 9 November 2022 10:35

Hi Martin:
I had grepped for ‘FAIL’ in the 106, and I only found 6 files that matched. SO I thought more had succeeded. My mistake.

For the PLUMBER suite, you don’t need to use MPI. But it’s using MPI. so one possible way to fix this is to point the suite to the jasmin-intel-nompi.cfg file instead of the jasmin-lotus-intel.cfg file. There might be other changes needed to switch from MPI to noMPI, but that could be a start.
Patrick

mdekauwe · 9 November 2022 14:46

I tried just the JULES_PLATFORM=jasmin-intel-nompi and that didn’t work. This is what I got:

$ m /home/users/mdekauwe/cylc-run/u-cr731/log.20221109T142205Z/job/1/jules_AT-Neu/01/job.out

(skipping)

[INFO] file_ncdf_open: Opening file /home/users/mdekauwe/data/PLUMBER2/met/AT-Neu_2002-2012_FLUXNET2015_Met.nc for reading
 FATAL ERROR: Attempt to use dummy NetCDF procedure
 To use NetCDF, recompile linking the NetCDF library

And

$ m /home/users/mdekauwe/cylc-run/u-cr731/log.20221109T142205Z/job/1/jules_AT-Neu/01/job.err
[FAIL] rose-run jules.exe <<'__STDIN__'
[FAIL]
[FAIL] '__STDIN__' # return-code=1
2022-11-09T14:26:39Z CRITICAL - failed/EXIT

I can obviously see it relates to the netcdf libs but I’m not clear on the solution. I am using “JULES_NETCDF=nonetcdf”, but this did work fine when I ran just two sites.

pmcguire · 9 November 2022 15:58

Hi Martin:
Have you only tried the 106-site run with MPI one time? Or have you tried more than once?
If you’ve only tried once, maybe you can try again?
Patrick

pmcguire · 9 November 2022 16:05

Hi Martin:
I looked at your file
~mdekauwe/roses/u-cr731/app/fcm_make/rose-app.conf
You have this code:

JULES_PLATFORM=jasmin-intel-nompi
!!JULES_PLATFORM=jasmin-lotus-intel

The !! is not a comment symbol. So I think you are actually using JULES_PLATFORM=jasmin-lotus-intel.
Patrick

mdekauwe · 9 November 2022 16:55

Right, my mistake.

OK so I removed the line and tried again

$ m /home/users/mdekauwe/cylc-run/u-cr731/log.20221109T164112Z/job/1/jules_AT-Neu/01/job.err

[FATAL ERROR] file_ascii_read_var_1d: start must be 1 for all dimensions - reading part of a variable from an ASCII file is not supported
Image              PC                Routine            Line        Source
jules.exe          00000000009038EA  Unknown               Unknown  Unknown
jules.exe          00000000004C2138  logging_mod_mp_wr         170  logging_mod.F90
jules.exe          000000000066B697  driver_ascii_mod_        1071  driver_ascii_mod.F90
jules.exe          0000000000654B4F  file_mod_mp_file_         776  file_mod.F90
jules.exe          000000000067DE73  file_gridded_mod_         861  file_gridded_mod.F90
jules.exe          0000000000702667  fill_variables_fr         181  fill_variables_from_file_mod.F90
jules.exe          00000000007A2A4C  init_frac_mod_mp_         106  init_frac_mod.F90
jules.exe          0000000000797162  init_ancillaries_         114  init_ancillaries_mod.F90
jules.exe          000000000042DC3F  init_mod_mp_init_         336  init.F90
jules.exe          000000000040CCE8  MAIN__                    131  jules.F90
jules.exe          000000000040CA92  Unknown               Unknown  Unknown
libc-2.17.so       00007FA881CFB555  __libc_start_main     Unknown  Unknown
jules.exe          000000000040C9A9  Unknown               Unknown  Unknown
[FAIL] rose-run jules.exe <<'__STDIN__'
[FAIL]
[FAIL] '__STDIN__' # return-code=1
2022-11-09T16:49:45Z CRITICAL - failed/EXIT

pmcguire · 9 November 2022 17:11

Hi again, Martin:
I was getting [those same] weird JULES jules_frac ancillary errors when I ran a copy of your suite with your local copy of JULES 7.0. So I just created a branch of a branch of Heather’s MOSRS JULES 6.0 branch, and added the missing jasmin-intel-nompi.cfg file from the JULES7.0 trunk. The new Rose/Cylc suite that uses this MOSRS branch is checked in as u-cr922.
Patrick

pmcguire · 9 November 2022 17:49

Hi again2, Martin:
It looks like the latest version of Heather’s original suite, u-bx465, uses for JULES a Met-Office local branch of JULES_PATH=‘/data/users/haddb/FCM_13.0/vn7.0_FrcLocTime/’. I am guessing that this branch is not the same as the JULES 7.0 trunk. Maybe you’d want to ask her to commit those changes to the MOSRS so that you can use it instead of the JULES7.0 trunk that you are currently using?
Patrick

mdekauwe · 14 November 2022 14:58

Hi Patrick,

a few responses…

I’ve emailed Heather about the branch and enquired if there is anything relevant there, but my understanding was this worked fine on JULES vn 7. And indeed it did when running at two sites as per my initial test case.
I’m still not fully following why it ran fine for 2 sites and not for > 2, surely the same MPI issues apply in both scenarios?
I extracted the namelists files for a single site and ran that site locally on my mac. The site that I noted above that took 37 mins, took about 2 mins on my mac. This seems like an extremely large-time penalty on JASMIN. As I said, I don’t have any experience to fall back on with JASMIN but this really surprises me, we wouldn’t find slow downs like this on the Australian supercomputer.

Thanks,

Martin

Topic		Replies	Views
JULES on SLURM JULES JASMIN	12	251	17 February 2022
Innermost LAM suddenly failing due to walltime limit Unified Model ARCHER2	3	232	26 July 2023
Time limit error occurring ARCHER2	3	20	12 February 2025
Cancelled due to time limit *** ARCHER2	20	485	15 August 2023
Suite running but not producing output Rose/Cylc and FCM	7	365	4 January 2022

Wall time issue - related to JASMIN env?

Related topics