Wall time issue - related to JASMIN env?

Hi Martin:
It’s possible that using the MPI libraries is slowing things down on JASMIN, for some reason, as compared to not using the MPI libraries on JASMIN. It’s impressive that it is so fast on your Mac. We should definitely sort out the slow-down on JASMIN. Is JULES as fast as 2 mins for that site on the Met Office XCS supercomputer?

If you do get in touch with Heather, maybe you can ask her to do a quick diff -r between the different JULES versions (her JULES7.0 branch and JULES7.0 trunk)? The differences could be one reason for why it is failing without MPI on some (or all?) of your sites.

Maybe instead of trying 106 sites, you could try (say) 5 sites, to see if it works better.

I think our first priority should be to get it working without MPI on JASMIN. And then our second priority should be to get it working as fast as 2 mins/site on JASMIN. And maybe both of these priorities will be solved hand-in-hand if the first priority is solved first.
Patrick

Hi Martin:
I have been debugging the suite. I think that in order to write NETCDF dump files to disk with the write_dump routine, JULES needs to use MPI. So maybe this idea of trying to get it working without MPI is not going to get us very far. I am now trying to get it running better when using MPI.
Patrick

Hi Martin:
I am running it with MPI now. The first 15 sites have succeeded, with an average JULES execution time of about 5+/-2 minutes/site. Maybe you can try running it again? I didn’t change anything from your checked in version of the u-cr731 suite, except I now have the original setting of JULES_PLATFORM=jasmin-lotus-intel. I am using the higher-performance scratch-pw disk, like you were already doing.
Patrick

OK I will put back the original option and test 15 sites.

It looks like all 15 files ran fine from looking at the log files.

Looking at the runtime, I don’t see the short cycles you quoted:

$ m ~mdekauwe/cylc-run/u-cr731/log/job/1/jules_AT-Neu/01/job.status
CYLC_BATCH_SYS_NAME=slurm
CYLC_BATCH_SYS_JOB_ID=22873927
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2022-11-15T10:03:24Z
CYLC_JOB_PID=216775
CYLC_JOB_INIT_TIME=2022-11-15T10:03:48Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-11-15T10:33:47Z

Not sure if the best thing to do is to build up the number of sites in the test until it goes wrong (or not)…

re: the branch. Heather said:

"The branch below is now on the head of the JULES trunk (so vn7.0 at revision 24118, see ticket https://code.metoffice.gov.uk/trac/jules/ticket/1327), so please either use a branch of your own at this revision or the JULES trunk@24118. You may need to apply the latest upgrade macro to get this to work and set l_local_solar_time =.true. in the jules_time namelist

(see doc branch changes here: https://code.metoffice.gov.uk/trac/jules/changeset/23978/doc/branches/dev/johnmedwards/vn7.0_DocFrcLocTime?contextall=1&old=23522&old_path=%2Fdoc%2Ftrunk)

When JULES vn7.1 is released then you can just use the JULES trunk and it should work fine."

NB. I have removed the “l_loc_solar_time=.true.” in my suite, so this bit isn’t relevant. Sounds like the branch being tested is fine in summary.

Hi Martin:
That’s wonderful that you got 15 to work! I think when you ran the 170 sites the first time, there might have been some other problems with JASMIN. Maybe just trying a 2nd time or on a different day is all that was needed to get it right? It would be better that it worked the 1st time every time, I agree.

In order to get your science done quickly enough, does it really matter that the run-time of your jules jobs is 20 minutes instead of 2 minutes? The first time I ran the u-cr731 suite yesterday, all 170 of the FLUXNET sites succeeded, and from the time of the build finishing to the time of the last site finishing (including submission time, queueing time, and run time), the 170 jobs all finished on separate cores with ending times ranging from 3 minutes after initial jules submission (7:25PM) to 75 minutes after initial jules submission.

One possibility that could be looked into is to that the some of the slower tasks (> 30mins) were for some reason run on JASMIN nodes that were overloaded by other jobs (including from the same suite), or that were for some reason, there was too much output being written simultaneously to the disk.

You could try to modify the SLURM directives of the suite, to try to get better runtime performance, if runtime performance is your research-limiting factor. If this isn’t being done already, maybe change the suite so that it only submits (say) 15 sites at a time, and wait till those are done, before submitting the next 15? Or requesting a certain amount of memory for each site’s jules job. Or trying to limit the number of JULES tasks per node at any one time.

I did try to run two other copies of the suite (u-cr731b & u-cr731c) at the same time, and some fraction of the sites failed during those runs, and I had to resubmit/retrigger those failed sites. It was probably getting overloaded. If you’re interested, you could look at the stderr & stdout log files for those suites to see the failure modes.

Below is a list of the job.status for the 170 successful jobs for the original u-cr731 suite, which includes the time-stamp of last writing of the file, in order to show the range of times of completion of the 170 jobs.
Patrick

 [pmcguire@cylc1 ~]$ ls -ltr ~pmcguire/cylc-run/u-cr731/log/job/1/jules*/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:25 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_UK-Ham/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:26 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_ID-Pag/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:26 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-Isp/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:26 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Lit/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:27 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Bkg/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:27 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CN-Cng/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:28 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DK-Ris/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:28 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_FR-Lq2/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:28 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_FR-Lq1/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:29 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DE-Bay/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:29 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CN-Cha/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:29 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_ZA-Kru/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:29 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CA-SF1/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:29 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DK-Fou/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:30 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CA-SF3/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:31 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DE-SfN/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:31 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-Amp/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:31 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_NL-Ca1/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:31 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CA-SF2/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:32 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-TTE/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:33 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-SP1/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:34 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DK-Lva/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:34 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DE-Wet/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:34 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CA-NS7/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:34 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Twt/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:34 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CN-Qia/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:34 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-SR2/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:34 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CN-Din/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:35 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Ne3/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:35 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Ne2/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:35 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Ne1/01/job.status
-rw-r--r-- 1 pmcguire users 231 Nov 14 19:35 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Goo/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:35 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-Mal/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:36 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-BCi/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:36 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-AR2/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:36 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-GLE/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:36 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CN-Dan/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:37 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Bar/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:37 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Rig/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:38 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-PT1/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:38 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Ync/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:38 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-Ro2/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:39 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CA-Qfo/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:39 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-ASM/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:39 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_PT-Mi1/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:39 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Cpr/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:39 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Prr/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:40 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Cop/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:41 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_RU-Zot/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:41 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IE-Dri/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:41 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_PT-Esp/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:41 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_PT-Mi2/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:41 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-Cpz/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:41 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-SP2/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:42 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CH-Fru/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:42 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Whr/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:42 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_PL-wet/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:42 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Tum/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:42 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-GWW/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:43 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_ES-LMa/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:43 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Dry/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:43 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Otw/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:44 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_SD-Dem/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:44 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CA-NS1/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:44 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_FI-Sod/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:45 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_ES-LgS/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:46 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_SE-Deg/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:47 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Syv/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:47 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CA-NS5/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:47 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CA-NS4/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:47 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-MBo/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:47 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-DaS/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:48 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-ARM/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:48 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_RU-Che/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:48 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AT-Neu/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:49 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Me4/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:50 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Emr/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:50 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Gin/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:50 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CA-NS6/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:50 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CA-NS2/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:50 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Wrr/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:50 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_UK-Gri/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:50 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CA-Qcu/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:50 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CZ-wet/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:51 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Me6/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:51 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IE-Ca1/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:51 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Ha1/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:51 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_FR-Gri/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:52 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Aud/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:52 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-LMa/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:52 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DE-Seh/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:53 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DE-Gri/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:53 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CN-HaM/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:54 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-CA2/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:54 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-CA1/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:54 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AR-SLu/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:55 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-MOz/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:55 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_ZM-Mon/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:56 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Tw4/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:56 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_ES-ES2/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:56 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Stp/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:56 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-CA3/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 19:57 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-SRG/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:58 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_FI-Kaa/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 19:58 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Cum/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:00 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-DaP/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:00 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_FR-Fon/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:01 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Myb/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:01 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-SRo/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:01 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DE-Meh/01/job.status
-rw-r--r-- 1 pmcguire users 231 Nov 14 20:02 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_JP-SMF/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:02 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Whs/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:02 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Var/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:03 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_UK-PL3/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:03 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_ES-VDA/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:04 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_BW-Ma1/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:05 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-Non/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:05 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-FPe/01/job.status
-rw-r--r-- 1 pmcguire users 231 Nov 14 20:05 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-Ren/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:05 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_FR-Pue/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:06 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-KS2/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:06 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Rob/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:07 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-UMB/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:07 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-PFa/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:07 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-Col/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:07 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-AR1/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:07 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Ctr/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:08 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-SP3/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:08 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Bo1/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:08 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_HU-Bug/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:09 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DE-Tha/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:10 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-Ro1/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:10 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-NR1/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:10 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_FI-Lom/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:10 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_ES-ES1/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:10 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CH-Oe1/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:11 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_NL-Hor/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:11 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-Noe/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:11 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Sam/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:12 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_NL-Loo/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:12 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_BR-Sa3/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:12 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-Cow/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:12 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Los/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:12 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-SRM/01/job.status
-rw-r--r-- 1 pmcguire users 231 Nov 14 20:13 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CN-Du2/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:13 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CH-Cha/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:17 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DK-ZaH/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:17 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_GF-Guy/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:19 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Blo/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:20 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DE-Obe/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:20 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Me2/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:21 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Ton/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:21 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_CH-Dav/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:21 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Wkg/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:21 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DE-Kli/01/job.status
-rw-r--r-- 1 pmcguire users 227 Nov 14 20:22 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_BE-Vie/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:22 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DE-Geb/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:22 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-MMS/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:22 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_FR-LBr/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:23 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_AU-How/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:24 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_IT-Lav/01/job.status
-rw-r--r-- 1 pmcguire users 231 Nov 14 20:25 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-Ho1/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:26 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_US-WCr/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:28 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_BE-Lon/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:29 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DE-Hai/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:34 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_FR-Hes/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:35 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_RU-Fyo/01/job.status
-rw-r--r-- 1 pmcguire users 233 Nov 14 20:35 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_DK-Sor/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:37 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_BE-Bra/01/job.status
-rw-r--r-- 1 pmcguire users 232 Nov 14 20:37 /home/users/pmcguire/cylc-run/u-cr731/log/job/1/jules_FI-Hyy/01/job.status

Hi Patrick,

OK I will test the full suite again.

In terms of run time, yes and no. Partly, this is me simply trying to learn about how the system works and where the bottlenecks are because if I want to translate some of the science advances we’ve embedded in CABLE into JULES, then I need to know it isn’t going to take 20 mins to run 1 site. How is that going to scale, if I want to run a global application? Ultimately, whether it takes 2 mins or 1 day, isn’t game-changing (so long as it does run), but it does seem strange that I can run something so much quicker on a laptop. So my curiosity is as much about ensuring I don’t have something silly setup in my JASMIN environment that is causing the issue, as distinct from “it always takes this long”. And I don’t have the knowledge base to determine this currently, as the last time I was running JULES for science it was on the CEH cluster and things ran seamlessly.

Do you have an example suite, where you submitted say N at a time and waited till they were concluded so I can see how I can amend the current suite?

Thanks again for the assistance,

M

Hi Martin:
It looks like your suite is already set up to limit to 50 at a time, see the line “limit = 50” in your suite.rc code. Maybe 25 would be better? Just guessing. This guess is especially if you’re sending the output all to the same pw disk. But this limit doesn’t apply, I guess, if you are running multiple copies of the suite at the same time. Also, the JULES FLUXNET-tower suite u-al752 sets things at a limit of 24 JULES jobs, so my guess wasn’t too inaccurate.

So is it all working tolerably well right now?

Running globally on a grid is a different setup than running a bunch of individual sites. So getting experience with running regionally or globally on a grid is necessary before extrapolating to larger grids or longer runs.
Patrick

Hi Martin:
Were you able to run the complete set yet? Did you try limit=25?

I wanted also to make sure that you are aware that files on scratch-pw and scratch-nopw are automatically deleted after 1 month. So if you produce any important data on there, you might want to copy it to a group workspace or something within a month, at the latest.

Also, I was reading the JASMIN official email about the planned JASMIN maintenance next week, and you should be aware:

"2. New scratch area and retiring /work/scratch/pw
A new 1 PB scratch volume will be made available at /work/scratch-pw3, which will replace the old volume /work/scratch-pw:

  • /work/scratch/pw will be made Read Only from 22nd November with no further access from 15th December
  • Available scratch volumes from then on will be /work/scratch/pw2 /work/scratch/pw3, both 1 PB in size
  • Any important data on /work/scratch/pw should be removed from that location before the 15th December 2022"

Patrick

Hi Patrick,

Bit hard to tell as the way this suite runs you end up with 150 odd log files and site outputs, so it isn’t obvious how one can quickly summarise if all site finished (I set it running and went home). I will pull the files locally and make some plots, but it looks promising. I think you can close the ticket, it must have been a system issue on that day, sorry.

Martin

Hi Martin:
If you want to find out the range of end times of the 170 jules jobs, just look at the spread of timestamps for the log files:

ls -ltr ~mdekauwe/cylc-run/u-cr731/log/job/1/jules*/01/job.status

With a little more linux, you can find out other things:

grep CYLC_JOB_INIT_TIME ~mdekauwe/cylc-run/u-cr731/log/job/1/jules*/01/job.status

Patrick

Hi again, Martin:
Also, I see that you’ve tar’ed up the output NETCDF data for all 170 sites for this PLUMBER2 suite from Heather Rumbold Ashton, presumably so you can plot and analyze it on your Mac, which is reasonable, since your Mac might be more familiar to you.

I should note that the JULES FLUXNET suite u-al752, from Karina Williams and Anna Harper, can easily do the plotting on JASMIN for 70-some sites and all the variables of interest, plotting the observations from the sites overlain with time-series plots for the JULES models. You recently got this u-al752 suite working, too. Have you looked at the plots from it?

The u-al752 suite has this extra plotting capability, which you expressed interest in doing manually (in your last comment here). And it does all the plotting somewhat automatically on JASMIN. And I am much more involved so far with the u-al752 suite than with the PLUMBER2 suite, so I could help out more, if you need/want help.
Patrick

Thanks Patrick, noted.

I have a lot of existing code. I’m planning to compare the suites, but given the PLUMBER2 suite is exactly aligned with my working practice in Oz, it is a closer natural fit. I’ll also note that my understanding is that the next FLUXNET release is using netcdfs and the pipeline that PLUMBER was proposing, so the longevity there might be longer… but who knows.

Thanks, Martin
I don’t know the differences between the u-al752 JULES FLUXNET suite and the PLUMBER2 JULES FLUXNET suite. How is the PLUMBER2 suite different from the u-al752 suite so that the PLUMBER2 suite is naturally aligned to your working practices?
Patrick

Hi again, Martin:
Since you have a lot of pre-existing analysis code, you could consider getting more bang for your buck by running it on JASMIN, directly on the output of the PLUMBER2 or u-al752 suites. One way to do this or to get started doing this is with the JASMIN Jupyter Notebook Service.
Patrick

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.