JULES on JASMIN - fcm fail

I’ve tried to re-run an old JULES rose suite using my original suite.rc file, but when running the fcm_make part, it complains that it can’t find the contrib/gnu/gcc/7.3.0 module. I did a quick check on JASMIN using “module avail” and noticed that this module no longer exists so I switched it for gcc/8.2.0 which does exist, however now I’m getting a load of fortran errors:

[FAIL] mpif90 -oo/water_constants_mod.o -c -DSCMA -DBL_DIAG_HACK -DINTEL_FORTRAN -I./include -I/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0//include -heap-arrays -fp-model precise -traceback /home/users/jmac87/cylc-run/u-ck523/share/fcm_make/preprocess/src/jules/src/params/standalone/water_constants_mod_jls.F90 # rc=127
[FAIL] mpif90: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
[FAIL] compile 0.0 ! water_constants_mod.o ← jules/src/params/standalone/water_constants_mod_jls.F90
[FAIL] mpif90 -oo/switches.o -c -DSCMA -DBL_DIAG_HACK -DINTEL_FORTRAN -I./include -I/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0//include -heap-arrays -fp-model precise -traceback /home/users/jmac87/cylc-run/u-ck523/share/fcm_make/preprocess/src/jules/src/control/shared/switches.F90 # rc=127

My hunch is that this has something to do with the various intel/compiler modules that I’m loading in under the [[JASMIN]] section of my suite.rc file (I’ve pasted this below). For some reason, these no longer work. I see there is now an intel.20.0.0 – maybe I need to use this? Or perhaps there is something else I need to change in my suite.rc file. Any ideas?

Many thanks for your help with this
Jon

[runtime]
[[root]]
script = rose task-run --verbose
[[[events]]]
mail events = submission failed, submission timeout, failed, timeout, succeeded

[[JASMIN]]
    env-script = """
            eval $(rose task-env)
            export PATH=/apps/jasmin/metomi/bin:$PATH
                                            module load jaspy
            module load intel/19.0.0
            module load contrib/gnu/gcc/7.3.0
            module load eb/OpenMPI/intel/3.1.1
            module list 2>&1
            export NETCDF_FORTRAN_ROOT=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/
            export NETCDF_ROOT=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/
            export HDF5_LIBDIR=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/lib
            export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HDF5_LIBDIR
            env | grep LD_LIBRARY_PATH
            """

    [[[job]]]
        batch system = slurm
    [[[environment]]]
        NETCDF_FORTRAN_ROOT=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/
        NETCDF_ROOT=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/

[[fcm_make]]
    inherit = None, JASMIN
    [[[job submission]]]
        method = background
    [[[environment]]]
        ROSE_TASK_N_JOBS = 4
    [[[directives]]]
                                --constraint="ivybridge128G|skylake348G|broadwell256G"
        #--partition = short-serial
        --time = 00:5:00
        --ntasks = 1

Hi Jonathan:
I see that you are having trouble loading the module contrib/gnu/gcc/7.3.0 , which is needed to compile JULES. I also have trouble with this, since this module isn’t there. So this is an issue that affects all JULES users on JASMIN. I checked, and at least according to JULES trunk v7.0 (/home/users/pmcguire/jules/jules-vn7.0/rose-stem/include/jasmin/runtime.rc ), this file is the one that is needed. I also checked, and I don’t see anything in /apps/jasmin/modulefiles/contrib for gnu/gcc or for gnu , if that’s where they’re supposed to be.
I have put this into the attention of a couple of colleagues in NCAS CMS. I have also emailed JASMIN support about this. We will keep you posted.
Patrick

Thanks Patrick - FYI my particular JULES rose-suite is configured to use JULES v6.0. I assme the need for gcc 7.3.0 is the same for this one (at least it used to compile when I had access to 7.3.0).

Also, just to note, I also tried using gcc 7.2.0 which does show as available in the current list of available modules on JASMIN. I got the same error as when using gcc 8.2.0

cheers

Jon

I’m pretty sure that the compiler install has an issue. For example if you try ‘module load eb/OpenMPI/intel/4.1.0’

then you should be able to use mpif90 - but you can’t because of issues of the type where it can’t find ‘libiomp5.so’ . I’ve had a quick look at where this stuff is, and for example:

‘ls -lht /apps/sw/eb/software/OpenMPI/4.1.0-iccifort-2018.3.222-GCC-7.3.0-2.30/lib64’
points to a dead link. So I think that perhaps JASMIN have been updating compilers or something and maybe the libraries being pointed to were deleted?

This is a bit of a guess - the gcc seems also to have gone, as you note.
Hopefully we can hear something soon, as this may be important for others (or I may be being daft… we’ll see).

1 Like

Hi Jonathan & David
This is the response I just got from the CEDA JASMIN support email helpdesk:
"The module and the GNU software under contrib/gnu/gcc/7.3.0 were not migrated to the new partition area as we thought that they were redundant. All GNU compilers are now available via the JASPY environment.

Could [the user] try and build JULES using the GNU compiler provided by JASPY?
Is it a parallel version of JULES?

For example GNU Fortran (conda-forge gcc 12.1.0-16) 12.1.0 is available via the default JASPY environment. Earlier GNU versions are available from previous JASPY environments."

Can you try that?
Patrick

It’s the Intel which is the bigger problem. When I tried to load this, anything with mpif90 failed.

I’m sorry you’re stuck in the middle of this. It’s a pain.

cc me into your emails with JASMIN if you’d like to

Hi Jonathan:
Is it working properly yet?
Patrick

Hi Patrick,

Yes, I’ve just tried again since Fatima migrated contrib/gnu/gcc/7.3.0 and it is now working.

cheers

Jon

Excellent, Jonathan!

You mentioned in an email that your SLURM settings in your suite aren’t working now.
Did you try the suggestion I sent in the email?

Patrick

Hi Jonathan:
(responding to a CEDA JASMIN email support ticket):
You do need to make sure that you do this in your JULES suite prior to running the jules.exe:

module load intel/19.0.0
module load contrib/gnu/gcc/7.3.0
module load eb/OpenMPI/intel/3.1.1
export NETCDF_FORTRAN_ROOT=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/
export NETCDF_ROOT=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/
export HDF5_LIBDIR=/gws/nopw/j04/jules/admin/netcdf/local_nc_par/3.1.1/intel.19.0.0/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HDF5_LIBDIR

When I tried to do what the CEDA JASMIN support person did, i.e., run the ldd on the jules.exe (without doing the module loads and exports above either in the suite (preferably) or before running the suite),
then I get the same error message about missing libraries that the CEDA JASMIN support person did. When I do the module loads and exports first before the ldd, everything seems fine.

Patrick

Hi Jonathan:
What is the suite number that you’re working on? If this is still troubling you, I can look directly at the suite and maybe figure out what is going on.
Patrick

Many thanks Patrick I’d really appreciate that.

I’ve just committed some changes to the suite to include a small example input dataset so that it should be ready to run. The suite number is u-ck523. You’ll see on running it, it performs one fcm compile jonb to compile jules. It then runs the same jules model multiple times (with different climate inputs) so you’ll see for the JULES section of the code it submits >100 jobs. All of these fail because apparently they exceed the runtime. It never used to do this for this small example.

Cheers

Jon

Hi Jon:
I am trying to run your suite u-ck523 now. The fcm_make task/app succeeded. But the first time that it tried to run the jules tasks/apps on the short-serial queue/partition, the submission failed. The job-activity.log (~pmcguire/cylc-run/u-ck523/log/job/1/jules_0/01/job-activity.log) said:

sbatch: error: Batch job submission failed: Invalid feature specification.

This was easy to trace down, since the job script (~/cylc-run/u-ck523/log/job/1/jules_0/01/job) had:
#SBATCH --constraint=ivybridge128G|skylake348G|broadwell256G

The processor type ivybridge128G is out of date for JASMIN (see: LOTUS cluster specification - JASMIN help docs).

It would be better to use --constraint="intel" for the jules app in your suite.rc file.

The 2nd time that I ran the jules app (after making this change and reloading the suite and retriggering the jules family of apps), I got this error in ~pmcguire/cylc-run/u-ck523/log/job/1/jules_130/02/job.err:
/var/spool/slurmd/job21656806/slurm_script: line 93: /home/users/pmcguire/cylc-run/u-ck523/bin/prep_jules_clim_cmip_run.py: Permission denied

I fixed the permissions of that, and then I also got this permissions error in ~pmcguire/cylc-run/u-ck523/log/job/1/jules_132/02/job.err:

PermissionError: [Errno 13] Permission denied: '/gws/nopw/j04/rahu/jules_oggm_cmip5/cmip_clim_sw_lw_u2_p_sh_prov_canchis_norte_extra.pkl'

I don’t have access to the rahu GWS, so I can’t do much there. Should I request permission to access the rahu GWS? This processing with the rahu GWS is done before executing the jules.exe executable.
Patrick

I also note that your log file (~jmac87/cylc-run/u-ck523/log/job/1/jules_132/01/job.out) says:

Currently Loaded Modules:
  1) jaspy/3.10/r20220721   5) contrib/gnu/binutils/2.31
  2) intel/cce/19.0.0       6) contrib/gnu/gcc/7.3.0
  3) intel/fce/19.0.0       7) eb/OpenMPI/intel/3.1.1
  4) intel/19.0.0

And my log file (~pmcguire/cylc-run/u-ck523/log/job/1/jules_132/02/job.out) says:

  1) intel/14.0                       8) intel/cce/19.0.0
  2) libPHDF5/intel/14.0/1.8.12       9) intel/fce/19.0.0
  3) libpnetcdf/intel/14.0/1.5.0     10) intel/19.0.0
  4) netcdf/intel/14.0/4.3.2         11) contrib/gnu/binutils/2.31
  5) netcdff/intel/14.0/4.2          12) contrib/gnu/gcc/7.3.0
  6) parallel-netcdf/intel/20141122  13) eb/OpenMPI/intel/3.1.1
  7) jaspy/3.10/r20220721

So you’re not successfully loading the first 6 modules that my version of your suite loads (when run from my account). This should be fixed.
Patrick

Hi Jonathan:
Those extra 6 modules are coming from my .bash_profile doing a module add parallel-netcdf/intel. I think this is from a previous set-up. I am now trying to run a branch of the u-al752 suite without this setting in my .bash_profile. I am expecting that it will work like it was working before with this setting.
Patrick

Hi Jonathan:
I hacked my version of your u-ck523 suite (see: ~pmcguire/roses/u-ck523; I also made a copy of your u-ck523 with these mods, and it is checked in as u-cr771). The new version has several changes, including:
– skipping an extra module load jaspy in the script of the [[jules]] section, since there seems to already be a module load jaspy inherited from the [[JASMIN]] section.
– commenting out the script that uses the rahu GWS, since I don’t have access to that.

Maybe this version will be able to run the jules.exe (and maybe actually give a JULES error message).
Patrick

Hi Jonathan:
The hacked version of u-ck523 mentioned in the previous entry here does indeed run jules.exe. (see: ~pmcguire/roses/u-ck523 ; I also made a copy of your u-ck523 with these mods, and it is checked in as u-cr771 ). It also give proper JULES error messages when it fails, presumably since I am skipping prep_jules_clim_cmip_run.py $I_GLACIER $I_CMIP_RUN ${REGION_NAME}. I am skipping that because I don’t have access to the rahu GWS. When this Python script is not skipped, it will probably run further than it did for you previously.

I am currently doing a test run with everything the same (except not skipping the extra module load jaspy) in ~pmcguire/roses/u-cr771b. It looks like these jules app runs are getting stuck like they did for you, without the immediate failure like they did for me previously when I also skipped the module load jaspy (in my case, due to the skipping of using the data from the Rahuls GWS).

You can look at the log files in ~pmcguire/roses/u-cr771b, ~pmcguire/roses/u-cr771, and ~pmcguire/roses/u-ck523.

But I recommend getting rid of the extra module load jaspy. This was overriding previous module load’s that you made after your first module load jaspy.
Patrick

Hi again Jonathan:
I also note that my log file, wherein jules successfully partially runs (~/cylc-run/u-cr771/log/job/1/jules_0/01/job.out) has this mpirun path:

[INFO] Running JULES in parallel MPI mode
[INFO] exec /apps/sw/eb/software/OpenMPI/3.1.1-iccifort-2018.3.222-GCC-7.3.0-2.30/bin/mpirun /home/users/pmcguire/cylc-run/u-cr771/share/fcm_make/build/bin/jules.exe

whereas, your log file (~jmac87/cylc-run/u-ck523/log/job/1/jules_0/01/job.out) has this mpirun path:

[INFO] Running JULES in parallel MPI mode
[INFO] exec /apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/bin/mpirun /home/users/jmac87/cylc-run/u-ck523/share/fcm_make/build/bin/jules.exe

I think the OpenMPI path is preferred over the jaspy path.
Patrick

Hi Patrick and JASMIN Helpdesk agent,

Thanks so much for the help with this. I uncommented the additional “module load jaspy” commands in my suite.rc and it’s working! Thanks so much for your help with this!

All the best

Jon