Hi Simon
Thanks for the help and explanations! Much appreciated. Let me know what you find out from JASMIN support.
Patrick
Hi again, Simon:
I just read your reply again. I think you’re referring to the exec
program that is used by the rose/cylc suite to run JULES (exec rose mpi-launch -v jules.exe
). At first reading, I parsed ‘Intel built exec’ as ‘Intel built executable’. I understand a bit better now.
Patrick
Hi Patrick,
I did mean the jules.exe but “exec”. I think I’ve tracked the issue back
to its source.
Both OpenMPI versions are built with easybuild. The respective dirs are:
/apps/sw/eb/software/OpenMPI/3.1.1-GCC-7.3.0-2.30/easybuild
and
/apps/sw/eb/software/OpenMPI/3.1.1-iccifort-2018.3.222-GCC-7.3.0-2.30/easybuild
I did a grep for “-xHost”, which causes the Intel compiler to use the
instruction set
of the build processor, in the build logs. It appears in the Intel
version, but not the gcc
version. So the jasmin provided MPI is bespoke for Intel architectures
only, due
to being built with the “-xHost” switch on an Intel system. No
Intel-built OpemMPI
commands (mpirun, mpif90, mpicc…) work on AMD machines. They need to
re-build the
Intel compiled software stack without “-xHost”
Simon.
Hi Simon:
Thanks for the clarification, and for figuring out the build problem. I presume that you have passed the build problem on to the JASMIN team. Please do let me know when they fix it, ok?
Thanks,
Patrick
Hi Simon:
Any news from the JASMIN team about this?
Patrick
Hi Patrick,
Yes,
This arrived whilst I was on leave:
Hi Simon,
Thank you for the update
I very much appreciate your time and effort to investigate the issue and
identify the root cause of the problem with the Intel MPI on the AMD node.
We did not realise that the MPI application was built in the Intel
compiler environment and is limited to the Intel processor nodes.
Both versions of MPI |eb/OpenMPI/intel/3.1.1| and
eb/OpenMPI/intel/4.1.0 |need to be recompiled without the |-xHost|
flag. I will escalate the issue and update you when I can.
Some background info that might have hindered identifying the issue
earlier on:
Previously, LOTUS compute nodes were of Intel node type with different
Intel processor models. This necessitated defining host groups which
were used to homogeneously specify nodes for MPI parallel jobs in
particular.
When the new node type AMD was introduced to LOTUS and some of the old
Intel node types were gradually removed and retired, the uptake for AMD
was still very low. Users continued to use Intel and simply updated the
constraint flag.
The JASMIN infrastructure team checked any likely compatibility issues
before buying the AMD nodes and was informed that there would be no
compatibility issues. Code compiled on an Intel node will run fine on
AMD hosts unless it was explicitly compiled with MMX instructions sets.
Regards,
Fatima
Thanks, Simon!
Hopefully it won’t be too much longer before it is fixed.
Patrick
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.
This is this message that Simon Wilson forwarded to me from the JASMIN Helpdesk on July 11:
A new version of the OpenMPI library was compiled with Intel compiler
version 20.0.0 and without -x host flag.
The corresponding (hidden) module file is |eb/OpenMPI/intel/4.1.5 |:module add eb/OpenMPI/intel/4.1.5
Could you please test JULES against this Intel OpenMPI?
Hi Simon:
Many thanks. I will try it out with AMD as soon as I can.
Patrick
Hi Simon:
I tried it in ~pmcguire/roses/u-al752AMD,
but I get an “error parsing data file mpif90: not found” when I am in the fcm_make phase.
Patrick
Hi Simon:
I tried again with a new copy of the suite. The new copy is ~pmcguire/roses/u-al752AMD3
.
But I get the same error message. The complete error message is below. I don’t have permission to
read the file /home/users/cdelcano/openmpi/share/openmpi/mpif90-wrapper-data.txt
I will also inform the JASMIN helpdesk about this.
Patrick
mpif90 -oo/timestep_mod.o -c -DSCMA -DBL_DIAG_HACK -DINTEL_FORTRAN -I./include -I/gws/nopw/j04/jules/admin/netcdf/netcdf.openmpi//include -heap-arrays -fp-model precise -traceback /home/users/pmcguire/cylc-run/u-al752AMD3/share/fcm_make/preprocess/src/jules/src/control/standalone/var/timestep_mod.F90 # rc=243
[FAIL] Cannot open configuration file /home/users/cdelcano/openmpi/share/openmpi/mpif90-wrapper-data.txt
[FAIL] Error parsing data file mpif90: Not found
From messages exchanged with the JASMIN Helpdesk:
These are modules necessary for a background build on the cylc1
VM of JULES with the mpif90
compiler. This allows compiling on INTEL nodes (i.e., cylc1
) and running on AMD nodes (i.e., in the short-serial-4hr
queue/partition.
module load intel/20.0.0
module load contrib/gnu/gcc/7.3.0
module load eb/OpenMPI/intel/4.1.5
The OpenMPI module above has now been properly built without node-type-specific instructions.
A JULES suite that successfully runs with these modules on the AMD nodes (in the short-serial-4hr queue) is in ~pmcguire/roses/u-al752AMD3
. These changes have also been checked in to the parent-suite u-al752
, but that parent suite also needs to be ironed out a bit more, due to other updates.
Patrick McGuire
The u-al752 JULES/FLUXNET suite has been updated and checked in, with the code to run JULES on the AMD nodes of the short-serial-4hr partition. It currently runs at JULES7.3 trunk. The plotting is done in Python and it needs 8 hours sometimes to plot, so this is in the short-serial queue, without an AMD constraint.
The docs for the u-al752 suite have been updated and are here:
https://research.reading.ac.uk/landsurfaceprocesses/software-examples/tutorial-rose-cylc-jules-on-jasmin/
Patrick
This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.