The GLOMAP-v8.2-upgraded copy of the NCAS-CMS ported v8.4 GA4 UM-UKCA job
(copy of Ros’s xoxta job) is giving a seg-fault on timestep 1 since the resumption from
the May/June OS upgrade.
I’ve checked in detail the ARCHER2 OS upgrade page, and tried a number of different ways
to explore the cause of the seg-fault, including adding compiler over-ride files with
- updated cce-compiler (v15.0.0), and 2) as 1 with also -g debug flag in to the compile lines
I can see that the compile-lines are effected on ARCHER2, so these options are propogating
through successfully to upgrade the complier version and to
The job I am testing, is copy of xplt-o job, which ran fine on 256 cores prior to the OS upgrade,
and was one of the ensemble-members we ran, within the two HErSEA-Pinatubo GA4 UM-UKCA ensembles we submitted, as the UK contribution to the ISA-MIP Pinatubo intercomparison paper (Quaglia et al., 2023, ACP, https://acp.copernicus.org/articles/23/921/2023/acp-23-921-2023.pdf )
The post-OS-upgrade re-run of the xplt-o job is for a paper in preparation from the ACSIS research to compare UK Agung model simulations to searchlight, lidar and balloon
measurements from the 1963-65 period.
The observations suggest a higher-altitude SO2 emission would give better agreement
to the measurements, are we are then keen to be able to re-run with the “obs-calibrated”
emission source parameter settings.
The simulations in Dhomse et al. (2020) were all with the same best-estimate mid-level
injection height, and then it’s straightforward to re-run with the higher injection-height.
Basically the Pinatubo base-case is giving a seg-fault on the 1st timestep, with the
128-core (xprb-r) and 256-core (xprb-s) debug-settings-compiled jobs both giving
near-identical behaviour.
With the info re: the change in behaviour of the slurm system to no-longer pass-on the
ntasks-per-mode settings, I also tried re-running both with OpenMP switched off, this
requiring a different compiler over-ride file, but although the no-OMP equivalent jobs
do get slightly further, still both fail on the 1st timestep, see xprb-p (128-core) and
xprb-q (256-core)
I did try to run the gdb4hpc de-bugger, following the instructions at
https://docs.archer2.ac.uk/user-guide/debug/#gdb4hpc
But trying both methods to either “attach” the de-bugger to the running job, and
running the debugger directly, I couldn’t get the gdb4hpc system to co-run with
the executable.
I had hoped to run the totalview de-bugger, and this worked well at v7.3
(from modifying the qsexecute script, to add the totalview syntax around the aprun
command on HECToR phase 3).
Unfortunately, I could not see any info on totalview on the ARCHER2 system, but
I guess it will be possible to install this locally, to try this out, with help from ARCHER2 support).
I’m just raising this query here, to ask whether it might simply be that I need to add in some
additional flags, either within the compiler over-ride, or another machine over-ride or hand-edit
I checked-back Ros Hatcher’s job xoxta, and I could see there this was run in June, but also
failed at run-time, although it looks like it had not yet reached the run-stage.
Note that I had to do the re-configuration step interactively from ARCHER2, via a
sbatch umuisubmit_rcf from the umui_runs directory, because the re-configuration
step launched when submitting from the UMUI did not complete.
I also had to increase the number of nodes to 4 for the reconfiguration (for the 256-core job)
to get the reconfiguration to work.
If you look in the /work/n02/n02/gmann/output directory you see the sequence of test-runs,
and the corresponding error messages etc. at the various stages of testing re; the above
test jobs xprbt, xprbs, xprbr, xprbq and xprbp.
When checking back to the original xoxta job from Ros, I noticed the revision number for
the FCM “container” had changed, since I originally copied the job.
And I have updated that sequence of jobs xbrb-s, -r, -q and -p for that updated container revision number, but the earlier jobs xprbt, and the Agung test jobs xprb-g, xprb-d etc.
will still have the original container FCM revision number.
Please can someone in the team try running cp of the reference GA4 UM-UKCA Job (xoxta)
and refer to the differences in the test jobs xprb-r etc., re: compiler over-ride file, and see
if the same error is given, to then seek assistance from the ARCHER2 team, if needed.
I realise the ARCHER2 system is down for maintenance today, and am having to prioritise
teaching matters for the next 2 weeks (after the suspension of the UCU
marking & assessment boycott).
However, I will be able to reply to messages either here or via email.
Thanks a lot for your help with this,
Best regards,
Dr. Graham Mann
Lecturer in Atmospheric Science
University of Leeds