Previous working-job now seg-faulting on 1st timestep after OS upgrade (ARCHER2 v8.4 GA4 UM-UKCA)

The GLOMAP-v8.2-upgraded copy of the NCAS-CMS ported v8.4 GA4 UM-UKCA job
(copy of Ros’s xoxta job) is giving a seg-fault on timestep 1 since the resumption from
the May/June OS upgrade.

I’ve checked in detail the ARCHER2 OS upgrade page, and tried a number of different ways
to explore the cause of the seg-fault, including adding compiler over-ride files with

  1. updated cce-compiler (v15.0.0), and 2) as 1 with also -g debug flag in to the compile lines

I can see that the compile-lines are effected on ARCHER2, so these options are propogating
through successfully to upgrade the complier version and to

The job I am testing, is copy of xplt-o job, which ran fine on 256 cores prior to the OS upgrade,
and was one of the ensemble-members we ran, within the two HErSEA-Pinatubo GA4 UM-UKCA ensembles we submitted, as the UK contribution to the ISA-MIP Pinatubo intercomparison paper (Quaglia et al., 2023, ACP, https://acp.copernicus.org/articles/23/921/2023/acp-23-921-2023.pdf )

The post-OS-upgrade re-run of the xplt-o job is for a paper in preparation from the ACSIS research to compare UK Agung model simulations to searchlight, lidar and balloon
measurements from the 1963-65 period.

The observations suggest a higher-altitude SO2 emission would give better agreement
to the measurements, are we are then keen to be able to re-run with the “obs-calibrated”
emission source parameter settings.

The simulations in Dhomse et al. (2020) were all with the same best-estimate mid-level
injection height, and then it’s straightforward to re-run with the higher injection-height.

Basically the Pinatubo base-case is giving a seg-fault on the 1st timestep, with the
128-core (xprb-r) and 256-core (xprb-s) debug-settings-compiled jobs both giving
near-identical behaviour.

With the info re: the change in behaviour of the slurm system to no-longer pass-on the
ntasks-per-mode settings, I also tried re-running both with OpenMP switched off, this
requiring a different compiler over-ride file, but although the no-OMP equivalent jobs
do get slightly further, still both fail on the 1st timestep, see xprb-p (128-core) and
xprb-q (256-core)

I did try to run the gdb4hpc de-bugger, following the instructions at
https://docs.archer2.ac.uk/user-guide/debug/#gdb4hpc

But trying both methods to either “attach” the de-bugger to the running job, and
running the debugger directly, I couldn’t get the gdb4hpc system to co-run with
the executable.

I had hoped to run the totalview de-bugger, and this worked well at v7.3
(from modifying the qsexecute script, to add the totalview syntax around the aprun
command on HECToR phase 3).

Unfortunately, I could not see any info on totalview on the ARCHER2 system, but
I guess it will be possible to install this locally, to try this out, with help from ARCHER2 support).

I’m just raising this query here, to ask whether it might simply be that I need to add in some
additional flags, either within the compiler over-ride, or another machine over-ride or hand-edit

I checked-back Ros Hatcher’s job xoxta, and I could see there this was run in June, but also
failed at run-time, although it looks like it had not yet reached the run-stage.

Note that I had to do the re-configuration step interactively from ARCHER2, via a
sbatch umuisubmit_rcf from the umui_runs directory, because the re-configuration
step launched when submitting from the UMUI did not complete.

I also had to increase the number of nodes to 4 for the reconfiguration (for the 256-core job)
to get the reconfiguration to work.

If you look in the /work/n02/n02/gmann/output directory you see the sequence of test-runs,
and the corresponding error messages etc. at the various stages of testing re; the above
test jobs xprbt, xprbs, xprbr, xprbq and xprbp.

When checking back to the original xoxta job from Ros, I noticed the revision number for
the FCM “container” had changed, since I originally copied the job.

And I have updated that sequence of jobs xbrb-s, -r, -q and -p for that updated container revision number, but the earlier jobs xprbt, and the Agung test jobs xprb-g, xprb-d etc.
will still have the original container FCM revision number.

Please can someone in the team try running cp of the reference GA4 UM-UKCA Job (xoxta)
and refer to the differences in the test jobs xprb-r etc., re: compiler over-ride file, and see
if the same error is given, to then seek assistance from the ARCHER2 team, if needed.

I realise the ARCHER2 system is down for maintenance today, and am having to prioritise
teaching matters for the next 2 weeks (after the suspension of the UCU
marking & assessment boycott).

However, I will be able to reply to messages either here or via email.

Thanks a lot for your help with this,

Best regards,

Dr. Graham Mann
Lecturer in Atmospheric Science
University of Leeds

Hi Graham,

Yes I did update the 2 reference jobs following the Archer2 OS Upgrade to change compiler, etc. I thought I did get it to run ok, but do remember hitting some Archer2 temporary issue at somepoint which, hopefully is why you see my failed output.

I’m in the middle of doing some UMUI work, so once ARCHER2 is back up next week, I’ll let you know how I get on running xoxta/b.

Cheers,
Ros.

Hi Ros,

OK, that’s great, thanks a lot,

Cheers
Graham

A quick update:
xoxta doesn’t work on Archer2 anymore as one of the input files has gone walkies. :slightly_frowning_face:
I’ve started looking at your job xprbs.

Hi Graham,

Can you please change the permissions on /work/n02/n02/gmann/UMdumps/xmvxha.da20021201_00 so that I can read it please?

Cheers,
Ros.

Thanks Ros – I’ve done a "chmod a+r " on those files so you should be able to access that file now.

Cheers
Graham

xphroa.da19910116_00 xphrsa.da19910116_00 xpllca.da20220114_00 xpltca.da19911201_00
gmann@ln01:/work/n02/n02/gmann/UMdumps> ls -lrt
total 316100676
-rw-r–r-- 1 gmann n02 8617328640 Dec 3 2016 xmvxha.da20030601_00
-rw-r–r-- 1 gmann n02 11589529600 Jan 12 2020 xnbeka.da19630301_00
-rw-r–r-- 1 gmann n02 11589529600 Jan 12 2020 xneoka.da19630301_00
-rw-r–r-- 1 gmann n02 10586243072 Jan 19 2022 xlazsa.da19921201_00
-rw-r–r-- 1 gmann n02 10906017792 Jan 22 2022 xpdbwa.da20021201_00
-rw-r–r-- 1 gmann n02 10906017792 Jan 22 2022 xpdbwa.da20030601_00
-rw-r–r-- 1 gmann n02 10906017792 Jan 23 2022 xpdbya.da20000601_00
-rw-r–r-- 1 gmann n02 11141574656 Jan 27 2022 xmbcla.da19981201_00
-rw-r–r-- 1 gmann n02 11141574656 Jan 27 2022 xmbcla.da19971201_00
-rw-r–r-- 1 gmann n02 11404312576 Jan 27 2022 xmotoa.da20091201_00
-rw-r–r-- 1 gmann n02 10634555392 Jan 28 2022 xphrta.da19910611_00
-rw-r–r-- 1 gmann n02 10634555392 Jan 28 2022 xphrua.da19910611_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphroa.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrpa.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrqa.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrra.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrsa.da19910116_00
-rw-r–r-- 1 gmann n02 11404312576 Aug 29 2022 xmotoa.da20080901_00
-rw-r–r-- 1 gmann n02 9720463360 Aug 31 2022 xpllca.da20220101_00
-rw-r–r-- 1 gmann n02 9720463360 Aug 31 2022 xpllca.da20220114_00
-rw------- 1 gmann n02 8617328640 Sep 2 2022 xmvxha.da19991201_00
-rw------- 1 gmann n02 8617328640 Sep 2 2022 xmvxha.da20021201_00
-rw------- 1 gmann n02 8617328640 Sep 2 2022 xmvxha.da20051201_00
-rw------- 1 gmann n02 8617328640 Sep 3 2022 xmvxha.da20031201_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 4 2022 xploaa.da19910601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltca.da19911201_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltba.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltaa.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltda.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltga.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltea.da19930601_00
gmann@ln01:/work/n02/n02/gmann/UMdumps> chmod a+r *
gmann@ln01:/work/n02/n02/gmann/UMdumps> ls -lrt
total 316100676
-rw-r–r-- 1 gmann n02 8617328640 Dec 3 2016 xmvxha.da20030601_00
-rw-r–r-- 1 gmann n02 11589529600 Jan 12 2020 xnbeka.da19630301_00
-rw-r–r-- 1 gmann n02 11589529600 Jan 12 2020 xneoka.da19630301_00
-rw-r–r-- 1 gmann n02 10586243072 Jan 19 2022 xlazsa.da19921201_00
-rw-r–r-- 1 gmann n02 10906017792 Jan 22 2022 xpdbwa.da20021201_00
-rw-r–r-- 1 gmann n02 10906017792 Jan 22 2022 xpdbwa.da20030601_00
-rw-r–r-- 1 gmann n02 10906017792 Jan 23 2022 xpdbya.da20000601_00
-rw-r–r-- 1 gmann n02 11141574656 Jan 27 2022 xmbcla.da19981201_00
-rw-r–r-- 1 gmann n02 11141574656 Jan 27 2022 xmbcla.da19971201_00
-rw-r–r-- 1 gmann n02 11404312576 Jan 27 2022 xmotoa.da20091201_00
-rw-r–r-- 1 gmann n02 10634555392 Jan 28 2022 xphrta.da19910611_00
-rw-r–r-- 1 gmann n02 10634555392 Jan 28 2022 xphrua.da19910611_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphroa.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrpa.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrqa.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrra.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrsa.da19910116_00
-rw-r–r-- 1 gmann n02 11404312576 Aug 29 2022 xmotoa.da20080901_00
-rw-r–r-- 1 gmann n02 9720463360 Aug 31 2022 xpllca.da20220101_00
-rw-r–r-- 1 gmann n02 9720463360 Aug 31 2022 xpllca.da20220114_00
-rw-r–r-- 1 gmann n02 8617328640 Sep 2 2022 xmvxha.da19991201_00
-rw-r–r-- 1 gmann n02 8617328640 Sep 2 2022 xmvxha.da20021201_00
-rw-r–r-- 1 gmann n02 8617328640 Sep 2 2022 xmvxha.da20051201_00
-rw-r–r-- 1 gmann n02 8617328640 Sep 3 2022 xmvxha.da20031201_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 4 2022 xploaa.da19910601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltca.da19911201_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltba.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltaa.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltda.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltga.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltea.da19930601_00