V8.4 GA4 UM-UKCA ARCHER2 job seg-faulting, but runs OK with debug option on cce compiler (but obviously very slowly with debug and O0)

Dear NCAS-CMS helpdesk,

Further to CMS helpdesk message last week, I am running an extra set of simulations with the interactive
stratospheric aerosol configuration of GA4 UM-UKCA,

The reason we’re running “the old” GA4 UM-UKCA is to complete our submission for the Hunga Tonga MIP,
and aligned to a paper drafted-up by Margot Clyne (Univ Colorado) on the first 100 days post-eruption.

We previously ran the main specified dynamics runs for this MIP, but the last phase is to repeat the runs
with the model free-running, then enacting stratospheric heating consistent with the model simulated
volcanic sulphate aerosol, with also the stratospheric cooling from the model simulated volcanogenic water
vapour (150Tg in this Hunga case).

Note that last year we ran these specified dynamics runs also with UKESM1.1 on MONSOON
(using the v12.1 NCAS release job of UKESM1.1).

Anyway, now that we’ve submitted both sets of nudged runs, we’re now running the final set of runs
where the model is free-running, without the nudging, and I’ve then gone back to the v8.4 GA4 UM-UKCA
simulations we did on ARCHER2 during 2023/24, with the interactive strat-aerosol and water vapour
(the model from Dhomse et al., 2020; as analysed also for the Pinatubo-MIP within Quaglia et al., 2023).

It was August 2024 the last time we ran this vn8.4 GA4 UM-UKCA model, and the ERA5-nudged Hunga runs
we did then ran fine, e.g. see jobid’s xpzgx and xpzgy.

At that time (July 2024) we also re-ran the Agung simulations from Dhomse et al. (2020), for some extra runs
we’re doing to assess the 1966 Awu eruption and associated prolonged volcanic forcing through the 1960s.

Anyway, for running these free-running Hunga simulations, the main jobs that ran fine in July/Aug 2024 are
now giving a seg-fault when they start running on ARCHER2 (see UMUI jobs xqjhx, copy of xpzgx).

The only modification I made when copying this, was to modify this to submit to ARCHER2 ln01 rather than ln02
(see msg below, re: ln01 being out of circulation for maintenance).

I realise that’s the login node, and then shouldn’t affect the memory usage or availability when submitting
equivalently to the 16 x 16 PE domain-decomposition (256 cores, across 2 nodes).

There is an error message that says “corrupted double-linked list”, but I’m not sure that may just be a warning
message, rather than what actually caused the model to seg-fault.

I recall there were initially some problems with seg-faults, when we first picked-up the GA4 UM-UKCA ported job
(which initially didn’t have the GLOMAP aerosol, but then we added it). At that time, in 2021, the job was from
pumatest to ARCHER2, and Grenville and Ros both helped to get that running, and although the seg-fault was
resolved, I don’t recall exactly what remedied the problem there. I don’t think it was changing memory settings,
or at least not within the individual job assignment etc.

The reason I mention that, is just wondering if it might be similar issue again, where some default memory
setting on ARCHER2 from PUMA2 has changed since Aug 2024 when this was working OK?

I’ve done some exploratory work on this over the weekend, and tried increasing this to 512 cores (on 4 nodes),
but still failure in exactly the same place, as the job starts to run part-way through timestep 1.

Note that for the previous jobs these all included to run with specific GCOM version within the CCE compiler
(v1.5 of GCOM), but I don’t think that was the specific thing that got over the issue there.

Anyway just to note that over-ride file is also active in this seg-faulting job, and again, not sure if this could
be one “hidden difference” in the last 18 months since these ran OK?

The last thing to note, is that I tried running 2 test jobs, off the xqjhx running job, xqjhy and xqjhz, which do the
exact same run, with 16x16 decomposition, but -y does not exact the -hflex_mp=intolerant CCE compiler setting

That didn’t make any difference when I re-ran xqjhy with that change to the compiler settings.

However, when I ran xqjhz, I tried adding in the debugging flag -g, with also optimisation set to zero (-O0)
and quite surprisingly the job then ran fine.

I mean quite surprisingly because I ran that to see if it gave info on the seg-fault, but the optimisation set to zero
seems to have avoided whatever was the reason for the seg-fault, and the xqjxz job then runs fine.

The 3 over-ride files for these jobs xqjh-x, -y and -z are within my PUMA umui_jobs/overrides/ directory:
/home/n02/n02/gmann/umui_jobs/overrides/

The 1st one is the default one, that I’d used before (but modified to avoid repeat noomp and omp flags).
The 2nd one is the variant of that without the -hflex_mp=intolerant flag.

And the 3rd one is where it adds the debug flag, and surprisingly the xqjhz job then runs OK.

-rwxr-xr-x. 1 gmann n02 1398 Feb 22 14:32 ARCHER2_UpdVnGCOMtoV15woutNoOMP.ovr
-rwxr-xr-x. 1 gmann n02 1405 Feb 22 14:37 ARCHER2_UpdVnGCOMtoV15NoOMPdbug.ovr
-rwxr-xr-x. 1 gmann n02 1241 Feb 22 14:40 ARCHER2_UpdVnGCOMtoV15tlr8NoOMP.ovr

Please can someone in the team take a look into this, and see if they can see why the original job
seg-faults (xqjh-x), whereas the debug-flagged job with -O0 runs OK (xqjh-z).

Thanks a lot for your help with this,

Regards,

Dr. Graham Mann
Lecturer in Atmospheric Science
University of Leeds

*********************************************************
UM Executable : /work/n02/n02/gmann/um/xqjhx/bin/xqjhx.exe
*********************************************************
corrupted double-linked list
srun: error: nid006672: task 128: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=12582290.0
slurmstepd: error: *** STEP 12582290.0 ON nid006669 CANCELLED AT 2026-02-22T22:21:09 ***
srun: error: nid006672: tasks 129-255: Terminated
srun: error: nid006669: tasks 0-127: Terminated
srun: Force Terminated StepId=12582290.0
xqjhx: Run failed
*****************************************************************
Ending script : qsatmos
Completion code : 143
Completion time : Sun 22 Feb 2026 10:21:10 PM GMT

From: Rosalyn Hatcher notifications@cms-support.discoursemail.com
Date: Tuesday, 17 February 2026 at 12:03
To: Graham Mann G.W.Mann@leeds.ac.uk
Subject: [NCAS Modelling Support] [Unified Model] Problem submitting v8.4 GA4 UM-UKCA job to ARCHER2 from PUMA2

CAUTION: External Message. Use caution opening links and attachments.

RosalynHatcher
17 February
Hi Graham,
It’s because you have the job set to submit to ln01 and that login node is out of circulation for maintenance. Change the host to one of the other login nodes and try again.
You can add set -x at the beginning of the UMSUBMIT_ARCHER2 to see where the script is falling over.
Cheers,
Ros

Hi Graham,

Having to lower the compiler optimisation level on a particular file for the model to run is not uncommon and is something we often do when porting UM versions or with a change of compiler version.

By process of elimination, setting -O0 on groups of files in the compile override file, you can work out which file(s) is/are causing the problem. Then simply leave the identified files compiled with the lowered optimisation.

Cheers,
Ros

I’d just add a point that you have a core dump, so if you want to see which sections of the code should have their compiler settings changed you can use gdb.

I don’t have access to your core file, but the commands would be:

cd /work/n02/n02/gmann/um/xqjhx

gdb bin/xqjhx.exe core

bt full

and so on. I thought I’d point this out in case you haven’t tried poking around. Full docs online.

Hi Ros, Dave,

Thanks for sending this info, and ah OK re: sometimes it being a bit of “trial and error” to track this down.

Dave – thanks for this info re: the gdb commands, and that’s great.

I did refer to the info on the ARCHER debug tools page (Debugging - ARCHER2 User Documentation ), and see the details in the post, I did create a copy of the over-ride file to enact the CCE compilation with also the debugging flag -g

The thing is, that when the job was re-submitted, with compiling with the -g, the model run actually ran OK (it no longer seg-faulted !!)

That re-submission there was with also adding -O0 (as well as -g), and my understanding was it was essential to also run with -O0 when running with -g.

That might be my conflation of two things.

It’s simply that whenever I’ve done this kind of thing before, I’ve added -g for debug with also adding -O0, and I thought that was generally the way this was done.

But thinking about this here, maybe it’s the case the -g may also work OK when optimisation is at O2 or O3?

Please can you clarify, is that correct that you have to run with -O0 as well as -g?
Or are those just 2 different things that people tend to try, and I should be fine to add -g only. to debug the core with the usual optimisation where the seg-fault is occuring?

Thanks a lot for your help with this.

Cheers
Graham

Do put -g because it includes the symbol table, so your debugger can ‘read’ the code and match it to source files. This will tell you which line you segfault on and you should be able to see which variable is going out of bounds etc.

The -O0,1,2,… is the optimisation level. It sounds like the compiler is optimising too aggressively and causing a bug. As Ros says, if you can find which files are causing the bug you can compile only these with a lower optimisation (or remove openMP or something similar if applicable).

If you aren’t able to use the debugger to find the section of the code which has the bug, then Ros’s suggestion to use divide and conquer will also help you narrow it down.