Found cause of the ARCHER2 UMv8.4 GA4 UM-UKCA seg-fault -- now runs fine (JULES routine "snow.F90")

Dear NCAS-CMS helpdesk,

Just to post an update to the post I raised in February re: seg-fault in the GA4 UM-UKCA ARCHER2 job.

I’ve put the link to that post below for info, but in summary the model ran when compiled with the CCE fortran compiler optimization flag set to O0, but was failing if set to O1 and O2.

Ros and Dave (dcase) suggested ways to track down the seg-fault, but at that particular time I didn’t have time to try these out.

In fact the set of runs I was doing only needed to be 100-days, which ran through fine in the standard queue with O0.

But the 1960s stratosphere runs to complete the UK submission to ISA-MIP HErSEA experiment (Timmreck et al., 2018; Dhomse et al., 2020) required to run the ensemble members each for ~48 months

At the weekend I had a go at running the gdb de-bugger to track down the source of the seg-fault, and this actually worked really easily to find the problem.

After compiling with -g and -O1 I was able to then get the model run to crash, and then it was simply a case of running the suggested “gdb bin/${JOBID}.exe core” command to isolate the source of the problem.

The debugger information gave some informational/warning messages suggesting it didn’t run perfectly, but the key line of information I was able to decipher as telling me what routine triggered the seg-fault.

I’ve copied and pasted below the lines of std-output printed from the debugger was

See the debugger gives the source of the seg-fault as “snow ()” there.

Actually I misunderstood this initially, and thought it was a memory error from a variable “snow”.
And then I added exclusions for all UM routines containing the variable “snow” (23 in total).

It turned out not to be any of those 23 routines, and I then was puzzled, but realised that being the case,
I wondered if maybe it could an error originating from a JULES routine
(this was the only explanation I could think of).

I guess the JULES repo code came in relatively recently to the UM trunk and then less ported from ARCHER variant and other machines etc. But I hadn’t thought to check that.

Anyway, checking the routines I saw there was actually a subroutine called “snow.F90”.
And then tried adding that to the exclusion list as well as the 23 UM routines.

Once I did that, the model then runs fine!!

Apologies for the long-hand explainer, but thought I’d say that the gdb worked really well in this case
(and to encourage others to consider using this in similar circumstances in future).

Ros – I’m not sure if the NCAS-CMS team are aware of this, but looks like that particular routine has a crash
with -O1 or -O2 but runs OK with -O0 (i.e. similar to the other routines in the over-ride “exclusion list”)

For info you can see how I’ve done the over-ride at

/home/n02/n02/gmann/umui_jobs/overrides/ARCHER2_UpdVnGCOMtoV15NoOMPdbO1.ovr
/home/n02/n02/gmann/umui_jobs/overrides/ARCHER2_UpdVnGCOMtoV15NoOMPdbO2.ovr

The UM jobs are xpksy is the -O0 that runs OK with all UM routines compiled with O0 via
this compiler over-ride file:

/home/n02/n02/gmann/umui_jobs/overrides/ARCHER2_UpdVnGCOMtoV15NoOMPdbug.ovr

Then UM jobs xpksz and xpksx are the runs with the dbO1.ovr and dbO2.ovr over-ride files above
(i.e. with all routines compiled with -O1 and -O2, except for the exclusion list, which there includes also snow.F90)

Quite chuffed to get the model running again, and not sure the source of the bug in JULES/src/science/snow.F90

Best regards,

Cheers
Graham


[New LWP 191025]
warning: Error reading shared library list entry at 0x3531722540382029
Cannot access memory at address 0x158d00074647b
Cannot access memory at address 0x158d000746473
Failed to read a valid object file image from memory.
Core was generated by `/work/n02/n02/gmann/um/xqksz/bin/xqksz.exe’.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000011c5c7f in snow_ ()


** Dr. Graham Mann, Lecturer in Atmospheric Science **
** Institute for Climate & Atmospheric Science T: +44 0113 3431660 **
** Room 10.108, School of Earth & Environment F: +44 0113 3435259 **
** University of Leeds, Leeds, LS2 9JT, U.K. E: G.W.Mann@leeds.ac.uk **