Seg fault in ec_um_recon

I’m running a nesting suite: u-ci575 and I can’t track down what’s causing this error?

[0] exceptions: An exception was raised:11 (Segmentation fault)
[0] exceptions: the exception reports the extra information: Address not mapped to object.
[0] exceptions: whilst in a serial region
[0] exceptions: Task had pid=241625 on host nid005216
[0] exceptions: Program is "/work/n02/n02/hburns/cylc-run/u-ci575/share/fcm_make/build-recon/bin/um-recon.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[0] exceptions: Data address (si_addr): 0x7fff60e4f2c0; rip: 0x2b406087dfe9
[0] exceptions: [backtrace]: has   5 elements:
[0] exceptions: [backtrace]: (  1) : Address: [0x2b406087dfe9] 
[0] exceptions: [backtrace]: (  1) : ?? (* Cannot Locate *)
[0] exceptions: [backtrace]: (  2) : Address: [0x00409604] 
[0] exceptions: [backtrace]: (  2) : signal_do_backtrace_linux in file /mnt/lustre/a2fs-work2/work/n02/n02/hburns/cylc-run/u-ci575/share/fcm_make/extract/um/src/control/c_code/exceptions/exceptions-platform/exceptions-linux.c line 81
[0] exceptions: [backtrace]: (  3) : Address: [0x00408178] 
[0] exceptions: [backtrace]: (  3) : signal_do_backtrace in file /mnt/lustre/a2fs-work2/work/n02/n02/hburns/cylc-run/u-ci575/share/fcm_make/extract/um/src/control/c_code/exceptions/exceptions.c line 81
[0] exceptions: [backtrace]: (  4) : Address: [0x2b40610b72d0] 
[0] exceptions: [backtrace]: (  4) : ?? (* Cannot Locate *)
[0] exceptions: [backtrace]: (  5) : Address: [0x2b406087dfe9] 
[0] exceptions: [backtrace]: (  5) : ?? (* Cannot Locate *)
[0] exceptions: 
[0] exceptions: To find the source line for an entry in the backtrace;
[0] exceptions: run addr2line --exe=</path/too/executable> <address>
[0] exceptions: where address is given as [0x<address>] above
[0] exceptions: 
srun: error: nid005216: task 0: Exited with exit code 11
srun: launch/slurm: _step_signal: Terminating StepId=954606.0
slurmstepd: error: *** STEP 954606.0 ON nid005216 CANCELLED AT 2022-01-10T18:08:35 ***
srun: error: nid005216: tasks 1-11: Terminated
srun: Force Terminated StepId=954606.0

I think I made all the correct changes from the 4 cab to the 23 cab system to both the suite and source code.

The suite uses: helenburns/vn11.1_eccodes which is pretty much jeffcole/vn11.1_archer2_fixes with the eccodes flags turned on (pointing my my build of eccodes).

I’m struggling to locate what’s causing the error and haven’t come a across this sort of error before so could use some pointers in tracking down the issue.

Helen

Please allow us read access to your home and work spaces on ARCHER.

Grenville

done! sorry had forgot the default setting wasn’t readable!

Hi Helen

If addr2line doesn’t provide clue (which it doesn’t), then it’s probably a case of building with debug options and switching on extra output - set RCF_PRINTSTATUS to PrStatus_Diag.

Also, add

module load atp

to the HOST_HPC init-script
and

export ATP_ENABLED=1

to the runtime environment section.

(You may need to relink recon.exe with atp loaded)

Grenville

Hi Grenville,

I’ve re-run with the debug options and extra output.

However I’m not getting any more information in the error message so I’m still struggling to identify what’s causing the seg fault and I’m not sure I’m picking up any clues in the extra output

Cheers
Helen

Hi Helen

I built with -g and ran to get a core file, but the back trace can’t identify the routine names. Could you try building eccodes with -g?

Grenville

Hi Grenville

I’ve rebuilt eccodes with -g and got a shared library error so rebuilt again with shared libraries off

cmake  ../eccodes-2.22.0-Source -DCMAKE_INSTALL_PREFIX=/work/n02/n02/hburns/eccodes/2.22.0 -DENABLE_EXTRA_TESTS=ON  -DCMAKE_Fortran_COMPILER=ftn -DCMAKE_Fortran_FLAGS=-g -DBUILD_SHARED_LIBS=OFF -DENABLE_JPG=OFF

whist doing this
I had a vague memory of having a nosey at a README for an old build in /work/y07/shared/umshared/lib/cce-10.0.4/eccodes/README.txt mentioning something about a bug in the rpath linking which I made an equivalent edit to, just in case it was that that was causing the issue.

I’ve re run and still the same error message with no further clues though

Hi Helen,

Just out of interest are you rebuilding eccodes because the central one built with cce-10.0.4 doesn’t work or because you needed a newer version of eccodes than 2.19.0? I haven’t rebuilt it centrally until I knew if it was needed or not.

Cheers,
Ros.

HI Ros,

It was because the central didn’t work and it didn’t look like much effort to build my own. I wasn’t sure if perhaps not many were using it.

Cheers
Helen

Hi Helen,

I don’t think many people do use it. I have no idea if anyone actually used the version I’d compiled for the 4cab - so don’t know for sure if it actually worked properly in the first place!

For what it’s worth I’ve just built 2.24.1 with cce/12.0.3 if you want to give it a go it’s under /work/n02/n02/ros/eccodes/lib/2.24.1.

Cheers,
Ros.

Hi Ros,

In my noseying around when I was setting this up I did find a few personal copies of eccodes floating around so I’m guessing a central version would get some use. I’ve re-run using your eccodes build and am still getting the same error.

Cheers
Helen

Hi Helen

Did this work on the 4-cab system?

Grenville

Hi Grenville

Unhelpfully I only partially tested it on the 4 cab system as in checked it was picking up eccodes etc and ran the same suite but configured with UM forcing. So it’s very possible the same error would have occurred there too but when I tested it on the 4 cab system I had forgotten to generate some extra ancillary files so the I think the test run stopped before it would have generated this error. Didn’t ever re-run as it looked like transition to the full system was fast approaching.

Cheers
Helen