I’m running a nesting suite: u-ci575 and I can’t track down what’s causing this error?
 exceptions: An exception was raised:11 (Segmentation fault)
 exceptions: the exception reports the extra information: Address not mapped to object.
 exceptions: whilst in a serial region
 exceptions: Task had pid=241625 on host nid005216
 exceptions: Program is "/work/n02/n02/hburns/cylc-run/u-ci575/share/fcm_make/build-recon/bin/um-recon.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
 exceptions: Data address (si_addr): 0x7fff60e4f2c0; rip: 0x2b406087dfe9
 exceptions: [backtrace]: has 5 elements:
 exceptions: [backtrace]: ( 1) : Address: [0x2b406087dfe9]
 exceptions: [backtrace]: ( 1) : ?? (* Cannot Locate *)
 exceptions: [backtrace]: ( 2) : Address: [0x00409604]
 exceptions: [backtrace]: ( 2) : signal_do_backtrace_linux in file /mnt/lustre/a2fs-work2/work/n02/n02/hburns/cylc-run/u-ci575/share/fcm_make/extract/um/src/control/c_code/exceptions/exceptions-platform/exceptions-linux.c line 81
 exceptions: [backtrace]: ( 3) : Address: [0x00408178]
 exceptions: [backtrace]: ( 3) : signal_do_backtrace in file /mnt/lustre/a2fs-work2/work/n02/n02/hburns/cylc-run/u-ci575/share/fcm_make/extract/um/src/control/c_code/exceptions/exceptions.c line 81
 exceptions: [backtrace]: ( 4) : Address: [0x2b40610b72d0]
 exceptions: [backtrace]: ( 4) : ?? (* Cannot Locate *)
 exceptions: [backtrace]: ( 5) : Address: [0x2b406087dfe9]
 exceptions: [backtrace]: ( 5) : ?? (* Cannot Locate *)
 exceptions: To find the source line for an entry in the backtrace;
 exceptions: run addr2line --exe=</path/too/executable> <address>
 exceptions: where address is given as [0x<address>] above
srun: error: nid005216: task 0: Exited with exit code 11
srun: launch/slurm: _step_signal: Terminating StepId=954606.0
slurmstepd: error: *** STEP 954606.0 ON nid005216 CANCELLED AT 2022-01-10T18:08:35 ***
srun: error: nid005216: tasks 1-11: Terminated
srun: Force Terminated StepId=954606.0
I think I made all the correct changes from the 4 cab to the 23 cab system to both the suite and source code.
The suite uses: helenburns/vn11.1_eccodes which is pretty much jeffcole/vn11.1_archer2_fixes with the eccodes flags turned on (pointing my my build of eccodes).
I’m struggling to locate what’s causing the error and haven’t come a across this sort of error before so could use some pointers in tracking down the issue.
Please allow us read access to your home and work spaces on ARCHER.
done! sorry had forgot the default setting wasn’t readable!
If addr2line doesn’t provide clue (which it doesn’t), then it’s probably a case of building with debug options and switching on extra output - set RCF_PRINTSTATUS to PrStatus_Diag.
module load atp
to the HOST_HPC
to the runtime environment section.
(You may need to relink recon.exe with atp loaded)
I’ve re-run with the debug options and extra output.
However I’m not getting any more information in the error message so I’m still struggling to identify what’s causing the seg fault and I’m not sure I’m picking up any clues in the extra output
I built with -g and ran to get a core file, but the back trace can’t identify the routine names. Could you try building eccodes with -g?
I’ve rebuilt eccodes with -g and got a shared library error so rebuilt again with shared libraries off
cmake ../eccodes-2.22.0-Source -DCMAKE_INSTALL_PREFIX=/work/n02/n02/hburns/eccodes/2.22.0 -DENABLE_EXTRA_TESTS=ON -DCMAKE_Fortran_COMPILER=ftn -DCMAKE_Fortran_FLAGS=-g -DBUILD_SHARED_LIBS=OFF -DENABLE_JPG=OFF
whist doing this
I had a vague memory of having a nosey at a README for an old build in /work/y07/shared/umshared/lib/cce-10.0.4/eccodes/README.txt mentioning something about a bug in the rpath linking which I made an equivalent edit to, just in case it was that that was causing the issue.
I’ve re run and still the same error message with no further clues though
Just out of interest are you rebuilding eccodes because the central one built with cce-10.0.4 doesn’t work or because you needed a newer version of eccodes than 2.19.0? I haven’t rebuilt it centrally until I knew if it was needed or not.
It was because the central didn’t work and it didn’t look like much effort to build my own. I wasn’t sure if perhaps not many were using it.
I don’t think many people do use it. I have no idea if anyone actually used the version I’d compiled for the 4cab - so don’t know for sure if it actually worked properly in the first place!
For what it’s worth I’ve just built 2.24.1 with cce/12.0.3 if you want to give it a go it’s under
In my noseying around when I was setting this up I did find a few personal copies of eccodes floating around so I’m guessing a central version would get some use. I’ve re-run using your eccodes build and am still getting the same error.
Did this work on the 4-cab system?
Unhelpfully I only partially tested it on the 4 cab system as in checked it was picking up eccodes etc and ran the same suite but configured with UM forcing. So it’s very possible the same error would have occurred there too but when I tested it on the 4 cab system I had forgotten to generate some extra ancillary files so the I think the test run stopped before it would have generated this error. Didn’t ever re-run as it looked like transition to the full system was fast approaching.
FYI just in case anyone else encounters the same problem. Eccodes was working fine as was the UM Source code and suite. Although I’ve moved to an upgraded suite and am now pointing my UM source code to Ros’s build in case anyone uses my suite as a template in the future as its messy pointing to random builds.
I was using ERA 5 data and not obtaining it correctly I used get_start.sh as a template for a cdsapi python script and was not retrieving the data properly. I’ve now corrected my scripts and its all working fine
Thanks very much for letting us know Helen. I will put a copy of the latest eccodes build under
$UMDIR now I know it’s ok, when I get a few spare minutes.
This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.