Subsequent thread calling handler will sleep error in glm fcst

Hi CMS,

I’m getting a strange error in the final glm forecast of suite u-da704 which I don’t understand. The same crash has occurred after 3 re-submissions.

[180] exceptions: Subsequent thread (1) calling handler will sleep
[180] exceptions: An exception was raised:11 (Segmentation fault)
[180] exceptions: the exception reports the extra information: Address not mapped to object.
[180] exceptions: whilst in a parallel region, by thread 0
[180] exceptions: Task had pid=91935 on host nid001281
[180] exceptions: Program is “/work/n02/n02/shakka/cylc-run/u-cy223/share/fcm_make_um/build-atmos/bin/um-atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[180] exceptions: Data address (si_addr): 0xfffffffc149960e8; rip: 0x00e1bfb6
[180] exceptions: [backtrace]: has 20 elements:
[180] exceptions: [backtrace]: ( 1) : Address: [0x00e1bfb6]
[180] exceptions: [backtrace]: ( 1) : eg_tri_linear$eg_tri_linear_mod__cray$mt$p0001 in file The Cpu Module line 32764
[180] exceptions: [backtrace]: ( 2) : Address: [0x026b4b95]
[180] exceptions: [backtrace]: ( 2) : signal_do_backtrace_linux (* Cannot Locate )
[180] exceptions: [backtrace]: ( 3) : Address: [0x026b2720]
[180] exceptions: [backtrace]: ( 3) : signal_handler (
Cannot Locate )
[180] exceptions: [backtrace]: ( 4) : Address: [0x15062e98c8c0]
[180] exceptions: [backtrace]: ( 4) : ?? (
Cannot Locate )
[180] exceptions: [backtrace]: ( 5) : Address: [0x00e1bfb6]
[180] exceptions: [backtrace]: ( 5) : eg_tri_linear$eg_tri_linear_mod__cray$mt$p0001 in file The Cpu Module line 0
[180] exceptions: [backtrace]: ( 6) : Address: [0x15062f8baf77]
[180] exceptions: [backtrace]: ( 6) : ?? (
Cannot Locate )
[180] exceptions: [backtrace]: ( 7) : Address: [0x15062f8bd19e]
[180] exceptions: [backtrace]: ( 7) : ?? (
Cannot Locate )
[180] exceptions: [backtrace]: ( 8) : Address: [0x15062f8bd4e0]
[180] exceptions: [backtrace]: ( 8) : ?? (
Cannot Locate )
[180] exceptions: [backtrace]: ( 9) : Address: [0x00e1ba44]
[180] exceptions: [backtrace]: ( 9) : eg_tri_linear$eg_tri_linear_mod_ (
Cannot Locate )
[180] exceptions: [backtrace]: ( 10) : Address: [0x00e13409]
[180] exceptions: [backtrace]: ( 10) : eg_interpolation_eta$eg_interpolation_eta_mod_ (
Cannot Locate )
[180] exceptions: [backtrace]: ( 11) : Address: [0x00e0ee4e]
[180] exceptions: [backtrace]: ( 11) : eg_interpolation_eta_pmf$eg_interpolation_eta_pmf_mod_ (
Cannot Locate )
[180] exceptions: [backtrace]: ( 12) : Address: [0x00fb13c1]
[180] exceptions: [backtrace]: ( 12) : departure_point_eta$departure_point_eta_mod_ (
Cannot Locate )
[180] exceptions: [backtrace]: ( 13) : Address: [0x011392e3]
[180] exceptions: [backtrace]: ( 13) : eg_sl_wind_u$eg_sl_wind_u_mod_ (
Cannot Locate )
[180] exceptions: [backtrace]: ( 14) : Address: [0x01129a1e]
[180] exceptions: [backtrace]: ( 14) : eg_sl_full_wind$eg_sl_full_wind_mod_ (
Cannot Locate )
[180] exceptions: [backtrace]: ( 15) : Address: [0x00c7be96]
[180] exceptions: [backtrace]: ( 15) : atm_step_4a$atm_step_4a_mod_ (
Cannot Locate )
[180] exceptions: [backtrace]: ( 16) : Address: [0x0045fe43]
[180] exceptions: [backtrace]: ( 16) : u_model_4a$u_model_4a_mod_ (
Cannot Locate )
[180] exceptions: [backtrace]: ( 17) : Address: [0x00410efc]
[180] exceptions: [backtrace]: ( 17) : um_shell$um_shell_mod_ (
Cannot Locate )
[180] exceptions: [backtrace]: ( 18) : Address: [0x004097d8]
[180] exceptions: [backtrace]: ( 18) : main (
Cannot Locate )
[180] exceptions: [backtrace]: ( 19) : Address: [0x15062e5b629d]
[180] exceptions: [backtrace]: ( 19) : ?? (
Cannot Locate *)
[180] exceptions: [backtrace]: ( 20) : Address: [0x004095aa]
[180] exceptions: [backtrace]: ( 20) : _start in file /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/…/sysdeps/x86_64/start.S line 122
[180] exceptions:
[180] exceptions: To find the source line for an entry in the backtrace;
[180] exceptions: run addr2line --exe=</path/too/executable>
[180] exceptions: where address is given as [0x] above
[180] exceptions:
srun: error: nid001281: task 180: Exited with exit code 11
srun: launch/slurm: _step_signal: Terminating StepId=4914401.0
slurmstepd: error: *** STEP 4914401.0 ON nid001278 CANCELLED AT 2023-11-21T10:24:33 ***

Have you seen this before? There’s nothing useful in job.out or pe_output. If it was the regional forecast I’d just skip and try re-running the previous cycle with a longer forecast, but if it’s in the glm it seems more fundamental. I haven’t changed anything in the suite and it was running happily until this cycle.

Cheers
Ella

Hi Ella

It looks like some kind of model instability - the err file is reporting a trace of where the error has occurred and what was the path to getting there. It looks like it failed in eg_tri_linear with a memory error of some kind.

Can you re run PRINT_STATUS set to extra diagnostic messages – I’m not too hopeful of much help, but it’s worth a try.

Grenville

1 Like

Interesting, thanks Grenville. Will have a go with extra diagnostics now… E

Nothing useful in the extra diags output, but I’ve skipped the cycle for now and will come back to it later…