HadGEM3 crash (from NEMO)


I am using ARCHER2 to continue a run (u-bk453, previously run on ARCHER successfully for 250 years). My suite is u-ck857. I am running HadGEM3-GC3.1-LL, with a modified calving file: /work/n02/n02/radiam24/lig127k_H11/GC3.1_eORCA1v2.2x_nemo_ancils_CMIP6_H11 points to the calving file /work/n02/n02/radiam24/lig127k_H11/ecalving_v2.2x_H11.nc, which is identical to the one previously used for u-bk453. I also used this calving file to run u-ck855 for 10 years, which worked fine.

HadGEM3 seems to run fine from the restart date 2100-01-01, but then crashes around 2102-11-21.
I differenced u-bk453 and u-ck857, and they look similar (other than STASH requests and ldflags_overrides_suffix=-lstdc++).

I think the error is from NEMO; is there a way to fix this?
The error from /work/n02/n02/radiam24/cylc-run/u-ck857/work/21021101T0000Z/coupled/ocean.output is:

Greenland iceberg calving climatology (kg/s) : 5130975919.5296955
** Greenland iceberg calving adjusted value (kg/s) : 213652999.9999997**
** Antarctica iceberg calving climatology (kg/s) : 36646412.219468936**
** Antarctica iceberg calving adjusted value (kg/s) : 25618500.**
** Greenland iceshelf melting climatology (kg/s) : 0.**
** Greenland iceshelf melting adjusted value (kg/s) : 0.**
** Antarctica iceshelf melting climatology (kg/s) : -39542110.62454956**
** Antarctica iceshelf melting adjusted value (kg/s) : -31311500.000000004**
** stpctl: the elliptic solver DO not converge or explode**

it: 33330 iter:2000 r: NaN b: NaN

** stpctl: output of last fields**

** E R R O R**

** step: indic < 0**

** dia_wri_state : single instantaneous ocean state**
** and forcing fields file created**
** and named :output.abort .nc**


** NEMO abort from dia_wri_state**
** E R R O R: Calling mppstop**

Thanks so much.

Best wishes,



The problem is with the solver as far as I can see. It might be worth setting ln_ctl=.true. in app/nemo_cice/rose-app.conf and resubmit - that should hopefully at least give a more explicit error message.


Thanks, I tried that. The output from the timestep that failed looks similar to the output from previous timesteps to me.

There is ocean output in files like /work/n02/n02/radiam24/cylc-run/u-ck857/work/21021101T0000Z/coupled/output*_0004.nc; only a few of these (e.g. output*_0004.nc and output*_0005.nc) seem to contain non-zero values.

Do you have any other ideas about where to look/what to try?


Best wishes,



I’d try perturbing the atmosphere start file - that might change the conditions enough to prevent the NaNs being generated by the ocean.

rename ck857a.da21021101_00 as ck857a.da21021101_00.orig

then create a slurm file with the content here (I’d create & submit the file in: /work/n02/n02/radiam24/cylc-run/u-ck857/share/data/History_Data/)

#!/bin/bash --login

#SBATCH --job-name=perturb
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00:00

#SBATCH --account=<your account>
#SBATCH --partition=standard
#SBATCH --qos=standard

module load epcc-job-env
module load cray-python

export HISTORY_DATA=/work/n02/n02/radiam24/cylc-run/u-ck857/share/data/History_Data/
export DUMP=ck857a.da21021101_00
export UMDIR=/work/y07/shared/umshared
export PYTHONPATH=$PYTHONPATH:/work/y07/shared/umshared/lib/python3.8


$UMDIR/scripts/perturb_theta.py $DUMP.orig --output $DUMP

then sbatch this file

this will create a perturbed start file (there will be validation errors from mule that can be ignored. I have checked that theta is perturbed) – run the cycle again.
(I’m not sure you need 10-day dumping - with monthly cycling, monthly dumping should be OK)

If this doesn’t work, it might be worth reconfiguring the start file – failing that, you may need to seek more expert NEMO help.


Hi Grenville,

Thanks so much, that worked and the coupled step ran!

The errors from mule mean that postproc_atmos fails and the atmosphere output isn’t transferred from cylc-run/u-ck857/share/data/History_Data/ to the archive, but I can transfer these files manually.

The postproc_atmos error is:

File “/work/y07/shared/umshared/lib/python3.8/mule/init.py”, line 527, in init
raise ValueError(_msg)
ValueError: Incorrect size for fixed length header; given 0 words but should be 256.
[FAIL] main_pp.py atmos <<‘STDIN
[FAIL] ‘STDIN’ # return-code=1
2022-03-31T12:29:12Z CRITICAL - failed/EXIT


Hi Rachel,

You have a lot of empty files in the History_Data which are likely the culprits. I suggest removing these and try again.

-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pv2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pu2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pt21021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pn21021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pl2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pk2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.ph2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pe2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pd2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pa2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p921021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p821021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p721021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p621021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p52102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p42102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p32102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p22102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p12102oct


Thanks, that worked.

However the model failed again at 2104-11-01; should I try perturbing the atmosphere start file for this month as well? The error from ocean.output:

===>>> : E R R O R

stpctl: the zonal velocity is larger than 20 m/s

kt= 56565 max abs(U): 30.89 , i j k: 312 72 41

output of last fields in numwso

===>>> : E R R O R

step: indic < 0

dia_wri_state : single instantaneous ocean state
and forcing fields file created
and named :output.abort .nc

===>>> : E R R O R

NEMO abort from dia_wri_state
E R R O R: Calling mppstop


Best wishes,



It might be worth a try. I don’t have any better ideas.


This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.