HadGEM3 crash (from NEMO)

Hi,

I am using ARCHER2 to continue a run (u-bk453, previously run on ARCHER successfully for 250 years). My suite is u-ck857. I am running HadGEM3-GC3.1-LL, with a modified calving file: /work/n02/n02/radiam24/lig127k_H11/GC3.1_eORCA1v2.2x_nemo_ancils_CMIP6_H11 points to the calving file /work/n02/n02/radiam24/lig127k_H11/ecalving_v2.2x_H11.nc, which is identical to the one previously used for u-bk453. I also used this calving file to run u-ck855 for 10 years, which worked fine.

HadGEM3 seems to run fine from the restart date 2100-01-01, but then crashes around 2102-11-21.
I differenced u-bk453 and u-ck857, and they look similar (other than STASH requests and ldflags_overrides_suffix=-lstdc++).

I think the error is from NEMO; is there a way to fix this?
The error from /work/n02/n02/radiam24/cylc-run/u-ck857/work/21021101T0000Z/coupled/ocean.output is:

Greenland iceberg calving climatology (kg/s) : 5130975919.5296955
** Greenland iceberg calving adjusted value (kg/s) : 213652999.9999997**
** Antarctica iceberg calving climatology (kg/s) : 36646412.219468936**
** Antarctica iceberg calving adjusted value (kg/s) : 25618500.**
** Greenland iceshelf melting climatology (kg/s) : 0.**
** Greenland iceshelf melting adjusted value (kg/s) : 0.**
** Antarctica iceshelf melting climatology (kg/s) : -39542110.62454956**
** Antarctica iceshelf melting adjusted value (kg/s) : -31311500.000000004**
** stpctl: the elliptic solver DO not converge or explode**

it: 33330 iter:2000 r: NaN b: NaN

** stpctl: output of last fields**

** E R R O R**

** step: indic < 0**

** dia_wri_state : single instantaneous ocean state**
** and forcing fields file created**
** and named :output.abort .nc**

E R R O R

** MPPSTOP**
** NEMO abort from dia_wri_state**
** E R R O R: Calling mppstop**

Thanks so much.

Best wishes,

Rachel

Rachel

The problem is with the solver as far as I can see. It might be worth setting ln_ctl=.true. in app/nemo_cice/rose-app.conf and resubmit - that should hopefully at least give a more explicit error message.

Grenville

Thanks, I tried that. The output from the timestep that failed looks similar to the output from previous timesteps to me.

There is ocean output in files like /work/n02/n02/radiam24/cylc-run/u-ck857/work/21021101T0000Z/coupled/output*_0004.nc; only a few of these (e.g. output*_0004.nc and output*_0005.nc) seem to contain non-zero values.

Do you have any other ideas about where to look/what to try?

Thanks.

Best wishes,

Rachel

Rachel

I’d try perturbing the atmosphere start file - that might change the conditions enough to prevent the NaNs being generated by the ocean.

rename ck857a.da21021101_00 as ck857a.da21021101_00.orig

then create a slurm file with the content here (I’d create & submit the file in: /work/n02/n02/radiam24/cylc-run/u-ck857/share/data/History_Data/)

#!/bin/bash --login

#SBATCH --job-name=perturb
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00:00

#SBATCH --account=<your account>
#SBATCH --partition=standard
#SBATCH --qos=standard

module load epcc-job-env
module load cray-python

export HISTORY_DATA=/work/n02/n02/radiam24/cylc-run/u-ck857/share/data/History_Data/
export DUMP=ck857a.da21021101_00
export UMDIR=/work/y07/shared/umshared
export PYTHONPATH=$PYTHONPATH:/work/y07/shared/umshared/lib/python3.8

cd $HISTORY_DATA

$UMDIR/scripts/perturb_theta.py $DUMP.orig --output $DUMP

then sbatch this file

this will create a perturbed start file (there will be validation errors from mule that can be ignored. I have checked that theta is perturbed) – run the cycle again.
(I’m not sure you need 10-day dumping - with monthly cycling, monthly dumping should be OK)

If this doesn’t work, it might be worth reconfiguring the start file – failing that, you may need to seek more expert NEMO help.

Grenville

Hi Grenville,

Thanks so much, that worked and the coupled step ran!

The errors from mule mean that postproc_atmos fails and the atmosphere output isn’t transferred from cylc-run/u-ck857/share/data/History_Data/ to the archive, but I can transfer these files manually.

The postproc_atmos error is:

File “/work/y07/shared/umshared/lib/python3.8/mule/init.py”, line 527, in init
raise ValueError(_msg)
ValueError: Incorrect size for fixed length header; given 0 words but should be 256.
[FAIL] main_pp.py atmos <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2022-03-31T12:29:12Z CRITICAL - failed/EXIT

Rachel

Hi Rachel,

You have a lot of empty files in the History_Data which are likely the culprits. I suggest removing these and try again.

-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pv2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pu2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pt21021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pn21021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pl2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pk2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.ph2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pe2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pd2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pa2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p921021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p821021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p721021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p621021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p52102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p42102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p32102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p22102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p12102oct

Regards,
Ros.

Thanks, that worked.

However the model failed again at 2104-11-01; should I try perturbing the atmosphere start file for this month as well? The error from ocean.output:

===>>> : E R R O R
===========

stpctl: the zonal velocity is larger than 20 m/s

kt= 56565 max abs(U): 30.89 , i j k: 312 72 41

output of last fields in numwso

===>>> : E R R O R
===========

step: indic < 0

dia_wri_state : single instantaneous ocean state
and forcing fields file created
and named :output.abort .nc

===>>> : E R R O R
===========

MPPSTOP
NEMO abort from dia_wri_state
E R R O R: Calling mppstop

Thanks.

Best wishes,

Rachel

Rachel

It might be worth a try. I don’t have any better ideas.

Grenville

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.