HadGEM3 crash (from NEMO)

racheldiamond · 28 March 2022 18:22

Hi,

I am using ARCHER2 to continue a run (u-bk453, previously run on ARCHER successfully for 250 years). My suite is u-ck857. I am running HadGEM3-GC3.1-LL, with a modified calving file: /work/n02/n02/radiam24/lig127k_H11/GC3.1_eORCA1v2.2x_nemo_ancils_CMIP6_H11 points to the calving file /work/n02/n02/radiam24/lig127k_H11/ecalving_v2.2x_H11.nc, which is identical to the one previously used for u-bk453. I also used this calving file to run u-ck855 for 10 years, which worked fine.

HadGEM3 seems to run fine from the restart date 2100-01-01, but then crashes around 2102-11-21.
I differenced u-bk453 and u-ck857, and they look similar (other than STASH requests and ldflags_overrides_suffix=-lstdc++).

I think the error is from NEMO; is there a way to fix this?
The error from /work/n02/n02/radiam24/cylc-run/u-ck857/work/21021101T0000Z/coupled/ocean.output is:

Greenland iceberg calving climatology (kg/s) : 5130975919.5296955
** Greenland iceberg calving adjusted value (kg/s) : 213652999.9999997**
** Antarctica iceberg calving climatology (kg/s) : 36646412.219468936**
** Antarctica iceberg calving adjusted value (kg/s) : 25618500.**
** Greenland iceshelf melting climatology (kg/s) : 0.**
** Greenland iceshelf melting adjusted value (kg/s) : 0.**
** Antarctica iceshelf melting climatology (kg/s) : -39542110.62454956**
** Antarctica iceshelf melting adjusted value (kg/s) : -31311500.000000004**
** stpctl: the elliptic solver DO not converge or explode**

it: 33330 iter:2000 r: NaN b: NaN

** stpctl: output of last fields**

** E R R O R**

** step: indic < 0**

** dia_wri_state : single instantaneous ocean state**
** and forcing fields file created**
** and named :output.abort .nc**

E R R O R

** MPPSTOP**
** NEMO abort from dia_wri_state**
** E R R O R: Calling mppstop**

Thanks so much.

Best wishes,

Rachel

grenville · 29 March 2022 09:57

Rachel

The problem is with the solver as far as I can see. It might be worth setting ln_ctl=.true. in app/nemo_cice/rose-app.conf and resubmit - that should hopefully at least give a more explicit error message.

Grenville

racheldiamond · 29 March 2022 14:12

Thanks, I tried that. The output from the timestep that failed looks similar to the output from previous timesteps to me.

There is ocean output in files like /work/n02/n02/radiam24/cylc-run/u-ck857/work/21021101T0000Z/coupled/output*_0004.nc; only a few of these (e.g. output*_0004.nc and output*_0005.nc) seem to contain non-zero values.

Do you have any other ideas about where to look/what to try?

Thanks.

Best wishes,

Rachel

grenville · 30 March 2022 14:46

Rachel

I’d try perturbing the atmosphere start file - that might change the conditions enough to prevent the NaNs being generated by the ocean.

rename ck857a.da21021101_00 as ck857a.da21021101_00.orig

then create a slurm file with the content here (I’d create & submit the file in: /work/n02/n02/radiam24/cylc-run/u-ck857/share/data/History_Data/)

#!/bin/bash --login

#SBATCH --job-name=perturb
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00:00

#SBATCH --account=<your account>
#SBATCH --partition=standard
#SBATCH --qos=standard

module load epcc-job-env
module load cray-python

export HISTORY_DATA=/work/n02/n02/radiam24/cylc-run/u-ck857/share/data/History_Data/
export DUMP=ck857a.da21021101_00
export UMDIR=/work/y07/shared/umshared
export PYTHONPATH=$PYTHONPATH:/work/y07/shared/umshared/lib/python3.8

cd $HISTORY_DATA

$UMDIR/scripts/perturb_theta.py $DUMP.orig --output $DUMP

then sbatch this file

this will create a perturbed start file (there will be validation errors from mule that can be ignored. I have checked that theta is perturbed) – run the cycle again.
(I’m not sure you need 10-day dumping - with monthly cycling, monthly dumping should be OK)

If this doesn’t work, it might be worth reconfiguring the start file – failing that, you may need to seek more expert NEMO help.

Grenville

racheldiamond · 31 March 2022 13:11

Hi Grenville,

Thanks so much, that worked and the coupled step ran!

The errors from mule mean that postproc_atmos fails and the atmosphere output isn’t transferred from cylc-run/u-ck857/share/data/History_Data/ to the archive, but I can transfer these files manually.

The postproc_atmos error is:

File “/work/y07/shared/umshared/lib/python3.8/mule/init.py”, line 527, in init
raise ValueError(_msg)
ValueError: Incorrect size for fixed length header; given 0 words but should be 256.
[FAIL] main_pp.py atmos <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2022-03-31T12:29:12Z CRITICAL - failed/EXIT

Rachel

RosalynHatcher · 1 April 2022 07:43

Hi Rachel,

You have a lot of empty files in the History_Data which are likely the culprits. I suggest removing these and try again.

-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pv2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pu2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pt21021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pn21021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pl2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pk2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.ph2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pe2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pd2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.pa2102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p921021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p821021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p721021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p621021021
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p52102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p42102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p32102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p22102oct
-rw-r--r-- 1 radiam24 n02           0 Mar 29 13:41 ck857a.p12102oct

Regards,
Ros.

racheldiamond · 1 April 2022 10:33

Thanks, that worked.

However the model failed again at 2104-11-01; should I try perturbing the atmosphere start file for this month as well? The error from ocean.output:

===>>> : E R R O R
===========

stpctl: the zonal velocity is larger than 20 m/s

kt= 56565 max abs(U): 30.89 , i j k: 312 72 41

output of last fields in numwso

===>>> : E R R O R
===========

step: indic < 0

dia_wri_state : single instantaneous ocean state
and forcing fields file created
and named :output.abort .nc

===>>> : E R R O R
===========

MPPSTOP
NEMO abort from dia_wri_state
E R R O R: Calling mppstop

Thanks.

Best wishes,

Rachel

grenville · 4 April 2022 13:17

Rachel

It might be worth a try. I don’t have any better ideas.

Grenville

system · 6 April 2022 13:17

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
HadGEM3 fails at first coupled task Unified Model ARCHER2	3	228	18 May 2022
Cycle point for restarting suites? Unified Model ARCHER2 , PUMATest	21	731	1 March 2022
Porting HadGEM3 suite to ARCHER2 General ARCHER2	5	50	17 December 2024
HadGEM-GC3.1-MM workflow NEMO and CICE Monsoon2	11	366	4 January 2022
Running HadGEM3-GC3.1-LL from Puma Rose/Cylc and FCM ARCHER2	4	340	8 December 2021

HadGEM3 crash (from NEMO)

stpctl: the zonal velocity is larger than 20 m/s

Related topics