HadGEM3-GC3.1-LL SSP245 still debugging

penmaher · 20 May 2025 15:23

Hi CMS team,

I am still working on getting my SSP245 HadGEM3-GC3.1-LL suite working. This is a follow up to

I would like to try a different set of restart files if possible please. I have been digging around in a few suites and I noticed one of interest. Ros moved some files from the MO for me a few months back and I was hoping to get a couple more file please. Would it be possible to have a copy of the following restart files please? The path of the data is /nesi/project/uoo03538/andrew.pauling/MASS/u-bg469/2015-restarts/

bg469a.da20150101_00
bg469i.restart.2015-01-01-00000.nc
bg469o_icebergs_20150101_restart.nc
bg469o_20150101_restart.nc
bg469o_20150101_restart_trc.nc

Could these files be placed into the following location please?
/work/n02/n02/penmaher/ssp585_N96O1_ensemble1_dumps/u-bg469/

Thank you in advance!

Penny

penmaher · 20 May 2025 15:28

These files might also be in the following path too at the MO

/data/d01/ukcmip6/ssp585_N96O1_ensemble1_dumps/

RosalynHatcher · 22 May 2025 15:59

Hi Penny,

I will take a look for those files when I’m back in the office and let you know when I have them.

Regards,
Ros.

RosalynHatcher · 27 May 2025 09:15

Hi Penny,

Unfortunately those files are not under /data/d01/ukcmip6/ssp585_N96O1_ensemble1_dumps/ and I have no idea what filesystem the /nesi/project path is on. It’s not a MO path I recognise nor can I find it on the XCE/F or XCS. I’d suggest talking to someone at the Met Office.

Regards,
Ros

penmaher · 27 May 2025 10:10

Thanks for getting back to me Ros. I will track down the files and get back to you.

Thanks,

Penny

penmaher · 27 May 2025 12:47

Hi Ros,

I got in contact with the owner of the suite. The path ref for the /nesi/project is local to their institution. Sorry for that. He said he got the files from moose. I am happy to try and retrieve them from moose. Could you tell me what module I need to load for moo please? I have not been able to fund it on the online intro courses or within the helpdesk tickets.

Penny

RosalynHatcher · 27 May 2025 14:28

Hi Penny,

MASS cannot be accessed from ARCHER2. It can be acessed from JASMIN however. Instructions on how to apply for access are here: JASMIN Help Site - How to apply for MASS access

Regards,
Ros

penmaher · 3 June 2025 14:03

Hi CMS team,

I have tested the idea the restart files might explain why the model run blows up (too fast ocean velocity) within a few timesteps. But a different set of restart files also blew up (too fast vwind).

So I am back looking at my suite changes to debug it.

I started a new suite u-dq189 which I branched from my working piControl run on archer2 (which was originally u-as037 and updated to run on archer2).

I have changed the suite from a piControl to SSP245. I did this by comparing old MO suites (ar766 which is a piControl suite and bj616 which is a ssp245 suite).

I am now trying to get this suite running. I have not included ozone redistribution, as I just want the suite to run at this point (it is easier to debug) rather than the science being correct (i.e. the ozone redistribution).

When I run the suite, I get a very unhelpful job.err exception that reads:

[2] exceptions: An exception was raised:8 (Floating point exception)
[2] exceptions: the exception reports the extra information: Integer divide by zero.
[2] exceptions: whilst in a serial region
[2] exceptions: Task had pid=23868 on host nid004585
[2] exceptions: Program is “/mnt/lustre/a2fs-work2/work/n02/n02/penmaher/cylc-run/u-dq189/work/20150101T0000Z/coupled/./atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[2] exceptions: Data address (si_addr): 0x00a0bbfb; rip: 0x00a0bbfb

This above message happens when I set any of the following:

PRINT_STATUS=Normal
PRINT_STATUS=PrStatus_Min
PRINT_STATUS=Operational

When I use

PRINT_STATUS=PrStatus_Diag

I reach the file limit with the error

lib-4211 : UNRECOVERABLE library error
A WRITE operation tried to write a record that was too long.

Do you have any ideas on how to can get some more meaningful information on why my job is failing? The code is running into the exception while in the ocean model for the first time (it finished the ocean.out prematurely and did not create any content within the pe_output).

Thank you in advance,
Penny

mdalvi · 3 June 2025 14:52

Hi,

Looking at further traceback in the job.err the failure seems to be related to one of the nlst* namelists.
In app/um/rose-app.conf under [namelist:nlstcgen] two required items seem to have been inadvertently removed:

secs_per_periodim=86400
steps_per_periodim=${ATMOS_TIMESTEPS_PER_DAY}

The failure from Prstatus_Diag is probably an unrelated bug where a print statement has been added without testing at this level.

system · 3 July 2025 14:53

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
PiControl HadGEM3-GC3.1 suite	36	152	3 April 2025
Running HadGEM3-GC3.1-LL from Puma Rose/Cylc and FCM ARCHER2	4	340	8 December 2021
HadGEM3 fails at first coupled task Unified Model ARCHER2	3	228	18 May 2022
HadGEM-GC3.1-MM workflow NEMO and CICE Monsoon2	11	366	4 January 2022
Cycle point for restarting suites? Unified Model ARCHER2 , PUMATest	21	731	1 March 2022

HadGEM3-GC3.1-LL SSP245 still debugging

Related topics