HadGEM3-GC3.1-LL SSP245 still debugging

Hi CMS team,

I am still working on getting my SSP245 HadGEM3-GC3.1-LL suite working. This is a follow up to

I would like to try a different set of restart files if possible please. I have been digging around in a few suites and I noticed one of interest. Ros moved some files from the MO for me a few months back and I was hoping to get a couple more file please. Would it be possible to have a copy of the following restart files please? The path of the data is /nesi/project/uoo03538/andrew.pauling/MASS/u-bg469/2015-restarts/

bg469a.da20150101_00
bg469i.restart.2015-01-01-00000.nc
bg469o_icebergs_20150101_restart.nc
bg469o_20150101_restart.nc
bg469o_20150101_restart_trc.nc

Could these files be placed into the following location please?
/work/n02/n02/penmaher/ssp585_N96O1_ensemble1_dumps/u-bg469/

Thank you in advance!

Penny

These files might also be in the following path too at the MO

/data/d01/ukcmip6/ssp585_N96O1_ensemble1_dumps/

Hi Penny,

I will take a look for those files when I’m back in the office and let you know when I have them.

Regards,
Ros.

Hi Penny,

Unfortunately those files are not under /data/d01/ukcmip6/ssp585_N96O1_ensemble1_dumps/ and I have no idea what filesystem the /nesi/project path is on. It’s not a MO path I recognise nor can I find it on the XCE/F or XCS. I’d suggest talking to someone at the Met Office.

Regards,
Ros

Thanks for getting back to me Ros. I will track down the files and get back to you.

Thanks,

Penny

Hi Ros,

I got in contact with the owner of the suite. The path ref for the /nesi/project is local to their institution. Sorry for that. He said he got the files from moose. I am happy to try and retrieve them from moose. Could you tell me what module I need to load for moo please? I have not been able to fund it on the online intro courses or within the helpdesk tickets.

Penny

Hi Penny,

MASS cannot be accessed from ARCHER2. It can be acessed from JASMIN however. Instructions on how to apply for access are here: JASMIN Help Site - How to apply for MASS access

Regards,
Ros

Hi CMS team,

I have tested the idea the restart files might explain why the model run blows up (too fast ocean velocity) within a few timesteps. But a different set of restart files also blew up (too fast vwind).

So I am back looking at my suite changes to debug it.

I started a new suite u-dq189 which I branched from my working piControl run on archer2 (which was originally u-as037 and updated to run on archer2).

I have changed the suite from a piControl to SSP245. I did this by comparing old MO suites (ar766 which is a piControl suite and bj616 which is a ssp245 suite).

I am now trying to get this suite running. I have not included ozone redistribution, as I just want the suite to run at this point (it is easier to debug) rather than the science being correct (i.e. the ozone redistribution).

When I run the suite, I get a very unhelpful job.err exception that reads:

[2] exceptions: An exception was raised:8 (Floating point exception)
[2] exceptions: the exception reports the extra information: Integer divide by zero.
[2] exceptions: whilst in a serial region
[2] exceptions: Task had pid=23868 on host nid004585
[2] exceptions: Program is “/mnt/lustre/a2fs-work2/work/n02/n02/penmaher/cylc-run/u-dq189/work/20150101T0000Z/coupled/./atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[2] exceptions: Data address (si_addr): 0x00a0bbfb; rip: 0x00a0bbfb

This above message happens when I set any of the following:

PRINT_STATUS=Normal
PRINT_STATUS=PrStatus_Min
PRINT_STATUS=Operational

When I use

PRINT_STATUS=PrStatus_Diag

I reach the file limit with the error

lib-4211 : UNRECOVERABLE library error
A WRITE operation tried to write a record that was too long.

Do you have any ideas on how to can get some more meaningful information on why my job is failing? The code is running into the exception while in the ocean model for the first time (it finished the ocean.out prematurely and did not create any content within the pe_output).

Thank you in advance,
Penny

Hi,

Looking at further traceback in the job.err the failure seems to be related to one of the nlst* namelists.
In app/um/rose-app.conf under [namelist:nlstcgen] two required items seem to have been inadvertently removed:

secs_per_periodim=86400
steps_per_periodim=${ATMOS_TIMESTEPS_PER_DAY}

The failure from Prstatus_Diag is probably an unrelated bug where a print statement has been added without testing at this level.