I am still working on getting my SSP245 HadGEM3-GC3.1-LL suite working. This is a follow up to
I would like to try a different set of restart files if possible please. I have been digging around in a few suites and I noticed one of interest. Ros moved some files from the MO for me a few months back and I was hoping to get a couple more file please. Would it be possible to have a copy of the following restart files please? The path of the data is /nesi/project/uoo03538/andrew.pauling/MASS/u-bg469/2015-restarts/
Unfortunately those files are not under /data/d01/ukcmip6/ssp585_N96O1_ensemble1_dumps/ and I have no idea what filesystem the /nesi/project path is on. It’s not a MO path I recognise nor can I find it on the XCE/F or XCS. I’d suggest talking to someone at the Met Office.
I got in contact with the owner of the suite. The path ref for the /nesi/project is local to their institution. Sorry for that. He said he got the files from moose. I am happy to try and retrieve them from moose. Could you tell me what module I need to load for moo please? I have not been able to fund it on the online intro courses or within the helpdesk tickets.
I have tested the idea the restart files might explain why the model run blows up (too fast ocean velocity) within a few timesteps. But a different set of restart files also blew up (too fast vwind).
So I am back looking at my suite changes to debug it.
I started a new suite u-dq189 which I branched from my working piControl run on archer2 (which was originally u-as037 and updated to run on archer2).
I have changed the suite from a piControl to SSP245. I did this by comparing old MO suites (ar766 which is a piControl suite and bj616 which is a ssp245 suite).
I am now trying to get this suite running. I have not included ozone redistribution, as I just want the suite to run at this point (it is easier to debug) rather than the science being correct (i.e. the ozone redistribution).
When I run the suite, I get a very unhelpful job.err exception that reads:
[2] exceptions: An exception was raised:8 (Floating point exception)
[2] exceptions: the exception reports the extra information: Integer divide by zero.
[2] exceptions: whilst in a serial region
[2] exceptions: Task had pid=23868 on host nid004585
[2] exceptions: Program is “/mnt/lustre/a2fs-work2/work/n02/n02/penmaher/cylc-run/u-dq189/work/20150101T0000Z/coupled/./atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[2] exceptions: Data address (si_addr): 0x00a0bbfb; rip: 0x00a0bbfb
This above message happens when I set any of the following:
lib-4211 : UNRECOVERABLE library error
A WRITE operation tried to write a record that was too long.
Do you have any ideas on how to can get some more meaningful information on why my job is failing? The code is running into the exception while in the ocean model for the first time (it finished the ocean.out prematurely and did not create any content within the pe_output).
Looking at further traceback in the job.err the failure seems to be related to one of the nlst* namelists.
In app/um/rose-app.conf under [namelist:nlstcgen] two required items seem to have been inadvertently removed: