Postproc nemo memory problem with UKESM1

Hello Ros,

Perhaps you can help me with this?

I am trying to run a copy of the UKESM1 pre-industrial run (u-bc964, my version u-da655). The model runs fine for 5 quarters, but on the fifth quarter postproc nemo fails with the error (in job.err)

Lmod is automatically replacing "cce/15.0.0" with "gcc/11.2.0".

Due to MODULEPATH changes, the following have been reloaded:

  1) cray-mpich/8.1.23

[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:script_arch
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=4795310.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

It looks to as if the process hasn’t been allocated sufficient memory to create the annual means, but is OK for the seasonal means. I found that ‘sebsteinig’ (see below) had the same error, and increased the memory accordingly. However, I still get the memory error, even with 20G allocated. I found that I needed to insert the extra memory request after the model build, as otherwise the file is overwritten. I also can’t find anywhere to set the memory in the rosie GUI to incorporate it into the build.

Any ideas?

Thanks,
Martin

Hi Martin,

We’ve also seen it OOM sometimes, even with the memory increased - it’s nemo_rebuild. One of my colleagues has been in touch with ARCHER2, but I’m not sure what state that’s at.

Our advice at the moment is to run it on a compute node.

Regards,
Ros.

Hi Ros,

Thanks. I’ve been grepping around looking for any reference to the queue used for postproc nemo, but drawn a blank. Could you point me in the right direction as to where to change the queue please, or where it is documented?

Thanks,
Martin

Hi Martin,

You have to look at the inheritance for the task to figure these type of things out.

So

postproc tasks inherit form POSTPROC (suite.rc)
POSTPROC inherits from POSTPROC_RESOURCE (suite.rc)
POSTPROC_RESOURCE inherits from HPC_SERIAL (site/archer2.rc)
HPC is the family that sets the queue to standard so you need to change POSTPROC_RESOURCE to inherit HPC rather than HPC_SERIAL in site/archer.rc

And remove the --mem option

Can you please also give us permission to read your puma2 home directory so it’s easier for us to help in future.
Thanks
Cheers,
Ros.