Out of memory kill-error

Hello, I’m trying to run a rather large domain (2000x2000 points) and when the run reaches the forecast stage, I’m encountering this error -

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=10650430.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid001161: task 0: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=10650430.0
slurmstepd: error: *** STEP 10650430.0 ON nid001161 CANCELLED AT 2025-08-21T09:34:37 ***
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=10650430.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
[FAIL] um-atmos <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2025-08-21T08:34:40Z CRITICAL - failed/EXIT

Is there a way to increase the memory available for the run? I’ve looked at some similar tickets posted, but I’m not sure if/where I could change the memory settings.

Best,

Michelle Maclennan

Hi Michelle,

The standard compute nodes on ARCHER2 have 256Gb of memory and jobs have exclusive use of a compute node so you can’t request more memory.

You have 2 options:

  1. There are some high memory nodes with 512Gb so you could try running on them. You access these by specifying the highmem partition & qos - see
    Running jobs - ARCHER2 User Documentation

  2. Run on the standard nodes but underpopulated. How to do this is also detailed in the above web page.

Regards,
Ros.

Hi Ros,

Thank you for the advice! Is this done by changing ARCHER_QUEUE=‘standard’ in the ./roses/suitname/rose-suite.conf file, or is there another file for the slurm submission script?

Best,

Michelle