Out of memory kill-error

miclen · 21 August 2025 08:52

Hello, I’m trying to run a rather large domain (2000x2000 points) and when the run reaches the forecast stage, I’m encountering this error -

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=10650430.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid001161: task 0: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=10650430.0
slurmstepd: error: *** STEP 10650430.0 ON nid001161 CANCELLED AT 2025-08-21T09:34:37 ***
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=10650430.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
[FAIL] um-atmos <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2025-08-21T08:34:40Z CRITICAL - failed/EXIT

Is there a way to increase the memory available for the run? I’ve looked at some similar tickets posted, but I’m not sure if/where I could change the memory settings.

Best,

Michelle Maclennan

RosalynHatcher · 21 August 2025 17:15

Hi Michelle,

The standard compute nodes on ARCHER2 have 256Gb of memory and jobs have exclusive use of a compute node so you can’t request more memory.

You have 2 options:

There are some high memory nodes with 512Gb so you could try running on them. You access these by specifying the highmem partition & qos - see
Running jobs - ARCHER2 User Documentation
Run on the standard nodes but underpopulated. How to do this is also detailed in the above web page.

Regards,
Ros.

miclen · 26 August 2025 09:32

Hi Ros,

Thank you for the advice! Is this done by changing ARCHER_QUEUE=‘standard’ in the ./roses/suitname/rose-suite.conf file, or is there another file for the slurm submission script?

Best,

Michelle

Topic		Replies	Views
N512 run - Out of Memory Unified Model ARCHER2	4	255	11 August 2021
Oom and timeout errors for large regional domain ancil generation Unified Model ARCHER2	9	274	13 December 2023
Postproc_nemo failure out of memory Unified Model ARCHER2	5	219	6 October 2023
Oom-kill error Rose/Cylc and FCM Monsoon2 , ARCHER2 , PUMATest	4	484	26 May 2022
Postproc nemo memory problem with UKESM1 Unified Model ARCHER2	4	141	7 December 2023

Out of memory kill-error

Related topics