Hi,
My suite u-cz307 (GC3.1, UM 10.7) has problems with the ‘postproc_nemo’ task after updating it to the new ARCHER2 OS.
This is the error in the log file:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=4356300.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
The last line of the output log indicates that the problem occurs when running the [INFO] command: main_pp.py nemo
script.
The weird thing is that the task sometimes works (about 1 out of 5 times). So I have to re-trigger the task several times (e.g. up to 10 times in this cycle: /home/n02/n02/ssteinig/work/cylc-run/u-cz307/log/job/24100101T0000Z/postproc_nemo
) until it finally succeeds.
This only started to happen since I adjusted my old suite (u-cu066) to the new ARCHER2 OS (u-cz307) following this thread: UM vn10.7 on the ARCHER2. All other tasks work fine.
This is the diff between both suites: https://code.metoffice.gov.uk/trac/roses-u/changeset/HEAD/c/z/3/0/7/trunk?old=HEAD&old_path=c/u/0/6/6/trunk
Do you have any recommendation on where I could start debugging the problem?
Thank you very much for your help.
Best wishes,
Seb