My suite u-cz307 (GC3.1, UM 10.7) has problems with the ‘postproc_nemo’ task after updating it to the new ARCHER2 OS.
This is the error in the log file:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=4356300.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
The last line of the output log indicates that the problem occurs when running the [INFO] command: main_pp.py nemo script.
The weird thing is that the task sometimes works (about 1 out of 5 times). So I have to re-trigger the task several times (e.g. up to 10 times in this cycle: /home/n02/n02/ssteinig/work/cylc-run/u-cz307/log/job/24100101T0000Z/postproc_nemo) until it finally succeeds.
This only started to happen since I adjusted my old suite (u-cu066) to the new ARCHER2 OS (u-cz307) following this thread: UM vn10.7 on the ARCHER2. All other tasks work fine.
I contacted the ARCHER2 Helpdesk as I think this is an issue on their end. They acknowledged a lot of I/O problems at the moment and think the problem is caused by “a bug in the Lustre filesystem which causes intermittent read errors”.
They are currently working on a fix and I will copy any update I receive to this thread.
Yes, we’re waiting on update when that fix is going to be applied to see if it does fix the problem. In the meantime using a compute node is the workaround.
The ARCHER2 Helpdesk got back to me and said they applied a patch to the Lustre filesystem over the past few days.
I tested this with several suites and the postproc now seems to be running fine again on the serial nodes! One additional thing I had to to do was to request a bit more memory for the postproc jobs (the default is just below 2GB). 10 GB works fine for me, but I did not test other values.
If anybody runs into the same OOM error, I just added the memory request to the postproc resources in the archer2.rc configuration: