Postproc_nemo failure out of memory

Hi,

My suite u-cz307 (GC3.1, UM 10.7) has problems with the ‘postproc_nemo’ task after updating it to the new ARCHER2 OS.

This is the error in the log file:

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=4356300.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

The last line of the output log indicates that the problem occurs when running the [INFO] command: main_pp.py nemo script.

The weird thing is that the task sometimes works (about 1 out of 5 times). So I have to re-trigger the task several times (e.g. up to 10 times in this cycle: /home/n02/n02/ssteinig/work/cylc-run/u-cz307/log/job/24100101T0000Z/postproc_nemo) until it finally succeeds.

This only started to happen since I adjusted my old suite (u-cu066) to the new ARCHER2 OS (u-cz307) following this thread: UM vn10.7 on the ARCHER2. All other tasks work fine.

This is the diff between both suites: https://code.metoffice.gov.uk/trac/roses-u/changeset/HEAD/c/z/3/0/7/trunk?old=HEAD&old_path=c/u/0/6/6/trunk

Do you have any recommendation on where I could start debugging the problem?

Thank you very much for your help.

Best wishes,
Seb

Hi Sebastien,

Did you manage to solve the problem?

We’ve seen OOMs with postproc too since the OS upgrade, at the moment the workaround is to run the task on a compute node rather than the serial node.

Regards,
Ros.

Hi Ros,

Thanks for your reply.

I contacted the ARCHER2 Helpdesk as I think this is an issue on their end. They acknowledged a lot of I/O problems at the moment and think the problem is caused by “a bug in the Lustre filesystem which causes intermittent read errors”.

They are currently working on a fix and I will copy any update I receive to this thread.

Best wishes,
Seb

Hi Seb,

Yes, we’re waiting on update when that fix is going to be applied to see if it does fix the problem. In the meantime using a compute node is the workaround.

Cheers,
Ros.

UPDATE: problem solved!

The ARCHER2 Helpdesk got back to me and said they applied a patch to the Lustre filesystem over the past few days.

I tested this with several suites and the postproc now seems to be running fine again on the serial nodes! One additional thing I had to to do was to request a bit more memory for the postproc jobs (the default is just below 2GB). 10 GB works fine for me, but I did not test other values.

If anybody runs into the same OOM error, I just added the memory request to the postproc resources in the archer2.rc configuration:

    [[POSTPROC_RESOURCE]]
        inherit = HPC_SERIAL
        pre-script = """
                     module load postproc
                     module list 2>&1
                     ulimit -s unlimited
                     """
        [[[directives]]]
            --mem=10G

Best wishes,
Seb

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.