Postproc_nemo failure out of memory

sebsteinig · 1 September 2023 09:44

Hi,

My suite u-cz307 (GC3.1, UM 10.7) has problems with the ‘postproc_nemo’ task after updating it to the new ARCHER2 OS.

This is the error in the log file:

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=4356300.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

The last line of the output log indicates that the problem occurs when running the [INFO] command: main_pp.py nemo script.

The weird thing is that the task sometimes works (about 1 out of 5 times). So I have to re-trigger the task several times (e.g. up to 10 times in this cycle: /home/n02/n02/ssteinig/work/cylc-run/u-cz307/log/job/24100101T0000Z/postproc_nemo) until it finally succeeds.

This only started to happen since I adjusted my old suite (u-cu066) to the new ARCHER2 OS (u-cz307) following this thread: UM vn10.7 on the ARCHER2. All other tasks work fine.

This is the diff between both suites: https://code.metoffice.gov.uk/trac/roses-u/changeset/HEAD/c/z/3/0/7/trunk?old=HEAD&old_path=c/u/0/6/6/trunk

Do you have any recommendation on where I could start debugging the problem?

Thank you very much for your help.

Best wishes,
Seb

RosalynHatcher · 21 September 2023 09:12

Hi Sebastien,

Did you manage to solve the problem?

We’ve seen OOMs with postproc too since the OS upgrade, at the moment the workaround is to run the task on a compute node rather than the serial node.

Regards,
Ros.

sebsteinig · 21 September 2023 10:27

Hi Ros,

Thanks for your reply.

I contacted the ARCHER2 Helpdesk as I think this is an issue on their end. They acknowledged a lot of I/O problems at the moment and think the problem is caused by “a bug in the Lustre filesystem which causes intermittent read errors”.

They are currently working on a fix and I will copy any update I receive to this thread.

Best wishes,
Seb

RosalynHatcher · 21 September 2023 10:52

Hi Seb,

Yes, we’re waiting on update when that fix is going to be applied to see if it does fix the problem. In the meantime using a compute node is the workaround.

Cheers,
Ros.

sebsteinig · 6 October 2023 13:13

UPDATE: problem solved!

The ARCHER2 Helpdesk got back to me and said they applied a patch to the Lustre filesystem over the past few days.

I tested this with several suites and the postproc now seems to be running fine again on the serial nodes! One additional thing I had to to do was to request a bit more memory for the postproc jobs (the default is just below 2GB). 10 GB works fine for me, but I did not test other values.

If anybody runs into the same OOM error, I just added the memory request to the postproc resources in the archer2.rc configuration:

    [[POSTPROC_RESOURCE]]
        inherit = HPC_SERIAL
        pre-script = """
                     module load postproc
                     module list 2>&1
                     ulimit -s unlimited
                     """
        [[[directives]]]
            --mem=10G

Best wishes,
Seb

system · 8 October 2023 13:14

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Postproc nemo memory problem with UKESM1 Unified Model ARCHER2	4	138	7 December 2023
Failure of postproc_nemo Unified Model ARCHER2	8	107	21 June 2024
Oom-kill error Rose/Cylc and FCM Monsoon2 , ARCHER2 , PUMATest	4	480	26 May 2022
Upgrading a suite to the latest UM version Unified Model ARCHER2	8	247	19 May 2023
Previous working-job now seg-faulting on 1st timestep after OS upgrade (ARCHER2 v8.4 GA4 UM-UKCA) Unified Model ARCHER2 , PUMATest	41	383	19 February 2024

Postproc_nemo failure out of memory

Related topics