Oom-kill error

Hello,

My suite ‘u-cm987’ is suffering from an oom-kill error in the fcm_make2_um task.

I checked Monsoon to Archer2 Suite and followed advice there to increase available memory to 25Gb, but that didn’t fix this problem.

I also checked ARCHER2 Known Issues - ARCHER2 User Documentation and the top issue today is related. Copied here for later reference:


OOM due to memory leak in libfabric (Added: 2022-02-23)

There is an underlying memory leak in the version of libfabric on ARCHER2 (that comes as part of the underlying SLES operating system) which can cause jobs to fail with an OOM (Out Of Memory) error. This issue will be addressed in a future upgrade of the ARCHER2 operating system. You can workaround this issue by setting the following environment variable in your job submission scripts:

export FI_MR_CACHE_MAX_COUNT=0

This may come with a performance penalty (though in many cases, we have not seen a noticeable performance impact).


I’m not sure where to include the suggested ‘export’ command. Does this just go into my archer2.rc file?

I’m also not sure why this suite suffers from the oom-kill error when it is a copy of a working suite ‘u-cm283’ with a minor code update.

Thanks,

Leighton

Hi Leighton,

The fcm_make2_um failed in its latest attempt with a lock file error:

[FAIL] /work/n02/n02/lre/cylc-run/u-cm987/share/fcm_make_um/fcm-make2.lock: lock exists at the destination

Remove that lock directory on ARCHER2 and try again. Increasing the memory as you have done should fix the problem.

The memory leak only affects some compute jobs not serial jobs.

Cheers,
Ros.

Hi Ros,

I’ll do that and hopefully remember for next time - thanks!

Leighton