Oom and timeout errors for large regional domain ancil generation

Hi! I have been attempting to generate ancillaries to start running a simulation on ARCHER2. The plan is to run some experiments on a 2.2km-resolution grid with the domain covering a large area of southern Africa ~50 by 50 degrees. This ends up being a lot of grid points (2500x2500)… so it’s possibly not surprising that I’m getting a mixture of “out-of-memory” and timeout errors.

Specifically I’ve been trying to run a regional ancillary suite based on Doug Lowe’s u-cq149 suite which has ANTS configured. (The suite I am currently using is u-cz591.)

I received a lot of oom errors while running ancil generation tasks of the form

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=#######.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

This happens for tasks run with MPI processes (or at least, which inherit from [[HOST_ANTS_MPP]]) and also serial tasks. For most of them, I have been able to get around this by, for the MPI tasks, decreasing the number of jobs per node (as suggested in the ARCHER2 FAQs here), and for the serial tasks, increasing the available memory specified with the sbatch submissions (from default values up to 8GB, 16GB, and in some cases even 32GB which I think is the submission limit for serial-node tasks??) as suggested in previous helpdesk threads (1, 2, 3) and also increasing ntasks (not sure if this helps or really what it does).

However, I’m still having issues with a few of the tasks. Firstly, generating the dust ancil files, which continues to give oom errors even though I have now reduced the tasks-per-node to 8 (which feels a bit silly at this point). Secondly, I also consistently have timeout errors on the sstice ancil generation task, which is a serial job. I increased the memory to 8GB and the time limit to 2 hours (from 20 minutes) and it is still timing out.

I’m reasonably convinced that it’s not because of anything other than the large domain size, since a copy of the suite with no changes other than the domain size (reducing it to 1000x1000) runs the above tasks with no problems (u-cz653).

I am relatively new to HPC job management and don’t know if this was the most efficient way of addressing things - and I’m not sure whether I have gone about this the right way in terms of troubleshooting the errors. I was wondering whether you had any advice on:

  • avoiding oom errors? (i.e. are there any alternatives to indefinitely reducing the tasks-per-node on MPI processes or increasing the memory associated with serial tasks?)
  • speeding up tasks that timeout? (is there anything else that could make jobs more efficient or should I just extend the time limit further?)