Oom and timeout errors for large regional domain ancil generation

Hi! I have been attempting to generate ancillaries to start running a simulation on ARCHER2. The plan is to run some experiments on a 2.2km-resolution grid with the domain covering a large area of southern Africa ~50 by 50 degrees. This ends up being a lot of grid points (2500x2500)… so it’s possibly not surprising that I’m getting a mixture of “out-of-memory” and timeout errors.

Specifically I’ve been trying to run a regional ancillary suite based on Doug Lowe’s u-cq149 suite which has ANTS configured. (The suite I am currently using is u-cz591.)

I received a lot of oom errors while running ancil generation tasks of the form

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=#######.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

This happens for tasks run with MPI processes (or at least, which inherit from [[HOST_ANTS_MPP]]) and also serial tasks. For most of them, I have been able to get around this by, for the MPI tasks, decreasing the number of jobs per node (as suggested in the ARCHER2 FAQs here), and for the serial tasks, increasing the available memory specified with the sbatch submissions (from default values up to 8GB, 16GB, and in some cases even 32GB which I think is the submission limit for serial-node tasks??) as suggested in previous helpdesk threads (1, 2, 3) and also increasing ntasks (not sure if this helps or really what it does).

However, I’m still having issues with a few of the tasks. Firstly, generating the dust ancil files, which continues to give oom errors even though I have now reduced the tasks-per-node to 8 (which feels a bit silly at this point). Secondly, I also consistently have timeout errors on the sstice ancil generation task, which is a serial job. I increased the memory to 8GB and the time limit to 2 hours (from 20 minutes) and it is still timing out.

I’m reasonably convinced that it’s not because of anything other than the large domain size, since a copy of the suite with no changes other than the domain size (reducing it to 1000x1000) runs the above tasks with no problems (u-cz653).

I am relatively new to HPC job management and don’t know if this was the most efficient way of addressing things - and I’m not sure whether I have gone about this the right way in terms of troubleshooting the errors. I was wondering whether you had any advice on:

  • avoiding oom errors? (i.e. are there any alternatives to indefinitely reducing the tasks-per-node on MPI processes or increasing the memory associated with serial tasks?)
  • speeding up tasks that timeout? (is there anything else that could make jobs more efficient or should I just extend the time limit further?)

Thanks!

Hi Fran,

Sorry, are you still having problems with this? ARCHER2 fixed a bug with the Lustre file system in the last couple of weeks which has fixed most problems we had been seeing with OOMs and some slow running tasks.

Regards,
Ros.

Hi Ros,

I haven’t managed to make much progress on this. In the end I generated all the ancils I wanted using the changes I suggested above, except the ANTS dust aerosol ancil and the sstice CAP ancil. Having spoken to some of the k-scale team at the Met Office, it turns out that for large domains a lot of the time the high-resolution aerosol ancillaries are just not generated because they are just really stupidly large (e.g. for my 2500x2500 grid it ends up being >~100GB for each aerosol ancil, and >400GB when I managed to generate half a dust ancil), which then led to me getting errors with just running out of storage space. So (using a new suite now, since I needed to change the domain, and also I think I broke all the stash), I just turned off the 3D ancils option for the high-resolution model and am hoping it will run without them and downscale something else later on.

Still having trouble with timeouts on the SST ancil generation though. Regardless of the time limit (or number of nodes) I specify when submitting the job, the job.out file tends to stop at exactly the same place which makes me think that something else is going on… but I don’t know what. Is this a file size limit for job.out, or could there be something crashing but not outputting an error?

Thanks!

Fran

Fran

which suite has timeouts on the SST ancil generation ?

Grenville

u-da256 currently. But I couldn’t get it to work on u-cz070 either (compare /work/n02/n02/franmorr/cylc-run/u-cz070/log.20230817T142404Z/job/20130906T0000Z/SAfrica_2p2km_ancil_sstice/XX/job.out where XX=03,04,05,06 - they all end in exactly the same spot even though the .../XX/job files specify different time limits and number of nodes)

Thanks!
Fran

Hi Fran

The problem (we are 99% sure) is because the source sst data does not include Lake Victoria - that causes the spiral search to go awry & take an age (to produce rubbish even if it did ever complete.)

Can you create a mask that doesn’t include Lake Victoria?

Grenville

There is already mask without lakes (,cylc-run/u-da256/share/data/ancils/SAfrica/2p2km/qrparm.mask_sea_nolakes) - try that as the MASKIN for the sstice task.

That worked for me - see my copy of you suite (/home/n02/n02/grenvill/roses/u-da256)

Grenville

That’s worked! Thanks so much Grenville!