Help optimising the TerraMaris 2km simulations

Hi CMS,

I’ve got my 2km TerraMaris simulations set up and pretty much ready to run when Archer2 is fully available (suite u-cc339). However, I’m not sure whether my set up is optimal, particularly regarding I/O and the processor decomposition. I have a lot of stash output and the simulations are quite large, so I’d like to get the suite running as efficiently as possible before I start running.

I’m particularly interested to know whether I can and/or should use use I/O processors, and whether I can access profiler tools to judge how much time is being spent on I/O. This is a regional KPP-coupled suite but my queries here are limited to the atmosphere component.

Could someone from CMS please have a quick check over my suite and make sure it’s looking sensible?

Thanks and best wishes,
Emma

Hi Emma,

Decomposition and scaling

It looks like your atmosphere domain is 1500x3200 grid points, and you are running on 64 nodes (a 64x64 MPI decomp with 2 OpenMP threads). This is a similar size to the global model I have been running recently, so I think you could definitely speed up the execution time by running on more nodes, especially since LAMs generally scale better than global models.

The best thing to do would be to try it out, by just re-running the atmos part of the suite several times with different configurations. You could try 96 nodes, or even 128, but you will likely find that the queue time increases considerably as well, so you may find that for the 4-cab system it is more efficient to run on fewer nodes in terms of the overal workflow speed.

Usually we would also say that it’s more efficient to run in longer chunks (e.g. several days rather than 1 day), but it might not be possible to change that in your suite if you need to update bcs etc.

Profiling

To start with I would try using the switch print_runtime_info, as this is very low overhead and just prints out the time/timestep plus startup and shutdown costs. You can also try lstashdumptimer which should give you some profiling information on the time spent in I/O (I haven’t actually tried this myself but it looks useful). The full UM timers with ltimer can be quite expensive and not always useful if you are just interested in I/O.

I would try these first and see what information that gives you.

You can also probably switch prnt_writers so that “Only rank 0 writes output”. Having all ranks write output is only really useful for debugging, and that might be adding some overhead.

I/O

There are a few things that you may be able to do improve the I/O speed. If the output files are large, and I guess at least the dump files will be, you can stripe the output directory. I can tell you how to do this but it depends on where the model is writing the files.

You can also try using the io_alltoall_readflds method to parallelise the dump reading. It doesn’t work for every configuration, but it is worth a try. If it can’t use this method it will use the default and print a message in the output.

You likely would see a benefit from using the I/O server if you are doing a lot of I/O and running on a lot of nodes. Configuring the I/O server is a bit of black art, so it’s best to start from some reasonable settings and tweak as needed. You could start from what I am using in my global model as the domain size is similar. I found it best to have all my I/O server processors on a set of dedicated nodes, but you can also intersperse them throughout the run. It depends on the profile of your run really, and it’s sometimes easiest just to try both and see what works best.

Annette

Hi Annette,

That’s brilliant, thank you!

The model writes to:
/work/n02/n02/emmah/cylc-run/u-cc339/share/cycle/YYYYMMDDT0000Z/tm2_ra2t/um
and I’m not sure what it means to stripe the directory.

For the I/O server, can you let me know the suite name of your global model so I can try to copy the settings across? I noticed a lot of settings in coupled-namelist-io system settings-io server: are these the ones I’d be changing?

Best wishes,
Emma

Hi Emma,

It is best to do this in a systematic way, and try each potential optimisation one at a time, otherwise you won’t know whether the changes are working and whether they are improving or degrading performance, since multiple changes may cancel each other out.

So first collect your base runtimes - how long does it take to run at the moment with no changes. If you have times for multiple cycles it is useful to take an average, and look at min/max times as there can be a lot of varaibility in runtimes on Archer2. And I would switch on those timers I mentioned when you are testing the changes.

To help optimise the I/O you need to know how much data you are writing out each cycle. How much STASH data? How many different streams? And how large are the dumps?

Striping

There is a bit about striping on the Archer2 website here: I/O and file systems - ARCHER2 User Documentation

To stripe a directory run the following:
lfs setstripe -c -1 <dir-name>

This means that any new files written to the directory will be striped. It won’t change any files that are already in there. If you want to do that you need to move the files out then copy back in.

But striping will only benefit large files. If there are a mix of small and large files being written to the same directory, you may want to try and split them out somehow. Also as your suite writes output to a new directory every cycle, you will need to stripe the whole cycle directory or build something in to stripe certain directories each cycle. I can help with this but, you need to test whether striping is actually helping.

IO server

There is documentation on the UM IO server here: Met Office (Login)

My suite is u-cf432. Most of the IO server settings are under “IO system settings → IO server” in the GUI. It is best to use the GUI so you can see information about what each setting is doing. You also need to select a number of IO server processes in the rose-suite.conf file. I have 256 which is 4 IOS nodes which is a lot - you could start with fewer than that. I am then using 32 tasks per server. So I have 8 IO servers in total, with each server using half a node. The IO server works by assigning pp streams to each servers, so there is no point in having more servers than streams.

The variable ios_offset is set to 6016 which is the number of ranks I’m using for the atmosphere, and the ios_spacing of 1, means that all my IOS processes are together at the end on their own nodes. This might not be what you want to do, but in my case the IO server seemed to need a lot of memory. The usual case is to space the IO server processes evenly throughout the atmos processes.

Optimising the model and understanding performance is a complicated topic, and you will need to run a few experiments to figure out what is best for your configuration.

Annette

Hi Annette,

I think I understand the things to test now, I’ll make a plan and start the experiments when I’m ready. Thanks for your very comprehensive explanations!

Emma

Hi Emma,

Glad it all make sense. I think it might be useful for other users too, so it’s good to have it written down.

Best wishes,

Annette

Hi Annette,

I’ve made some progress on this, though I don’t think I’m at the optimal configuration yet. I’m finding that I can reduce the time taken for the timesteps with i/o, but that the timesteps with no i/o take longer so that there’s no actual gain. Is this expected?

Also, 2 of my output streams are limited area subdomains which are written at a higher frequency. The packing of these streams was causing an error, so I’m currently writing these streams unpacked - but have you seen anything like this before?

The error is in this file, repeated for 25 of the i/o processors:
in /work/n02/n02/emmah/cylc-run/u-cc339/log/job/20151203T0000Z/tm2_ra2t_um_fcst1/18/job.err. For one processor this is:

“”"
[4146] exceptions: the exception reports the extra information: Integer divide by zero.
[4146] exceptions: whilst in a parallel region, by thread 1
[4146] exceptions: Task had pid=133337 on host nid001191
[4146] exceptions: Program is “/work/n02/n02/emmah/cylc-run/u-cc339/work/20151203T0000Z/tm2_ra2t_um_fcst1/toyatm”
[4146] exceptions: Data address (si_addr): 0x005a20b1; rip: 0x005a20b1
[4146] exceptions: [backtrace]: has 15 elements:
[4146] exceptions: [backtrace]: ( 1) : Address: [0x005a20b1]
[4146] exceptions: [backtrace]: ( 1) : f_shum_wgdos_pack_1d_arg64$f_shum_wgdos_packing_mod_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/extract/shumlib/shum_wgdos_packing/src/f_shum_wgdos_packing.f90 line 411
[4146] exceptions: [backtrace]: ( 2) : Address: [0x00bef504]
[4146] exceptions: [backtrace]: ( 2) : signal_do_backtrace_linux in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/extract/um/src/control/c_code/exceptions/exceptions-platform/exceptions-linux.c line 81
[4146] exceptions: [backtrace]: ( 3) : Address: [0x00beeee7]
[4146] exceptions: [backtrace]: ( 3) : signal_handler in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/extract/um/src/control/c_code/exceptions/exceptions.c line 706
[4146] exceptions: [backtrace]: ( 4) : Address: [0x2af531b8a2d0]
[4146] exceptions: [backtrace]: ( 4) : ?? (* Cannot Locate *)
[4146] exceptions: [backtrace]: ( 5) : Address: [0x005a20b1]
[4146] exceptions: [backtrace]: ( 5) : f_shum_wgdos_pack_1d_arg64$f_shum_wgdos_packing_mod_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/extract/shumlib/shum_wgdos_packing/src/f_shum_wgdos_packing.f90 line 411
[4146] exceptions: [backtrace]: ( 6) : Address: [0x0059f4b4]
[4146] exceptions: [backtrace]: ( 6) : wgdos_compress_field$wgdos_packing_mod_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/preprocess-atmos/src/um/src/control/stash/wgdos_packing.F90 line 97
[4146] exceptions: [backtrace]: ( 7) : Address: [0x0259182f]
[4146] exceptions: [backtrace]: ( 7) : ios_stash_pack_wgdos$ios_stash_wgdos_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/preprocess-atmos/src/um/src/io_services/server/stash/ios_stash_wgdos.F90 line 68
[4146] exceptions: [backtrace]: ( 8) : Address: [0x0258a4d2]
[4146] exceptions: [backtrace]: ( 8) : ios_stash_pack$ios_stash_server_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/preprocess-atmos/src/um/src/io_services/server/stash/ios_stash_server.F90 line 1505
[4146] exceptions: [backtrace]: ( 9) : Address: [0x02585da6]
[4146] exceptions: [backtrace]: ( 9) : ios_stash_server_process$ios_stash_server_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/preprocess-atmos/src/um/src/io_services/server/stash/ios_stash_server.F90 line 804
[4146] exceptions: [backtrace]: ( 10) : Address: [0x025782ea]
[4146] exceptions: [backtrace]: ( 10) : ios_writer$io_server_writer_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/preprocess-atmos/src/um/src/io_services/server/io_server_writer.F90 line 161
[4146] exceptions: [backtrace]: ( 11) : Address: [0x02574d4b]
[4146] exceptions: [backtrace]: ( 11) : ios_run$ios_init__cray$mt$p0003 in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/preprocess-atmos/src/um/src/io_services/server/ios_init.F90 line 833
“”"

Looking through the code, my guess is that the stride, which is the “row length, for 2D field held as 1D array” and comes from “SIZE(field)” in “ios_stash_pack_wgdos”, is 0 - perhaps it’s trying to write the field for a part of the domain that’s not in the subdomain?

I’m continuing to run tweaks to the various settings to see if I can speed things up a bit too.

Best wishes,
Emma

Hi Emma,

I’m not sure I would expect the non-IO timesteps to take longer if you optimised the I/O, but maybe there is something else going on. Certainly in my testing I have seen a very large variability in the runtime on Archer2 - up to about 30%, so you do need to make sure you are looking at times from multiple runs.

I haven’t seen this particular packing error before, but usually packing errors are due to problems with the data fields being written. So you should make sure and check that the fields all look sensible in the unpacked form. I don’t know if you can tell from the logs which field was failing in the packing?

If you are confident the data is OK, then the other thing you could try is a different packing profile - particularly (7) which is for 1-4km LAMs. I haven’t ever used this so I don’t know if it will work either, but might be worth a shot.

Annette

Hi Annette,

Thanks, I’ll give that a go. Increasing the number of IO nodes from 1 to 2 seems to help with my first problem.

Emma