Decomposition and scaling
It looks like your atmosphere domain is 1500x3200 grid points, and you are running on 64 nodes (a 64x64 MPI decomp with 2 OpenMP threads). This is a similar size to the global model I have been running recently, so I think you could definitely speed up the execution time by running on more nodes, especially since LAMs generally scale better than global models.
The best thing to do would be to try it out, by just re-running the atmos part of the suite several times with different configurations. You could try 96 nodes, or even 128, but you will likely find that the queue time increases considerably as well, so you may find that for the 4-cab system it is more efficient to run on fewer nodes in terms of the overal workflow speed.
Usually we would also say that it’s more efficient to run in longer chunks (e.g. several days rather than 1 day), but it might not be possible to change that in your suite if you need to update bcs etc.
To start with I would try using the switch
print_runtime_info, as this is very low overhead and just prints out the time/timestep plus startup and shutdown costs. You can also try
lstashdumptimer which should give you some profiling information on the time spent in I/O (I haven’t actually tried this myself but it looks useful). The full UM timers with
ltimer can be quite expensive and not always useful if you are just interested in I/O.
I would try these first and see what information that gives you.
You can also probably switch
prnt_writers so that “Only rank 0 writes output”. Having all ranks write output is only really useful for debugging, and that might be adding some overhead.
There are a few things that you may be able to do improve the I/O speed. If the output files are large, and I guess at least the dump files will be, you can stripe the output directory. I can tell you how to do this but it depends on where the model is writing the files.
You can also try using the
io_alltoall_readflds method to parallelise the dump reading. It doesn’t work for every configuration, but it is worth a try. If it can’t use this method it will use the default and print a message in the output.
You likely would see a benefit from using the I/O server if you are doing a lot of I/O and running on a lot of nodes. Configuring the I/O server is a bit of black art, so it’s best to start from some reasonable settings and tweak as needed. You could start from what I am using in my global model as the domain size is similar. I found it best to have all my I/O server processors on a set of dedicated nodes, but you can also intersperse them throughout the run. It depends on the profile of your run really, and it’s sometimes easiest just to try both and see what works best.