Suite failing but no obvious error

c.j.r.williams · 24 March 2025 10:33

Hi,

Sorry to bother you, but I’m having trouble with one of my suites (u-do321), which is failing at the coupled stage (more or less straightaway). It failed at the end of last week, but that was due to an error in the ancillary updating (that Grenville identified), which I have now switched off. I don’t get this error any more, but it still fails - nothing obvious this time in the job.err, lots of warnings but no actual error.

Please can you help?

Charlie

Juan_Bilbao · 24 March 2025 10:59

Hello Charlie,
Could you please provide the location of your log files so I can take a look at the job.out and job.err?
Juan

c.j.r.williams · 24 March 2025 11:16

Many apologies, they are at /home/n02/n02/cjrw09/cylc-run/u-do321/log/job/18500101T0000Z/coupled/NN

Charlie

grenville · 25 March 2025 09:49

Charlie

The run fails because the files referred to in /work/n02/n02/cjrw09/cylc-run/u-do321/work/18500101T0000Z/coupled/namcouple have different names than those in RMP_DIR=/work/n02/n02/cjrw09/gc31/pliod/coupling_weights_v1.

Where does coupling_weights_v1 come from?

Grenville

c.j.r.williams · 25 March 2025 11:44

Ahh, great, that is at least something I can work with. The file namcouple is something I have changed, according to my instructions from the Met Office. Can’t say I entirely understand what the changes do, something to do with the interpolation of the coupling weights. Then the directory coupling_weights_v1 was also made by me, which contains all of the output of the oasis suite which generates the coupling weights. I thought that this directory was consistent with namcouple, but I guess not?

Out of interest, where was this error specified?

Charlie

PS. I am about to start teaching, in Bristol, so will look at this properly when back at my computer.

grenville · 25 March 2025 11:58

/home/n02/n02/cjrw09/cylc-run/u-do321/work/18500101T0000Z/coupled/debug.root.01
and
/home/n02/n02/cjrw09/cylc-run/u-do321/work/18500101T0000Z/coupled/debug.root.02

c.j.r.williams · 25 March 2025 16:26

Hi Grenville,

Okay, I have now checked everything, and this is slightly more complicated than I first thought (of course it is).

There are basically 2 versions of namcouple - one created by me and used as input to the model, and another created when the model runs. The former is on PUMA2 at /home/n02/n02/cjrw09/roses/u-do321/app/coupled/file/namcouple. The latter is on ARCHER2 at /work/n02/n02/cjrw09/cylc-run/u-do321/work/18500101T0000Z/coupled/namcouple (where you said).

As you can see, the only difference is at the beginning, where the one I created has:

$RUNTIME

Runtime setting automated via NEMO namelist values

15552000
$END

whereas the one that the suite created has:

$RUNTIME

Runtime setting automated via NEMO namelist values

31104000
$END

Other than that, they are identical.

Looking at either version, the files they appear to be pointing to are as follows:

rmp_tor1_to_atm3_CONSERV_FRACAREA.nc
rmp_uor1_to_aum3_BILINEA.nc
rmp_vor1_to_avm3_BILINEA.nc
rmp_aum3_to_uor1_BILINEA.nc
rmp_avm3_to_vor1_BILINEA.nc
rmp_atm3_to_tor1_CONSERV_DESTAREA.nc
rmp_atm3_to_tor1_nomask_BILINEA.nc

but all of these are present and correct in my coupling weights directory, on ARCHER2 at /work/n02/n02/cjrw09/gc31/pliod/coupling_weights_v1. The only other place this directory is pointed to within the suite is RMP_DIR, which is in /home/n02/n02/cjrw09/roses/u-do321/app/coupled/rose-app.conf and is pointing to the correct place.

In terms of what namcouple does - this was in response to a similar crash I received a couple of years ago, in which I received the following advice:

“Note from R. Hill: It [error with OASIS] won’t be caused directly by the ancils. Looking at your namcouple file you seem to have a number of fields which want to use 2nd order conservative remapping, but none of your rmp files contain second order terms for the gradients, as far as OASIS is concerned. So the UM will be calculating gradients for these fields and trying to pass them to OASIS which doesn’t want them. It looks like your rmp files have been generated by ESMF rather than SCRIP. ESMF doesn’t generate 2nd order conservative weights in a way that’s consistent with OASIS or the UM’s understanding of what OASIS is expecting. So assuming that youere happy to use 1st order regridding for heatflux, sublimation and emp: if you replace the contents of the existing namcouple with the contents of a new one, then that should set things up to expect 1st order regridding for all fields.”

The same person gave me a new version of namcouple, which shouldn’t be specific to any particular land sea mask, and so I used this one again.

In terms of where coupling_weights_v1 comes from - this is created by my OASIS suite (on MONSOON2, u-bp550@196487) which reads in the mesh_mask file (itself generated by a NEMO suite, based on the new bathymetry I have given it). The OASIS suite reads in the mesh_mask file, and uses this to create all of the coupling weights you see in coupling_weights_v1.

So I don’t entirely understand what’s going wrong here, given that the coupling weights are all being created correctly and all seem to exist, as specified in the namcouple (both the version I give the model, and the one it creates)?

Charlie

grenville · 25 March 2025 16:53

has files
rmp_aum3_to_uor1_BILINEAR.nc
rmp_atm3_to_tor1_CONSERVE_DSTAREA.nc
etc

these are spelled differently than the files in namcouple

c.j.r.williams · 25 March 2025 17:38

Gosh, sorry, how did I miss that? Apologies, it’s been a long day.

Do you think I should either change my version of namcouple (the one I created, which is then presumably picked up by the model and reproduced in that other directory) so that they match the file names in coupling_weights_v1? Or the other way round ie change the file names in coupling_weights_v1 so that they match the names in namcouple?

Charlie

c.j.r.williams · 27 March 2025 17:46

Hi again,

Okay, so I went for the first option (i.e. changing namcouple to match the filenames) as I thought the files might be used somewhere else. The coupled stage ran for a full 5 minutes this time, so an improvement on before, but then failed with an equally ambiguous error message (or at least, no error message that I can see).

Charlie

RosalynHatcher · 27 March 2025 20:07

Hi Charlie,

It’s still complaining about the remapping files. This time rmp_vor1_to_aum3_BILINEAR.nc. See /home/n02/n02/cjrw09/cylc-run/u-do321/work/18500101T0000Z/coupled/debug.root.02

Cheers,
Ros.

c.j.r.williams · 31 March 2025 10:56

Thanks very much Ros.

I think, although can’t be certain, that this time the issue was because the file (North-South surface ocean velocity) being pointed to in /home/n02/n02/cjrw09/roses/u-do321/app/coupled/file/namcouple is rmp_vor1_to_aum3_BILINEAR.nc whereas the appropriate file on ARCHER2 (in /work/n02/n02/cjrw09/gc31/pliod/coupling_weights_v1) rmp_vor1_to_avm3_BILINEAR.nc. Quite why namcouple has a “u” instead of a “v” is beyond me, because the “u” would be appropriate for the line above, where it is looking for rmp_uor1_to_aum3_BILINEAR.nc and correctly finding it in the above directory.

I have submitted my suite again after changing this in the above namcouple, and will let you know what happens. But this is rather worrying. A bit of context: as I say my message above, coupling_weights_v1 is created by my OASIS suite (on MONSOON2, dg751@289202 which is a copy of the original u-bp550@196487, owned by miroslawandrejczuk (although it was not them that was helping me)). This reads in the mesh_mask file (itself generated by a NEMO suite, based on the new bathymetry I have given it). The OASIS suite reads in the mesh_mask file, and uses this to create all of the coupling weights you see in coupling_weights_v1. These are then pointed to by the above namcouple, which needed to be modified according to some advice from the Met Office (see above message).

However…

When I was originally doing this, about 2 years ago, we weren’t able to do this particular stage by ourselves, because it needed to be done from inside the Met Office firewall. My Met Office sponsor at the time (sadly no more) therefore did this for us, and sent me the coupling weights. We were not happy about not being able to do this ourselves, so after a bit of wrangling were finally given the suite. Seb Steinig tested the suite by comparing it with the output produced by our Met Office sponsor, and they were absolutely identical (in terms of the filenames, and the data inside the files, down to several decimal places), apart from 2 things:

Our files areas.nc, grids.nc and masks.nc did not include uor1* and vor1* variables (ocean U and V components).
The latitude orientation in file atmo_mask_fracarea_anc_ns.nc was reversed (also the case for the respective UM-format file).

At the time, we were hoping this wouldn’t matter. But there are clearly other differences, as we are now seeing e.g. many of the filenames produced by my current version of the suite, in coupling_weights_v1, are spelled very differently to those in namcouple.

So whether the above differences are going to be the next problem, I don’t know!

Charlie

c.j.r.williams · 31 March 2025 11:43

Okay, it has now crashed again almost immediately after starting the coupled task, but this time with a different error (in the job.err file): “Convergence failure in BiCGstab, omg is NaN”.

This, unfortunately for me, I do recognise, and have seen before. Apparently it is a common point for the model to fail if it has ingested or developed NaNs or infinities. Official guidance gives the following URL for more information: https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints but based on my notes, this doesn’t help!

Again according to my notes, the common reason a mismatch land-sea mask with one of the ancils, meaning the model is expecting missing data but gets data, or vice versa. But the error doesn’t say which ancil, and I have modified almost all of them (~40 or so). According to my notes, I have 2 options for tracking this down:

Look at fields exchanged via the coupler, since they get exchanged at timestep 0 - i.e. there will be some things arriving from the ocean before the start of the first timestep (effectively initial conditions), to check if anything looks wrong either as they come out of ocean (before regridding) or go into atmosphere (after regridding). To see these, in namcouple replace any occurrences of “EXPORTED” with “EXPOUT”, which will tell coupler to produce netcdf files containing coupling fields. Files will appear at /home/d05/
/cylc-run//work//coupled and will be derived from field names in namcouple: anything with “toyatm” = field on atmosphere side of coupling exchange, anything with “toyoce” = field on ocean side of coupling exchange e.g.
model01_O_SSTSST_toyoce_01.nc = SST coming out of ocean and ocn_sst_toyatm_01.nc = SST as it goes into dear after being regridded. If fields aren’t obvious from file names, can pair up outgoing/incoming fields using number at end of name (e.g. _64 indicates topmelt category 4) or by cross referencing contents of namcouple file.
If everything looks okay here, error is most likely due to one newly-created/modified ancillaries. Need to go through each one very carefully, checking things like: new land-sea mask (if there is one) matches actual land-sea mask, data are not upside down, latitudes are not upside down, longitudes are not reversed/displaced, data are consistent with other ancillaries.

Looking at my output files in /home/n02/n02/cjrw09/cylc-run/u-do321/work/18500101T0000Z/coupled/ I don’t seem to have anything with “toyatm” or “toyoce”, so that’s a nonstarter. Looking at my namcouple file, only one file is exported - atmos_restart.nc, so I could try writing this out according to the above instructions?

But before I do that, and before I start going to each other every one of my ancillaries (although I can’t see why they would be inconsistent, given that they have all been produced by the ancillary suite based on the same mask), do you have any other insight as to what this error might be referring to?

Thanks a lot,

Charlie

grenville · 31 March 2025 12:07

Charlie

Please switch on Extra diagnostic messages (PRINT_STATUS=PrStatus_Diag) - may give us a clue.

Grenville

c.j.r.williams · 31 March 2025 13:13

Sorry, yes of course, I have now done that and have rerun the simulation. It again failed at the coupled stage, this time literally straightaway (i.e. not after 5 minutes, with the earlier namcouple errors) and has given me a completely different error message this time:

lib-4211 : UNRECOVERABLE library error A WRITE operation tried to write a record that was too long. Encountered during a sequential formatted WRITE to an internal file (character variable)

Not the same error as this morning.

Charlie

grenville · 31 March 2025 13:40

that’s a different error - turn down the PRINT_STATUS down to PrStatus_Oper

c.j.r.williams · 31 March 2025 14:51

Okay, I have now done that, and rerun it, and now get the same error as before i.e. Error message: Convergence failure in BiCGstab, omg is NaN.

Does that help?

Thanks,

Charlie

grenville · 2 April 2025 09:28

Sadly not at all.

I have taken a copy of the suite to try to debug it - no success as yet.

Do you have a working version of this suite in some configuration (not necessarily with this set of ancillary files)?

Grenville

c.j.r.williams · 2 April 2025 10:56

Yes, I actually do!

If you look at u-df570@285541, this is my version of the preindustrial control simulation (a copy of u-as037@264242, owned by Ros). This uses all of the standard (i.e. PI) ancillary files, and works absolutely fine (I have tested it by running for a year). The suite I am now working on, that we are struggling with, is a copy of this. Apart from the different ancillary files, and a load of switch changes (e.g. changing certain scientific switches like ocean viscosity or the build-up of snow, but none of which should cause such an instant blowup), they should be identical.

If it helps, I have an “Idiot’s guide” that I wrote a couple of years ago, which I have been following this time, the details all of the changes I made. I can send this to you, if it helps?

The main problem, which I found 2 years ago, is in debugging all of this and finding the particular ancillary that is causing the problem. As I said in my previous message, it is likely that the land sea mask in one of them does not match the others. Although quite why this would be, I don’t understand, given that all of them were created by the same ancillary suite so should at least be consistent. But because the model will instantly crash if just one is different, it’s not possible to try switching out one of them for the original (modern) version and running. And then doing the same for another, and running. It’s not even possible to do a bisection technique when trying to find which ancillary is problematic (e.g. revert to modern/unmodified versions for half of them, see if it works, then try other half, see if it works, then half these, see if it works, and so on). Because that will, by definition, crash instantly. It has to be all or nothing i.e. all original (which we know works) or all new (which we know doesn’t), but no information as to which one is the problem.

Charlie

grenville · 3 April 2025 15:22

Charlie

The orography looks wrong - why use UM_ANCIL_OROG_DIR=$UM_ANCIL_N96EORCA1DIR/orography/globe30/v6

why is there no orography produced by u-do273?

My suite u-dn772 produced an orography called qrparm.orog_from_herold.

Grenville

Topic		Replies	Views
Metadata error Unified Model	41	268	11 August 2024
PiControl HadGEM3-GC3.1 suite	36	162	3 April 2025
Suggestions for debugging BICGstab "NaNs in error term" Unified Model	63	1028	15 February 2024
Cycle point for restarting suites? Unified Model ARCHER2 , PUMATest	21	738	1 March 2022
Postproc failure Unified Model	36	362	24 November 2023

Suite failing but no obvious error

Runtime setting automated via NEMO namelist values

Runtime setting automated via NEMO namelist values

Related topics