Suite failing but no obvious error

c.j.r.williams · 4 April 2025 13:52

Ahh. I was afraid this would come back to bite me. This is where it gets complicated (even more than it currently is). Very many apologies for the long message which follows.

So the orography that your version of the ancillary-creation suite (u-dn772) creates, at /work/n02/n02/grenvill/cylc-run/u-dn772/share/data/n96e_orca025_go6/orography/globe30/qrparm.orog_from_herold is, as you can easily see, not using a modern land sea mask, but rather one that is appropriate for about 50 million years ago (the early Eocene), which is what I was originally working on several years ago. The suite uses as its input /work/n02/n02/grenvill/wilro.2021-03-31/TGRES/herold_orog.new which, as you can see, is using the same land sea mask but is at very very high spatial resolution. This is to account for any sub-grid scale variability in the orography. So somehow the suite is designed to read this in and interpolate it to the standard model grid of 192,144.

The problem is this file was created several years before I started working on this, even before I moved to Bristol. It basically reads in very high resolution (satellite derived, ARC-30 second) data showing the modern orography, and modifies it to be appropriate for the Eocene. Although I have the original Python, I have no idea where the original data comes from - the person who did this has long since left Bristol, and their home directory (and all others) has been deleted. So I don’t know how to do this step. I have contacted the person, but so far waiting for a reply.

In order to run my version of the ancillary-creation suite (u-do273), instead of using this high resolution orography as input to the suite, I instead used the standard/original orography file (at /work/y07/shared/umshared/ancil/atmos/n96e/orca1/orography/globe30/v6/qrparm.orog) which, as you can see, is already on the standard model grid. Although this works, and the suite generates the equivalent orography file to yours (at /work/n02/n02/cjrw09/cylc-run/u-do273/share/data/n96e_orca025_go6/orography/globe30/qrparm.orog), all but 2 of the fields are completely blank i.e. full of zeros. The only fields in this file that look okay are “Silhouette orographic roughness” and “Half of (peak to trough)…”. And a couple of the other fields make xconv crash. So clearly something is going wrong with the ancillary-creation suite, and it doesn’t like using as it’s input a file that is already at low resolution.

Instead, when running my actual suite (u-do321) I pointed the orography to the standard/original orography file (at /work/y07/shared/umshared/ancil/atmos/n96e/orca1/orography/globe30/v6/qrparm.orog). The reason I thought this would be okay is because the orography file is a filled field i.e. it does not contain any land sea mask. So I was hoping that, for testing purposes, I could get away with using this original file, because it shouldn’t conflict with my new land sea mask. In the same way, a lot of my other ancillaries (e.g. the aerosols) are also currently set to the standard/original versions, because they are also filled fields and therefore don’t conflict with my new land sea mask.

Obviously, eventually, when it comes to running this properly, I will need to solve the above problem and create my own orography. Although my new land sea mask is very similar to the standard/original version, it is not the same. But for now, I was just wanting to see if the model would run with all of these new ancillaries.

But clearly I was wrong in using the standard/original orography. Is it possible that, even though it is a filled field, that the model is crashing because e.g. it is seeing a value for orography where the land sea mask says it is ocean, or vice versa? So it is seeing a high value in the orography (e.g. a mountain) but then finding no land in the new land sea mask? As I said, my new land sea mask is very similar to the standard (which would match the orography), but not exactly the same e.g. Australia is attached to Papua New Guinea.

If this is the case, I can’t even try running with your version of the orography, because it will have the same problem i.e. your version is early Eocene, so again it will find e.g. high values for orography where there is no land, or vice versa.

So clearly I have 2 problems here. Either the ancillary-creation suite has a problem, which is why it is creating a load of zeros when making the orography. Or it was never designed to read in orography that was already on the model grid, but only works with a much higher resolution version of the orography. In which case, I need to go back and try harder to get hold of the person who originally created this high resolution file for the Eocene, in order to replicate it for my new land sea mask.

Does that make sense? Are you able to see anywhere in my ancillary-creation suite (u-do273) where it specifies that the input orography needs to be at very high resolution?

Thank you,

Charlie

grenville · 7 April 2025 07:59

on further reflection, I’m not convinced that orography is the problem. Not sure where to go with this just now.

grenville · 7 April 2025 11:42

Charlie

Are the remapping files consistent with the atmosphere mask? I note that the mask in /work/n02/n02/cjrw09/gc31/pliod/coupling_weights_v1 is quite different from what the atmosphere model in u-do321 is using.

Grenville

c.j.r.williams · 7 April 2025 12:28

Hi Grenville,

Okay, I don’t know what I have done this time, but now I can’t even get install-ancil to run, or several of the other tasks e.g. the fcm_make tasks. All I tried doing this morning, before your last message arrived, was trying to run the model with my version of the orography, even though it contains zeros. Then I tried running with your version of the orography, even though it uses the wrong mask. Then I tried switching back to the original/standard version of the orography, which worked before but now doesn’t! What on earth have I done this time?!

Either way, to respond to your question, can you clarify exactly which files you are talking about i.e. which file in coupling_weights_v1 and which file the atmosphere model is using? They should be consistent, if I have understood the process correctly.

Charlie

grenville · 9 April 2025 11:43

Charlie

Apologies - turns out I believe I am wrong to suggest a mismatch between the mask in /work/n02/n02/cjrw09/gc31/pliod/coupling_weights_v1/masks.nc and the mask used by the model (in the dump for example.) The things to compare are the mask in masks,nc and the land_fraction (in the dump) modified to set to ocean grid cells that are not 100% land. Those masks match.

So the problem remains.

Grenville

c.j.r.williams · 12 April 2025 13:40

Hi Grenville,

Thanks very much, and sorry for the delay. It was a super busy week, plus I was ill towards the end of it.

Okay, so I think I need to do 2 things:

Go through each and every ancillary file again, to check they are all consistent and contain the same mask (I have already done this once, but clearly need to do it again).
Try to match up whatever is coming out of the model with the mask in the ancillary files, to check that it is indeed ingesting the right thing.

For the 2nd task, is it best to check what is coming out of the reconfiguration? As this is where the ancillaries are read in? If so, where is this stored? Normally I would check the atmosphere start dump, but I can’t seem to see this in my output so maybe it is not even getting to that stage?

You will be pleased to hear that I am away on annual leave next week, so I will put this down until after Easter. If you could answer my question above that would be really appreciated, otherwise I will get back to you after Easter.

Thanks again,

Charlie

c.j.r.williams · 28 April 2025 12:02

Hi,

Apologies for the delay in getting on with this, a shed load of marking and then Easter got in the way.

Did you see my message above (#23, dated 7 April), about my latest problem i.e. not even getting to the previous crash? Can you possibly advise on this, before I begin searching through each of the ancillaries (see message #25, dated 9 April)?

Thank you,

Charlie was

grenville · 28 April 2025 14:03

Charlie

The job-activity.log says

(ln02) 2025-04-07T23:10:34Z [STDERR] sbatch: error: AssocMaxCpuMinutesPerJobLimit
(ln02) 2025-04-07T23:10:34Z [STDERR] sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

because n02-PLIOD has no resources.

Grenville

c.j.r.williams · 28 April 2025 14:17

But I thought we had loads of resources, following my first request? I have only run about 3 tests of one year each (of another suite), plus several failed tests of this suite (which never got passed the first timestep), so how can we be out of resources already?

Charlie

c.j.r.williams · 12 May 2025 14:02

Hi,

Okay, I have now returned to this original problem i.e. the ambiguous error message “Convergence failure in BiCGstab, omg is NaN” (see my message dated 31 March). This usually means that the model is ingesting NaNs, which is usually (but not always) indicative of a problem somewhere with the ancillary files. I have tried running with 3 versions of the orography, and get the same error each time, implying that either the problem is elsewhere or all 3 versions of the orography file are the problem.

Looking at my original notes/instructions, they say that the first thing to do with this error is to look at fields exchanged via the coupler, since they get exchanged at timestep 0, to check if anything looks wrong either as they come out of ocean (before regridding) or go into atmosphere (after regridding). To see these, in namcouple (currently at /home/n02/n02/cjrw09/roses/u-do321/app/coupled/file/namcouple) replace any occurrences of “EXPORTED” with “EXPOUT”, which will tell the coupler to produce netcdf files containing coupling fields. Looking at namcouple, the only instances where this occurs is for one file: atmos_restart.nc, so presumably all of the fields are written out into this.

However, when I follow the above instructions and try to run again, this time the reconfiguration fails, giving me another ambiguous error message (at /home/n02/n02/cjrw09/cylc-run/u-do321/log/job/18500101T0000Z/recon/NN/job.err):

lib-4611 : UNRECOVERABLE library error
Missing opening (left) parenthesis in format.
Encountered during a sequential formatted WRITE to an internal file (character variable)
srun: error: nid004835: task 0: Exited with exit code 3
srun: launch/slurm: _step_signal: Terminating StepId=9561207.0
slurmstepd: error: *** STEP 9561207.0 ON nid004835 CANCELLED AT 2025-05-09T15:18:13 ***
srun: error: nid004835: tasks 1-127: Terminated
srun: Force Terminated StepId=9561207.0
[FAIL] um-recon <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=143
2025-05-09T14:18:14Z CRITICAL - failed/EXIT

What has now gone wrong with this, and why has this happened as a result of my change to namcouple?

Thanks,

Charlie

grenville · 15 May 2025 12:32

Charlie

fix this & try again

Ancillary File does not exist.
File : /work/n02/n02/cjrw09/cylc-run/u-do273/share/data/n96e_orca025_go/orography/globe30/qrparm.orog

Grenville

c.j.r.williams · 16 May 2025 12:25

Very many apologies, I don’t know how I missed that. Where was that error, because I didn’t see it in job.err?

Either way, I have now corrected the problem and this time it ran, and has successfully written out a load of extra files specified by namcouple, which I am now going through one by one. More soon…

Charlie

c.j.r.williams · 30 May 2025 14:33

Hi,

Very many apologies for the delay with this, it has been a busy couple of weeks of teaching/marking.

So I have now had time to look at the output, generated changing all occurrences of “EXPORTED” in namcouple with “EXPOUT”, telling the coupler to produce netcdf files containing coupling fields. They are all at /home/n02/n02/cjrw09/cylc-run/u-do321/work/18500101T0000Z/coupled. Here, anything with “toyatm” = field on atmosphere side of coupling exchange, and anything with “toyoce” = field on ocean side of coupling exchange e.g. model01_O_SSTSST_toyoce_01.nc = SST coming out of ocean and ocn_sst_toyatm_01.nc = SST as it goes into atmosphere after being regridded.

I have checked all of these. Of the “toyatm” files, most of them are completely blank i.e. full of zeros. I don’t know if this is a problem, as I have never looked at these before. The ones that are not full of zeros look okay, as although they don’t use an official mask, an imprint of the mask is visible in the data and it looks okay i.e. it matches my new mask.

However, of the “toyoce” files, things look less good. There are several files (e.g. model01_O_OCurx1_toyoce_27.nc or model01_O_SSTSST_toyoce_01.nc) where there could be a problem, because although it doesn’t use a mask of NaNs, they do use a mask containing other values and, importantly, this mask doesn’t look right. It is obviously difficult to tell, because of the ocean grid, but certain things don’t match my new mask. For example, in these files the Bering Strait is open, whereas in my new mask, it is closed.

Do you think this could be the source of the crash? So, for whatever reason, the fields on the ocean side of the coupling exchange are not being updated to use my new mask?

The only other issue I have come across is that when I was running my OASIS suite (in other words, creating the coupling weights based on the new bathymetry that I made, u-dg751@289202), there were a couple of differences to when we did this originally, several years ago. They are as follows:

Files areas.nc, grids.nc and masks.nc don’t include uor1* and vor1* variables (ocean U and V components)
Latitude orientation in file atmo_mask_fracarea_anc_ns.nc is reversed (also the case for the respective UM-format file).

I had no idea at the time whether this would be a problem, but maybe it is? The main difference between now and several years ago is that several years ago this stage was done by the Met Office, and could only be done behind their firewall. In other words, I created by new bathymetry, created the mesh_mask files based on this, and then sent them to my MO sponsor who created the coupling weights. At the time, we weren’t happy about this so we asked to do this stage by ourselves, and he eventually sent over the suite. Sebastien Steinig tested it against our original version, and found everything was identical apart from the above 2 differences.

Do you think this might be the source of the crash, possibly as well as the above?

Thanks a lot,

Charlie

system · 29 June 2025 14:33

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failing at ‘Coupled’ Unified Model	12	210	19 February 2024
Failing with no clue Unified Model Monsoon2 , Nesting-Suite	4	284	22 April 2022
Suite fails on restart after extending length Unified Model ARCHER2	4	24	22 February 2025
My GOML3 suite failed Unified Model ARCHER2	4	74	4 March 2024
Issues with submitting coupled task Unified Model PUMA , ARCHER2	13	343	13 May 2022

Suite failing but no obvious error

Related topics