Suite failing but no obvious error

Ahh. I was afraid this would come back to bite me. This is where it gets complicated (even more than it currently is). Very many apologies for the long message which follows.

So the orography that your version of the ancillary-creation suite (u-dn772) creates, at /work/n02/n02/grenvill/cylc-run/u-dn772/share/data/n96e_orca025_go6/orography/globe30/qrparm.orog_from_herold is, as you can easily see, not using a modern land sea mask, but rather one that is appropriate for about 50 million years ago (the early Eocene), which is what I was originally working on several years ago. The suite uses as its input /work/n02/n02/grenvill/wilro.2021-03-31/TGRES/herold_orog.new which, as you can see, is using the same land sea mask but is at very very high spatial resolution. This is to account for any sub-grid scale variability in the orography. So somehow the suite is designed to read this in and interpolate it to the standard model grid of 192,144.

The problem is this file was created several years before I started working on this, even before I moved to Bristol. It basically reads in very high resolution (satellite derived, ARC-30 second) data showing the modern orography, and modifies it to be appropriate for the Eocene. Although I have the original Python, I have no idea where the original data comes from - the person who did this has long since left Bristol, and their home directory (and all others) has been deleted. So I don’t know how to do this step. I have contacted the person, but so far waiting for a reply.

In order to run my version of the ancillary-creation suite (u-do273), instead of using this high resolution orography as input to the suite, I instead used the standard/original orography file (at /work/y07/shared/umshared/ancil/atmos/n96e/orca1/orography/globe30/v6/qrparm.orog) which, as you can see, is already on the standard model grid. Although this works, and the suite generates the equivalent orography file to yours (at /work/n02/n02/cjrw09/cylc-run/u-do273/share/data/n96e_orca025_go6/orography/globe30/qrparm.orog), all but 2 of the fields are completely blank i.e. full of zeros. The only fields in this file that look okay are “Silhouette orographic roughness” and “Half of (peak to trough)…”. And a couple of the other fields make xconv crash. So clearly something is going wrong with the ancillary-creation suite, and it doesn’t like using as it’s input a file that is already at low resolution.

Instead, when running my actual suite (u-do321) I pointed the orography to the standard/original orography file (at /work/y07/shared/umshared/ancil/atmos/n96e/orca1/orography/globe30/v6/qrparm.orog). The reason I thought this would be okay is because the orography file is a filled field i.e. it does not contain any land sea mask. So I was hoping that, for testing purposes, I could get away with using this original file, because it shouldn’t conflict with my new land sea mask. In the same way, a lot of my other ancillaries (e.g. the aerosols) are also currently set to the standard/original versions, because they are also filled fields and therefore don’t conflict with my new land sea mask.

Obviously, eventually, when it comes to running this properly, I will need to solve the above problem and create my own orography. Although my new land sea mask is very similar to the standard/original version, it is not the same. But for now, I was just wanting to see if the model would run with all of these new ancillaries.

But clearly I was wrong in using the standard/original orography. Is it possible that, even though it is a filled field, that the model is crashing because e.g. it is seeing a value for orography where the land sea mask says it is ocean, or vice versa? So it is seeing a high value in the orography (e.g. a mountain) but then finding no land in the new land sea mask? As I said, my new land sea mask is very similar to the standard (which would match the orography), but not exactly the same e.g. Australia is attached to Papua New Guinea.

If this is the case, I can’t even try running with your version of the orography, because it will have the same problem i.e. your version is early Eocene, so again it will find e.g. high values for orography where there is no land, or vice versa.

So clearly I have 2 problems here. Either the ancillary-creation suite has a problem, which is why it is creating a load of zeros when making the orography. Or it was never designed to read in orography that was already on the model grid, but only works with a much higher resolution version of the orography. In which case, I need to go back and try harder to get hold of the person who originally created this high resolution file for the Eocene, in order to replicate it for my new land sea mask.

Does that make sense? Are you able to see anywhere in my ancillary-creation suite (u-do273) where it specifies that the input orography needs to be at very high resolution?

Thank you,

Charlie

on further reflection, I’m not convinced that orography is the problem. Not sure where to go with this just now.

Charlie

Are the remapping files consistent with the atmosphere mask? I note that the mask in /work/n02/n02/cjrw09/gc31/pliod/coupling_weights_v1 is quite different from what the atmosphere model in u-do321 is using.

Grenville

Hi Grenville,

Okay, I don’t know what I have done this time, but now I can’t even get install-ancil to run, or several of the other tasks e.g. the fcm_make tasks. All I tried doing this morning, before your last message arrived, was trying to run the model with my version of the orography, even though it contains zeros. Then I tried running with your version of the orography, even though it uses the wrong mask. Then I tried switching back to the original/standard version of the orography, which worked before but now doesn’t! What on earth have I done this time?!

Either way, to respond to your question, can you clarify exactly which files you are talking about i.e. which file in coupling_weights_v1 and which file the atmosphere model is using? They should be consistent, if I have understood the process correctly.

Charlie

Charlie

Apologies - turns out I believe I am wrong to suggest a mismatch between the mask in /work/n02/n02/cjrw09/gc31/pliod/coupling_weights_v1/masks.nc and the mask used by the model (in the dump for example.) The things to compare are the mask in masks,nc and the land_fraction (in the dump) modified to set to ocean grid cells that are not 100% land. Those masks match.

So the problem remains.

Grenville

Hi Grenville,

Thanks very much, and sorry for the delay. It was a super busy week, plus I was ill towards the end of it.

Okay, so I think I need to do 2 things:

  1. Go through each and every ancillary file again, to check they are all consistent and contain the same mask (I have already done this once, but clearly need to do it again).
  2. Try to match up whatever is coming out of the model with the mask in the ancillary files, to check that it is indeed ingesting the right thing.

For the 2nd task, is it best to check what is coming out of the reconfiguration? As this is where the ancillaries are read in? If so, where is this stored? Normally I would check the atmosphere start dump, but I can’t seem to see this in my output so maybe it is not even getting to that stage?

You will be pleased to hear that I am away on annual leave next week, so I will put this down until after Easter. If you could answer my question above that would be really appreciated, otherwise I will get back to you after Easter.

Thanks again,

Charlie

Hi,

Apologies for the delay in getting on with this, a shed load of marking and then Easter got in the way.

Did you see my message above (#23, dated 7 April), about my latest problem i.e. not even getting to the previous crash? Can you possibly advise on this, before I begin searching through each of the ancillaries (see message #25, dated 9 April)?

Thank you,

Charlie was

Charlie

The job-activity.log says

(ln02) 2025-04-07T23:10:34Z [STDERR] sbatch: error: AssocMaxCpuMinutesPerJobLimit
(ln02) 2025-04-07T23:10:34Z [STDERR] sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

because n02-PLIOD has no resources.

Grenville

But I thought we had loads of resources, following my first request? I have only run about 3 tests of one year each (of another suite), plus several failed tests of this suite (which never got passed the first timestep), so how can we be out of resources already?

Charlie

Hi,

Okay, I have now returned to this original problem i.e. the ambiguous error message “Convergence failure in BiCGstab, omg is NaN” (see my message dated 31 March). This usually means that the model is ingesting NaNs, which is usually (but not always) indicative of a problem somewhere with the ancillary files. I have tried running with 3 versions of the orography, and get the same error each time, implying that either the problem is elsewhere or all 3 versions of the orography file are the problem.

Looking at my original notes/instructions, they say that the first thing to do with this error is to look at fields exchanged via the coupler, since they get exchanged at timestep 0, to check if anything looks wrong either as they come out of ocean (before regridding) or go into atmosphere (after regridding). To see these, in namcouple (currently at /home/n02/n02/cjrw09/roses/u-do321/app/coupled/file/namcouple) replace any occurrences of “EXPORTED” with “EXPOUT”, which will tell the coupler to produce netcdf files containing coupling fields. Looking at namcouple, the only instances where this occurs is for one file: atmos_restart.nc, so presumably all of the fields are written out into this.

However, when I follow the above instructions and try to run again, this time the reconfiguration fails, giving me another ambiguous error message (at /home/n02/n02/cjrw09/cylc-run/u-do321/log/job/18500101T0000Z/recon/NN/job.err):

lib-4611 : UNRECOVERABLE library error
Missing opening (left) parenthesis in format.
Encountered during a sequential formatted WRITE to an internal file (character variable)
srun: error: nid004835: task 0: Exited with exit code 3
srun: launch/slurm: _step_signal: Terminating StepId=9561207.0
slurmstepd: error: *** STEP 9561207.0 ON nid004835 CANCELLED AT 2025-05-09T15:18:13 ***
srun: error: nid004835: tasks 1-127: Terminated
srun: Force Terminated StepId=9561207.0
[FAIL] um-recon <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=143
2025-05-09T14:18:14Z CRITICAL - failed/EXIT

What has now gone wrong with this, and why has this happened as a result of my change to namcouple?

Thanks,

Charlie