Suggestions for debugging BICGstab "NaNs in error term"

shakka · 6 October 2022 15:12

Hi CMS,

I’ve been trying to get a simulation going over the Arctic. I’ve tried three or four different domains and every time I get the dreaded “NaNs in error term in BiCGstab after 1 iterations” error in the first timestep of the first forecast.

A few things I’ve tried:

I’ve had a look at the ancils that are being read in, and the only strange features I can see are pole effects in the biog, veg.frac and veg.func files. However I’m not sure if these are from the satellite data used to produce the ancils (which naturally have a ‘pole hole’) or are an artefact from the CAP. Given that none of the other ancils have this problem, I feel it could be the former. Do you think that could have an effect?
Looking at the pe_output of the processor that reports the error (102) I can see that the step immediately preceding the error message is related to ozone - but having looked at the ozone I can’t see any zeros or NaNs and the ozone ancil looks like I would expect.
The only other thing I can see is that the NaNs are coming out in all of the fast physics terms (atmos_physics2), which mostly look like they’re related to moist physics terms.
I’ve also checked if any of the likely variables look weird in the recon, but can’t see anything odd.

So I’m kind of running out of ideas! Can you think of any other avenues or ways of tracking down the source of the error?

Thanks,
Ella

grenville · 6 October 2022 15:32

did you try reducing the time step?

shakka · 6 October 2022 16:22

I did - have tried 300, 150, 60 and now 30 s. All with the same result…

shakka · 10 October 2022 10:24

I’ve found the problem - an anomaly in the U/V winds at the north pole in the reconfigured ERA5 data.

This results in the following:

**  cycleno =  1 **  L2 norms after atmos_physics2 **
Levels    1 to   70 theta_star Two_Norm =                    NaN
Levels    1 to   70 q_star Two_Norm =                    NaN
Levels    1 to   70 qcl_star Two_Norm = 0.1006355737767701E-03
Levels    1 to   70 qcf_star Two_Norm = 0.0000000000000000E+00
Levels    1 to   70 qrain_star Two_Norm = 0.0000000000000000E+00
Levels    1 to   70 qcf2_star Two_Norm = 0.0000000000000000E+00
Levels    1 to   70 qgraup_star Two_Norm = 0.0000000000000000E+00
Levels    1 to   70 r_u_p2 Two_Norm =                    NaN
Levels    1 to   70 r_v_p2 Two_Norm =                    NaN
Levels    1 to   70 r_w_p2 Two_Norm =                    NaN
**  cycleno =  1 **  L2 norms of tracers after atmos_physics2  **

Any ideas where to go from here?

Thanks,
Ella

simon · 10 October 2022 13:03

Hi,

Is the anomaly at the pole of the non rotated grid? What’s the name of the file with the issue?
If there are bad values in the ancillaries they can easily cause the model to fail in the first t/s, or even produce bad
values in other fields in the reconfigured dump as the reconfiguration does some adjustment of variables as part of its
processing.

Simon

shakka · 10 October 2022 13:38

Hi Simon,

Thanks for this. Yes indeed the anomaly is at the rotated pole (0, 6.55). The file is cylc-run/u-cq635/share/cycle/20211201T0000Z/Arctic_RACMO/12km/ERA5_24cyc_72fcst/ics/ERA5_24cyc_72fcst_astart

I have had a look at the ancils but couldn’t see anything obvious wrong with them - can you advise on anything specific to look out for?

Thanks
Ella

simon · 10 October 2022 15:13

I’ve had a look, and the anomaly appears in the input dump to the recon, /work/n02/n02/shakka/cylc-run/u-cq635/share/cycle/20211201T0000Z/ec/um/ec_cb000 which in turn is produced from the global grib dump /work/n02/n02/shakka/Dec_2021/ec_grib_202112010000.t+000 So it looks as if the regridding from the ERA data is the issue. I suspect the recon is expecting the polar data to be handled differently from how the ECMWF provides it. One thing you could try is changing the horizontal regridding from ‘bilinear’ to one of the other two options in glm_um-> namelist-> Recon…->Hozintal…

Simon.

shakka · 11 October 2022 08:27

Hi Simon,

I’ve had a go at changing the interpolation method and now get errors in the ec recon. Having thought about it and re-read the help panel for the horizontal interpolation in glm_um, I am not sure either of the other interpolation methods are appropriate. The nearest neighbour method ignores the namelists and takes all the input fields directly from the source dump (not all the required fields are included in the dump because I’m using ERA5 data so this throws up ‘required field not in input dump’ errors), while the area-weighted method is supposedly not advised when pole rotations are being performed (indeed, this produces the error ‘grids have different poles!’).

So I guess we’re stuck with bilinear! I’ve had a look and it’s a problem in all the ec cb00* files. I think @ruthprice was also having a similar problem in the Antarctic if I recall correctly. Is there anything we can do to the bilinear interpolation scheme itself to address this?

Cheers,
Ella

ruthprice · 11 October 2022 08:56

Hello,

I’ve had various problems with anomalies in the recon in the Arctic and Antarctic, although not from ERA5 data and not over the pole in Antarctica - my domains are on the coast there. Paul Field gave me one fix for avoiding NaNs when I was getting them at the north pole. It’s specific to UKCA ancils but could possibly be adapted. Here’s the change on MOSRS, hope the link works:

https://code.metoffice.gov.uk/trac/roses-u/changeset?reponame=&new=233767%40c%2Fo%2F5%2F9%2F0&old=229759%40c%2Fo%2F5%2F9%2F0

If the link doesn’t work, it’s suite co590, revisions 229759:233767.

Ruth

simon · 11 October 2022 09:55

Hi Ruth,

Many thanks for replying. I’ll look into it, but unfortunately your fix is for ancils rather than regridding of the ERA5 data in the recon which I think is causing this issue. Poles are such a pain!

Ella, how did you obtain the input ERA5 data? Do you know if it’s been through any processing?

Simon.

shakka · 11 October 2022 10:21

Hi Simon, agreed that the poles are a pain! If only the poles were somewhere around the equator, hey…

I got the data using a MARS retrieval script directly from ECMWF. They look fine to me but you might be able to see something I can’t. They’re in: /work/n02/n02/shakka/Dec_2021/

Will Ruth’s fix not work for the recon then, do you think?

Cheers
Ella

simon · 11 October 2022 13:45

Hi Ella,

Unfortunately, Ruth’s fix is the the UKCA ancils, but your job doesn’t appear to use UKCA.

My current working theory is that the horizontal interpolation of the ERA winds is incorrect for the global pole. This could be due to an UM-inconsistent polar value and/or the ERA wind grids being misinterpreted by the reconfiguration. I will have a look at the recon output and code to see if I can work out what’s going on.

What slightly concerning me is that an anomaly like yours hasn’t be spotted before. Are we sure that it’s the source of the NaNs? Has this job been run successfully with a non-ERA dump?

Simon.

shakka · 11 October 2022 14:29

Hi Simon,

Understood re: the UKCA ancils, I thought it might be possible to adapt to the recon.

And on your other question: it’s not been tested with the glm as it’s a totally new domain (hence why it may not have been spotted yet - I know many UM users avoid the poles for this very reason!). I’m not sure if anyone has run a pan-Arctic domain like this before but I’m happy to be corrected. I am also not 100% certain that it’s the source of the NaNs, but I can’t see any other immediately obvious source of error either.

As noted above, the NaNs seem to develop in atmos_physics2, with theta* and q*, i.e. potential temperature and humidity, plus r_u_p2, r_v_p2 and r_w_p2 (I’m not sure on the exact definitions of those but they’re definitely wind-related), so it does seem plausible that the winds could be causing this problem.

I was intrigued to see that it wasn’t a problem over the South Pole - but I have just had a look at the startfiles for the Antarctic simulation in u-co433 (cylc-run/u-co433/share/cycle/20211201T0000Z/Antarctic/12km/ERA5/ics/ERA5_astart) and I notice that -although subtle - there is still a blob in U/V winds over the geographical pole.

My gut feeling is telling me that this is due to the changes in direction convention when winds are passing the geographical poles, and that the recon is trying to smooth this out over several grid points.

I did get an error about the ERA5 and nested domain having different poles, but I dismissed it because I figured that it makes sense given that the LAM is a rotated pole domain while ERA5 has standard lat/lon and pole coordinates.

Ella

Edit: also visible in the humidity and theta fields in the Antarctic recon, throughout all vertical levels

simon · 12 October 2022 14:04

Hi Ella,

I’m going to need to take a copy of your job. To confirm, it’s u-cq635 isn’t it? Also, have you made any changes to it which haven’t been checked in? If so, I’ll need to take a copy directly from your roses directory,

Simon.

shakka · 12 October 2022 14:38

Hi Simon, exactly right. The Arctic suite is u-cq635, and I’ve just committed my local changes for you.
The Antarctic suite (which is having the same problem) is u-co433. The two should be the same apart from the domains.
Thanks for your help on this.
Ella

simon · 14 October 2022 12:38

Hi Ella,

I checked out u-cq635 and am trying to run it. To get it to work I had to turn off grid checking for the veg,func and veg.frac files, and I’m now getting
???????????????????????????????????????????????????????????????????????????????? ???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!! ? Error code: 2 ? Error from routine: INBOUNDA ? Error message: CHK_LOOK_BOUNDA : Wrong number of LBC fields ? Error from processor: 31 ? Error number: 21 ????????????????????????????????????????????????????????????????????????????????

errors.
Have you seen this sort of thing before?

shakka · 14 October 2022 13:00

Hi Simon,
I also had to turn off grid checking in the ancils (thought I’d already done that in the committed suite, sorry about that!). I have had this error in my Antarctic suite. I fixed it by setting CRUN_LEN to the same frequency as the ERA5 start files I have (i.e. rg01_rs02_m01_lbc_freq = 43200, CRUN_LEN = 43200, dm_ec_cb_freq = 43200).

Does that work for you?

Cheers
Ella

simon · 14 October 2022 13:53

Hi Ella,

That worked (although CRUN_LEN was left untouched at 12). It now falls over with NaNs. I will now start looking at the regridding.

Simon.

shakka · 14 October 2022 13:59

Thanks for your help Simon

shakka · 18 October 2022 09:02

Hi Simon, any chance you’ve been able to look into this?

Topic		Replies	Views
Sudden "NaNs in error term in BiCGstab" after model running smoothly Unified Model ARCHER2	3	116	30 March 2024
NaNs in error term in BiCGstab for high res domain Unified Model ARCHER2 , Nesting-Suite	3	420	13 December 2023
NaNs BiCGstab error - from transplanting data? Unified Model ARCHER2 , Nesting-Suite	2	18	30 March 2025
Error in glm_um_fcst Unified Model	4	153	14 December 2023
New ancillaries for future continuation run, Halos, NaNs in error term in BiCGstab Unified Model ARCHER2 , Ancillaries	4	357	13 May 2022

Suggestions for debugging BICGstab "NaNs in error term"

Related topics