Suggestions for debugging BICGstab "NaNs in error term"

I’m not familiar with how to use this file – as a work around until we understand how to use it, could you try changing i_rad_topography to 0 or 1

1 Like

Have tried this, but puma seems to be unresponsive at the moment. Will keep trying to log back in…

Thanks Grenville. Changing the i_rad settings to 0 seems to have helped (I had to remove stash 5 and 6, orographic gradients, to get it to work). Have now gotten past the error related to 274. Am now getting the same error about item 275 (standard deviation of topographic index)

I can’t see why any of this is necessary if it worked in u-co433. How does u-cq635 differ?

Hi Grenville. It hasn’t worked in co433, because I’m using the ‘ecmwf forecast fields’ option with co433. But using that method is resulting in the strange wind ‘blob’ over the pole that Simon found, which is why I have been trying to test the glm > LAM method. I’m also quite mystified as to why this shouldn’t work. Is there an older version of the model on Archer2 (e.g. vn11.1) that I could revert to that might not have this problem?

Hi Simon , I’m trying to get your gribfix branch to work in my suite - but can’t seem to find the directory, is this still the right filepath?

Cheers
Ella

The full path is /home/simon/branches/vn12.0_gribfix on pumanew It hasn’t been checked in.

The RAS should be able to make the topographic index ancils. A heads up if you’re using RA3 yes I did have a load of issues with it, it is explicitly requested in: app/um/opt/rose-app-ra3_pack3.conf

I had to add in the file name here
bin/setup_ancil_versions
and tell the name list entry the file name in the config file

You can see the modifications made in this suite I keep meaning to look through and send back to Paul!

https://code.metoffice.gov.uk/trac/roses-u/log/c/q/9/8/4/

1 Like

Thanks Simon, so does that mean I can’t use it as a source in the build? Sorry, I get super confused by fcm stuff!

Coming to my rescue again, Helen! Thanks so much - your modifications look like almost exactly what I need to do too.
Cheers!

1 Like

Hi Helen,

I have been looking through the changes you committed in cq984 to the ancil_lct_postproc_c4 app, but I can’t seem to see the source file you reference ( c4source=${ANCIL_MASTER}/vegetation/cover/cci/v3/c4_percent_1d.nc)

As far as I can see there’s an .asc file under /vegetation/cover/cci/v1/ but no netcdf. Do you know where it is or did you make a netcdf file yourself from the .asc?

Cheers
Ella

Hi again. I’ve been trying to recreate the ancils using the RAS from scratch as I’m still having the same problem at the first timestep.

I’ve taken a copy of @douglowe’s archer RAS suite but the ancil_soils_hydr step keeps timing out and failing (have tried increasing the time limit to 30, then 40, then 60 and finally 90 M with the same effect). This seems weird, is there something that I need to change to make this step work? Seems possible that this could be causing the BIGCstab error .

FYI the veg frac ancil looks much more sensible now I’ve made it with ANTS, thanks for the heads-up @cemachelen . I’ve also switched to using RA2M for now to avoid the topographic index issue.

Update for anyone following this: it appears the culprit was a malformed mask file. There were two veeery small bands of incorrect values over the Eurasian part of the domain that were either created by the CAP or xancil, as they weren’t in the netcdf of the mask I made.

This inconsistency ultimately caused the forecast to fail, and also affected the other ancils. The region of missing data in the lsm is also the cause of the time-out error in the ancil_soils_hydr step for the RAS because there is a land point which the source data doesn’t have data for and so it tries to find one with a spiral search which is very slow (huge thanks to Nick S for alerting me to this!).

I’m now re-producing the ancils with a clean mask, which will hopefully fix the issue. Stay tuned!

EDIT

narrator: dear reader, it did NOT fix the issue. Read on to find out why…

I have made some progress to remove NaNs in the input files, finding another ANTS bug in the process (I am writing up the solution and will post once I manage to get this going).

However, I’m still getting the same error on the first timestep. It seems to be happening in the fast physics module, as I get NaNs in the pe_output file for theta_star, r_theta and dOLR. Digging around in the dynamics source code seems to suggest something about the advection / vertical levels / radiation modules to me.

pe_output looks like this:


Maximum horizontal wind at timestep  0       Max wind this run
max_wind   level  proc         position        run max_wind level timestep
0.105E+03  70    143   97.9% East    73.7% North  0.105E+03   70     0
Atm Step: Lexpand_ozone F
***  L2 norms before atmos_physics1 ***
Levels    1 to   71 exner Two_Norm = 0.6745924431923690E+01
Levels    1 to   70 wet rho_r_sq_n Two_Norm = 0.2963453988275088E+15
Levels    1 to   70 u Two_Norm = 0.1415843902126721E+03
Levels    1 to   70 v Two_Norm = 0.1001045329944668E+03
Levels    1 to   70 w Two_Norm = 0.0000000000000000E+00
Levels    1 to   70 theta Two_Norm = 0.3391484770277533E+04
Levels    1 to   70 q Two_Norm = 0.6823241429242393E-02
Levels    1 to   70 qcl Two_Norm = 0.0000000000000000E+00
Levels    1 to   70 qcf Two_Norm = 0.0000000000000000E+00
Levels    1 to   70 qrain Two_Norm = 0.0000000000000000E+00
Levels    1 to   70 qgraup Two_Norm = 0.0000000000000000E+00
** L2 norms of increments after atmos_physics1 **
Mixing ratio physics, l_mr_physics = F
Levels    1 to   70 theta_star Two_Norm =                    NaN
Levels    1 to   70 q_star Two_Norm = 0.2751579747626714E-04
Levels    1 to   70 qcl_star Two_Norm = 0.2750702131646228E-04
Levels    1 to   70 qcf_star Two_Norm = 0.4336001018267724E-07
Levels    1 to   70 qrain_star Two_Norm = 0.1754965133539365E-07
Levels    1 to   70 qgraup_star Two_Norm = 0.0000000000000000E+00
Levels    1 to   70 u_inc Two_Norm = 0.1867536857197371E-04
Levels    1 to   70 v_inc Two_Norm = 0.2084776473856972E-04
ls_rain Two_Norm = 0.3337455935215572E-07
ls_snow Two_Norm = 0.0000000000000000E+00
dOLR Two_Norm =                    NaN
====================================================================================
Slow physics source terms from atmos_physics1:
r_u      :         -0.3523021429746739E-03          0.2729656237848158E-03
r_v      :         -0.2201466297953937E-03          0.4516719877189951E-03
r_thetav :         -0.3280097091682001E-01          0.3298219880806270E+00                             NaN          0.1000000000000000E+01
r_m_v    :         -0.1570501223241527E-03          0.5249577208269981E-06         -0.3751767755803022E-06          0.0000000000000000E+00
r_m_cl   :          0.0000000000000000E+00          0.1565126435238083E-03          0.3751838296592790E-06          0.0000000000000000E+00
r_m_cf   :          0.0000000000000000E+00          0.9514744320678199E-05          0.2011070520505183E-10          0.0000000000000000E+00
r_m_graup:          0.0000000000000000E+00          0.0000000000000000E+00          0.0000000000000000E+00          0.0000000000000000E+00
r_m_rain :          0.0000000000000000E+00          0.5529906696351741E-06          0.1143524245902685E-09          0.0000000000000000E+00
====================================================================================

I’m really running out of ideas about how to debug this further, and I’d be so grateful for some help. Any thoughts @grenville @simon ?

Hello, happy new year, just bumping this in the hope that someone might be able to help? :crossed_fingers:

[At risk of talking into the void here but anyway here goes…]

After spending an age testing a variety of different tests and checking and double-checking the ancillary metadata, I think I have finally uncovered the reason for the BICGSTAB fails in the Antarctic.

I wrote some code to check that ancil masks were consistent between the custom mask/orog files I made from the n2560 recon and the missing data masks from the ancils created by the RAS (u-cs542).

Turns out… they weren’t!

Yellow filled contours = adjusted mask
Black line = unadjusted mask outline
Green line = 0 m orography contour

I created a land/sea mask from the missing data field of the vegfunc ancil (adjusted mask, yellow shaded region on the figure above).

…as you can see the unadjusted mask and orography are not the same as the mask derived from the missing values of ancils created by the RAS. Not ideal.

I also compared the adjusted mask with the mask in the recon start file I was using (cylc-run/u-co447/share/cycle/20160828T0000Z/Antarctic_PolarRES/11km/ics/RA3_astart ) and you can see that they are subtly different despite having the same coordinates and inputs in the suite GUI.

Adjusted mask:

image
image
image
image

RA3 astart:

image
image
image
image

Very confusing, but the inconsistency between mask/orog and other ancils will definitely cause the UM to freak out.

Next task is to figure out why this happened… and how to fix it.

Hi Ella

It might feel like no one’s listening, but that’s not so, we didn’t have much to contribute to this problem - BICGstab errors are very frequently ancil (or input data in general) related. Bad masks will always cause problems.

Does the model run now?

Grenville

Thanks Grenville. Will keep these here in case anyone else finds the thread useful.
E