BiCGstab error at the beginning of simulation

MYoshioka · 9 December 2025 12:09

My UM vn13.9 UKESM1.1 global AMIP suite, u-dv077, completed a one-year simulation without a problem (finally). Now I copied it and created u-dv258. I removed the aviation emissions that I added to ukca_em_file in u-dv077 and tried to run the suite. It failed with this error;

NaNs in error term in BiCGstab after 1 iterations

I’m aware that this problem is usually treated by perturbing the dump file created in the last cycle. However, in this case this occurs at the beginning of the run, so there is no such dump file except the initial dump (cx946a.da20150101_00 provided by Luke Abraham), which I used many times without a problem.

So I added back all the aviation emissions so the suite is virtually the same as u-dv077. When I tried to run it, I got the same problem. Now I tried to run u-dv077 itself again. The result is the same!

I cannot rule out the possibility that I have made a small change to u-dv077 or the branch used there after it was run.

So I wonder if I can get any advice. What can be causing the problem? What could I check? What could I try? Or is this a temporary problem and I should be just waiting?

Thanks,

Masaru

MYoshioka · 10 December 2025 14:32

I forgot to mention that these suites use 1 UM branch (vn13.9_Contrails_Chen) and 2 UKCA branches (um13.9_primSU_Aitken and um13.9_Aviation_3D_emiss). All of these are currently set to the local working copies but they should be identical to the latest revisions committed.

I tried changing the initial dump to dv077a.da20160101_00, which was created in the previous run of u-dv077. The result didn’t change (crashed with the BiCGstab error).

Masaru

MYoshioka · 11 December 2025 13:31

I said u-dv077 that previously ran for 1 year doesn’t run now. But more precisely, the run that had run successfully was run3. And I said I tried to run the suite again and it failed. That was run4. Now I tried extending the run3 and it is running for a while now.

There is no difference between run3/rose-suite.conf and run4/rose-suite.conf or between run3/app/um/rose-app.conf and run4/app/um/rose-app.conf except for the run length. There is no difference in the code either, as the following doesn’t return anything at all.

diff -r run3/share/fcm_make_um/extract run4/share/fcm_make_um/extract

The only difference is that run3 starts from 20160201 with the dump the suite produced whereas run4 is starting from the beginning (20150101). I just checked the initial dump cx946a.da20150101_00 but it doesn’t seem to have changed.

-r–r–r–. 1 luke.abraham.mon ukca-cam 8572932096 Jun 25 2023 /projects/ukca-cam/luke.abraham.mon/restarts/AMIP/cx946a.da20150101_00

This is very strange. Please help.

MYoshioka · 12 December 2025 14:38

To summarise;

u-dv077/run3 once ran. Extension run still runs fine.
u-dv077/run4 doesn’t run. BiCGstab error occurs as soon as the run starts.
u-dv258 doesn’t run. BiCGstab error occurs as soon as the run starts.
u-dv258 fails using dv077a.da20160101_00 as ainitial. BiCGstab error occurs.
There is virtually no difference between these runs.

I know, this doesn’t make any sense at all, but it is what is happening to me.

Please somebody help me. Super-duper pretty please

Masaru

MYoshioka · 27 December 2025 15:53

I went one suite back and started from there (u-du915) to do the same as dv077. In that process I noticed that I had changed a namelist setting because the run3 of dv077 had failed after running for 6 months.

[namelist:items(085d9ed4)]
ancilfilename='$ROSE_DATA/etc/ancil/qrparm.veg.frac'
domain=1
!!interval=5
l_ignore_ancil_grid_check=.false.
!!netcdf_varname='unset','unset'
!!period=3
source=4
stash_req=216
update_anc=.false.
!!user_prog_ancil_stash_req=
!!user_prog_rconst=0.0

I had changed update_anc from true to false (and rose edit automatically put ‘!!’ in frot of interval=5 and period=3). With this change the suite continued to run after the 6 months so I thought this was a right change. I committed it but copies of this suite was failing. That was what was happening.

So now I know this namelist is causing the error but still cannot fix the problem. In u-dv725 (committed), I changed the veg frac data from
u-by791_m01s00i216_1979-2014_annual_timeseries_land_cover_frac.anc
to
/common/share/monsoon_ancils/atmos/n96e/orca1/vegetation/fractions_igbp/v4/qrparm.veg.frac
and I thought this would fix the problem, but it actually doesn’t. dv725/run2 fails just like the first time with dv077/run3. atmos_main fails after 6 months with this error;

? Error code: 500
? Error from routine: UP_ANCIL
? Error message: REPLANCA: error finding next lookup entry for field: 4 stashcode 216
? Error from processor: 0
? Error number: 70

I thought this qrparm.veg.frac might actually be a time series data but it is not. Its time dimension is 1 (nt=1) so it does look like a climatological data.

I thought maybe I should set a different value for source? I tested source=2 but this time recon fails with this error “Land surface configuration does not match ancillary”.

Please can I have any advice?
Masaru

mdalvi · 29 December 2025 10:14

Hi Masaru,

I am not sure I can follow this entirely, but in case of the veg frac (or any other surface) ancillary you can only use the ‘CMIP6’ ancillaries with an UKESM configuration as this has a different land-surface setup to other science configurations. In this instance for Present Day type of runs you can try to use the ancils in /common/share/monsoon_ancils_cmip6/model_derived/ukesm1.0_historical_r2i1p1f2_u-bc292/n96e/clim_2005-2014/

MYoshioka · 5 January 2026 10:57

Hi Mohit,

Oh, that’s great! I replaced all data in
app/install_ancil/opt/rose-app-monsoon3.conf
and
app/um/opt/rose-app-monsoon.conf
with those in the directory you gave here. I set source=2.
Now my suite runs without any problem.

This has annoyed me for many weeks but the problem was surprisingly simple!!!

Now I just wonder why this kind of information has to be so hard to find.

Anyway, thanks for your help.
Masaru

Topic		Replies	Views
Sudden "NaNs in error term in BiCGstab" after model running smoothly Unified Model ARCHER2	2	189	29 February 2024
NaNs in error term in BiCGstab for high res domain Unified Model ARCHER2 , Nesting-Suite	0	465	7 April 2022
NaNs BiCGstab error - from transplanting data? Unified Model ARCHER2 , Nesting-Suite	1	59	28 February 2025
BICGstab error 20 years into nudged run Unified Model Monsoon2	19	217	13 September 2024
Suggestions for debugging BICGstab "NaNs in error term" Unified Model	57	1371	11 January 2023

BiCGstab error at the beginning of simulation

Related topics