BICGstab error 20 years into nudged run

Hello CMS Helpdesk,

I have been running a nudged run in suite u-df691 with a branch that contains code changes I have added for polar stratospheric cloud formation in the UKCA. The simulation runs fine from 1982 (when it is initialized) to 2000, but fails with a BICGstab error in December 2000 (screenshot attached). Any ideas what could be causing this error so far into a model run?

Thank you!

Isabelle

We generally work around these failures by perturbing the atmosphere start file and rerunning the cycle. See https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral#Restartingifthemodelblowsup
for how to use perturb_theta.py

Grenville

Hi Grenville,

Thank you. I have looked through this and tried to perturb the theta field, but I am not sure how to use the perturb_theta.py script on monsoon (ie how to access it from monsoon and how to run the python script).

Thanks!

Best,
Isabelle

Hi Isabelle,
The script is not installed by default on Monsoon, so you will have to extract from the repository.
$ mkdir ~/test (or suitable folder)
$ cd ~/test/
$ fcm export fcm:moci.xm_tr/Utilities/lib/perturb_theta.py
$ chmod +x perturb_theta.py
$ cd cylc-run/u-df691/share/data/History_Data/
(assuming the atmos_main task starting 01Dec2000 is failing)
$ mv df691a.da20001201_00 df691a.da20001201_00.orig
$ module load um_tools
$ ~/test/perturb_theta.py df691a.da20001201_00.orig --output ./df691a.da20001201_00

Resubmit the failed task.

If the suite has stopped in the meantime, you can restart from the failed point (as long as no configuration settings/ namelists have changed) using:
$ rose suite-run --restart

Thank you! I have perturbed the dumpfile and resubmitted the task, however now the postproc step seems to be failing (it is stuck in a loop of retrying). The text from job.err file from the postproc task is copied below. Any idea why this would be happening? Thanks in advance!

update: the atmos step has failed again for 12/2000 with a BICGstab error

job.err file :

[WARN] file:atmospp.nl: skip missing optional source: namelist:archer_arch
[WARN] file:pptransfer.nl: skip missing optional source: namelist:archer_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
[WARN] file:pptransfer.nl: skip missing optional source: namelist:pptransfer
[WARN] rose date requires length=5 input date array - adding 0: [1, 1, 1, 0]
[WARN] rose date requires length=5 input date array - adding 0: [1, 1, 1, 0, 0]
[WARN] rose date requires length=5 input date array - adding 0: [2000, 12, 1, 0]
[WARN] rose date requires length=5 input date array - adding 0: [2000, 12, 1, 0, 0]
[WARN] rose date requires length=5 input date array - adding 0: [2000, 12, 1, 0]
[WARN] rose date requires length=5 input date array - adding 0: [2000, 12, 1, 0, 0]
[WARN] rose date requires length=5 input date array - adding 0: [1, 12, 1, 0]
[WARN] rose date requires length=5 input date array - adding 0: [1, 12, 1, 0, 0]
[WARN] rose date requires length=5 input date array - adding 0: [2000, 12, 1, 0]
[WARN] rose date requires length=5 input date array - adding 0: [2000, 12, 1, 0, 0]
[WARN] move_files: Deleted pre-existing file with same name prior to move: /home/d03/isangha/cylc-run/u-df691/work/20001201T0000Z/atmos_main/df691a.ps2000son.arch
Traceback (most recent call last):
File “/home/d03/isangha/cylc-run/u-df691/share/fcm_make_pp/build/bin/main_pp.py”, line 118, in
main()
File “/home/d03/isangha/cylc-run/u-df691/share/fcm_make_pp/build/bin/main_pp.py”, line 111, in main
run_postproc()
File “/home/d03/isangha/cylc-run/u-df691/share/fcm_make_pp/build/bin/main_pp.py”, line 82, in run_postproc
getattr(model, meth)()
File “/projects/ukca-cam/isangha/cylc-run/u-df691/share/fcm_make_pp/build/bin/timer.py”, line 115, in wrapper
out = function(*args, **kw)
File “/projects/ukca-cam/isangha/cylc-run/u-df691/share/fcm_make_pp/build/bin/atmos.py”, line 519, in do_transform
for fname in self.diags_to_process(finalcycle):
File “/projects/ukca-cam/isangha/cylc-run/u-df691/share/fcm_make_pp/build/bin/timer.py”, line 115, in wrapper
out = function(*args, **kw)
File “/projects/ukca-cam/isangha/cylc-run/u-df691/share/fcm_make_pp/build/bin/atmos.py”, line 335, in diags_to_process
logfile=log_file
File “/projects/ukca-cam/isangha/cylc-run/u-df691/share/fcm_make_pp/build/bin/timer.py”, line 115, in wrapper
out = function(*args, **kw)
File “/projects/ukca-cam/isangha/cylc-run/u-df691/share/fcm_make_pp/build/bin/validation.py”, line 178, in verify_header
headers, empty_file = mule_headers(fname)
File “/projects/ukca-cam/isangha/cylc-run/u-df691/share/fcm_make_pp/build/bin/timer.py”, line 115, in wrapper
out = function(*args, **kw)
File “/projects/ukca-cam/isangha/cylc-run/u-df691/share/fcm_make_pp/build/bin/validation.py”, line 283, in mule_headers
umfile = mule.UMFile.from_file(filename, remove_empty_lookups=True)
File “/opt/scitools/environments/production_legacy/2018_10_17/lib/python2.7/site-packages/mule/init.py”, line 1246, in from_file
new_umf._read_file(file_or_filepath)
File “/opt/scitools/environments/production_legacy/2018_10_17/lib/python2.7/site-packages/mule/init.py”, line 1424, in _read_file
FixedLengthHeader.from_file(source))
File “/opt/scitools/environments/production_legacy/2018_10_17/lib/python2.7/site-packages/mule/init.py”, line 555, in from_file
return super(FixedLengthHeader, cls).from_file(source, cls._NUM_WORDS)
File “/opt/scitools/environments/production_legacy/2018_10_17/lib/python2.7/site-packages/mule/init.py”, line 393, in from_file
return cls(values)
File “/opt/scitools/environments/production_legacy/2018_10_17/lib/python2.7/site-packages/mule/init.py”, line 527, in init
raise ValueError(_msg)
ValueError: Incorrect size for fixed length header; given 0 words but should be 256.
[FAIL] main_pp.py atmos # return-code=1
2024-09-12T08:29:53Z CRITICAL - failed/EXIT

Hi Isabelle

I think you should have perturbed df691a.da20001201_00 - that’s the start file that the 20001201T0000Z cylce will use.

Not sure about posproc - let’s get this fixed first.

Grenvile

Hi Grenville,

The dump file I perturbed was the df691a.da20001201_00 file, however it still results in a BICGstab error when I resubmit the for the 20001201T0000Z cycle.

Best,
Isabelle

Hi Isabelle,

Just trying to understand the run so far - the postproc task should only launch on successful completion of the atmos_main task (for that month). Was the postproc task launched manually?
From looking at your work and share folders it looks like at least part of the simulation was re-run recently (without recon). If the pertrub_theta does not seem to work for the December restart file, it is likely anamolous values or cause of failure is already ‘baked in’ that dump and it might be worth perturbing an earlier month and re-running the simulation from there.

Mohit

Hi Mohit,

I think the issue with the postproc is that I ran the ‘trigger now’ for the whole cycle 20001201T0000Z rather than just the atmos_main task and so the postproc was triggered before the atmos_main finished.

In terms of perturbing an earlier month, I tried perturbing the theta field in November, 2000 restart file and it results in the same BICGstab error in December, 2000. I will try perhaps perturbing a restart file for January, 2000 and see whether that allows the simulation to progress past December, 2000 – although would it take that long for anamolous values to cause a failure?

Thanks!

Best,
Izzy

If perturbing the values 2 -3 months back still causes a failure at the same point then it is most likely an input happening at/ just before the failure point.
In case of monthly ancillaries the data for December would have been read from 16th Nov (and interpolated) so would have failed earlier. One other cause could be the Nudging input files, but so far other users have not reported similar problems for this period.

Argh - sorry misread the filename!

Thanks - I will try and perturb September and October and see if the failure still happens.

Just to check that I am re-running these correctly. If I want to perturb the September dumpfile and try rerunning I should 1) perturb the September file using perturb_theta.py, 2) update the AINITIAL file to be the perturbed September file, 3) Update the model basis time to match the September dumpfile date and turn build and recon off, and 3) run the suite from September.

Yes…
Note that AINITIAL → Reconf → ASTART. so the Atmos model always reads the ASTART file and not Ainitial. However, the suite may have been set up to link the Ainitial file as Astart directly, in absence of Reconfiguration so need to check.

Ah, I see. The astart is currently : $ROSE_DATA/${RUNID}$a.astart , but should I set it to /cylc-run/u-df691/share/data/History_Data/df691a.da20000901_00 (assuming I want to rerun from September) to ensure that the atmos model is reading in the correct dump file?

Thanks!

Yes, for this run. but be aware that the astart file will be overwritten if Reconfiguration is turned On.

Sorry for the continued errors …
The run fails at the atmos job when I set ASTART to /cylc-run/u-df691/share/data/History_Data/df691a.da20001001_00 with the error copied below. I have tried using ‘/cylc-run/u-df691/share/data/History_Data/df691a.da20001001_00’ as well as ‘~/cylc-run/u-df691/share/data/History_Data/df691a.da20001001_00’ and both result in the same error.

???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 1
? Error from routine: io:file_open
? Error message: Failed to open file /cylc-run/u-df691/share/data/History_Data/df691a.da20001001_00
? Error from processor: 0
? Error number: 39
???

Hi,
You will have to specify the full path starting from /projects/ukca-cam/ (i.e $DATADIR)/cylc-run/suite-id/. That is usually where the share/data folders are installed.

Hi Mohit,

Thank you, I think that the astart file can now be found however it is failing with a ‘cylc: unbound variable error’ related to the astart file.

Hi,
My earlier reply only contained an indication of what the full path should be (as I was not sure of what your $DATADIR is !).
The following setting should work:
astart=‘/projects/ukca-cam/isangha/cylc-run/u-df691/share/data/History_Data/df691a.da20001001_00’