Bit comparability for re-run cycles

Dear CMS Helpdesk,

I am trying to re-run cycles from u-dd727 to recover corrupted output files in the original run. For example, in cycle 18911001T0000Z the file dd727a.pa18911201.pp has less data than .pa18910 and .pa189111. See /work/n02/n02/ajw1g19/archive/u-dd727/19120401T0000Z/ for the same issue with .pu files.

I copied the suite u-dd727 into u-dh366 and made the following changes:
→ Set the initial dump file to dd727a.da18911001_00
→ Set the model basis time and initial file time (um → recon and ancil control → output dump fixed header → new_date_time) to 1891/10/01
→ Removed all ancillaries used to initialise dump fields in the original run. The only remaining ancillaries were those used to update fields periodically through the model run (sea ice and sea surface temperature).

I was expecting the output from the re-run cycle to be identical to the original. Solar forcing was equivalent between the two, however, in temperature and precipitation, there were very large anomalies of up to 20 degrees.

Is there a setting that I missed in the new suite (u-dh366) to ensure bit comparability between the two runs?

Furthermore, is there an explanation as to why the output files are corrupting? There were no issues of this kind over the first month running the model but it’s happening much more often now with all output files and dumps.

Your advice on these issues will be much appreciated,
Alfred

Hi Alfred

I don’t know for sure if UKESM1.1 bit compares under these circumstances (my guess is that it does),but you should set

BITCOMP_NRUN=true

How are the files corrupt?

Grenville

Hi Grenville,

I’ve set BITCOMP_NRUN=true and I’m running the suite now so hopefully that works.

By corrupt I mean that the file sizes do not match up: .pa files should be 765 KB but in the example above for cycle 1891/10, file .pa189112 is 320 KB.

On JASMIN, when I tried compiling data from .pa files for years 1875-1900 with cf-python, the program crashed at the above file because it couldn’t determine the file type. I’ve since discovered over 40 different output files and 25 different dumps over two suites (u-dd727, u-dh140) which have less data than they should.

The same pattern exists in the archive on ARCHER2 so the issue is not with the data transfer to JASMIN but with the writing of the files by the model.

Thanks,
Alfred

is the problem the one we found earlier - ARCHER2 File copy failures ?

Yes that sounds exactly like what’s happening

Is there a fix I can implement?

Hi Grenville,

The BITCOMP_NRUN=true test has finished and shows identical results to BITCOMP_NRUN=false. The large anomalies remain between the re-run (u-dh366) and original (u-dd727) cycles.

In the rose suite, in the information for the BITCOMP_NRUN switch, it mentions the following:

“To bit compare with a CRUN you also need to turn off reconfiguration
and load our atmosphere start dump as astart (not ainitial).”

Turning off recon I can understand but I’m not sure what it means by “load our atmosphere start dump”. Could you explain?

Thanks,
Alfred

Hi Alfred,

You are currently using a very old version of the ARCHER2 postproc branch dev/rosalynhatcher/postproc_2.3_pptransfer_gridftp_nopw@3202

In panel fcm_make_pp → Configuration please replace the @3202 with @5095 to pick up the version of the branch which has the shutil copy fix in it.

Regards,
Ros.

Hi Ros,

Thank you for that. If I make that switch, what do I need to do to have the suite accept it? Do I just use suite run --reload or do I need to restart from scratch?

Thanks,
Alfred

As another quick question, should I make the same change to the second source that I have for postproc:

dev/rosalynhatcher/postproc_2.3_archer_fixes@3205

Thanks,
Alf

Alfred

The model needs to start using the same files as it did previously, so switch off reconfiguration and set astart to be the start file used in the original run.

Grenville

Hi Grenville,

Does that mean that I have to run the suite from the start again? Is there no way to get bit comparability by running just a single cycle?

Alfred

You can rerun cycles but you need to supply the start file that it ran with - you may have to back up to a point in the run where you have all the start files.

Grenville

Sorry Grenville, I’m getting a bit confused.

Just to be clear:

→ u-dd727 began with dd727a.astart which was reconfigured from by230a.da30900101_00
→ The suite ran from 18500101T0000Z and generates dump files every 3 months
→ The suite is still running currently

→ I want to re-run the cycle 18911001T0000Z for which I have the dump dd727a.da18911001_00
→ I have a copy of u-dd727 to use for re-running cycles whilst u-dd727 continues the main production run

How should I set up the start files in the copied suite so that it can run cycle 18911001T0000Z?

Thanks,
Alf

Hi Alfred,

In regards the quest re revision number for postproc_2.3_archer_fixes, you can just remove the @3205 and let it pick up the latest revision.

If the suite is already running then you will need to reload and then re-insert the extract and build tasks.

puma2$ cd ~/roses/<suiteid>
puma2$ rose suite-run --reload
puma2$ cylc insert --no-check <suiteid> fcm_make_pp.<cycle-point>

The fcm_make_pp should then show up in the cylc GUI - you may need to manually trigger it.
Once that has run insert the fcm_make2_pp task

puma2$ cylc insert --no-check <suiteid> fcm_make2_pp.<cycle-point>

Where <cycle-point> is the current active cycle (e.g. 19900101T0000Z)

Regards
Ros

Hi Ros,

Thank you! That appears to have worked and I can see that the python scripts in share/fcm_make_pp/bin/ are showing the changes implemented in revision 5095.

Thank you for your help!

Alfred

in the copied suite, set astart to be dd727a.da18911001_00, set BITCOMP_NRUN to true

Hi Grenville,

Thank you for clarifying. I’ve done as above but the results still show a mismatch with the original cycle.

Would it be right to assume now that my version of UKESM1.1 is not bit comparable?

Alfred

I need to talk some UKESM developers before saying definitively - give me a few days

Grenville

Hi Grenville,

I have found the solution!
In addition to the settings mentioned above, l_nrun_as_crun needs to be true in um → Top Level Model Control → Run Control and Time Settings.

Overall, the following settings are needed to ensure bit comparability:

  1. Recon disabled
  2. astart set to the desired dump file
  3. BITCOMP_NRUN = true
  4. l_nrun_as_crun = true

Thank you so much for your time on these issues!

Regards,
Alfred