Hi,
I have run 20 years (1850-1870) of a coupled UKESM job, u-dt663, on Archer2. I had added a few stash items and would like to continue running the model from 1870 onwards but avoid the issues that can occur when restarting a job (i.e I would like the model run to proceed as if it had never stopped). My understanding is that, as I have added new stash items, I can’t just do a rose-suite run --reload and trigger the next coupled task.
To check if I could get bit comparability following a restart, I extracted the 5 initialisation files (one atmosphere dump, 2 ocean, 1 iceberg, 1 sea ice) for 1st 1869 from u-dt663, and ran 1 month of a copy of u-dt663, u-dt998, with BITCOMP_NRUN = true, l_nrun_as_crun = true, RECON=false and the astart set to ‘dt663a.da18690101_00’. This follows the approach in Bit comparability for re-run cycles
However, u-dt998 does not bit compare with u-dt663 (which is running 6m resubmissions as opposed to u-dt998’s 1 month resubmission) for Jan 1869. In a separate suite I tried setting ancil_reftime to the actual date but this returned the same output as u-dt998.
I am therefore concerned that trying to restar tu-dt663 from 1st Jan 1870 by pointing it to the 5 initialisation files approach could cause an issue.
Is there anything needed to avoid this issue when continuing a run?
Many thanks for your help,
James
Have you tried – I don’t see why this would not work. It might be worth putting the new stash in new files, but that’s probably not needed either.
Grenville
Hi Grenville,
Sorry, I meant ‘rose-suite run --restart’. When I run this and retrigger the coupled task it fails with
[0] ???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
[0] ? Error code: 5
[0] ? Error from routine: UM_READDUMP
[0] ? Error message: UM_READDUMP Dump does not match STASH list
[0] ? Error from processor: 47
[0] ? Error number: 47
There is nothing in the coupled/pe_output directory.
When doing the rose-suite run --restart, I think it looks in ~/cylc-run/u-dt663/share/data/History_Data and I have put all the initialisations files for 1870 in there.
Thanks,
James
James
OK, so back to bit comparison.
The jobs must run with the same dump frequency. I can’t see which files were being compared - u-dt663 has Dec 1896 data, but u-dt998 has Jan 1869 data. I’d suggest comparing atmosphere start files ( I don’t see one ffor u-dt998)
Grenville
Hi Grenville,
u-dt663 had a 6 month dump frequency while u-dt998 used a 1 month dump frequency so I took a new copy of u-dt998, u-du093, and set a 6 month dump frequency and ran it for 6 months.
However, u-du093 doesn’t bit compare for Jan 1869 to u-dt663 (I did a mule-cumf between dt663a.p51869jan.pp /du093a.p51869jan.pp and dt663a.pm1869jan.pp /du093a.pm1869jan.pp in /work/n02/n02/jweber/archive).
File 1: dt663a.pm1869jan.pp
File 2: du093a.pm1869jan.pp
Files DO NOT compare
- 0 differences in fixed_length_header (with 7 ignored indices)
- 6865 field differences, of which 6259 are in data
Compared 7270/7270 fields, with 405 matches
Maximum RMS diff as % of data in file 1: 5004128.90625 (field 4132)
Maximum RMS diff as % of data in file 2: 13485.289001464844 (field 5108)
///
File 1: dt663a.p51869jan.pp
File 2: du093a.p51869jan.pp
Files DO NOT compare
- 0 differences in fixed_length_header (with 7 ignored indices)
- 5219 field differences, of which 4602 are in data
Compared 5227/5227 fields, with 8 matches
Maximum RMS diff as % of data in file 1: 440508.49609375 (field 779)
Maximum RMS diff as % of data in file 2: 798.71129989624023 (field 2233)
I can’t see an astart in ~/cylc-run/u-du093/share/data but I think that is because I had recon turned off and had set the astart in u-du093/app/um/rose-app.conf to /work/n02/n02/jweber/dump_files/u-dt663/dt663a.da18690101_00.
Best
James
James
Are u-dt663 and u-du093 identical suites or do they differ through stash?
Grenville
Hi Grenville,
I believe they had identical stash. The 1869 data generated by u-dt663 used revision 334546 and u-du093 was done at revision 336138.
The revision comparison below suggests they had the same stash.
https://code.metoffice.gov.uk/trac/roses-u/changeset?old_path=%2Fd%2Ft%2F6%2F6%2F3&old=334546&new_path=%2Fd%2Fu%2F0%2F9%2F3&new=336138
Best,
James
James
Try this - in site/archer2.rc, change
module load cce/15.0.0
to
module load cpe/22.12
rebuild & rerun.
Grenville
Thanks, Grenville. Will make that change to u-du093, rebuild and rerun it.
James
Hi Grenville,
Thanks for your help on this. I reran u-du093 but the new output doesn’t bit compare with that from u-dt663.
File 1: du093a.p51869jan.pp
File 2: dt663a.p51869jan.pp
Files DO NOT compare
- 0 differences in fixed_length_header (with 7 ignored indices)
- 5219 field differences, of which 4602 are in data
Compared 5227/5227 fields, with 8 matches
Maximum RMS diff as % of data in file 1: 798.71129989624023 (field 2233)
Maximum RMS diff as % of data in file 2: 440508.49609375 (field 779)
Does the “0 differences in fixed_length_header (with 7 ignored indices)” confirm that the stash names in each file are identical?
Any other things we can try re bit comparability?
Thanks,
James
James
Has this model ever bit compared?
I don’t have any more ideas at the moment.
Grenville
Hi Grenville,
I thought I did bit compare but I will check.
Many thanks for your advice so far,
James