Cycle point for restarting suites?

Hi,

I’m trying to follow the instructions here to restart HadGEM3 suites (http://cms.ncas.ac.uk/wiki/pumatest-transition), but all suites fail at the coupled step.

Could this be because restart files do not exist for this cycle point/the warm start is pointing to the wrong restart files for this cycle point?

  1. I first tried to start from the last completed month in the ARCHER2 /work/n02/n02/radiam24/archive/u-exptid, but this failed in the coupled step and I got errors in the model output files - e.g. in ocean.output from NEMO, there was

===>>> : E R R O R
** ===========**

** ===>>>> : problem with nittrc000 for the restart**
** verify the restart file or rerun with nn_rsttr = 0 (namelist)**

** ===>>> : E R R O R**
** ===========**

** STOP**
** Critical errors in NEMO initialisation**
** huge E-R-R-O-R : immediate stop**

and in output from the UM, there were a lot of MPICH errors.

  1. I then tried to restart from the cycle point the model was at when PUMA went down, but got the following error:
    TypeError: not enough arguments for format string
    [FAIL] run_model <<'STDIN
    **[FAIL] **
    [FAIL] ‘STDIN’ # return-code=1
    2022-02-17T13:54:15Z CRITICAL - failed/EXIT

So will I need to restart from the last cycle point I know has a complete set of restart dumps? This is a bit inconvenient since I only have restart dumps from January and December each year.

Best wishes,

Rachel

Hi Rachel,

What’s the suite id? And please post which cycle points you tried to restart from.

You should be able to see from the log files on ARCHER2 where the last cycle had got to in order to work out which cycle you needed to start from.

Cheers,
Ros.

Hi Ros,

I tried

suite 1. Cycle point: last archived on ARCHER2 2. Cycle point: running when PUMA went down
u-cl809 18500201T0000Z 18500301T0000Z
u-ck767 18700701T0000Z 18700801T0000Z
u-ck831 18620301T0000Z 18620401T0000Z

The first set of cycle points all gave the first error in the last post, and the second set of cycle points all gave the second error.

I also noticed the errors are similar to those here: Issues with a warm start using pumatest - #4 by aschurer.

Rachel

Hi Rachel,

The cycle points that were running when PUMA went down was where you needed to restart from.

Having tried to back up even further has confused it - not entirely sure what happened. I’ve only looked at u-cl809 so far. If you look in the job.out file you’ll see more informative errors that indicate the UM, NEMO & CICE restarts are now out of sync.

[INFO] The NEMO restart data does not match the  current cycle time
.   Cycle time is 18500301
   NEMO restart time is 18500201
[INFO] Remove any NEMO dumps currently ahead of the current cycletime, and pick up the dump at this time

So for u-cl809 you will need to follow the instructions at https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral#RestartingFailingSuites to sort out the restarts and rerun from 18500201.

I will look at the other 2 suites and check they are in the same situation.

Regards,
Ros.

Hi Ros,

Thanks so much for clarifying. I tried this (and followed the instructions very carefully so the timestamps of the required start files did match up), but the suite failed anyway with an initialisation error. (I can’t be more specific about the error, sorry, as I didn’t make a note of it before trying to start the run again and I think RoseBush is still down).

Since it was only 2 months into the run, I then tried to start the run over again from the beginning, but this also failed (which I should have realised it would do as the cylc-run directory hasn’t been copied over to PUMATest).

I have saved in another directory most, but not all, of the data from the History_Data file for this suite before PUMA went down, and I do have access to the previous .xhist files for this suite. So is it worth trying to restart this suite as well as my other previously running suites, given they will probably fail with the same initialisation error? Or is there another way of restarting suites I could try?

Best wishes,

Rachel

Hi Rachel,

As you’ve tried to restart from beginning again I won’t be able to see what went wrong after following those Met Office instructions. If it’s only 2 months in, it will be easiest just to start the suite again from the beginning.

rose suite-run --new

to give a clean run.

The other suites look further into the run. So worth trying to go through those instructions on u-ck767 and see if it the warm start then works ok. If it doesn’t please let us know so we can take a look - don’t try anything else with it otherwise it makes it very difficult to see what’s going on.

Rose bush is on pumatest. The url just changed slightly inline with the hostname change: https://pumatest.nerc.ac.uk/rose-bush/

Cheers,
Ros.

P.S.

In your ~/.profile please change the SCRATCH directory to be:

export SCRATCH=/export/puma/data-01/scratch/$USER

Hi Rachel,

In the site/archer2.rc file in the suites. In the [[EXTRACT_RESOURCE]] section can you please comment (#) out the line:

# script = "rose task-run --verbose --define=fast-dest-root-orig=$SCRATCH --define='args=--archive --ignore-lock'"

So it does not use the scratch disk.

Thanks.
Cheers,
Ros.

Hi Ros,

Thanks for the help. I’ve now started u-cl809 from the beginning, and it is running fine. Should I also comment out the # script = “rose task-run --verbose --define=fast-dest-root-orig=$SCRATCH --define=‘args=–archive --ignore-lock’” for this suite, and reload it?

I am now trying to restart u-ck767 from 18700701T000Z, and followed the MetOffice instructions to get consistent datestamps that you shared. I then tried to restart it using:
rose suite-run – --warm 18700701T0000Z

This suite then failed in the coupled step with a MPICH error in job.err, preceded by the same warning I previously got with u-cl809, which looks like:

WARNING ???
? Warning code: -10
? Warning from routine: INIT_PP_CRUN
? Warning message: Error: Failed to load fixed header.
? Warning from processor: 0
? Warning number: 34

Could you please let me know what to try next?

Best wishes,

Rachel

Hi Rachel

Try moving /home/n02/n02/radiam24/cylc-run/u-ck767/work/18700801T0000Z away (it should be OK to delete it), then warm start 18700701T0000Z again.

Grenville

Hi,

I tried that but get the same error.

Best wishes,

Rachel

Hi,

Is there anything else I should try? Or should I make a new suite with the same model configuration as I used for u-ck767, and initialise the run from the last set of restart dumps I have for u-ck767?

Best wishes,

Rachel

Hi Rachel,

Think this error is due to an incorrect namelist file please try copying in the nemo namelist from the end of the previous cycle:

archer2$ cp ~/cylc-run/u-ck767/work/18700601T0000Z/coupled/namelist_cfg ~/cylc-run/u-ck767/share/data/History_Data/NEMOhist/namelist_cfg

Then warm start from 18700701T0000Z again.

Regards,
Ros

For info error in job.err is a generic MPI error and more detailed error
in ocean.output is:

===>>> : E R R O R
         ===========

  ===>>>> : problem with nittrc000 for the restart
  verify the restart file or rerun with nn_rsttr = 0 (namelist)

Hi Ros,

Thanks, I retriggered and ran the 18700701 coupled task. The run now fails in the postproc_atmos task with the same error as here (11/15 on Issues with a warm start using pumatest - #11 by aschurer)

I checked in the 18700701 History_Data directory and it contains old files of 0 length from my last attempt at running the task (2 days ago):

-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.p11870jun
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.p21870jun
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.p31870jun
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.p41870jun
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.p51870jun
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.p618700621
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.p718700621
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.p818700621
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.p918700621
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pa1870jun
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pb18700401
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pc18700401
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pd1870jun
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pe1870jun
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pf18700401
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pg18700401
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.ph1870jun
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pi18700401
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pj18700401
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pk1870jun
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pl1870jun
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pn18700621
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pt18700621
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pu1870jun
-rw-r–r-- 1 radiam24 n02 0 Feb 21 13:29 ck767a.pv1870jun

Am I ok to delete the old 0-length files and retrigger the postproc_atmos task? Or do I need to delete a lot more of the files in History_Data and retrigger from the coupled task?

Best wishes,

Rachel

Hi Rachel,

That’s good to hear it started running ok again.

Yes, delete those zero-length files in the History_Data directory and retrigger the postproc_atmos task.

Cheers,
Ros.

Hi Ros,

Thanks, u-ck767 is running fine now.

However, I then tried to run u-ck831 from cycle point 18620301T0000Z after applying all the same fixes because it previously failed with the same error as u-ck767 (I also used cp ~/cylc-run/u-ck831/work/18620201T0000Z/coupled/namelist_cfg ~/cylc-run/u-ck831/share/data/History_Data/NEMOhist/namelist_cfg; 18620201T0000Z completed successfully on ARCHER2 before PUMA went down).

u-ck831 failed in the coupled step:

Atm_Step: Timestep 315378 Model time: 1862-03-01 06:00:00
update_dpsidt: updating based on existing values
update_pattern: updating coeffc and coeffs
ERROR detected in routine STWORK
: no. of output fields (=12721) exceeds no. of reserved PP headers for unit 28
STWORK: Error when processing diagnostic section 3, item 227, code 4

???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 4
? Error from routine: STWORK
? Error message: STWORK: Number of fields exceeds reserved headers for unit 28
? Error from processor: 0
? Error number: 39
???

I could increase the number of headers, but cycle 18620301T0000Z ran successfully before PUMA went down, so is the real problem because of something else?

Rachel

P.S. Is it that the History_Data directory contains data from previous runs that may be confusing the model? In pe_output/ck831.fort6.000:

IO: Open: /work/n02/n02/radiam24/cylc-run/u-ck831/share/data/History_Data/ck831a.pa1862feb on unit 11 (Server = 11)
loadHeader: The file on unit 11 does not have enough content to load the header

???
??? WARNING ???
? Warning code: -10
? Warning from routine: INIT_PP_CRUN
? Warning message: Error: Failed to load fixed header.
? Warning from processor: 0
? Warning number: 13

I checked and /work/n02/n02/radiam24/cylc-run/u-ck831/share/data/History_Data/ck831a.pa1862feb is one of several zero-length files created just now by the coupled task.

Rachel

The error is coming from the atmosphere model - it has nothing to do with restarting the model. I can’t see output for a previous run of cycle 18620301T0000Z, but I can only suppose that the previous run did not get as far as time step 315378.
I’d not worry about the warning - the pa file re-initialises each month and ck831a.pa1862feb.pp is in the archive directory.
post proc will complain about the empty files 'though

Grenville.

Hi Grenville,

Thanks.

Where would I find which stream is attached to unit 28?

I can’t find any reference to unit 28 in /work/n02/n02/radiam24/cylc-run/u-ck831/log/job/18620301T0000Z/coupled/01/job.out.

Best wishes,

Rachel

Hi Rachel,

Sometimes suites are not set up to put the pe0 output into the job.out file.

Have a look at: /work/n02/n02/radiam24/cylc-run/u-ck831/work/18620301T0000Z/coupled/pe_output/ck831.fort6.pe000

FILE_MANAGER: Assigned : /work/n02/n02/radiam24/cylc-run/u-ck831/share/data/History_Data/ck831a.pc18620101
FILE_MANAGER:          : id   : pp2
FILE_MANAGER:          : Unit :  28 (portio)

Cheers,
Ros.

Hi Ros,

Thanks. I changed reinit_step from 90 to 60 for pp2 (and pp5 and pp9), which fixed that problem. Do I need to change reinit_step for all the other streams as well, to be consistent?

However, the coupled step still fails with new error in job.err, that looks a bit like this segmentation fault http://cms.ncas.ac.uk/ticket/2429?cversion=0&cnum_hist=1

The error I get is -

[0] exceptions: Program is “/mnt/lustre/a2fs-work2/work/n02/n02/radiam24/cylc-run/u-ck831/work/18620301T0000Z/coupled/./atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[0] exceptions: Data address (si_addr): 0x00000000; rip: 0x005963ea
[0] exceptions: [backtrace]: has 14 elements:
[0] exceptions: [backtrace]: ( 1) : Address: [0x005963ea]
[0] exceptions: [backtrace]: ( 1) : stwork_ (* Cannot Locate )
[0] exceptions: [backtrace]: ( 2) : Address: [0x00416df4]
[0] exceptions: [backtrace]: ( 2) : signal_do_backtrace_linux in file /mnt/lustre/a2fs-work2/work/n02/n02/radiam24/cylc-run/u-ck831/share/fcm_make_um/extract/um/src/control/c_code/exceptions/exceptions-platform/exceptions-linux.c line 81
[0] exceptions: [backtrace]: ( 3) : Address: [0x00415b88]
[0] exceptions: [backtrace]: ( 3) : signal_do_backtrace in file /mnt/lustre/a2fs-work2/work/n02/n02/radiam24/cylc-run/u-ck831/share/fcm_make_um/extract/um/src/control/c_code/exceptions/exceptions.c line 81
[0] exceptions: [backtrace]: ( 4) : Address: [0x2b84e0a3c2d0]
[0] exceptions: [backtrace]: ( 4) : ?? (
Cannot Locate )
[0] exceptions: [backtrace]: ( 5) : Address: [0x005963ea]
[0] exceptions: [backtrace]: ( 5) : stwork_ (
Cannot Locate )
[0] exceptions: [backtrace]: ( 6) : Address: [0x005926b4]
[0] exceptions: [backtrace]: ( 6) : stash_ (
Cannot Locate )
[0] exceptions: [backtrace]: ( 7) : Address: [0x0097eeca]
[0] exceptions: [backtrace]: ( 7) : st_diag1_ (
Cannot Locate )
[0] exceptions: [backtrace]: ( 8) : Address: [0x00ac5187]
[0] exceptions: [backtrace]: ( 8) : atm_step_4a_ (
Cannot Locate )
[0] exceptions: [backtrace]: ( 9) : Address: [0x00447f4e]
[0] exceptions: [backtrace]: ( 9) : u_model_4a_ (
Cannot Locate )
[0] exceptions: [backtrace]: ( 10) : Address: [0x0040fc08]
[0] exceptions: [backtrace]: ( 10) : um_shell_ (
Cannot Locate )
[0] exceptions: [backtrace]: ( 11) : Address: [0x00408698]
[0] exceptions: [backtrace]: ( 11) : main (
Cannot Locate )
[0] exceptions: [backtrace]: ( 12) : Address: [0x00408698]
[0] exceptions: [backtrace]: ( 12) : main (
Cannot Locate )
[0] exceptions: [backtrace]: ( 13) : Address: [0x2b84e0c6c34a]
[0] exceptions: [backtrace]: ( 13) : ?? (
Cannot Locate *)
[0] exceptions: [backtrace]: ( 14) : Address: [0x0040849a]
[0] exceptions: [backtrace]: ( 14) : _start in file /home/abuild/rpmbuild/BUILD/glibc-2.26/csu/…/sysdeps/x86_64/start.S line 122
[0] exceptions:
[0] exceptions: To find the source line for an entry in the backtrace;
[0] exceptions: run addr2line --exe=</path/too/executable>
[0] exceptions: where address is given as [0x] above
[0] exceptions:
srun: error: nid002159: task 0: Exited with exit code 11
srun: launch/slurm: _step_signal: Terminating StepId=1158426.0+0
srun: launch/slurm: _step_signal: Terminating StepId=1158426.0+1
slurmstepd: error: *** STEP 1158426.0+1 ON nid005786 CANCELLED AT 2022-02-25T12:56:10 ***
slurmstepd: error: *** STEP 1158426.0+0 ON nid002159 CANCELLED AT 2022-02-25T12:56:10 ***
slurmstepd: error: *** STEP 1158426.0+2 ON nid005787 CANCELLED AT 2022-02-25T12:56:10 ***
srun: launch/slurm: _step_signal: Terminating StepId=1158426.0+2
srun: error: nid005787: tasks 0-1: Terminated
srun: error: nid005840: tasks 2-3: Terminated
srun: error: nid005786: task 7: Terminated
srun: error: nid002160: tasks 50-99: Terminated
srun: error: nid002159: tasks 1-49: Terminated
srun: error: nid002161: tasks 100-148: Terminated
srun: error: nid005860: tasks 4-5: Terminated
srun: Force Terminated StepId=1158426.0+2
srun: error: nid005785: tasks 149-197: Terminated
srun: Force Terminated StepId=1158426.0+0
srun: error: nid005786: tasks 0-6,8-127: Terminated
srun: Force Terminated StepId=1158426.0+1
[FAIL] run_model <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=143
2022-02-25T12:56:11Z CRITICAL - failed/EXIT

Rachel