Suite fails to restart

Hi,

One of my suites (u-cr613) stopped running at cycletime 2250-01-01 over Christmas, and I can’t restart it. I ran both rose suite-run --restart and rose suite-run – --warm 22500101T0000Z.
I have checked the atmosphere, ocean and ice restart files exist.
How should I get this running again?

I get the error message:

rose suite-run – --warm 22500101T0000Z
[INFO] export CYLC_VERSION=7.8.12
[INFO] export ROSE_ORIG_HOST=pumanew.novalocal
[INFO] export ROSE_SITE=
[INFO] export ROSE_VERSION=2019.01.3
[INFO] create: log.20230111T110851Z
[INFO] delete: log
[INFO] symlink: log.20230111T110851Z <= log
[INFO] log.20221129T105834Z.tar.gz <= log.20221129T105834Z
[INFO] delete: log.20221129T105834Z/
[INFO] create: log/suite
[INFO] create: log/rose-conf
[INFO] symlink: rose-conf/20230111T110856-run.conf <= log/rose-suite-run.conf
[INFO] symlink: rose-conf/20230111T110856-run.version <= log/rose-suite-run.vers ion
[INFO] REGISTERED u-cr613 → /home/radiam24/cylc-run/u-cr613
[INFO] WARNING - deprecated items were automatically upgraded in ‘suite definiti on’:
[INFO] WARNING - * (6.11.0) [runtime][RETRIES][retry delays] → [runtime][RETRI ES][job][execution retry delays] - value unchanged
[FAIL] ssh -oBatchMode=yes -n login4.archer2.ac.uk env\ ROSE_VERSION=2019.01.3\ CYLC_VERSION=7.8.12\ bash\ -l\ -c\ '"$0"\ "$@"'\ rose\ suite-run\ -vv\ -n\ u-cr613\ –run=run\ –remote=uuid=a3058fc0-91a9-47a2-80ef-7eaf81fce39d,now-str= 20230111T110851Z,root-dir='$DATADIR' # return-code=1, stderr=
[FAIL] [FAIL] 2023-01-11T11:15:01+0000 [Errno 2] No such file or directory: ‘log .20221129T105834Z.tar.gz’

Many thanks.

Best wishes,

Rachel

Hi Rachel,

On ARCHER2 try tar’ing up and gzip’ing the log.20221129T105834Z directory to create the missing .tar.gz file.

cd /home/n02/n02/radiam24/cylc-run/u-cr613
tar -cf log.20221129T105834Z.tar log.20221129T105834Z
gzip log.20221129T105834Z.tar

Then try restarting it again. I suspect now that you’ve tried to do a --warm start a normal --restart won’t work but it might be worth trying.

PUMA rebooted over Christmas. If this happens again all you need to do is re-add your id_rsa_archerum key to your ssh-agent and then do a rose suite-restart. This is much safer than trying to do a --warm start which is a last resort.

Regards,
Ros.

Hi Ros,

Thanks, I tried that but the suite failed again. The failure may be because of the UM - there is a job.err file with MPICH errors, but no new output from NEMO or CICE, and no NEMOhist/ocean.output or CICEhist/ice_diag.d files.

What do you suggest?

Best wishes,

Rachel

Hi Rachel,

In ocean.output NEMO has failed with:

===>>> : E R R O R
     ===========

 stpctl: the zonal velocity is larger than 20 m/s

I note that the coupled task for this cycle failed 3 times previously, with an MPICH error, on 22 & 23 December before PUMA rebooted so I don’t think it has anything to do with the restart.

This is a catchall NEMO error message usually indicating a model instability. I think you will need to talk to a NEMO expert, someone at NOC.

Cheers,
Ros.

Thanks, you’re right and I’ve fixed the other problem now.

Best wishes,

Rachel

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.