Suite fails to restart

racheldiamond · 11 January 2023 11:28

Hi,

One of my suites (u-cr613) stopped running at cycletime 2250-01-01 over Christmas, and I can’t restart it. I ran both rose suite-run --restart and rose suite-run – --warm 22500101T0000Z.
I have checked the atmosphere, ocean and ice restart files exist.
How should I get this running again?

I get the error message:

rose suite-run – --warm 22500101T0000Z
[INFO] export CYLC_VERSION=7.8.12
[INFO] export ROSE_ORIG_HOST=pumanew.novalocal
[INFO] export ROSE_SITE=
[INFO] export ROSE_VERSION=2019.01.3
[INFO] create: log.20230111T110851Z
[INFO] delete: log
[INFO] symlink: log.20230111T110851Z <= log
[INFO] log.20221129T105834Z.tar.gz <= log.20221129T105834Z
[INFO] delete: log.20221129T105834Z/
[INFO] create: log/suite
[INFO] create: log/rose-conf
[INFO] symlink: rose-conf/20230111T110856-run.conf <= log/rose-suite-run.conf
[INFO] symlink: rose-conf/20230111T110856-run.version <= log/rose-suite-run.vers ion
[INFO] REGISTERED u-cr613 → /home/radiam24/cylc-run/u-cr613
[INFO] WARNING - deprecated items were automatically upgraded in ‘suite definiti on’:
[INFO] WARNING - * (6.11.0) [runtime][RETRIES][retry delays] → [runtime][RETRI ES][job][execution retry delays] - value unchanged
[FAIL] ssh -oBatchMode=yes -n login4.archer2.ac.uk env\ ROSE_VERSION=2019.01.3\ CYLC_VERSION=7.8.12\ bash\ -l\ -c\ '"$0"\ "$@"'\ rose\ suite-run\ -vv\ -n\ u-cr613\ –run=run\ –remote=uuid=a3058fc0-91a9-47a2-80ef-7eaf81fce39d,now-str= 20230111T110851Z,root-dir='$DATADIR' # return-code=1, stderr=
[FAIL] [FAIL] 2023-01-11T11:15:01+0000 [Errno 2] No such file or directory: ‘log .20221129T105834Z.tar.gz’

Many thanks.

Best wishes,

Rachel

RosalynHatcher · 11 January 2023 12:19

Hi Rachel,

On ARCHER2 try tar’ing up and gzip’ing the log.20221129T105834Z directory to create the missing .tar.gz file.

cd /home/n02/n02/radiam24/cylc-run/u-cr613
tar -cf log.20221129T105834Z.tar log.20221129T105834Z
gzip log.20221129T105834Z.tar

Then try restarting it again. I suspect now that you’ve tried to do a --warm start a normal --restart won’t work but it might be worth trying.

PUMA rebooted over Christmas. If this happens again all you need to do is re-add your id_rsa_archerum key to your ssh-agent and then do a rose suite-restart. This is much safer than trying to do a --warm start which is a last resort.

Regards,
Ros.

racheldiamond · 11 January 2023 13:47

Hi Ros,

Thanks, I tried that but the suite failed again. The failure may be because of the UM - there is a job.err file with MPICH errors, but no new output from NEMO or CICE, and no NEMOhist/ocean.output or CICEhist/ice_diag.d files.

What do you suggest?

Best wishes,

Rachel

RosalynHatcher · 11 January 2023 14:11

Hi Rachel,

In ocean.output NEMO has failed with:

===>>> : E R R O R
     ===========

 stpctl: the zonal velocity is larger than 20 m/s

I note that the coupled task for this cycle failed 3 times previously, with an MPICH error, on 22 & 23 December before PUMA rebooted so I don’t think it has anything to do with the restart.

This is a catchall NEMO error message usually indicating a model instability. I think you will need to talk to a NEMO expert, someone at NOC.

Cheers,
Ros.

racheldiamond · 11 January 2023 15:32

Thanks, you’re right and I’ve fixed the other problem now.

Best wishes,

Rachel

system · 13 January 2023 15:32

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Restarts failing Unified Model ARCHER2	5	162	22 December 2023
Suite restart fails Unified Model ARCHER2 , PUMATest	2	221	15 June 2022
Stopped with 'submitted'? Rose/Cylc and FCM Monsoon2	4	332	25 November 2021
Failure from rose suite-run Rose/Cylc and FCM ARCHER2	11	95	14 May 2025
Rosie go unresponsive / time mis-match Rose/Cylc and FCM ARCHER2	3	131	15 September 2023

Suite fails to restart

Related topics