Issues with a warm start using pumatest

After a few issues with the ssh connection which were fixed by removing all ssh known keys and re-adding them by ssh commands to the login nodes.
I now have an issue which I am not sure about, see error message below.
Can you advise what I am doing wrong?
I chose this cycle point as 17631001T0000Z is the last cycle to be transferred to JASMIN.
/gws/nopw/j04/glosat/production/UKESM/raw/u-ck819/17631001T0000Z
Should I have selected a different cycle?
Thanks
Andrew

-bash-4.1$ rose suite-run – --warm 17640101T0000Z
[INFO] export CYLC_VERSION=7.8.7
[INFO] export ROSE_ORIG_HOST=pumatest.nerc.ac.uk
[INFO] export ROSE_SITE=
[INFO] export ROSE_VERSION=2019.01.3
[INFO] create: log.20220217T132630Z
[INFO] delete: log
[INFO] symlink: log.20220217T132630Z <= log
[INFO] log.20220217T132607Z.tar.gz <= log.20220217T132607Z
[INFO] delete: log.20220217T132607Z/
[INFO] create: log/suite
[INFO] create: log/rose-conf
[INFO] symlink: rose-conf/20220217T132630-run.conf <= log/rose-suite-run.conf
[INFO] symlink: rose-conf/20220217T132630-run.version <= log/rose-suite-run.version
[INFO] REGISTERED u-ck819 → /home/aschurer/cylc-run/u-ck819
[FAIL] ssh -oBatchMode=yes -n schn02@login4.archer2.ac.uk env\ ROSE_VERSION=2019.01.3\ CYLC_VERSION=7.8.7\ bash\ -l\ -c\ ‘"$0"\ “$@”’\ rose\ suite-run\ -vv\ -n\ u-ck819\ --run=run\ --remote=uuid=2301fb7a-1fdd-4b72-98d8-c5df7fb32ee5,now-str=20220217T132630Z,root-dir=’$DATADIR’ # return-code=1, stderr=
[FAIL] [FAIL] 2022-02-17T13:28:40+0000 [Errno 2] No such file or directory: ‘log.20220210T153136Z.tar.gz’

Andrew, have you tried more than once?

please

chmod -R g+rX /home/n02/n02/<your-username>
chmod -R g+rX /work/n02/n02/<your-username>

Grenville

Hi Grenville,
Thanks for the help. It has now submitted.

The first try gave this
FAIL] [FAIL] 2022-02-17T13:26:14+0000 Cannot call rmtree on a symbolic link
the second the error in the previous message
[FAIL] [FAIL] 2022-02-17T13:28:40+0000 [Errno 2] No such file or directory: ‘log.20220210T153136Z.tar.gz’
And the third time submitted.

However the coupled job has now failed with the error
[FAIL] Unable to find iceberg restart files for this cycle. Must either have one rebuilt file, as many as there are nemo processors (108) or both rebuilt and processor files.[FAIL] Found 0 iceberg restart files
[FAIL] run_model <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=144
2022-02-17T14:22:35Z CRITICAL - failed/EXIT

I have changed the permissions on archer2 as advised

Hi Andew,

We still can’t see your /work directory on ARCHER2.

chmod -R g+rX /work/n02/n02/schn02

Thanks.
Cheers,
Ros.

Sorry, not sure what went wrong there - I certainly tried to give you permissions!
I’ve tried again - and think it has worked this time.

Yes, that’s got it.

Hi Andrew

Ocean start files for 176401 are in

/work/n02/n02/schn02/archive/u-ck819/17640401T0000Z

Try copying them back to /work/n02/n02/schn02/cylc-run/u-ck819/share/data/History_Data/NEMOhist.

I’d delete the links to restart files in /home/n02/n02/schn02/cylc-run/u-ck819/work/17640101T0000Z/coupled (they are just pointing to non-existent files) then warm starting again.

Grenville

Hi Grenville,
I’m pleased to report that it now seems to be running again.

In addition to the above I also had to copy the CICE restart file to
/work/n02/n02/schn02/cylc-run/u-ck819/share/data/History_Data/CICEhist/
And the atmospheric dump to
work/n02/n02/schn02/cylc-run/u-ck819/share/data/History_Data/

Thanks for all your help with this.
Andrew

The first coupled task successfully completed. But I now have failures in some of the postprocessing tasks:
This looks like it is due to not having enough files to calculate Winter means as December values are missing.
Will it be possible to start from this cycle if it doesn’t have all the files to create the seasonal means?
Would I be better of starting from a later cycle point and pushing the pp files to JASMIN manually?
Can I assume that they are all complete?
Or is it better off just starting from the beginning? It has only run 13 years…
Thanks
Andrew

I’ve just realised that the files I think it is looking for are in the archive folder.
So I will attempt to find the files and copy them into the correct folders and relaunch the jobs.
Hopefully that should solve the problem…

I’ve now managed to get all postprocessing tasks to complete apart from postproc_atmos.
This fails with an error:
ValueError: Incorrect size for fixed length header; given 0 words but should be 256.
[FAIL] main_pp.py atmos <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2022-02-18T18:42:25Z CRITICAL - failed/EXIT
I can’t work out if this is due to a missing file and if so what it is.
Could you suggest any solution to this?
Thanks,
Andrew

check for zero-length files in the archive directory

There are no empty files in the archive directory but in the History_data directory there are

-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.p41763dec
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.p51763dec
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pa1763dec
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pd1763dec
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pe1763dec
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.ph1763dec
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pk1763dec
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pm1763dec
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pu1763dec
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pv1763dec
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.p617631221
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.p717631221
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.p817631221
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.p917631221
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pb17631001
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pc17631001
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pf17631001
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pg17631001
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pi17631001
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pj17631001
-rw-r–r-- 1 schn02 n02 0 Feb 18 15:27 ck819a.pt17631221
-rw-r–r-- 1 schn02 n02 0 Feb 18 17:25 ck819.stash

I removed the zero length files moved over some pp files which had the same names as some of these files and it now seems to work.
Hopefully all the means are still OK…

I think it’s OK

Grenville