Setting up archer2-jasmin archiving in the nesting suite

shakka · 16 August 2023 09:35

I’m trying to set up archiving to jasmin on my suite u-cy223 which runs the global model + pan-Arctic/pan-Antarctic nests (as you’ve been helping me with here)

I want to be able to set up automatic archiving to jasmin, so I’ve been looking at the CANARI suite u-cn134, because it seems to have a lot of useful functionality.

@RosalynHatcher I know you’ve done a lot of work on this suite, how straightforward would it be to so something similar (i.e. output pp streams, convert and postprocess to netCDF and transfer to jasmin)?

And can you help me set it up to archive only model outputs?

grenville · 16 August 2023 11:57

Ella

It looks like u-cy223 is using the UM netcdf output - will you not carry on with that?

You could create a bespoke app (like get_era_data) that rsync’d just the files required - but that’s effectively just what the built in archiving does. We can try to work out the settings for archiving to shift the right files.

I don’t think trying to incorporate the data management from u-cn134 would be worthwhile. The u-cn134 data management is geared specifically for the CANARI project.

How much data do you anticipate moving to JASMIN?

Grenville

shakka · 16 August 2023 12:06

Thanks Grenville. I only want to transfer the output files - am currently in two minds about whether I should transfer as pp files that take up less space and then stashsplit them on jasmin as part of my processing, or if I should output as netcdf in the first place. Am currently leaning towards the former.

Sounds like it’s easier to just write a bespoke transfer script/app then - would very much appreciate some help on it! you can probably see how (not) far I’ve got already…

grenville · 16 August 2023 12:20

have you got a suite that has run for a while to create its output over several cylces?

shakka · 16 August 2023 13:13

u-cy223 has successfully completed a week (current run is stalling cos I added in some new stash). Otherwise u-cy175 and 173 both finished a weeks’ worth of simulation I think…?

grenville · 16 August 2023 13:30

I don’t see any output in /home/n02/n02/shakka/cylc-run/u-cy175/share/cycle/*/Arctic/11km/ga7_24-36/um

there is glm/um output for u-cy175

I don’t seen any output for the 173 suite?

shakka · 16 August 2023 13:51

Ah yes, I had to delete it to make space to run the others! Doh. Hopefully will have some output from u-cy223 imminently…

grenville · 17 August 2023 10:21

Ella,

I have been trying to understand u-cy223 more and just wanted to check that this is the way you want the suite to run.

The cycle length (CYCLE_INT_HR) is set to 24hrs and CRUN_LEN is set to 6 hrs, yet this creates 6 runs per cycle:

-rw-r–r-- 1 shakka n02 305086464 Aug 16 10:33 umglaa_pa000
-rw-r–r-- 1 shakka n02 209293312 Aug 16 10:43 umglaa_pa006
-rw-r–r-- 1 shakka n02 211345408 Aug 16 10:48 umglaa_pa012
-rw-r–r-- 1 shakka n02 212590592 Aug 16 10:53 umglaa_pa018
-rw-r–r-- 1 shakka n02 213549056 Aug 16 10:59 umglaa_pa024
-rw-r–r-- 1 shakka n02 214069248 Aug 16 11:02 umglaa_pa030

this results in a doubling of some output.

Looking in /work/n02/n02/shakka/cylc-run/u-cy223/share/cycle/19991231T1200Z/glm/um/umglaa_pa018, I see output for U WIND on PRESSURE LEVELS (for example) for 01/01/2000 at 1200

but in /work/n02/n02/shakka/cylc-run/u-cy223/share/cycle/20000101T1200Z/glm/um/umglaa_pa000 , there is output for the same diagnostic at the same time, but the data is different.

I guess you don’t want a continuous 12 month run?

Grenville

shakka · 17 August 2023 11:24

Hi Grenville,

It’s safe to assume I’m probably just doing something wrong here - I am trying to get 36 hour forecasts and discard the first 12 hours as spin-up, then output only t+12 to t+36 (which should be 00Z to 00Z) to file. I’ll then concatenate those together into a continuous timeseries.

There are also a few variables which aren’t outputting (anything on tiles, snow tiles or soil levels - I’m guessing this is a JULES thing?).

On archiving - the archiving step says ‘succeeded’ at every cycle, but I’ve evidently not set something up correctly because the transfer hasn’t happened.

So far, the contents of my app/rose-app.conf are:

[arch]
command-format=rsync %(sources)s %(target)s
!!target-prefix=moose:/devfc/$ROSE_SUITE_NAME/
source-prefix=$ROSE_DATA
target-prefix=shakka@hpxfer1.jasmin.ac.uk:/gws/nopw/j04/polarres/ella/evaluation_run/$ROSE_SUITE_NAME/

[arch:rename/]
rename-format=%(cycle)s_$REGION_NAME_%(name)s
update-check=mtime+size

I’m guessing that I’m missing something here?

Thanks for your help as ever
Ella

shakka · 17 August 2023 14:23

Ok, I’ve tried working through some of the archiving errors and now getting stuck with the renaming part.

I want to rename the files from each region (which have the same names) as <REGION_NAME>

i.e. 20000101T1200Z_Antarctic_m3h_000 etc.

but I can’t figure out a clean way to do this using the rose environment variables that are available.

Currently I’ve got:

[arch]
command-format=rsync %(sources)s %(target)s
source-prefix=$ROSE_DATAC/$REGNAME
source=11km/ga7_24_36/um/*_000
target-prefix=shakka@hpxfer1.jasmin.ac.uk:/gws/nopw/j04/polarres/ella/evaluation_run/$ROSE_SUITE_NAME/

[arch:rename/]
rename-format=%(cycle)s_(?P< tag>)_%(name)s
update-check=mtime+size

Can I define a REGNAME somehow ?

grenville · 17 August 2023 16:42

Ella

Please see my copy of u-cy223 - where archive_files_rsync sends glm files to JASMIN and renames them from umglaa_pa000 to my_data_p000 etc

I needed to set DATADIR in the task environment since the glm does not do so – the LAM tasks do set DATADIR, so a separate task might be the way to go for them (needed anyway since the file renaming will be different I suppose.)

The renaming used python regular expression which aren’t the easiest.

I’d create a task for the LAMs that simply copies the data first then tackle that renaming later.

Grenville

shakka · 20 August 2023 19:52

Right, so I need to set the archiving up to run separately for each region? Because if the files are named the same for both domains I don’t want it to overwrite one of the domains’ files when I transfer them.

How can I handle that?

shakka · 30 August 2023 10:24

Hi Grenville, just coming back to this now. I was wondering if it would make sense to do this as two separate archiving tasks - one for each domain. That way, the renaming by domain could be hard-wired in. What do you think to that idea?

And can you help me on this too?:

It’s safe to assume I’m probably just doing something wrong here - I am trying to get 36 hour forecasts and discard the first 12 hours as spin-up, then output only t+12 to t+36 (which should be 00Z to 00Z) to file. I’ll then concatenate those together into a continuous timeseries.

There are also a few variables which aren’t outputting (anything on tiles, snow tiles or soil levels - I’m guessing this is a JULES thing?).

shakka · 1 September 2023 12:38

Update: now testing using a separate archiving step for each domain. Copied suite u-cy223 and am now running glm with just one nested domain (Arctic) in u-cz478.

I added an archive_files_Arctic task with the same format as @grenville’s u-cy223 archive_files_rsync and updated the suite.rc to include an [[[environment]]] definition of DATADIR, which I had to hard-wire as

DATADIR=$ROSE_DATAC/Arctic/11km/ga7_24-36/um/

FYI I tried to be clever and use the syntax in /suite-runtime/lams.rc, i.e.

DATADIR= {{regn[“name”]}}/{{resln[“name”]}}/{{mod[“name”]}}

because I thought I could use this to make it work with both domains within one suite… but that didn’t work.

Now, it seems to be at least attempting to transfer the files as expected, but I’m getting an rsync error when trying to connect to either hpxfer1 or xfer1 on jasmin. I am able to connect from the archer2 command line to both without being prompted for a password, which means there shouldn’t be a connection or authentication issue.

The error I’m getting is:

[FAIL] rsync -aLv /work/n02/n02/shakka/cylc-run/u-cz478/work/19991231T1200Z/archive_files_Arctic/tmpBZ0tbh/MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_day_000 /work/n02/n02/shakka/cylc-run/u-cz478/work/19991231T1200Z/archive_files_Arctic/tmpBZ0tbh/MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_mlev_000 /work/n02/n02/shakka/cylc-run/u-cz478/work/19991231T1200Z/archive_files_Arctic/tmpBZ0tbh/MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_m3h_000 /work/n02/n02/shakka/cylc-run/u-cz478/work/19991231T1200Z/archive_files_Arctic/tmpBZ0tbh/MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_mi1h_000 /work/n02/n02/shakka/cylc-run/u-cz478/work/19991231T1200Z/archive_files_Arctic/tmpBZ0tbh/MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_plev_000 /work/n02/n02/shakka/cylc-run/u-cz478/work/19991231T1200Z/archive_files_Arctic/tmpBZ0tbh/MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_i3h_000 /work/n02/n02/shakka/cylc-run/u-cz478/work/19991231T1200Z/archive_files_Arctic/tmpBZ0tbh/MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_m6h_000 /work/n02/n02/shakka/cylc-run/u-cz478/work/19991231T1200Z/archive_files_Arctic/tmpBZ0tbh/MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_3ht_000 /work/n02/n02/shakka/cylc-run/u-cz478/work/19991231T1200Z/archive_files_Arctic/tmpBZ0tbh/MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_i6h_000 /work/n02/n02/shakka/cylc-run/u-cz478/work/19991231T1200Z/archive_files_Arctic/tmpBZ0tbh/MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_6ht_000 shakka@hpxfer1.jasmin.ac.uk:/gws/nopw/j04/polarres/ella/evaluation_run/Arcticfiles # return-code=255, stderr=
[FAIL] ssh: connect to host hpxfer1.jasmin.ac.uk port 22: Connection timed out
[FAIL] rsync: connection unexpectedly closed (0 bytes received so far) [sender]
[FAIL] rsync error: unexplained error (code 255) at io.c(228) [sender=3.2.3]
[FAIL] ! /gws/nopw/j04/polarres/ella/evaluation_run/Arcticfiles [compress=None, t(init)=2023-09-01T10:59:00Z, dt(tran)=0s, dt(arch)=130s, ret-code=255]
[FAIL] !	MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_3ht_000 (ut_3ht_000)
[FAIL] !	MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_6ht_000 (ut_6ht_000)
[FAIL] !	MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_day_000 (ut_day_000)
[FAIL] !	MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_i3h_000 (ut_i3h_000)
[FAIL] !	MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_i6h_000 (ut_i6h_000)
[FAIL] !	MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_m3h_000 (ut_m3h_000)
[FAIL] !	MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_m6h_000 (ut_m6h_000)
[FAIL] !	MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_mi1h_000 (ut_mi1h_000)
[FAIL] !	MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_mlev_000 (ut_mlev_000)
[FAIL] !	MetUM_PolarRES_Arctic_11km_19991231T1200Z_ut_plev_000 (ut_plev_000)
2023-09-01T11:01:12Z CRITICAL - failed/EXIT

… any ideas?

[as an aside, the renaming should be renaming as MetUM_PolarRES_Arctic_11km_$CYCLE_out_fileid_000 , but seems to be missing the ‘o’ in ‘out’ - not sure why that’s happening but it doesn’t matter too much]

grenville · 5 September 2023 11:26

Hi Ella

Sorry this is slow - holiday intervened.

Try running the archive tasks on the login node - there is not agent running on the compute nodes. So, add method = background (to both archive tasks) eg

    [[archive]]
        inherit = None, HOST_HPC
        [[[job]]]
            execution retry delays = PT15M, PT15M, PT30M, PT60M, PT60M, PT180M, PT360M,PT360M
             method = background

shakka · 5 September 2023 15:24

Outrageous! Hope you enjoyed your time off

I tried this and I got an authentication error, despite being able to ssh into hpxfer1 directly from archer. Do I need to add something somewhere to allow the ssh agent to work in the background?

grenville · 5 September 2023 15:54

In your .bashrc, add

# ssh-agent setup on login nodes
. ~/.ssh/ssh-setup

copy /home/n02/n02/grenvill/ssh-setup to your .ssh directory

Logout & back in. It should say Initialising new SSH agent..., then add your jasmin key.

If you don’t have a ~/.ssh/config, create one and add

Host hpxfer1.jasmin.ac.uk
User <your jasmin user name>
IdentityFile ~/.ssh/<your jasmin key>
ForwardAgent no

check that you can ssh hpxfer1.jasmin.ac.uk without having to type a password/passphrase

then try archiving again

shakka · 6 September 2023 08:42

Hi Grenville,
Tried this, and I get the “Initialising new SSH agent…” on login, but am still getting a permission denied error when I try to ssh into hpxfer1. The security key is exactly as it should be, so I’m a little confused.
Cheers
Ella

grenville · 6 September 2023 08:55

did you add your jasmin key?

ssh-add ~/.ssh/<your jasmin key>

then to check
ssh-add -l

what do you get for

ssh -vvv hpxfer1.jasmin.ac.uk

Grenville

shakka · 6 September 2023 09:58

Nope. Doh! How often will I have to re-add my key? Can I include it in my bashrc or something?

After adding my ssh key, I am now back to ‘connection timed out’ errors…

Topic		Replies	Views
Location of output data /transferring to JASMIN General ARCHER2	6	286	23 November 2022
Workflow for running the UM remotely from the Met Office Unified Model ARCHER2	5	281	4 January 2022
Access to Acclimation run data from ARCHER2 AMIP runs General JASMIN , ARCHER2	22	230	13 January 2024
Pptransfer in UM nesting suite Unified Model ARCHER2	16	364	13 December 2023
Modifying pptransfer task? Unified Model JASMIN , ARCHER2	5	369	23 October 2023

Setting up archer2-jasmin archiving in the nesting suite

Related topics