Pptransfer in UM nesting suite

Hi helpdesk,

I’m trying to automate the process of moving model output from ARCHER2 to JASMIN to that my disk quota doesn’t fill up during runs. I looked at the pptransfer page (Configuring PPTransfer) but the .rc files seem to be a bit different in my suite since I’m running the nesting suite. Are there any existing nesting suites with pptransfer setup?

I had a go at copying the stuff into my suite to get pptransfer to run, using suite be303 as a reference, my suite is cq995. I get a module error when it trys to run pptransfer:

Lmod has detected the following error: Unable to load module because of error
when evaluating modulefile:
/work/y07/shared/archer2-lmod/libs/core/epcc-cray-hdf5-parallel/1.12.0.3.lua:
[string “-- Patch up cray-hdf5-parallel…”]:51: bad argument #1 to ‘find’
(string expected, got nil)
Please check the modulefile and especially if there is a the line number
specified in the above message

This could easily be from me butchering the .rc files where I’ve tried to copy stuff into suite.rc, site/ncas-cray-ex/suite-adds.rc and suite-graph/lam-fcst.rc but I can’t figure out what the problem is.

Can you advise on either the best way for me to automate my transfer process, or what the problem is on cq995?

Thanks!
Ruth

Hi Ruth,

The pptransfer app requires data to have already been archived/staged into a single directory on ARCHER2. In a standard UM suite postproc runs first to do this and then pptransfer transfers the “archived” data for that cycle over to JASMIN. We have never tried to use pptransfer in a nesting suite and I don’t know how hard it would be to do so - the nesting suite is notoriously complicated. :slightly_frowning_face:

With a quick look at your suite you would first need to edit the nesting suite archive app and change the rose-app.conf and all the optional config files in the /opt directory from using moose to instead copy/move all the required files into a single directory on ARCHER2 that must have the same name as the cycle name (e.g. /work/n02/n02/ros/archive/20170508T0000Z). Then pptransfer should be able to transfer that directory over to JASMIN. I can’t guarantee, however, that this is all that would be required.

Unless you’re planning a long simulation, the effort for you to get this working may not be worth it.

The module load error you have got is because module load um has been run before the module load postproc. You can’t have both these modules loaded at the same time.

Hope this helps.

Regards,
Ros.

Thanks for the quick turnaround. That’s all really helpful. I’ll see what I can do with that, but yes, might end up giving up or finding another solution!
Ruth

Hi Ruth,

I’ve just been thinking about this again and wondering if simply replacing the moose commands in the archive app to use rsync would work? The archive task would need to run as a background task on a specific ARCHER2 login node (e.g. login2.archer2.ac.uk).

Cheers,
Ros.

Thanks Ros! You may be right that actually a simple transfer is all that’s needed. When you say it would run as a background task, do you still mean online as part of the model run or offline?

Ruth

Hi Ruth,

Yes still online as part of the model workflow. Just that rather than running that task in the ARCHER2 serial queue it would need to be submitted as a background job on an ARCHER2 login node as you can’t setup ssh-agent on the serial nodes.

You’d need to add the following to the appropriate *.rc file to tell cylc to run the task in the background on login node login3 for example:

     [[[job]]]
            batch system = background
     [[[remote]]]
            host = login3.archer2.ac.uk

Does that make sense?

Cheers,
Ros.

Hi Ros,

That makes sense yes, but when I add those lines into suite.rc the suite will no longer run. Here’s the error it gives on the command line after running rose suite-run:

[FAIL] ssh -oBatchMode=yes -n rprice@login2.archer2.ac.uk env\ ROSE_VERSION=2019.01.3\ CYLC_VERSION=7.8.7\ bash\ -l\ -c\ '"$0"\ "$@"'\ rose\ suite-run\ -vv\ -n\ u-cq995\ –run=run\ –remote=uuid=49342f11-ace1-4d58-a929-f60b90ee3797,now-str=20221011T153940Z,root-dir='$DATADIR' # return-code=255, stderr=
[FAIL] Host key verification failed.

I also tried login3.archer2.ac.uk but it gave the same error. Is this something I’m doing wrong? This is still on cq995.

Thanks
Ruth

Ah, it works if I use just login.archer2.ac.uk. Will that be alright for running it?

Hi Ruth,

No, the background task has to be submitted to a specific login node (e.g. login3.archer2.ac.uk). If you use login.archer2.ac.uk this will randomly select a login node. When cylc then polls it will at somepoint land on a different login node to where the task is running, not be able to find the task and incorrectly declare the task as failed.

First fix your ~/.ssh/config file so it works for all login nodes by changing:

Host login.archer2.ac.uk
to
Host login*.archer2.ac.uk

Then on the pumatest command line ssh into each of the login nodes (login1-4) to check your connection. I suspect at least one of them will return a “Host verification failed” message containing a line number of a bad key which will need removing from your ~/.ssh/known_hosts file.

Cheers,
Ros.

Hi Ros,

Understood. I’ve done that and changed it back to login3 in the suite.rc file. I didn’t get any “Host verification failed” messages, but login2 did behave weirdly, it gave this error: “Connection closed by 193.62.216.43” rather than the usual message about PTY allocation request. I’m not sure if that has any bearing on this.

The suite is now failing on the archive job due to a Permission denied error on the rsync command. Error message is like this:

[FAIL] rsync -e"ssh -i /home/n02/n02/rprice/.ssh/id_rsa_jasmin" /work/n02/n02/rprice/cylc-run/u-cq995/work/20170508T0000Z/Regn1_km1p0_ra3_p3_casim_ukca_archive/tmpmgGlXz/u-cq995_20170508T0000Z_Regn1_km1p0_ra3_p3_casim_ukca_pa000 eersp@hpxfer1.jasmin.ac.uk:/gws/nopw/j04/asci/rprice/archive/field.pp/ # return-code=255, stderr=
[FAIL]
[FAIL] Access to this system is monitored and restricted to
[FAIL] authorised users. If you do not have authorisation
[FAIL] to use this system, you should not proceed beyond
[FAIL] this point and should disconnect immediately.
[FAIL]
[FAIL] Unauthorised use could lead to prosecution.
[FAIL]
[FAIL] (See also - Science and Technology Facilities Council (STFC) – UKRI)
[FAIL]
[FAIL] eersp@hpxfer1.jasmin.ac.uk: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
[FAIL] rsync: connection unexpectedly closed (0 bytes received so far) [sender]
[FAIL] rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.3]
[FAIL] ! eersp@hpxfer1.jasmin.ac.uk:/gws/nopw/j04/asci/rprice/archive/field.pp/ [compress=None, t(init)=2022-10-12T15:32:31Z, dt(tran)=0s, dt(arch)=0s, ret-code=255]
[FAIL] ! u-cq995_20170508T0000Z_Regn1_km1p0_ra3_p3_casim_ukca_pa000 (umnsaa_pa000)

I tried doing the equivalent rsync job directly on ARCHER2 to test the permissions and it runs fine there, ie this command is successful:

rprice@ln01:~> rsync -e"ssh -i /home/n02/n02/rprice/.ssh/id_rsa_jasmin" /work/n02/n02/rprice/cylc-run/u-cq995/share/cycle/20170508T0000Z/Regn1/km1p0/ra3_p3_casim_ukca/um/umnsaa_pa000 eersp@hpxfer1.jasmin.ac.uk:/gws/nopw/j04/asci/rprice/archive/field.pp/

So I must be doing something wrong in the way I’m telling ARCHER2 to run rsync from pumatest.Do you have any ideas what the problem could be?

Thanks
Ruth

Hi Ruth,

Apologies, I neglected to say you will need to setup ssh-agent on ARCHER2 to allow the rsync to work non-interactively.

On login3:

  • In your ~/.ssh/config file:

    Host xfer?.jasmin.ac.uk hpxfer?.jasmin.ac.uk
    User <jasmin-username>
    IdentityFile ~/.ssh/id_rsa_jasmin
    ForwardAgent no
    
  • In your ~/.bashrc or ~/.bash_profile if you don’t have a .bashrc:

    # ssh-agent setup on login nodes
    . ~/.ssh/ssh-setup
    
  • Copy my ~/ssh-setup script to your ~/.ssh directory.

  • Login again to login3 and it should start up a new ssh-agent. Then run ssh-add ~/.ssh/id_rsa_jasmin to add your key.

Hopefully that should fix the problem.

Regards,
Ros.

Hi Ros,

Thanks! That sorted the permission denied stuff. The archive job now appears to work at first glance (ie log files say it completes successfully) but there are a couple of problems:

  • in cq995 I’m testing the transfer of an output file and a start dump, and it only transferred the output file
  • the output file does get transferred but seems to get corrupted somehow - I can’t open it in xconv

This is a potentially stupid question, but do you know where the archive job actually is? As in, is tehre a script, or something that gets compiled? There’s no bin/ directory or any other scripts in app/archive/ so I’m struggling to find the actual code that the app uses (so that I can try and debug this).

Sorry for the never ending stream of questions about this.

Ruth

Hi Ruth,

The archiving uses the Rose built-in app rose-arch, so there are no scripts in the suite.

Details on Rose Arch can be found here: Rose Arch Documentation

If you want me to take a look at your setup, let me know.

Cheers,
Ros

Thank you! Those rose arch docs are really helpful, I can have another stab at things now.
Ruth