Pptransfer retrying on ARCHER2

I just got the first cycle of the pptransfer app of my ARCHER2 suite u-ce196try3_23c to work.
It took me 12 tries to get the suite set up right so that it could do this.

The problem now is that it’s trying to run the second cycle of the pptransfer app of this suite, and it keeps retrying. It’s on its 2nd retry now. In the 1st retry, I triggered it before the default retry waiting period of 30 minutes was over. Not sure if that matters, but I don’t want to trigger the 2nd retry before the 30 minute retry waiting period is over.

It is retrying, and maybe the next retry will work. Or maybe not. But I don’t see the reason for the failure prior to the retry in the log files. I looked at the log files on puma at the linux command prompt, as well as in the Cylc GUI, since the latter seem to be more updated sometimes. Can you help?
Patrick

It looks like it possibly might work on try#4 of cycle#2 (starting in 1980). At least it might be getting a little further in each retry. The Cylc GUI says that it is trying again (try#5), but I don’t see this planned retry yet in the ARCHER2 versions of the try #4 log files.

There is more info in the log files on ARCHER2 than there is in the log files on PUMA.

In trys #1, #2, and #3 of cycle#2, there was an error message:
[ERROR] transfer.py: Unknown Error - Return Code=24

Patrick

Hi Patrick,

Can you change the permissions of your /home and /work on ARCHER2 so we can see them please?

chmod -R g+rX /work/n02/n02/pmcguire
chmod -R g+rX /home/n02/n02/pmcguire

I have the same issue with one of my suites - it’s the rsync getting stuck and I can’t see why at the moment… My other 2 suites are working absolutely fine which is slightly bizarre and it’s not login node specific either.

Cheers,
Ros.

thanks, Ros.
I just changed the permissions.
Patrick

Hi Patrick,

I remember the problem now…

In the archer2.rc file add the following to the [[PPTRANSFER_RESOURCE]] family:

[[[remote]]]
    host = login3.archer2.ac.uk

Obviously you can pick whichever login node [1-4] you want or which is up.

Because the task is running in the background and only visible on the login node on which it’s running you have to force the use of a specific login node for the polling to work.

I’ll update the documentation before I forget again! :upside_down_face:
Cheers,
Ros.

Thanks, Ros:
I have tried several times to get it working with the [[[remote]]] suggestion. Maybe my next try will work.
I saw in my ~pmcguire/cylc-run/u-ce196try3_23c/log/rose-suite-run.log file on PUMA that it was trying to spawn two ssh and two rsync sessions, one for login.archer2 and one for login3.archer2.

I think the first one to login.archer2 was because [[PPTRANSFER_RESOURCE]] inherits as a great grandparent [HPC], which also has login.archer2 for [[[remote]]]. So I commented that out, to try. And I am trying now. But this will mean that the other apps won’t know where to login from. I will have to figure out what to do then.
Patrick

Hi Patrick,

As long as you put it in the [[PPTRANSFER_RESOURCE]] section that will work fine as anything in here overrides what is in the parent and grand-parent sections.

Did you stop and restart the suite or do a rose suite-run --reload?

I’m just about to pop out but can take a look later. You are welcome to look at my suites u-as037, u-cb151 and u-bs251 which all do it this way.

Regards,
Ros.

Patrick

You’ll need to have set up the known hosts too for which ever login node you chose to use.

Grenville

Thanks, Ros & Grenville!

I did a --reload whenever I made changes instead of stopping it and --restarting it.

The pptransfer app just finished its 2nd cycle successfully, thanks to Ros’s tip about the login3.archer2 remote in PPTRANSFER_RESOURCE. It had failed the 2nd cycle previously in over 10 attempts on my part. This time, when it worked, I commented out the [[[remote]]] in [[HPC]]. And this time, it looks like there were no double ssh’s + rsync’s for login3.archer2 and login.archer .

Now it’s on the 3rd cycle of pptransfer. Maybe that one will work too? I have now also uncommented my previously-commented [[[remote]]] in [[HPC]] which uses login.archer2, and I did a --reload. Hopefully, the other apps will now continue to work fine.

I looked at u-as037, and I don’t think that has a [[[remote]]] in PPTRANSFER_RESOURCE. But u-cb151 does, and it also has a [[[remote]]] in HPC, and I am doing it like that. Thanks for the confirmation.
Patrick

Ros:
It seems to be working better now. It is now through a couple more cycles of pptransfer and atmos_main.
Thanks for your help.
Patrick