Pptransfer failing

Dear CMS Helpdesk:
My new suites u-cl527 and u-cl528 are both failing to connect with rsync from login3 on ARCHER2 to hpxfer2 on JASMIN. The rsync seems to work fine when I try it from the command line on login3.
These suites are modified from u-ce196try3_23c, which was previously working ok with pptransfer.
Can you advise?
Patrick

P.S. This is the error message from cycle 10 of pptransfer:
~/cylc-run/u-cl528/log/job/19490601T0000Z/pptransfer/10/job.err

[WARN]  [SUBPROCESS]: Command: rsync -av --stats --rsync-path=mkdir -p /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z && rsync /work/n02/n02/pmcguire/archive/u-cl528/19490601T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z
[SUBPROCESS]: Error = 255:
/usr/lib/ssh/ssh-askpass: line 21: /usr/lib/ssh/gnome-ssh-askpass: No such file or directory
pmcguire@hpxfer2.jasmin.ac.uk: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.3]

[WARN]  Transfer command failed: rsync -av --stats --rsync-path="mkdir -p /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z && rsync" /work/n02/n02/pmcguire/archive/u-cl528/19490601T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z
[ERROR]  transfer.py: Unknown Error - Return Code=255
[FAIL]  Command Terminated
[FAIL] Terminating PostProc...
[FAIL] transfer.py # return-code=1
Received signal ERR

I set up the ssh-agent again on login3, and I tried the pptransfer again, and it didn’t work.

Then I saw with ps -elf that there were 4 different ssh-agents running on login3. So I have just killed all four of those ssh-agents, and started another ssh-agent, and checked that password-less ssh from login3 to archer2 works. I have now started the pptransfer app once again. This is try#11. Each try takes an hour or so. There must be a faster way to ensure that the pptransfer app and the ssh agent are setup to work properly.
Patrick

Hi Patrick,

Having multiple ssh-agents running on a node is never a good idea so definitely good to stop them all and start up a clean one.

I notice your ~/.ssh directory on ARCHER2 is too open it should only be readable to you. This is likely to cause problems at somepoint.

chmod 700 ~/.ssh

Cheers,
Ros.

Thanks, Ros,
I just changed the permissions to 700 for ~/.ssh

The pptransfer is still running for the 1st year (1949). So that might be a good sign. I think it conked out a bit earlier previously.

I do note that if I do a ps -elf, that there are like 7 different python2 processes for each of the 2 pptransfers, and there is one rsync and one ssh hpxfer2 process for each of the 2 pptransfers. Hopefully this number of processes is correct.
Patrick

P.S.: Here is the ps -elf listing:

pmcguire@ln03:~> !ps
ps -elf | grep pmcg
4 S pmcguire  24966      1  0  80   0 - 21298 ep_pol 13:26 ?        00:00:22 /usr/lib/systemd/systemd --user
5 S pmcguire  24969  24966  0  80   0 - 82233 -      13:26 ?        00:00:00 (sd-pam)
1 S pmcguire  25327      1  0  80   0 -  3974 -      13:26 ?        00:00:00 ssh-agent
0 S pmcguire  65642      1  0  80   0 -  2551 sigsus 13:29 ?        00:00:00 timeout --signal=XCPU 36000 /home/n02/n02/pmcguire/cylc-run/u-cl528/log/job/19490601T0000Z/pptransfer/11/job
0 S pmcguire  65643  65642  0  80   0 -  3529 do_wai 13:29 ?        00:00:00 /bin/bash /home/n02/n02/pmcguire/cylc-run/u-cl528/log/job/19490601T0000Z/pptransfer/11/job
0 S pmcguire  67108  65643  0  80   0 - 77607 do_wai 13:29 ?        00:00:02 python2 -m rose.task_run --verbose -O (archer2)
0 S pmcguire  67803      1  0  80   0 -  2551 sigsus 13:29 ?        00:00:00 timeout --signal=XCPU 36000 /home/n02/n02/pmcguire/cylc-run/u-cl527/log/job/19490601T0000Z/pptransfer/11/job
0 S pmcguire  67804  67803  0  80   0 -  3529 do_wai 13:29 ?        00:00:00 /bin/bash /home/n02/n02/pmcguire/cylc-run/u-cl527/log/job/19490601T0000Z/pptransfer/11/job
1 S pmcguire  68066  67108  0  80   0 - 26627 futex_ 13:29 ?        00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire  68067  67108  0  80   0 - 26627 pipe_w 13:29 ?        00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire  68068  67108  0  80   0 - 26628 futex_ 13:29 ?        00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire  68071  67108  0  80   0 - 26633 futex_ 13:29 ?        00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire  68072  67108  0  80   0 - 26628 futex_ 13:29 ?        00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire  68074  67108  0  80   0 - 26632 futex_ 13:29 ?        00:00:00 python2 -m rose.task_run --verbose -O (archer2)
0 S pmcguire  68224  67108  0  80   0 - 15058 pipe_w 13:29 ?        00:00:00 python /work/n02/n02/pmcguire/cylc-run/u-cl528/share/fcm_make_pptransfer/build/bin/transfer.py
0 S pmcguire  68285  67804  0  80   0 - 77606 do_wai 13:29 ?        00:00:02 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire  68486  68285  0  80   0 - 26646 futex_ 13:29 ?        00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire  68487  68285  0  80   0 - 26646 futex_ 13:29 ?        00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire  68488  68285  0  80   0 - 26613 futex_ 13:29 ?        00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire  68489  68285  0  80   0 - 26647 futex_ 13:29 ?        00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire  68490  68285  0  80   0 - 26619 pipe_w 13:29 ?        00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire  68491  68285  0  80   0 - 26616 futex_ 13:29 ?        00:00:00 python2 -m rose.task_run --verbose -O (archer2)
0 S pmcguire  68506  68285  0  80   0 - 15058 pipe_w 13:29 ?        00:00:00 python /work/n02/n02/pmcguire/cylc-run/u-cl527/share/fcm_make_pptransfer/build/bin/transfer.py
4 S root      90914  50838  0  80   0 - 41603 -      14:39 ?        00:00:00 sshd: pmcguire [priv]
5 S pmcguire  92187  90914  0  80   0 - 41603 -      14:39 ?        00:00:00 sshd: pmcguire@pts/185
0 S pmcguire  92253  92187  3  80   0 -  8175 do_wai 14:39 pts/185  00:00:00 -bash
0 R pmcguire  93561  92253 99  80   0 - 11711 -      14:39 pts/185  00:00:00 ps -elf
0 S pmcguire  93562  92253  0  80   0 -  2177 pipe_w 14:39 pts/185  00:00:00 grep pmcg
0 R pmcguire 203659  68506 35  80   0 - 11192 -      13:54 ?        00:16:16 rsync -av --stats --rsync-path=mkdir -p /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl527/19490601T0000Z && rsync /work/n02/n02/pmcguire/archive/u-cl527/19490601T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl527/19490601T0000Z
0 S pmcguire 203660 203659 37  80   0 - 13511 poll_s 13:54 ?        00:16:44 ssh hpxfer2.jasmin.ac.uk mkdir -p /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl527/19490601T0000Z && rsync --server -vlogDtpre.iLsfxC --stats . /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl527/19490601T0000Z
0 R pmcguire 210557  68224 36  80   0 - 11257 -      13:55 ?        00:16:20 rsync -av --stats --rsync-path=mkdir -p /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z && rsync /work/n02/n02/pmcguire/archive/u-cl528/19490601T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z
0 S pmcguire 210558 210557 37  80   0 - 13307 poll_s 13:55 ?        00:16:46 ssh hpxfer2.jasmin.ac.uk mkdir -p /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z && rsync --server -vlogDtpre.iLsfxC --stats . /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z

Is possible somehow to add a quick rsync-connection test at the beginning of the pptransfer app, prior to the checksum computation and the real rsync task? That way, the connection problems can be deciphered more quickly.
Patrick

It looks like the pptransfer app has now succeeded for the 1st cycle for both suites.
Thanks for the help and advice,
Patrick