Dear CMS Helpdesk:
My new suites u-cl527 and u-cl528 are both failing to connect with rsync from login3 on ARCHER2 to hpxfer2 on JASMIN. The rsync seems to work fine when I try it from the command line on login3.
These suites are modified from u-ce196try3_23c, which was previously working ok with pptransfer.
Can you advise?
Patrick
P.S. This is the error message from cycle 10 of pptransfer:
~/cylc-run/u-cl528/log/job/19490601T0000Z/pptransfer/10/job.err
[WARN] [SUBPROCESS]: Command: rsync -av --stats --rsync-path=mkdir -p /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z && rsync /work/n02/n02/pmcguire/archive/u-cl528/19490601T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z
[SUBPROCESS]: Error = 255:
/usr/lib/ssh/ssh-askpass: line 21: /usr/lib/ssh/gnome-ssh-askpass: No such file or directory
pmcguire@hpxfer2.jasmin.ac.uk: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.3]
[WARN] Transfer command failed: rsync -av --stats --rsync-path="mkdir -p /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z && rsync" /work/n02/n02/pmcguire/archive/u-cl528/19490601T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z
[ERROR] transfer.py: Unknown Error - Return Code=255
[FAIL] Command Terminated
[FAIL] Terminating PostProc...
[FAIL] transfer.py # return-code=1
Received signal ERR
I set up the ssh-agent again on login3, and I tried the pptransfer again, and it didn’t work.
Then I saw with ps -elf
that there were 4 different ssh-agents running on login3. So I have just killed all four of those ssh-agents, and started another ssh-agent, and checked that password-less ssh from login3 to archer2 works. I have now started the pptransfer app once again. This is try#11. Each try takes an hour or so. There must be a faster way to ensure that the pptransfer app and the ssh agent are setup to work properly.
Patrick
Hi Patrick,
Having multiple ssh-agents running on a node is never a good idea so definitely good to stop them all and start up a clean one.
I notice your ~/.ssh
directory on ARCHER2 is too open it should only be readable to you. This is likely to cause problems at somepoint.
chmod 700 ~/.ssh
Cheers,
Ros.
Thanks, Ros,
I just changed the permissions to 700 for ~/.ssh
The pptransfer is still running for the 1st year (1949). So that might be a good sign. I think it conked out a bit earlier previously.
I do note that if I do a ps -elf
, that there are like 7 different python2 processes for each of the 2 pptransfers, and there is one rsync and one ssh hpxfer2 process for each of the 2 pptransfers. Hopefully this number of processes is correct.
Patrick
P.S.: Here is the ps -elf
listing:
pmcguire@ln03:~> !ps
ps -elf | grep pmcg
4 S pmcguire 24966 1 0 80 0 - 21298 ep_pol 13:26 ? 00:00:22 /usr/lib/systemd/systemd --user
5 S pmcguire 24969 24966 0 80 0 - 82233 - 13:26 ? 00:00:00 (sd-pam)
1 S pmcguire 25327 1 0 80 0 - 3974 - 13:26 ? 00:00:00 ssh-agent
0 S pmcguire 65642 1 0 80 0 - 2551 sigsus 13:29 ? 00:00:00 timeout --signal=XCPU 36000 /home/n02/n02/pmcguire/cylc-run/u-cl528/log/job/19490601T0000Z/pptransfer/11/job
0 S pmcguire 65643 65642 0 80 0 - 3529 do_wai 13:29 ? 00:00:00 /bin/bash /home/n02/n02/pmcguire/cylc-run/u-cl528/log/job/19490601T0000Z/pptransfer/11/job
0 S pmcguire 67108 65643 0 80 0 - 77607 do_wai 13:29 ? 00:00:02 python2 -m rose.task_run --verbose -O (archer2)
0 S pmcguire 67803 1 0 80 0 - 2551 sigsus 13:29 ? 00:00:00 timeout --signal=XCPU 36000 /home/n02/n02/pmcguire/cylc-run/u-cl527/log/job/19490601T0000Z/pptransfer/11/job
0 S pmcguire 67804 67803 0 80 0 - 3529 do_wai 13:29 ? 00:00:00 /bin/bash /home/n02/n02/pmcguire/cylc-run/u-cl527/log/job/19490601T0000Z/pptransfer/11/job
1 S pmcguire 68066 67108 0 80 0 - 26627 futex_ 13:29 ? 00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire 68067 67108 0 80 0 - 26627 pipe_w 13:29 ? 00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire 68068 67108 0 80 0 - 26628 futex_ 13:29 ? 00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire 68071 67108 0 80 0 - 26633 futex_ 13:29 ? 00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire 68072 67108 0 80 0 - 26628 futex_ 13:29 ? 00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire 68074 67108 0 80 0 - 26632 futex_ 13:29 ? 00:00:00 python2 -m rose.task_run --verbose -O (archer2)
0 S pmcguire 68224 67108 0 80 0 - 15058 pipe_w 13:29 ? 00:00:00 python /work/n02/n02/pmcguire/cylc-run/u-cl528/share/fcm_make_pptransfer/build/bin/transfer.py
0 S pmcguire 68285 67804 0 80 0 - 77606 do_wai 13:29 ? 00:00:02 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire 68486 68285 0 80 0 - 26646 futex_ 13:29 ? 00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire 68487 68285 0 80 0 - 26646 futex_ 13:29 ? 00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire 68488 68285 0 80 0 - 26613 futex_ 13:29 ? 00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire 68489 68285 0 80 0 - 26647 futex_ 13:29 ? 00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire 68490 68285 0 80 0 - 26619 pipe_w 13:29 ? 00:00:00 python2 -m rose.task_run --verbose -O (archer2)
1 S pmcguire 68491 68285 0 80 0 - 26616 futex_ 13:29 ? 00:00:00 python2 -m rose.task_run --verbose -O (archer2)
0 S pmcguire 68506 68285 0 80 0 - 15058 pipe_w 13:29 ? 00:00:00 python /work/n02/n02/pmcguire/cylc-run/u-cl527/share/fcm_make_pptransfer/build/bin/transfer.py
4 S root 90914 50838 0 80 0 - 41603 - 14:39 ? 00:00:00 sshd: pmcguire [priv]
5 S pmcguire 92187 90914 0 80 0 - 41603 - 14:39 ? 00:00:00 sshd: pmcguire@pts/185
0 S pmcguire 92253 92187 3 80 0 - 8175 do_wai 14:39 pts/185 00:00:00 -bash
0 R pmcguire 93561 92253 99 80 0 - 11711 - 14:39 pts/185 00:00:00 ps -elf
0 S pmcguire 93562 92253 0 80 0 - 2177 pipe_w 14:39 pts/185 00:00:00 grep pmcg
0 R pmcguire 203659 68506 35 80 0 - 11192 - 13:54 ? 00:16:16 rsync -av --stats --rsync-path=mkdir -p /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl527/19490601T0000Z && rsync /work/n02/n02/pmcguire/archive/u-cl527/19490601T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl527/19490601T0000Z
0 S pmcguire 203660 203659 37 80 0 - 13511 poll_s 13:54 ? 00:16:44 ssh hpxfer2.jasmin.ac.uk mkdir -p /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl527/19490601T0000Z && rsync --server -vlogDtpre.iLsfxC --stats . /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl527/19490601T0000Z
0 R pmcguire 210557 68224 36 80 0 - 11257 - 13:55 ? 00:16:20 rsync -av --stats --rsync-path=mkdir -p /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z && rsync /work/n02/n02/pmcguire/archive/u-cl528/19490601T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z
0 S pmcguire 210558 210557 37 80 0 - 13307 poll_s 13:55 ? 00:16:46 ssh hpxfer2.jasmin.ac.uk mkdir -p /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z && rsync --server -vlogDtpre.iLsfxC --stats . /gws/nopw/j04/porcelain_rdg/pmcguire/archer2_archive2/u-cl528/19490601T0000Z
Is possible somehow to add a quick rsync-connection test at the beginning of the pptransfer app, prior to the checksum computation and the real rsync task? That way, the connection problems can be deciphered more quickly.
Patrick
It looks like the pptransfer app has now succeeded for the 1st cycle for both suites.
Thanks for the help and advice,
Patrick