Suites get stuck on pptransfer?

Hi,

I am running suites similar to u-as037, which were all running fine until 15/01/22. However, now these suites (e.g. u-ck768) often seem to get stuck on the pptransfer step: on PUMA, this step will be in state ‘running’ for over a day, but on JASMIN, none of the files have been transferred to the expected place, and on ARCHER2, command squeue -u my-username shows no jobs running. When I manually stop and then re-trigger pptransfer, I get the same outcome: PUMA still submits the job and says it is ‘running’, but it does not show up on ARCHER2.
I have checked the connection between PUMA and ARCHER2 is fine (I get the expected error message ‘Comand rejected by policy. Not in authorised list’). I have also checked I can ssh from ARCHER2 to both JASMIN hpxfer servers without being asked for a password, so that should also be fine.
What should I try next?

Best wishes,

Rachel

Hi Rachel,

The pptransfer runs in the background on the ARCHER2 login nodes at the moment as we are unable to use the serial nodes for this currently. This is why you’re not seeing the tasks show up with squeue. If you login to login3.archer2.ac.uk where your transfers are being submitted to and run ps -flu radiam24 | grep rsync you will see the rsync commands being run for 3 of your suites.

ARCHER2-23cab> ps -flu radiam24 |grep rsync
0 S radiam24  17509 204904  0  80   0 - 11193 -      16:08 ?        00:00:00 rsync -av --stats --rsync-path=mkdir -p /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck767/18550801T0000Z && rsync /work/n02/n02/radiam24/archive/u-ck767/18550801T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck767/18550801T0000Z
0 S radiam24  17510  17509  0  80   0 - 13218 -      16:08 ?        00:00:00 ssh hpxfer2.jasmin.ac.uk mkdir -p /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck767/18550801T0000Z && rsync --server -vlogDtpre.iLsfxC --stats . /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck767/18550801T0000Z
0 S radiam24 182268 119455  0  80   0 - 11192 -      Jan17 ?        00:05:18 rsync -av --stats --rsync-path=mkdir -p /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck768/18530601T0000Z && rsync /work/n02/n02/radiam24/archive/u-ck768/18530601T0000Z/ hpxfer1.jasmin.ac.uk:/gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck768/18530601T0000Z
0 S radiam24 182269 182268  0  80   0 - 13498 -      Jan17 ?        00:01:39 ssh hpxfer1.jasmin.ac.uk mkdir -p /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck768/18530601T0000Z && rsync --server -vlogDtpre.iLsfxC --stats . /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck768/18530601T0000Z
0 S radiam24 210367   7915  0  80   0 - 11193 -      16:22 ?        00:00:00 rsync -av --stats --rsync-path=mkdir -p /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z && rsync /work/n02/n02/radiam24/archive/u-ck765/18551001T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z
0 S radiam24 210368 210367  0  80   0 - 13323 -      16:22 ?        00:00:00 ssh hpxfer2.jasmin.ac.uk mkdir -p /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z && rsync --server -vlogDtpre.iLsfxC --stats . /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z

All the pptransfer tasks I’ve looked at for u-ck768 appear to have worked and checksums verfied saying that data has landed on JASMIN at /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck768.

Cheers,
Ros.

Hi Rachel,

I’ve had another look this morning and see that they are all stuck. I don’t know what’s happened as there is no output to go on. I can only suggest killing those 3 tasks in the cylc GUI and then logging into login3.archer2.ac.uk to double check the rsync commands have been stopped.

You could then check that connections are fine by running one of the rsync commands manually on the command line and see if it works:

E.g.
rsync -av --stats --rsync-path="mkdir -p /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z && rsync" /work/n02/n02/radiam24/archive/u-ck765/18551001T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z

If they still don’t work when you resubmit I would suggest switching to a running the transfers on a different login node and see if that helps.

Regards,
Ros.

Hi Ros,

Thanks so much. I’ve tried everything you suggested (except for switching the transfers to a different login node, as at the moment I have pptransfer tasks stuck on both login3 and login4). Killing the tasks in the cylc GUI does stop them on login3.archer2.ac.uk. If I resubmit any pptransfer tasks, they still just get stuck.

I think the problem is the rsync command:

When I run the following -
rsync -av --stats --rsync-path=“mkdir -p /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z && rsync” /work/n02/n02/radiam24/archive/u-ck765/18551001T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z

On JASMIN, directory 18551001T0000Z is created containing the checksum files. I get the following output on ARCHER2:

sending incremental file list
checksums
cice_ck765i_1d_18551001-18551101.nc

However, after this, no further filenames are printed on ARCHER2 and no more files are transferred to JASMIN (I waited for 20 mins, and I think the whole pptransfer task usually takes around 10 mins).

On login3 on ARCHER2, when I run radiam24 | grep rsync after waiting on rsync for 20 mins, I get output:

radiam24@ln03:~> ps -flu radiam24 |grep rsync
0 S radiam24 35401 129415 0 80 0 - 11190 poll_s 11:34 pts/17 00:00:00 rsync -av --stats --rsync-path=mkdir -p /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z && rsync /work/n02/n02/radiam24/archive/u-ck765/18551001T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z
0 S radiam24 35402 35401 0 80 0 - 13462 poll_s 11:34 pts/17 00:00:00 ssh hpxfer2.jasmin.ac.uk mkdir -p /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z && rsync --server -vlogDtpre.iLsfxC --stats . /gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z
0 S radiam24 45047 151072 0 80 0 - 2177 pipe_w 11:35 pts/194 00:00:00 grep rsync

Do you have any more suggestions? The only thing I may have changed since pptransfer was working normally last week was, for a few suites, increasing the runtime and restarting the suite, and, for a few other suites, changing the transfer server from hpxfer1 to hpxfer2, so I’m not sure where this problem could have come from.

Best wishes,

Rachel

Hi Rachel,

Just for clarity can you run:

rsync -av --stats /work/n02/n02/radiam24/archive/u-ck765/18551001T0000Z/ hpxfer2.jasmin.ac.uk:/gws/nopw/j04/pmip4_vol1/users/rachel/ARCHER2_archive/u-ck765/18551001T0000Z

This command is working fine for me transferring your data but to a different GWS.

We have several long running suites which are using the full rsync with the mkdir so the command itself is formulated ok. I’m going to request access to PMIP GWS to see if I can see anything obvious. At the moment I don’t have a feel for where the issue lies; Archer2 end or JASMIN.

Regards,
Ros.

Hi Ros,

I had a look and I think gws pmip4_vol1 ran out of space, so I cleared space and now the pptransfer tasks are running fine. Sorry for bothering you about this.

Best wishes,

Rachel

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.