I am archiving GC31 data from ARCHER2 to JASMIN with ‘pptransfer’. The transfer takes about 50 minutes and checksums are compared successfully at the end. But, after about 10 minutes into the transfer the cylc GUI reports the task as ‘failed’ even though I don’t get any error message and the rsync process continues to finish successfully in the background.
This is the pptransfer job.status of the most recent cycle (suite u-cn502@18520101T0000Z) that is still listed as ‘failed’ in the cylc GUI:
It seems to me that the ‘CYLC_BATCH_SYS_EXIT_POLLED’ reports the ‘failed’ during the task. If I manually set the pptransfer task to ‘succeeded’ the cycling of the next jobs continues fine.
I can see why it’s thinking it’s failed but I don’t understand how it got into the situation.
The problem is, it was polling the wrong login node. The job was running on ln03, according to the job.out file and it polled ln04 I think and thus wouldn’t find the task running and thus would flag it as failed. What I don’t understand is that you have correctly set [[PPTRANSFER]] in the suite.rc file to run on a specific login node (ln04) which should stop that situation from happening. Indeed pptransfer for the first 2 cycles correctly ran on ln04.
I can only suggest waiting and seeing what happens with the next cycle.
thanks for your reply. Unfortunately, the next cycle (18530101T0000Z) also failed, even though it ran on ln04 according to the job.out.
But there seems to be something wrong in general with my setup. I checked another suite (u-cn503) and the logs also indicate that the pptransfer node is different to what I specify in the suite.rc file. The GUI in the screenshot below for cycles 1852 and 1853 matches my suite.rc definition (I tried both ln04 and ln03), but the job.out shows a different login node.
Host archer2 login*.archer2.ac.uk
Hostname = login.archer2.ac.uk
User = ssteinig
IdentityFile = ~/.ssh/id_rsa_archerum
ForwardX11 = no
ForwardX11Trusted = no
Having the Hostname = login.archer2.ac.uk means that when you do ssh login4.archer2.ac.uk it forces login to login.archer2.ac.uk which then picks any login node at random.
Remove the Hostname line and change the Host line to Host login*.archer2.ac.uk