Pptransfer works fine but reports 'failed'

Hi,

I am archiving GC31 data from ARCHER2 to JASMIN with ‘pptransfer’. The transfer takes about 50 minutes and checksums are compared successfully at the end. But, after about 10 minutes into the transfer the cylc GUI reports the task as ‘failed’ even though I don’t get any error message and the rsync process continues to finish successfully in the background.

This is the pptransfer job.status of the most recent cycle (suite u-cn502@18520101T0000Z) that is still listed as ‘failed’ in the cylc GUI:

CYLC_BATCH_SYS_NAME=background
CYLC_BATCH_SYS_JOB_ID=234433
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2022-04-25T08:06:10Z
CYLC_JOB_PID=234433
CYLC_JOB_INIT_TIME=2022-04-25T08:06:12Z
CYLC_BATCH_SYS_EXIT_POLLED=2022-04-25T08:16:20Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-04-25T08:54:01Z

It seems to me that the ‘CYLC_BATCH_SYS_EXIT_POLLED’ reports the ‘failed’ during the task. If I manually set the pptransfer task to ‘succeeded’ the cycling of the next jobs continues fine.

Do you have any idea what might cause this issue?

Many thanks,
Seb

Hi Sebastien,

I can see why it’s thinking it’s failed but I don’t understand how it got into the situation.

The problem is, it was polling the wrong login node. The job was running on ln03, according to the job.out file and it polled ln04 I think and thus wouldn’t find the task running and thus would flag it as failed. What I don’t understand is that you have correctly set [[PPTRANSFER]] in the suite.rc file to run on a specific login node (ln04) which should stop that situation from happening. Indeed pptransfer for the first 2 cycles correctly ran on ln04.

I can only suggest waiting and seeing what happens with the next cycle.

Cheers,
Ros.

Hi Ros,

thanks for your reply. Unfortunately, the next cycle (18530101T0000Z) also failed, even though it ran on ln04 according to the job.out.

But there seems to be something wrong in general with my setup. I checked another suite (u-cn503) and the logs also indicate that the pptransfer node is different to what I specify in the suite.rc file. The GUI in the screenshot below for cycles 1852 and 1853 matches my suite.rc definition (I tried both ln04 and ln03), but the job.out shows a different login node.

job.out:

Suite    : u-cn503
Task Job : 18520101T0000Z/pptransfer/01 (try 1)
User@Host: ssteinig@ln03
Suite    : u-cn503
Task Job : 18530101T0000Z/pptransfer/01 (try 1)
User@Host: ssteinig@ln04

So somehow my [[PPTRANSFER]] setting in the suite.rc is not used for the actual pptransfer job. Could it be that I overwrite the suite.rc somehow?

Many thanks,
Seb

Hi Sebastian,

I think I’ve found the problem:

In your ~/.ssh/config file you have:

Host archer2 login*.archer2.ac.uk
    Hostname = login.archer2.ac.uk
    User = ssteinig
    IdentityFile = ~/.ssh/id_rsa_archerum
    ForwardX11 = no
    ForwardX11Trusted = no

Having the Hostname = login.archer2.ac.uk means that when you do ssh login4.archer2.ac.uk it forces login to login.archer2.ac.uk which then picks any login node at random.

Remove the Hostname line and change the Host line to Host login*.archer2.ac.uk

Regards,
Ros.

1 Like

Hi Ros,

Thank you so much for spotting this - it indeed solved the problem!

The pptransfer polling works fine now.

Really appreciate your help.

Best,
Seb

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.