Hi,
Sorry to bother you, but I am having trouble with my pptransfer. I get the following error message in my job.err:
[WARN] Transfer command failed: globus transfer --format unix --jmespath ‘task_id’ --recursive --fail-on-quota-errors --sync-level checksum --label u-df570/18500101T0000Z --verify-checksum --notify off 3e90d018-0d05-461a-bbaf-aab605283d21:/work/n02/n02/cjrw09/cylc-run/u-df570/share/cycle/18500101T0000Z/u-df570/18500101T0000Z a2f53b7f-1b4e-4dce-9b7c-349ae760fee0:/gws/nopw/j04/pmip4_vol2/users/cwilliams2011/archer2.d/output.d/pi.d/u-df570/18500101T0000Z
[ERROR] transfer.py: Transfer Error: Checksum validation failed (ReturnCode=4)
[FAIL] Command Terminated
[FAIL] Terminating PostProc…
[FAIL] transfer.py <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2026-01-27T03:12:29Z CRITICAL - failed/EXIT
Just above this, it tells me to run: globus login, but I do this and it says command not found.
What have I done wrong? My suite is u-df570.
Thanks a lot,
Charlie
Hi Charlie,
It’s a JASMIN maintenance day today, so all services are at risk. I wonder if that is the source of your issue?
I have had pptransfer failures with my runs as well - although different error messages. I would recommend re-trying the transfer again tomorrow.
Annette
Hi Charlie,
Once JASMIN is back from maintenance if you still get the same issue run the globus login command as instructed, but before you can run globus commands on the ARCHER2 command line you need to load the globus-cli module:
module load globus-cli/3.35.2
Regards,
Ros.
Thanks both, I will try again tomorrow.
Bizarrely, I checked a couple of hours ago and although pptransfer is still failing, or rather retrying, all of my output has correctly gone to JASMIN.
Hi Charlie,
Globus is an asynchronous task and once submitted will by default keep trying for 24hours or so. PPTransfer has it’s own timeout usually 3 hours so if Globus is down for a while and the transfer takes 5hours. PPTransfer will show as failed but the actual globus task may have succeeded. I’d probably recommend not having retries on the pptransfer task and if it fails you manually check through the Globus web app to confirm if it has failed or not and then either set the task to succeeded or retrigger it. Most times Globus will complete within the 3 hour limit of the pptransfer so pptransfer can detect it’s successful completion, but obviously with JASMIN maintenance transfers can and will take longer.
Cheers,
Ros
Hi Ros,
Sorry for the delay. Okay, having checked my run this morning, pptransfer has indeed failed. So I did what you suggested i.e.
module load globus-cli/3.35.2
globus login
and entered the authorisation code provided, and it now says I am logged in. So should I now retrigger pptransfer? Or do I not need to, given that all of my data transferred correctly yesterday, despite the JASMIN maintenance? I have just checked, and everything is there as expected.
I confess I am slightly confused here, as to what is actually doing the transfer? Is it Globus, or pptransfer? Or does pptransfer specify the location of where the data should be transferred to, whereas Globus does the actual transferring? If so, I don’t understand how Globus can make the transfer if pptransfer has failed? You say that Globus will keep trying for 24 hours whereas pptransfer times out after usually 3 hours or so. But if it is not necessary for pptransfer to have succeeded in order for Globus to do the transfer, why do we run pptransfer at all? Sorry, these are probably daft questions, but I would like to understand exactly what is going on here.
Either way, how often should I do the above Globus module commands? Just if pptransfer fails, or more regularly? And how do I turn off automatic retries with pptransfer? Maybe in ~/roses/u-df570/app/postproc/rose-app.conf ?
Charlie
Hi Charlie,
pptransfer uses the globus-cli to submit the data transfer request to Globus.
Globus is an asynchronous service so the data transfer may not happen immediately.
pptransfer waits to hear back from the Globus service that the transfer has succeeded. As you know with Cylc each task has a timeout on it. For pptransfer that is usually 3 hours.
Globus will usually complete within this time, however it may not. If JASMIN is down for instance, Globus will be unable to complete in the time so the cylc pptransfer task will indicate failure BUT the Globus task will continue trying. You then simply need to check on the Globus task within the app.globus.com website and if successful you can then manually set the pptransfer task to succeeded.
The globus authentication needs renewing every month.
Looking at your suite the setup in this one is not easy to turn off the retries just for pptransfer.
You will need to edit the file site/archer2.rc and in the [[POSTPROC_RESOURCE]] section remove the line:
execution retry delays = PT10M, PT1H, PT3H, P1D
NOTE however this will also turn off the automatic retries for the postproc tasks as well.
Cheers,
Ros
Thanks very much Ros, I now completely understand. I think I will leave the automatic retries on i.e. as is, just to avoid any further complications. It just means that a little bit of babysitting will be required when I start to run proper simulations (today, in fact), because presumably if (for whatever reason) JASMIN is down, pptransfer will fail (even if subsequently Globus transfers the data) which will hold up the next cycle. So I will need to manually reset it to “Succeeded” in order to move on to my next cycle, in my case year.
Charlie