Globus transfer failures

I’m getting a bunch of pptransfer failures. Looking at the globus errors they are of the same form – see below.

Is this an archer2 or jasmine failure? Asking so I can work out which help desk to complain to…

Simon

Error (transfer)

Endpoint: Archer2 file systems (3e90d018-0d05-461a-bbaf-aab605283d21)
Server: 193.62.216.42:443
File: /work/n02/n02/tetts/cylc-run/u-dr157/run7/share/cycle/19891001T0000Z/dr157a.pd1989nov.pp
Command: RETR /work/n02/n02/tetts/cylc-run/u-dr157/run7/share/cycle/19891001T0000Z/dr157a.pd1989nov.pp
Message: Data channel authentication failed

Details: 500-Command failed. : globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.\r\n500-globus_xio: Operation was canceled\r\n500-globus_xio: Operation timed out\r\n500 End.\r\n

Hi Simon,

If you are still getting this error please contact the ARCHER2 helpdesk and send them the error above which tells them which server is having problems.

Regards,
Ros.

Hi Ros,

I had a look at the files that had made it to jasmin with xconv. This suggested they had got corrupted. So, after checking that the data was still on archer2 I removed the data from jasmin. Retriggered the pptransfer and this time it went through. As I don’t really understand why it worked I don’t think this is a solution… Unless I am the only person having globus transfer problems…

Simon

I got below from the jasmin help desk (with edits to remove people’s names). Is a solution to change pptransfer to give a short deadline to globus – an hour. And pptransfer then retries (if failed) after an hour or so?? That way it might get another node….

Simon

Hi
This report outlines the current performance issues you might be experiencing with Globus transfers and explains the underlying causes.

Root Cause of Performance Problems

The performance challenges you’re observing with Globus transfers aren’t due to Globus itself. Instead, they stem from intermittent issues with the ability of our Globus transfer nodes to read and write to the QuoByte filesystem.

We operate a pool of five transfer nodes, which are automatically assigned to your transfers by a load balancer. If your transfer happens to be picked up by a node currently experiencing QuoByte problems, it’s likely that the transfer will undergo numerous retries before it eventually succeeds.

We are actively working to identify these problematic nodes as they occur and temporarily remove them from the pool for rebooting. However, this process is currently manual.

Data Integrity and Transfer Retries

To ensure data integrity, a Globus transfer typically involves a checksum validation unless specifically disabled by you or your workflow. If this integrity check fails, the transfer automatically retries. This mechanism accounts for both the slow overall performance and the intermittent nature of the issue: if your transfer lands on a healthy node, it will proceed quickly as expected.

Similarly, unless a transfer is explicitly canceled by a user, it is designed to continue retrying until it succeeds (up to a very high limit that is rarely encountered). Many of the “errors” you might see are actually just informative messages indicating that the task is being retried, rather than a definitive failure. While this ensures eventual completion, it can manifest as a slow overall transfer speed.

If you require a transfer to “bail out” earlier than this automatic retry limit, you can set an earlier deadline using the transfer command.

Please let us know if you have any further questions or require additional assistance

1 Like

Hi Simon,

Yes, Matt has been in touch with me and that is one workaround we have been looking at.

Regards,
Ros.

1 Like

Ros,

over the last few days all transfers have been going through… So, maybe nothing needs to be done!

Simon

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.