Pptransfer/globus

Hi,

Globus transfers are running very slowly since about 11ish today. I have 10 simulations running in, I think, scratch though the data (on archer2) is being successfully archived to /mnt/lustre/a2fs-work2/work/n02/shared/tetts/opt_dfols26/XXXXX/output/opt_dfols26/XXXXX/CYCLE_POINT where XXXXX is the job name. Normally the transfers from archer2 to jasmin would take 1-2 minutes moving about 180 Mbytes. But the 4 jobs in at the moment are all getting timeout/endpoint errors. Is something up with globus at the moment?

globus ID’s below for slow jobs.

933ec3df-bee6-11f0-b35f-0e092d85c59b worked taking 45 seconds at an effective speed of ~ 4 MB/s. This ran at 10:09.

Simon

c7839bc4-beed-11f0-937e-027493648695

8e72676d-beed-11f0-b9c2-027493648695

8e72a865-beed-11f0-ab5e-0e092d85c59b

8e7635be-beed-11f0-8133-027493648695

I note that pptransfer has a three hour time limit and when it fails (because it runs out time) globus continues to try and transfer data. I have just manually stopped some of those jobs.

Is there a reason why pptransfer has such a long time limit?

Is globus robust to having two transfer jobs for the same data running at the same time? Could this explain my zero size files? 2nd pptransfer submits globus job. Another one is running so pptransfer exits with status succeeded. House keeping runs cleaning out data… or in my case once the simulation has ran my post-processing runs which removes data.

If so, would I be better not having pptransfer try a 2nd time if it fails? Then I can intervene manually.

And what would happen if I reduced pptransfer’s timelimit to say 30 mins? Almost all the time, my transfers take 2-3 mins? So, if globus is going slow I’d rather fail quickly…

Simon

Hi Simon,

Getting similar problems today so maybe a globus issue!

Hannah

Hi Simon,

If you are seeing constant issues with either the ARCHER2 endpoint or JASMIN endpoint you need to contact either ARCHER2 or JASMIN as we don’t have access to diagnostic information at either end.

Is there a reason why pptransfer has such a long time limit?

You can set the timelimit to whatever you want. 3hrs is just the standard UM pptransfer timelimit which we found allows for intermittent issues with Globus. If it’s too short and Globus is still trying then pptransfer task will fail and you will have to manually check on the Globus process and set the pptransfer task to succeeded. If your transfer time is generally only a couple of minutes then by all means reduce the time limit.

Is globus robust to having two transfer jobs for the same data running at the same time?

Globus will not allow the same data transfer to be running twice. If the pptransfer task times out and then retries, it will immediately fail as Globus will say the request cannot be submitted as there is already one in progress. Thus if you set the pptransfer time too short i.e. not allowing for slightly slower Globus transfer times then you will end up likely needing to intervene manually more often.

There isn’t a one size fits all, you will need to experiment as to what works best for your particular workflow.

Cheers,
Ros

Hi Ros,

thanks a lot for that. That is very helpful. Is there a way of telling if the problems are at the jasmin or archer2 end? I had a lot of timeouts and the job is running on archer2. So, I’d guess jasmin. When I report, should I give them the globus task ids?

When pptransfer initiates a globus transfer it has an internal limit of, I think, a few days. Would it be sensible to modify pptransfer to reduce that to something comparable to the job time? That way if the job times out the globus transfer will time out soon after.

I restarted my workflow and the transfers all flowed through at their normal rate. I note that rate is about 1MB/sec which seems a bit slow to me. But as each transfer is moving about 180 MB that is enough for me!

Simon

Hi Simon,

I would suspect JASMIN end given the GWS issues. If you expand out the endpoint error message it should give you the endpoint IP address.

Yes, if you report, do supply the globus task ids.

Cheers,
Ros

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.