ARCHER2 Rename issue

dgaleareading · 6 October 2021 11:45

Hi,

I have been running the UM using the old archer2 name (login.archer.ac.uk) but I think today’s change to login-4c.archer.ac.uk may be causing an issue. I have attached a screenshot of the cylc-gui for my run showing a supposedly running pptransfer task, however this has finished on ARCHER2. When I re-poll the tasks, its status is not updated. Also, when trying to get a job status or error file, I get the error seen. I have now changed the login address in both site/archer2.rc and in the ssh config file. However, when trying to run suite-run reload, I get that the suite still has running tasks. How do you think I should proceed?

Regards.
Daniel

RosalynHatcher · 6 October 2021 12:10

Hi Daniel,

I would suggest manually checking the job.status file for the pptransfer task on ARCHER2. If the status is succeeded then in the cylc GUI change the status of pptransfer to succeeded. The next tasks should then start up and use login-4c from then on. Are you sure it was rose suite-run --reload you ran as that shouldn’t mind if tasks are already running. If it doesn’t work I would suggest stopping and then restarting the suite with rose suite-run --restart.

Have you also updated your ~/.ssh/config file on PUMA to change the login.archer2.ac.uk to login-4c.archer2.ac.uk?

Cheers,
Ros.

dgaleareading · 6 October 2021 13:04

Hi,

stopping the suite and running rose suite-run --restart seems to have done the trick. Thanks.

Regards,
Daniel

dgaleareading · 7 October 2021 09:37

Hi,

It seems that my pptransfer task is stuck on “retrying” due to the following error:

connection failed
connection failed
connection failed
connection failed
connection failed
connection failed
connection failed
WARNING: MESSAGE SEND FAILED
CPU time limit exceeded
Received signal XCPU
cylc (scheduler - 2021-10-07T03:23:37Z): CRITICAL Task job script received signal XCPU at 2021-10-07T03:23:37Z
cylc (scheduler - 2021-10-07T03:23:37Z): CRITICAL failed at 2021-10-07T03:23:37Z
connection failed
connection failed
connection failed
connection failed
connection failed
connection failed
connection failed
WARNING: MESSAGE SEND FAILED

Could this still be linked to the ARCHER2 name change? I have tried ssh’ing to JASMIN from ARCHER2 via ssh hpxfer1.jasmin.ac.uk and that worked fine.

Regards,
Daniel

RosalynHatcher · 7 October 2021 10:50

Hi Daniel,

Yes, I’ve just tweaked the cylc configuration file again on PUMA so cylc should now try polling rather than trying to communicate back from ARCHER2 which it can’t do. Please try retriggering the task. If it still does the same I would suggest stopping and restarting the suite.

Cheers,
Ros.

dgaleareading · 7 October 2021 16:23

Hi,

I’ve stopped and restarted the suite, but the same error still crops up.

Regards,
Daniel

RosalynHatcher · 8 October 2021 07:45

HI Daniel,

So it’s fixed the communication back problem. The suite is now polling and we’ve lost the connection failed errors. The problem now is that it’s just not completing in the allocated time. I was watching the checksumming last night and it took about 2 hours and then only managed to transfer about 80gb to JASMIN in the next 2 hours. I don’t know if this is ARCHER2 load on the login nodes, filesystem issues or connection to JASMIN. I’m going to take a look at another run to see if that’s seeing any slowdown.

Cheers,
Ros.

RosalynHatcher · 8 October 2021 09:03

Hi Daniel,

Just been looking at your suite further. 19910401T0000Z cycle still has only transferred 76Gb which is the same as when I looked at it at late last night, so it looks like the rsync has got stuck for some reason. I’m currently running a 300Gb transfer and it’s already done 100Gb so I don’t think it’s the connection to JASMIN.

Can you please try killing the 19910401T0000Z/pptransfer task that is currently running. Then on JASMIN move the
/gws/nopw/j04/hiresgw/dg/archer_transfers/u-cg647/19910401T0000Z directory out of the way and then retrigger the pptransfer task and see if a fresh run of the task clears it.

Cheers,
Ros.

dgaleareading · 8 October 2021 09:22

Hi,

I now think that there is a lack of space in my gws on JASMIN. I’ve removed some of the dumps from the earlier cycles and the rsync task on ARCHER2 has started going again, with some new files on JASMIN.

Regards,
Daniel

system · 10 October 2021 09:22

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Suites get stuck on pptransfer? Rose/Cylc and FCM PUMA , ARCHER2	6	345	19 January 2022
Pptransfer retrying on ARCHER2 Unified Model ARCHER2 , PPTransfer	10	237	18 January 2022
ARCHER2 to JASMIN pptransfer task failed ARCHER2	8	326	13 December 2021
Pptransfer works fine but reports 'failed' Rose/Cylc and FCM JASMIN , ARCHER2	5	277	25 April 2022
Submit-failed for pptransfer Unified Model ARCHER2 , PUMATest	3	291	15 June 2022

ARCHER2 Rename issue

Related topics