ARCHER2 Rename issue

Hi,

I have been running the UM using the old archer2 name (login.archer.ac.uk) but I think today’s change to login-4c.archer.ac.uk may be causing an issue. I have attached a screenshot of the cylc-gui for my run showing a supposedly running pptransfer task, however this has finished on ARCHER2. When I re-poll the tasks, its status is not updated. Also, when trying to get a job status or error file, I get the error seen. I have now changed the login address in both site/archer2.rc and in the ssh config file. However, when trying to run suite-run reload, I get that the suite still has running tasks. How do you think I should proceed?

Regards.
Daniel

Hi Daniel,

I would suggest manually checking the job.status file for the pptransfer task on ARCHER2. If the status is succeeded then in the cylc GUI change the status of pptransfer to succeeded. The next tasks should then start up and use login-4c from then on. Are you sure it was rose suite-run --reload you ran as that shouldn’t mind if tasks are already running. If it doesn’t work I would suggest stopping and then restarting the suite with rose suite-run --restart.

Have you also updated your ~/.ssh/config file on PUMA to change the login.archer2.ac.uk to login-4c.archer2.ac.uk?

Cheers,
Ros.

Hi,

stopping the suite and running rose suite-run --restart seems to have done the trick. Thanks.

Regards,
Daniel

Hi,

It seems that my pptransfer task is stuck on “retrying” due to the following error:

connection failed
connection failed
connection failed
connection failed
connection failed
connection failed
connection failed
WARNING: MESSAGE SEND FAILED
CPU time limit exceeded
Received signal XCPU
cylc (scheduler - 2021-10-07T03:23:37Z): CRITICAL Task job script received signal XCPU at 2021-10-07T03:23:37Z
cylc (scheduler - 2021-10-07T03:23:37Z): CRITICAL failed at 2021-10-07T03:23:37Z
connection failed
connection failed
connection failed
connection failed
connection failed
connection failed
connection failed
WARNING: MESSAGE SEND FAILED

Could this still be linked to the ARCHER2 name change? I have tried ssh’ing to JASMIN from ARCHER2 via ssh hpxfer1.jasmin.ac.uk and that worked fine.

Regards,
Daniel

Hi Daniel,

Yes, I’ve just tweaked the cylc configuration file again on PUMA so cylc should now try polling rather than trying to communicate back from ARCHER2 which it can’t do. Please try retriggering the task. If it still does the same I would suggest stopping and restarting the suite.

Cheers,
Ros.

Hi,

I’ve stopped and restarted the suite, but the same error still crops up.

Regards,
Daniel

HI Daniel,

So it’s fixed the communication back problem. The suite is now polling and we’ve lost the connection failed errors. The problem now is that it’s just not completing in the allocated time. I was watching the checksumming last night and it took about 2 hours and then only managed to transfer about 80gb to JASMIN in the next 2 hours. I don’t know if this is ARCHER2 load on the login nodes, filesystem issues or connection to JASMIN. I’m going to take a look at another run to see if that’s seeing any slowdown.

Cheers,
Ros.

Hi Daniel,

Just been looking at your suite further. 19910401T0000Z cycle still has only transferred 76Gb which is the same as when I looked at it at late last night, so it looks like the rsync has got stuck for some reason. I’m currently running a 300Gb transfer and it’s already done 100Gb so I don’t think it’s the connection to JASMIN.

Can you please try killing the 19910401T0000Z/pptransfer task that is currently running. Then on JASMIN move the
/gws/nopw/j04/hiresgw/dg/archer_transfers/u-cg647/19910401T0000Z directory out of the way and then retrigger the pptransfer task and see if a fresh run of the task clears it.

Cheers,
Ros.

Hi,

I now think that there is a lack of space in my gws on JASMIN. I’ve removed some of the dumps from the earlier cycles and the rsync task on ARCHER2 has started going again, with some new files on JASMIN.

Regards,
Daniel

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.