I have been running the UM using the old archer2 name (login.archer.ac.uk) but I think today’s change to login-4c.archer.ac.uk may be causing an issue. I have attached a screenshot of the cylc-gui for my run showing a supposedly running pptransfer task, however this has finished on ARCHER2. When I re-poll the tasks, its status is not updated. Also, when trying to get a job status or error file, I get the error seen. I have now changed the login address in both site/archer2.rc and in the ssh config file. However, when trying to run suite-run reload, I get that the suite still has running tasks. How do you think I should proceed?
I would suggest manually checking the job.status file for the pptransfer task on ARCHER2. If the status is succeeded then in the cylc GUI change the status of pptransfer to succeeded. The next tasks should then start up and use login-4c from then on. Are you sure it was rose suite-run --reload you ran as that shouldn’t mind if tasks are already running. If it doesn’t work I would suggest stopping and then restarting the suite with rose suite-run --restart.
Have you also updated your ~/.ssh/config file on PUMA to change the login.archer2.ac.uk to login-4c.archer2.ac.uk?
Yes, I’ve just tweaked the cylc configuration file again on PUMA so cylc should now try polling rather than trying to communicate back from ARCHER2 which it can’t do. Please try retriggering the task. If it still does the same I would suggest stopping and restarting the suite.
So it’s fixed the communication back problem. The suite is now polling and we’ve lost the connection failed errors. The problem now is that it’s just not completing in the allocated time. I was watching the checksumming last night and it took about 2 hours and then only managed to transfer about 80gb to JASMIN in the next 2 hours. I don’t know if this is ARCHER2 load on the login nodes, filesystem issues or connection to JASMIN. I’m going to take a look at another run to see if that’s seeing any slowdown.
Just been looking at your suite further. 19910401T0000Z cycle still has only transferred 76Gb which is the same as when I looked at it at late last night, so it looks like the rsync has got stuck for some reason. I’m currently running a 300Gb transfer and it’s already done 100Gb so I don’t think it’s the connection to JASMIN.
Can you please try killing the 19910401T0000Z/pptransfer task that is currently running. Then on JASMIN move the /gws/nopw/j04/hiresgw/dg/archer_transfers/u-cg647/19910401T0000Z directory out of the way and then retrigger the pptransfer task and see if a fresh run of the task clears it.
I now think that there is a lack of space in my gws on JASMIN. I’ve removed some of the dumps from the earlier cycles and the rsync task on ARCHER2 has started going again, with some new files on JASMIN.