We’ve finally managed to get the UM going over the geographical poles (I can’t claim any credit - that goes to Nick Savage), but am now finally getting around to fixing some problems that have been lower down the list of priorities for a while.
Firstly: every other cycle, the housekeeping task submit-fails, which means the simulation requires a lot of babysitting because I have to manually re-trigger the housekeeping task every time. This is clearly quite impractical for running very long simulations. Can you help me fix this please?
I suspect that housekeeping may be running on a particular login node but that cylc is polling a different login node, but that would not account for a submit failure (I can’t see anything in the logs to indicate a submit failure.)
You could try adding a particular host (login[1-4].archer2.ac.uk) eg,
[[[remote]]]
host = login2.archer2.ac.uk
to the [[housekeep_cycle]] section – that should ensure cylc polls the login node where housekeeping is running (this will fail if login2 goes down, but that shouldn’t happen often.)
You should also make sure you can ssh seamlessly to all four login nodes individually.
Thanks Grenville, doesn’t look like that’s worked unfortunately. It’s not actually a submit-failed error, it just gets stuck on submitted, and I get ‘poll-failed’ in the suite control window. I’m actually now getting a lot of submit-retrying steps in other jobs too (glm fcst, LBC creation, etc), which I imagine is related. I assume this is probably related to my known_hosts and ssh keys, as I’m also unable to get X forwarding on archer itself.
I was also wondering if it will be possible to amend the housekeeping task to automatically delete previous cycles after transferring the data to e.g. jasmin (I imagine I would have to do this via pptransfer?) - any thoughts?
Well, now I’ve read the docs I think your suites are no configured quite right (see rose_prune — Rose Documentation 2.0.0 documentation) - we’d normally submit the housekeeping task to PUMA. Please take a look at /home/grenville/roses/u-cn134 where housekeeping is set up (correctly I believe.)
Have had a look at the suite.rc and housekeeping configs and I’m still a little confused about where the relevant differences are - where exactly is it configured to submit to puma?
Thanks Grenville. I tried that but when I try to reload the suite it complains:
[FAIL] ERROR: Cannot upgrade deprecated item “[runtime][HOUSEKEEP][job][method] → [runtime][HOUSEKEEP][job][batch system] - value unchanged” because the upgraded item already exists
I changed the [[[job]]] line back to the execution retry delays line and it seems to have reloaded okay. Will re-run to see if the housekeeping works without babysitting this time!
From that error, I suspect you may have more than one [[HOUSEKEEP]] section defined in the suite and it is complaining about duplicate entries. Check the cylc-run/suiteid/suite.rc.processed to see if this is the problem.
Thanks for the suggestion. Looks indeed like there are duplicate entries from the suite.rc.processed file, but strangely I can’t see the second entry in the suite.rc itself. Might it be taking the config from somewhere else?
Ella
edit: it’s defined in the site/suite-adds.rc file as
[[[remote]]]
host ={{HPC_HOST}} instead of host = localhost in suite.rc
Update: I removed the duplicate entry from sites/ncas-cray-ex/suite-adds.rc and that has done the trick. Housekeeping now submits and runs without any poll-fail problems.
The offending line was
[[[remote]]]
host ={{HPC_HOST}}