Housekeeping task poll-fails

Hi CMS, it’s been a while!

We’ve finally managed to get the UM going over the geographical poles (I can’t claim any credit - that goes to Nick Savage), but am now finally getting around to fixing some problems that have been lower down the list of priorities for a while.

Firstly: every other cycle, the housekeeping task submit-fails, which means the simulation requires a lot of babysitting because I have to manually re-trigger the housekeeping task every time. This is clearly quite impractical for running very long simulations. Can you help me fix this please?

Cheers,
Ella

pl tell us which suite?

Thanks Grenville. All of my suites show this behaviour, and I know from others that I’m not the only one! But, for example: u-cy153

Hi Ella

I suspect that housekeeping may be running on a particular login node but that cylc is polling a different login node, but that would not account for a submit failure (I can’t see anything in the logs to indicate a submit failure.)

You could try adding a particular host (login[1-4].archer2.ac.uk) eg,

        [[[remote]]]
            host = login2.archer2.ac.uk

to the [[housekeep_cycle]] section – that should ensure cylc polls the login node where housekeeping is running (this will fail if login2 goes down, but that shouldn’t happen often.)

You should also make sure you can ssh seamlessly to all four login nodes individually.

Grenville

1 Like

You mean to suite.rc ?

oops, yes in suite.rc

1 Like

Thanks Grenville, doesn’t look like that’s worked unfortunately. It’s not actually a submit-failed error, it just gets stuck on submitted, and I get ‘poll-failed’ in the suite control window. I’m actually now getting a lot of submit-retrying steps in other jobs too (glm fcst, LBC creation, etc), which I imagine is related. I assume this is probably related to my known_hosts and ssh keys, as I’m also unable to get X forwarding on archer itself.

I was also wondering if it will be possible to amend the housekeeping task to automatically delete previous cycles after transferring the data to e.g. jasmin (I imagine I would have to do this via pptransfer?) - any thoughts?

Cheers,
Ella

Ella

Well, now I’ve read the docs I think your suites are no configured quite right (see rose_prune — Rose Documentation 2.0.0 documentation) - we’d normally submit the housekeeping task to PUMA. Please take a look at /home/grenville/roses/u-cn134 where housekeeping is set up (correctly I believe.)

Have

Grenville

Thanks for this Grenville (and apologies for age-long reply; have been at a conference). I’ll take a look at this today. E

Hi Grenville,

Have had a look at the suite.rc and housekeeping configs and I’m still a little confused about where the relevant differences are - where exactly is it configured to submit to puma?

Cheers
Ella

Ella
u-cy153 has:

    [[HOUSEKEEP]]
        inherit = None, HOST_HPC
        [[[job]]]
            execution retry delays = PT15M, PT15M, PT30M, PT60M, PT60M, PT180M, PT360M, PT360M, PT360M

        [[[remote]]]
            host = login2.archer2.ac.uk

where HOUSEKEEP says run on the HPC

u-cn134 has:

    [[HOUSEKEEP_RESOURCE]]
        [[[job]]]
            batch system = background
        [[[remote]]]
            host = localhost

which says run on local host (= PUMA). Copy the contents of u-cn134 [[HOUSEKEEP_RESOURCE]] into u-cy153 [[HOUSEKEEP]].

Grenville

Thanks Grenville. I tried that but when I try to reload the suite it complains:

[FAIL] ERROR: Cannot upgrade deprecated item “[runtime][HOUSEKEEP][job][method] → [runtime][HOUSEKEEP][job][batch system] - value unchanged” because the upgraded item already exists

I changed the [[[job]]] line back to the execution retry delays line and it seems to have reloaded okay. Will re-run to see if the housekeeping works without babysitting this time!

Thanks

Hi Ella,

From that error, I suspect you may have more than one [[HOUSEKEEP]] section defined in the suite and it is complaining about duplicate entries. Check the cylc-run/suiteid/suite.rc.processed to see if this is the problem.

Regards,
Ros.

Hi Ros,

Thanks for the suggestion. Looks indeed like there are duplicate entries from the suite.rc.processed file, but strangely I can’t see the second entry in the suite.rc itself. Might it be taking the config from somewhere else?

Ella

edit: it’s defined in the site/suite-adds.rc file as

[[[remote]]]
host ={{HPC_HOST}} instead of host = localhost in suite.rc

  • I’m guessing this is likely the issue?

Update: I removed the duplicate entry from sites/ncas-cray-ex/suite-adds.rc and that has done the trick. Housekeeping now submits and runs without any poll-fail problems.

The offending line was
[[[remote]]]
host ={{HPC_HOST}}

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.