Intermittant Host Select failure on PUMA2

I am getting some intermittant problems with submitting jobs from puma2 to archer2. Sometimes I get ‘submit-retrying’, then ‘submit-failed’, errors. The job activity log shows the following warning:

[jobs-submit cmd] (remote host select)
[jobs-submit ret_code] 1
[jobs-submit err]
rose host-select archer2: host selection failed:
COMMAND FAILED (124): rose host-select archer2

This does seem to resolve itself after a while - but it’s not clear to me how it does this. Is there anything I can do to solve the problem when it occurs, or do I just have to wait?

1 Like

I regularly also get this error - so watching your ticket with interest!

Did you ever get to the bottom of this @douglowe ?

I didn’t, sorry.

Might be something we need to raise with ARCHER2 support?

1 Like

We’ve never seen this!

1 Like

Ah, that’s a shame. I’m getting it a lot lately!

…will email archer now.

ARCHER are probably just going to pass this back to us!

I will take a look this week.

Annette

1 Like

Ha, indeed! They just emailed to say it’s your ballpark. I will try to get them to investigate their end, but if you could look into it too, Annette, that would be great.

FYI: seems to be a comms issue between puma2 and archer2:

The “rose host-select archer2: host selection failed” error suggests that there is an issue between PUMA2 and ARCHER2. When this error occurs, no job is in fact submitted to the ARCHER2 Slurm queue

it’s been assigned to the systems team to investigate further:

Just to let you know, I’ve assigned your query to the ARCHER2 systems team, who should be able to investigate further by analysing the PUMA2-to-ARCHER2 comms logs.

Will let you know if I get any further updates.
Cheers,
Ella

Hi Doug and Ella,

I do have one idea about what might be going on…

We recently had a couple of users with intermittent submit errors, with the error appearing as a malformed job script.

(I am including the details here because I think this was solved over email so it might be useful for other people).

The problem was caused by a line in the users’ .bash_profile which we just needed to protect so it was only called when launching an interactive session, i.e:

[[ $- != *i* ]] && return # Stop here if not running interactively
`. /home/n02/n02/grenvill/mosrs-setup-gpg-agent`

In your cases, I notice you both have the conda intialization blurb in your .bashrc and I wonder if that is causing the issue. I tested this in my account and it does slow the rose host-select down a bit so I wonder if it is just timing out sometimes.

Can you try adding this line to your .bashrc above the Conda initialization:

[[ $- != *i* ]] && return # Stop here if not running interactively

You will need to make sure you don’t have anything else after there that you need for batch jobs, but you can move that above or put in the .bash_profile.

Please do let me know if that doesn’t work and I will investigate further.

Best wishes,

Annette

Thanks Annette, I have implemented that change and will keep a close eye on things!

Currently I have a jasmin data storage issue to sort before any suites can run but I’m hoping this will be sorted this afternoon.

Cheers
Ella

Thanks for the suggestion Annette - I’ve applied this change to my .bashrc file too, and will let you know how it goes.

For long term work - is there a way to adapt this check, so that we only break at this point for ssh connections? I ask because I do use conda environments for some batch jobs for other projects, so it would be useful to have conda activate for batch jobs still.

Hi Doug,

I’m not sure how to do that, but it may be possible with some googling.

I do have some other ideas:

  • You could try increasing the timeout limit for rose host-select. The default is 10 seconds, so maybe try 20? On puma2 edit the file ~/.metomi/rose.conf and add the following lines:
[rose-host-select]
timeout=20.0
  • Alternatively move your conda initialization to another file (e.g. ~/conda_init, then explictly source this in your batch jobs as needed (. ~/conda_init).

Hope this helps,

Annette