Intermittant Host Select failure on PUMA2

douglowe · 16 November 2023 14:51

I am getting some intermittant problems with submitting jobs from puma2 to archer2. Sometimes I get ‘submit-retrying’, then ‘submit-failed’, errors. The job activity log shows the following warning:

[jobs-submit cmd] (remote host select)
[jobs-submit ret_code] 1
[jobs-submit err]
rose host-select archer2: host selection failed:
COMMAND FAILED (124): rose host-select archer2

This does seem to resolve itself after a while - but it’s not clear to me how it does this. Is there anything I can do to solve the problem when it occurs, or do I just have to wait?

shakka · 16 November 2023 16:34

I regularly also get this error - so watching your ticket with interest!

shakka · 29 November 2023 10:24

Did you ever get to the bottom of this @douglowe ?

douglowe · 29 November 2023 11:03

I didn’t, sorry.

Might be something we need to raise with ARCHER2 support?

grenville · 29 November 2023 11:20

We’ve never seen this!

shakka · 29 November 2023 11:34

Ah, that’s a shame. I’m getting it a lot lately!

…will email archer now.

AnnetteOsprey · 29 November 2023 11:47

ARCHER are probably just going to pass this back to us!

I will take a look this week.

Annette

shakka · 29 November 2023 11:55

Ha, indeed! They just emailed to say it’s your ballpark. I will try to get them to investigate their end, but if you could look into it too, Annette, that would be great.

shakka · 29 November 2023 15:57

FYI: seems to be a comms issue between puma2 and archer2:

The “rose host-select archer2: host selection failed” error suggests that there is an issue between PUMA2 and ARCHER2. When this error occurs, no job is in fact submitted to the ARCHER2 Slurm queue

it’s been assigned to the systems team to investigate further:

Just to let you know, I’ve assigned your query to the ARCHER2 systems team, who should be able to investigate further by analysing the PUMA2-to-ARCHER2 comms logs.

Will let you know if I get any further updates.
Cheers,
Ella

AnnetteOsprey · 1 December 2023 09:17

Hi Doug and Ella,

I do have one idea about what might be going on…

We recently had a couple of users with intermittent submit errors, with the error appearing as a malformed job script.

(I am including the details here because I think this was solved over email so it might be useful for other people).

The problem was caused by a line in the users’ .bash_profile which we just needed to protect so it was only called when launching an interactive session, i.e:

[[ $- != *i* ]] && return # Stop here if not running interactively
`. /home/n02/n02/grenvill/mosrs-setup-gpg-agent`

In your cases, I notice you both have the conda intialization blurb in your .bashrc and I wonder if that is causing the issue. I tested this in my account and it does slow the rose host-select down a bit so I wonder if it is just timing out sometimes.

Can you try adding this line to your .bashrc above the Conda initialization:

[[ $- != *i* ]] && return # Stop here if not running interactively

You will need to make sure you don’t have anything else after there that you need for batch jobs, but you can move that above or put in the .bash_profile.

Please do let me know if that doesn’t work and I will investigate further.

Best wishes,

Annette

shakka · 1 December 2023 15:01

Thanks Annette, I have implemented that change and will keep a close eye on things!

Currently I have a jasmin data storage issue to sort before any suites can run but I’m hoping this will be sorted this afternoon.

Cheers
Ella

douglowe · 6 December 2023 14:30

Thanks for the suggestion Annette - I’ve applied this change to my .bashrc file too, and will let you know how it goes.

For long term work - is there a way to adapt this check, so that we only break at this point for ssh connections? I ask because I do use conda environments for some batch jobs for other projects, so it would be useful to have conda activate for batch jobs still.

AnnetteOsprey · 7 December 2023 09:24

Hi Doug,

I’m not sure how to do that, but it may be possible with some googling.

I do have some other ideas:

You could try increasing the timeout limit for rose host-select. The default is 10 seconds, so maybe try 20? On puma2 edit the file ~/.metomi/rose.conf and add the following lines:

[rose-host-select]
timeout=20.0

Alternatively move your conda initialization to another file (e.g. ~/conda_init, then explictly source this in your batch jobs as needed (. ~/conda_init).

Hope this helps,

Annette

Topic		Replies	Views
Unable to submit a job from PUMA2 to Monsoon Monsoon2	10	227	13 November 2023
Jobs submit-failure Unified Model ARCHER2	3	73	16 February 2024
Submit-failed for all tasks today? Unified Model PUMA , ARCHER2	12	335	13 May 2022
Puma2 submit-failed ARCHER2	8	10	23 November 2024
Host key verification failed - first run after Archer2 restart ARCHER2	6	150	25 July 2023

Intermittant Host Select failure on PUMA2

Related topics