I am getting some intermittant problems with submitting jobs from puma2 to archer2. Sometimes I get ‘submit-retrying’, then ‘submit-failed’, errors. The job activity log shows the following warning:
This does seem to resolve itself after a while - but it’s not clear to me how it does this. Is there anything I can do to solve the problem when it occurs, or do I just have to wait?
Ha, indeed! They just emailed to say it’s your ballpark. I will try to get them to investigate their end, but if you could look into it too, Annette, that would be great.
FYI: seems to be a comms issue between puma2 and archer2:
The “rose host-select archer2: host selection failed” error suggests that there is an issue between PUMA2 and ARCHER2. When this error occurs, no job is in fact submitted to the ARCHER2 Slurm queue
it’s been assigned to the systems team to investigate further:
Just to let you know, I’ve assigned your query to the ARCHER2 systems team, who should be able to investigate further by analysing the PUMA2-to-ARCHER2 comms logs.
Will let you know if I get any further updates.
Cheers,
Ella
We recently had a couple of users with intermittent submit errors, with the error appearing as a malformed job script.
(I am including the details here because I think this was solved over email so it might be useful for other people).
The problem was caused by a line in the users’ .bash_profile which we just needed to protect so it was only called when launching an interactive session, i.e:
[[ $- != *i* ]] && return # Stop here if not running interactively
`. /home/n02/n02/grenvill/mosrs-setup-gpg-agent`
In your cases, I notice you both have the conda intialization blurb in your .bashrc and I wonder if that is causing the issue. I tested this in my account and it does slow the rose host-select down a bit so I wonder if it is just timing out sometimes.
Can you try adding this line to your .bashrc above the Conda initialization:
[[ $- != *i* ]] && return # Stop here if not running interactively
You will need to make sure you don’t have anything else after there that you need for batch jobs, but you can move that above or put in the .bash_profile.
Please do let me know if that doesn’t work and I will investigate further.
Thanks for the suggestion Annette - I’ve applied this change to my .bashrc file too, and will let you know how it goes.
For long term work - is there a way to adapt this check, so that we only break at this point for ssh connections? I ask because I do use conda environments for some batch jobs for other projects, so it would be useful to have conda activate for batch jobs still.
I’m not sure how to do that, but it may be possible with some googling.
I do have some other ideas:
You could try increasing the timeout limit for rose host-select. The default is 10 seconds, so maybe try 20? On puma2 edit the file ~/.metomi/rose.conf and add the following lines:
[rose-host-select]
timeout=20.0
Alternatively move your conda initialization to another file (e.g. ~/conda_init, then explictly source this in your batch jobs as needed (. ~/conda_init).