UMESM1.1 failing

SimonTett · 2 November 2023 17:20

Hi ,
I have made a copy of u-cj514/trunk@264964 to give u-da865. I have made the recommended changes at UKESM1.1-AMIP Release Notes. The model crashes soon after starting with the error below. Are there other things I need to change to make it work?
Simon

???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 2
? Error from routine: EG_BICGSTAB_MIXED_PREC
? Error message: Convergence failure in BiCGstab after restartSee the following URL for more information:
? https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
? Error from processor: 377
? Error number: 61

RosalynHatcher · 3 November 2023 08:40

Hi Simon,

There a couple of things that need changing in that suite. I’m currently talking to the MO to get that suite (and others) updated.

You’ll need to change the suite to pick up the start dump from Simon’s directory for now.

/work/n02/n02/simon/u-by791/by791a.da19790101_00

In file site/archer2.rc you’ll also need to change:

ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}}
to
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}} --cpus-per-task={{MAIN_OMPTHR_ATM}}

Cheers,
Ros.

SimonTett · 3 November 2023 10:45

Thanks Ros,
I’ll do that and report back! The failure message doesn’t look like a file-not-found error – is it falling back to some unreasonable default?
Simon

SimonTett · 3 November 2023 14:38

And it seems to be working – model is running and ran for a while!
Simon

RosalynHatcher · 3 November 2023 14:59

Glad to hear it’s running now. It wasn’t a file-not-found issue, the start dump was dodgey.

SimonTett · 6 November 2023 11:11

Though the model was running it ran out of time after 3 hours & 50 mins. Which I think should be enough for a 3 month simulation. Looking in the output the model has ran 311 timesteps and the pa/pm/pb files are empty according to xconv.

job.err has messages of the form:
srun: Warning: can’t honor --ntasks-per-node set to 64 which doesn’t match the requested tasks 504 with the number of requested nodes 8. Ignoring --ntasks-per-node.
[0] exceptions: feenableexcept() mask [0x00000000] enabled. (mask [0x00000000] requested)
WARNING: Requested total thread count and/or thread affinity may result in
oversubscription of available CPU resources! Performance may be degraded.

Any advice much appreciated.
Simon

grenville · 6 November 2023 12:40

Hi Simon

In site/archer2.ac, add the --cpus-per-task clause like this:

{% if MAIN_OMPTHR_ATM > 1 %}
 {% set ATM_SLURM_FLAGS= "--hint=nomultithread --distribution=block:block --cpus-per-task={{MAIN_OMPTHR_ATM}}" %}
{% else %}
 {% set ATM_SLURM_FLAGS = "--cpu-bind=cores" %}
{% endif %}

Grenville

SimonTett · 7 November 2023 10:11

Hi Grenville,
that failed. Job hang around for ages and then looks to have failed with the following error in job.err:
srun: error: Invalid numeric value “{{MAIN_OMPTHR_ATM}}” for --cpus-per-task.
[FAIL] um-atmos <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2023-11-07T01:09:20Z CRITICAL - failed/EXIT

I don’t understand rosie so can’t tell if I made a small error…
Simon

grenville · 7 November 2023 14:41

Simon

Mea culpa - remove the --cpus-per-task={{MAIN_OMPTHR_ATM}} from ATM_SLURM_FLAGS and add it to
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}} in ATMOS_RESOURSE section.

My jinja knowledge is weaker than I’d thought.

Grenville

SimonTett · 8 November 2023 09:29

So becomes:
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}} --cpus-per-task={{MAIN_OMPTHR_ATM}}
??

grenville · 8 November 2023 09:42

yes - (like you have for the reconfiguration)

SimonTett · 8 November 2023 16:50

And that has worked. Guess I must have been the first person to use this configuration in a while…
As I want to build on this, should I push the whole thing back into fcm?
Simon

SimonTett · 8 November 2023 17:32

OK – I don’t understand fcm so when I copy my modified job the modifications do not propagate…
I have done fcm ci on the job.

RosalynHatcher · 9 November 2023 07:22

Hi Simon,

The suite has been copied from the wrong version u-da865@r270554 - you checked in your changes at r271439.

On the command line do rosie copy u-da865 this will copy from the head of the suite.

Cheers,
Ros

SimonTett · 9 November 2023 15:30

Thanks a lot Ros.
Though as usual I had to kick off my met office code password…
Simon

RosalynHatcher · 9 November 2023 16:46

Hi Simon,

Yes, the MOSRS password is only cached for 12hours before it requires you to re-cache it.

Cheers,
Ros.

SimonTett · 9 November 2023 16:58

That is so annoying. Why can’t they use ssh keys etc…

SimonTett · 10 November 2023 09:19

I’ve moved to case u-db167 which is a version with no PP output and netcdf output. Case runs and produces output. But at the end I get an error:
/work/n02/n02/tetts/cylc-run/u-db167/bin/save_wallclock.sh: /work/n02/n02/tetts/cylc-run/u-db167/bin/iteration_bins.py: /usr/bin/python: bad interpreter: No such file or directory

I think this is because /work/n02/n02/tetts/cylc-run/u-db167/bin/iteration_bins.py uses a full path to python as opposed to #!/usr/bin/env python.

tetts@ln01:/work/n02/n02/tetts/cylc-run/u-db167> type -p python
/work/y07/shared/utils/core/python/miniconda2/bin/python
My python is somewhere else.
Anyhow, can the interation_bins.py script be modified to use #!/usr/bin/env python

ta
Simon

grenville · 10 November 2023 09:51

Hi SImon

Do you need to save the wallclock times ? If not, just remove
post-script = save_wallclock.sh {{EXPT_RESUB}}

Grenville

SimonTett · 10 November 2023 11:19

I’ve modified my copy of iteration_bins.py
Simon

Topic		Replies	Views
UM Training suite task fails to submit Unified Model	2	11	24 November 2024
Cancelled due to time limit *** ARCHER2	20	490	15 August 2023
BICGstab error 20 years into nudged run Unified Model Monsoon2	20	67	13 October 2024
Previous working-job now seg-faulting on 1st timestep after OS upgrade (ARCHER2 v8.4 GA4 UM-UKCA) Unified Model ARCHER2 , PUMATest	41	378	19 February 2024
Permission denied (publickey) failure Unified Model ARCHER2	27	908	27 October 2023

UMESM1.1 failing

Related topics