UMESM1.1 failing

Hi ,
I have made a copy of u-cj514/trunk@264964 to give u-da865. I have made the recommended changes at UKESM1.1-AMIP Release Notes. The model crashes soon after starting with the error below. Are there other things I need to change to make it work?
Simon

???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 2
? Error from routine: EG_BICGSTAB_MIXED_PREC
? Error message: Convergence failure in BiCGstab after restartSee the following URL for more information:
? https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
? Error from processor: 377
? Error number: 61

Hi Simon,

There a couple of things that need changing in that suite. I’m currently talking to the MO to get that suite (and others) updated.

You’ll need to change the suite to pick up the start dump from Simon’s directory for now.

/work/n02/n02/simon/u-by791/by791a.da19790101_00

In file site/archer2.rc you’ll also need to change:

ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}}
to
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}} --cpus-per-task={{MAIN_OMPTHR_ATM}}

Cheers,
Ros.

1 Like

Thanks Ros,
I’ll do that and report back! The failure message doesn’t look like a file-not-found error – is it falling back to some unreasonable default?
Simon

And it seems to be working – model is running and ran for a while!
Simon

Glad to hear it’s running now. It wasn’t a file-not-found issue, the start dump was dodgey.

1 Like

Though the model was running it ran out of time after 3 hours & 50 mins. Which I think should be enough for a 3 month simulation. Looking in the output the model has ran 311 timesteps and the pa/pm/pb files are empty according to xconv.

job.err has messages of the form:
srun: Warning: can’t honor --ntasks-per-node set to 64 which doesn’t match the requested tasks 504 with the number of requested nodes 8. Ignoring --ntasks-per-node.
[0] exceptions: feenableexcept() mask [0x00000000] enabled. (mask [0x00000000] requested)
WARNING: Requested total thread count and/or thread affinity may result in
oversubscription of available CPU resources! Performance may be degraded.

Any advice much appreciated.
Simon

Hi Simon

In site/archer2.ac, add the --cpus-per-task clause like this:

{% if MAIN_OMPTHR_ATM > 1 %}
 {% set ATM_SLURM_FLAGS= "--hint=nomultithread --distribution=block:block --cpus-per-task={{MAIN_OMPTHR_ATM}}" %}
{% else %}
 {% set ATM_SLURM_FLAGS = "--cpu-bind=cores" %}
{% endif %}

Grenville

Hi Grenville,
that failed. Job hang around for ages and then looks to have failed with the following error in job.err:
srun: error: Invalid numeric value ā€œ{{MAIN_OMPTHR_ATM}}ā€ for --cpus-per-task.
[FAIL] um-atmos <<ā€˜STDIN’
[FAIL]
[FAIL] ā€˜STDIN’ # return-code=1
2023-11-07T01:09:20Z CRITICAL - failed/EXIT

I don’t understand rosie so can’t tell if I made a small error…
Simon

Simon

Mea culpa - remove the --cpus-per-task={{MAIN_OMPTHR_ATM}} from ATM_SLURM_FLAGS and add it to
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}} in ATMOS_RESOURSE section.

My jinja knowledge is weaker than I’d thought.

Grenville

So becomes:
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}} --cpus-per-task={{MAIN_OMPTHR_ATM}}
??

yes - (like you have for the reconfiguration)

1 Like

And that has worked. Guess I must have been the first person to use this configuration in a while…
As I want to build on this, should I push the whole thing back into fcm?
Simon

OK – I don’t understand fcm so when I copy my modified job the modifications do not propagate…
I have done fcm ci on the job.

Hi Simon,

The suite has been copied from the wrong version u-da865@r270554 - you checked in your changes at r271439.

On the command line do rosie copy u-da865 this will copy from the head of the suite.

Cheers,
Ros

Thanks a lot Ros.
Though as usual I had to kick off my met office code password…
Simon

Hi Simon,

Yes, the MOSRS password is only cached for 12hours before it requires you to re-cache it.

Cheers,
Ros.

That is so annoying. Why can’t they use ssh keys etc…

I’ve moved to case u-db167 which is a version with no PP output and netcdf output. Case runs and produces output. But at the end I get an error:
/work/n02/n02/tetts/cylc-run/u-db167/bin/save_wallclock.sh: /work/n02/n02/tetts/cylc-run/u-db167/bin/iteration_bins.py: /usr/bin/python: bad interpreter: No such file or directory

I think this is because /work/n02/n02/tetts/cylc-run/u-db167/bin/iteration_bins.py uses a full path to python as opposed to #!/usr/bin/env python.

tetts@ln01:/work/n02/n02/tetts/cylc-run/u-db167> type -p python
/work/y07/shared/utils/core/python/miniconda2/bin/python
My python is somewhere else.
Anyhow, can the interation_bins.py script be modified to use #!/usr/bin/env python

ta
Simon

Hi SImon

Do you need to save the wallclock times ? If not, just remove
post-script = save_wallclock.sh {{EXPT_RESUB}}

Grenville

I’ve modified my copy of iteration_bins.py
Simon