UMESM1.1 failing

Hi ,
I have made a copy of u-cj514/trunk@264964 to give u-da865. I have made the recommended changes at UKESM1.1-AMIP Release Notes. The model crashes soon after starting with the error below. Are there other things I need to change to make it work?
Simon

???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 2
? Error from routine: EG_BICGSTAB_MIXED_PREC
? Error message: Convergence failure in BiCGstab after restartSee the following URL for more information:
? https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
? Error from processor: 377
? Error number: 61

Hi Simon,

There a couple of things that need changing in that suite. Iā€™m currently talking to the MO to get that suite (and others) updated.

Youā€™ll need to change the suite to pick up the start dump from Simonā€™s directory for now.

/work/n02/n02/simon/u-by791/by791a.da19790101_00

In file site/archer2.rc youā€™ll also need to change:

ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}}
to
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}} --cpus-per-task={{MAIN_OMPTHR_ATM}}

Cheers,
Ros.

1 Like

Thanks Ros,
Iā€™ll do that and report back! The failure message doesnā€™t look like a file-not-found error ā€“ is it falling back to some unreasonable default?
Simon

And it seems to be working ā€“ model is running and ran for a while!
Simon

Glad to hear itā€™s running now. It wasnā€™t a file-not-found issue, the start dump was dodgey.

1 Like

Though the model was running it ran out of time after 3 hours & 50 mins. Which I think should be enough for a 3 month simulation. Looking in the output the model has ran 311 timesteps and the pa/pm/pb files are empty according to xconv.

job.err has messages of the form:
srun: Warning: canā€™t honor --ntasks-per-node set to 64 which doesnā€™t match the requested tasks 504 with the number of requested nodes 8. Ignoring --ntasks-per-node.
[0] exceptions: feenableexcept() mask [0x00000000] enabled. (mask [0x00000000] requested)
WARNING: Requested total thread count and/or thread affinity may result in
oversubscription of available CPU resources! Performance may be degraded.

Any advice much appreciated.
Simon

Hi Simon

In site/archer2.ac, add the --cpus-per-task clause like this:

{% if MAIN_OMPTHR_ATM > 1 %}
 {% set ATM_SLURM_FLAGS= "--hint=nomultithread --distribution=block:block --cpus-per-task={{MAIN_OMPTHR_ATM}}" %}
{% else %}
 {% set ATM_SLURM_FLAGS = "--cpu-bind=cores" %}
{% endif %}

Grenville

Hi Grenville,
that failed. Job hang around for ages and then looks to have failed with the following error in job.err:
srun: error: Invalid numeric value ā€œ{{MAIN_OMPTHR_ATM}}ā€ for --cpus-per-task.
[FAIL] um-atmos <<ā€˜STDINā€™
[FAIL]
[FAIL] ā€˜STDINā€™ # return-code=1
2023-11-07T01:09:20Z CRITICAL - failed/EXIT

I donā€™t understand rosie so canā€™t tell if I made a small errorā€¦
Simon

Simon

Mea culpa - remove the --cpus-per-task={{MAIN_OMPTHR_ATM}} from ATM_SLURM_FLAGS and add it to
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}} in ATMOS_RESOURSE section.

My jinja knowledge is weaker than Iā€™d thought.

Grenville

So becomes:
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}} --cpus-per-task={{MAIN_OMPTHR_ATM}}
??

yes - (like you have for the reconfiguration)

1 Like

And that has worked. Guess I must have been the first person to use this configuration in a whileā€¦
As I want to build on this, should I push the whole thing back into fcm?
Simon

OK ā€“ I donā€™t understand fcm so when I copy my modified job the modifications do not propagateā€¦
I have done fcm ci on the job.

Hi Simon,

The suite has been copied from the wrong version u-da865@r270554 - you checked in your changes at r271439.

On the command line do rosie copy u-da865 this will copy from the head of the suite.

Cheers,
Ros

Thanks a lot Ros.
Though as usual I had to kick off my met office code passwordā€¦
Simon

Hi Simon,

Yes, the MOSRS password is only cached for 12hours before it requires you to re-cache it.

Cheers,
Ros.

That is so annoying. Why canā€™t they use ssh keys etcā€¦

Iā€™ve moved to case u-db167 which is a version with no PP output and netcdf output. Case runs and produces output. But at the end I get an error:
/work/n02/n02/tetts/cylc-run/u-db167/bin/save_wallclock.sh: /work/n02/n02/tetts/cylc-run/u-db167/bin/iteration_bins.py: /usr/bin/python: bad interpreter: No such file or directory

I think this is because /work/n02/n02/tetts/cylc-run/u-db167/bin/iteration_bins.py uses a full path to python as opposed to #!/usr/bin/env python.

tetts@ln01:/work/n02/n02/tetts/cylc-run/u-db167> type -p python
/work/y07/shared/utils/core/python/miniconda2/bin/python
My python is somewhere else.
Anyhow, can the interation_bins.py script be modified to use #!/usr/bin/env python

ta
Simon

Hi SImon

Do you need to save the wallclock times ? If not, just remove
post-script = save_wallclock.sh {{EXPT_RESUB}}

Grenville

Iā€™ve modified my copy of iteration_bins.py
Simon