Hi ,
I have made a copy of u-cj514/trunk@264964 to give u-da865. I have made the recommended changes at UKESM1.1-AMIP Release Notes. The model crashes soon after starting with the error below. Are there other things I need to change to make it work?
Simon
???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 2
? Error from routine: EG_BICGSTAB_MIXED_PREC
? Error message: Convergence failure in BiCGstab after restartSee the following URL for more information:
? https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
? Error from processor: 377
? Error number: 61
Hi Simon,
There a couple of things that need changing in that suite. Iām currently talking to the MO to get that suite (and others) updated.
Youāll need to change the suite to pick up the start dump from Simonās directory for now.
/work/n02/n02/simon/u-by791/by791a.da19790101_00
In file site/archer2.rc
youāll also need to change:
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}}
to
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}} --cpus-per-task={{MAIN_OMPTHR_ATM}}
Cheers,
Ros.
1 Like
Thanks Ros,
Iāll do that and report back! The failure message doesnāt look like a file-not-found error ā is it falling back to some unreasonable default?
Simon
And it seems to be working ā model is running and ran for a while!
Simon
Glad to hear itās running now. It wasnāt a file-not-found issue, the start dump was dodgey.
1 Like
Though the model was running it ran out of time after 3 hours & 50 mins. Which I think should be enough for a 3 month simulation. Looking in the output the model has ran 311 timesteps and the pa/pm/pb files are empty according to xconv.
job.err has messages of the form:
srun: Warning: canāt honor --ntasks-per-node set to 64 which doesnāt match the requested tasks 504 with the number of requested nodes 8. Ignoring --ntasks-per-node.
[0] exceptions: feenableexcept() mask [0x00000000] enabled. (mask [0x00000000] requested)
WARNING: Requested total thread count and/or thread affinity may result in
oversubscription of available CPU resources! Performance may be degraded.
Any advice much appreciated.
Simon
Hi Simon
In site/archer2.ac, add the --cpus-per-task clause like this:
{% if MAIN_OMPTHR_ATM > 1 %}
{% set ATM_SLURM_FLAGS= "--hint=nomultithread --distribution=block:block --cpus-per-task={{MAIN_OMPTHR_ATM}}" %}
{% else %}
{% set ATM_SLURM_FLAGS = "--cpu-bind=cores" %}
{% endif %}
Grenville
Hi Grenville,
that failed. Job hang around for ages and then looks to have failed with the following error in job.err:
srun: error: Invalid numeric value ā{{MAIN_OMPTHR_ATM}}ā for --cpus-per-task.
[FAIL] um-atmos <<āSTDINā
[FAIL]
[FAIL] āSTDINā # return-code=1
2023-11-07T01:09:20Z CRITICAL - failed/EXIT
I donāt understand rosie so canāt tell if I made a small errorā¦
Simon
Simon
Mea culpa - remove the --cpus-per-task={{MAIN_OMPTHR_ATM}}
from ATM_SLURM_FLAGS
and add it to
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}}
in ATMOS_RESOURSE
section.
My jinja knowledge is weaker than Iād thought.
Grenville
So becomes:
ROSE_LAUNCHER_PREOPTS = {{ATM_SLURM_FLAGS}} --cpus-per-task={{MAIN_OMPTHR_ATM}}
??
yes - (like you have for the reconfiguration)
1 Like
And that has worked. Guess I must have been the first person to use this configuration in a whileā¦
As I want to build on this, should I push the whole thing back into fcm?
Simon
OK ā I donāt understand fcm so when I copy my modified job the modifications do not propagateā¦
I have done fcm ci on the job.
Hi Simon,
The suite has been copied from the wrong version u-da865@r270554 - you checked in your changes at r271439.
On the command line do rosie copy u-da865
this will copy from the head of the suite.
Cheers,
Ros
Thanks a lot Ros.
Though as usual I had to kick off my met office code passwordā¦
Simon
Hi Simon,
Yes, the MOSRS password is only cached for 12hours before it requires you to re-cache it.
Cheers,
Ros.
That is so annoying. Why canāt they use ssh keys etcā¦
Iāve moved to case u-db167 which is a version with no PP output and netcdf output. Case runs and produces output. But at the end I get an error:
/work/n02/n02/tetts/cylc-run/u-db167/bin/save_wallclock.sh: /work/n02/n02/tetts/cylc-run/u-db167/bin/iteration_bins.py: /usr/bin/python: bad interpreter: No such file or directory
I think this is because /work/n02/n02/tetts/cylc-run/u-db167/bin/iteration_bins.py uses a full path to python as opposed to #!/usr/bin/env python.
tetts@ln01:/work/n02/n02/tetts/cylc-run/u-db167> type -p python
/work/y07/shared/utils/core/python/miniconda2/bin/python
My python is somewhere else.
Anyhow, can the interation_bins.py script be modified to use #!/usr/bin/env python
ta
Simon
Hi SImon
Do you need to save the wallclock times ? If not, just remove
post-script = save_wallclock.sh {{EXPT_RESUB}}
Grenville
Iāve modified my copy of iteration_bins.py
Simon