Python errors in UKESM1.2

Hi CMS team,

My suites u-dr800 and u-dr928 both fail with some python errors that didn’t use to occur.. have there been any python updates on ARCHER2?

the tasks that fail are

unicicles_gris; unicicles_ais; postproc_atmos

For the unicicles tasks, the errors are:

srun: error: nid005710: tasks 0-21,23-47,49-127: Exited with exit code 127 srun: launch/slurm: _step_signal: Terminating StepId=11209862.0 /work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_unicicles/unicicles/bin/unicicles: error while loading shared libraries: libpython3.9.so.1.0: cannot open shared object file: No such file or directory /work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_unicicles/unicicles/bin/unicicles: error while loading shared libraries: libpython3.9.so.1.0: cannot open shared object file: No such file or directory slurmstepd: error: *** STEP 11209862.0 ON nid005710 CANCELLED AT 2025-10-16T18:01:37 *** srun: error: nid005710: tasks 22,48: Exited with exit code 127 [FAIL] $IEXECDIR/unicicles_wrapper # return-code=127 2025-10-16T17:01:38Z CRITICAL - failed/ERR

Note this happened before, this was a fix from Grenville:

in site/archer2-unicicles.rc add --export=all thus
ROSE_LAUNCHER_PREOPTS = --hint=nomultithread --distribution=block:block --cpu-bind=cores --nodes=1 --ntasks=128 --export=allin [[UNI_EXEC_RESOURCE]]

This currently is included in the suite.
Then Grenville wrote:

same for [[CAP_RESOURCE]] in site/archer2-unicicles.rc
ROSE_LAUNCHER_PREOPTS = --hint=nomultithread --distribution=block:block --cpu-bind=cores --nodes=1 --ntasks=1 –export=all Fixed elsewhere - nothing needed

Could it be that this now needs an update elsewhere? I saw another ticket about issues with CAP ( Issue with CAP9.1 after archer2 update )

For postproc_atmos, the error I get is:

Traceback (most recent call last):
File “/work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_pp/build/bin/main_pp.py”, line 119, in
main()
File “/work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_pp/build/bin/main_pp.py”, line 112, in main
run_postproc()
File “/work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_pp/build/bin/main_pp.py”, line 83, in run_postproc
getattr(model, meth)()
File “/mnt/lustre/a2fs-work2/work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_pp/build/bin/timer.py”, line 115, in wrapper
out = function(*args, **kw)
File “/mnt/lustre/a2fs-work2/work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_pp/build/bin/atmos.py”, line 491, in do_meaning
icode = self.update_meanfile(meanfile, setend)
File “/mnt/lustre/a2fs-work2/work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_pp/build/bin/timer.py”, line 115, in wrapper
out = function(*args, **kw)
File “/mnt/lustre/a2fs-work2/work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_pp/build/bin/atmos.py”, line 445, in update_meanfile
rcode = climatemean.create_mean(meanfile,
File “/mnt/lustre/a2fs-work2/work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_pp/build/bin/timer.py”, line 115, in wrapper
out = function(*args, **kw)
File “/mnt/lustre/a2fs-work2/work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_pp/build/bin/climatemean.py”, line 256, in create_mean
icode, output = target_app(meanfile, **kwargs)
File “/mnt/lustre/a2fs-work2/work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_pp/build/bin/timer.py”, line 115, in wrapper
out = function(*args, **kw)
File “/mnt/lustre/a2fs-work2/work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_pp/build/bin/atmos_transform.py”, line 275, in create_um_mean
load_mule = [mule.load_umfile(f) for f in meanfile.component_files]
File “/mnt/lustre/a2fs-work2/work/n02/n02/adittus/cylc-run/u-dr928/run2/share/fcm_make_pp/build/bin/atmos_transform.py”, line 275, in
load_mule = [mule.load_umfile(f) for f in meanfile.component_files]
File “/work/y07/shared/umshared/lib/python3.9/mule/init.py”, line 1845, in load_umfile
result = _load_umfile(file_path, open_file)
File “/work/y07/shared/umshared/lib/python3.9/mule/init.py”, line 1836, in _load_umfile
raise ValueError(msg)
ValueError: Unknown dataset_type 0, supported types are dict_keys([1, 2, 3, 4, 5])
[FAIL] main_pp.py atmos # return-code=1
2025-10-17T21:20:43Z CRITICAL - failed/ERR

Thanks for any help!

Cheers,
Andrea

For the unicicles ones I modified the “module load cray-python” in my site/archer2-unicicles.rc to include the version number, ie “module load cray-python/3.9.13.1”, but that won’t help the postproc

Thanks Robin, re-running the unicicles tasks now!

Hi Andrea,

If you look in the job.out for the postproc_atmos task you can see what it was trying to do when it failed:

[INFO]  Creating meanfile dr800a.py20711201 with components:
        /work/n02/n02/adittus/cylc-run/u-dr800/run2/share/data/History_Data/dr800a.ps2071djf
        /work/n02/n02/adittus/cylc-run/u-dr800/run2/share/data/History_Data/dr800a.ps2071jja
        /work/n02/n02/adittus/cylc-run/u-dr800/run2/share/data/History_Data/dr800a.ps2071mam
        /work/n02/n02/adittus/cylc-run/u-dr800/run2/share/data/History_Data/dr800a.ps2071son

The mule error ValueError: Unknown dataset_type 0 suggests that it can’t read a file, and indeed if we look at those seasonal means, the djf file hasn’t been written properly.

archer2 History_Data$ ls -l dr800a.ps* 
-rw-r--r-- 1 adittus n02  563986432 Oct 16 17:28 dr800a.ps2071djf
-rw-r--r-- 1 adittus n02 5139582976 Oct 16 21:55 dr800a.ps2071jja
-rw-r--r-- 1 adittus n02 5139464192 Oct 16 19:39 dr800a.ps2071mam
-rw-r--r-- 1 adittus n02 5139722240 Oct 17 00:10 dr800a.ps2071son

Annette

Thank you Annette. Now this task is complaining because it can’t make the annual mean without the djf file, and I suspect it was a previous task that generated the djf file. Is there a way I can tell the suite to skip the seasonal means and annual means for this cycle?

If I set the task to succeeded manually, will it fail to do any other important tasks?

Andrea

If you don’t want seasonal and annual means at all (I though that’s what you wanted), try switching them off in postproc→Atmosphere→File transformation

(set create_means to false), reload the suite, retrigger the task.

Setting the task to succeeded may not work - hard to know if it has completed everything else it’s supposed to have done.

Grenville

Thanks Grenville, have done that now.
Now I get:

[WARN] Iris Module is not available
[ERROR] Validity time mismatch in file /work/n02/n02/adittus/cylc-run/u-dr800/run2/share/data/History_Data/dr800a.pm2071nov to be archived
→ Expected [2071, 11, 1, 0, 20, 0] and got [2071, 12, 1]
[FAIL] Command Terminated

Any ideas how to fix this? If I re-run the coupled task, is that more likely to fix things or make them worse?

Thanks,
Andrea

Hi CMS,

Thanks for the replies on this thread. After talking with Grenville and Robin, it seems the easiest for u-dr800 would be to restart it in January of the failure, as the problems in postproc can’t easily be resolved. As the suite has run a long way ahead, I have copied the suite, and wanted to re-run just the year in question (new suite-ID u-dt871). I’d then take the re-generated files (corrupted in u-dr800), rename and move to u-dr800 to just re-run the postproc task.

The trouble I’m having now is that the files in u-dt871 do not bit-compare with u-dr800 pre-failure, so this plan failed.

Is there anything I’m missing?

NRUN_BITCOMP=TRUE
RECON=False

BASIS=2071,1,1,0,0,0

Updated values for:

atmos_dr800c_P1Y_20700101-20710101_icecouple.nc - DONE
bisicles_dr800c_20710101_bathymetry-isf.nc - DONE
bisicles_dr800c_20710101_restart-AIS.hdf5 - DONE
bisicles_dr800c_20710101_restart-GrIS.hdf5 - DONE
bisicles_dr800c_P1Y_20700101-20710101_calving-AIS.hdf5 - DONE
bisicles_dr800c_P1Y_20700101-20710101_calving-GrIS.hdf5 - DONE
bisicles_dr800c_P1Y_20700101-20710101_calving.nc - DONE

dr800a.da20710101_00 - DONE (this is an environment variable in suite, so I renamed and moved it to /work/n02/n02/adittus/cylc-run/u-dt871/run1/share/data/dt871.astart)

dr800i.restart.2071-01-01-00000.nc - DONE
dr800o_20710101_restart.nc -DONE
dr800o_20710101_restart_trc.nc - DONE
dr800o_icebergs_20710101_restart.nc - DONE
glint_dr800c_20710101_restart-AIS.nc - DONE
glint_dr800c_20710101_restart-GrIS.nc - DONE
nemo_dr800c_P1Y_20700101-20710101_icecouple.nc - DONE

Any pointers would be greatly appreciated!

Thanks,
Andrea

I did recompile the executables.

Andrea

In rose-app-conf, u-dt871 has stphseed=0, which is odd (it should be 2) - it looks like the bit compare override has not been applied.

in flow.cylc, in [[COUPLED]], set

ROSE_APP_OPT_CONF_KEYS = {{RM_MEDUSA_OPT_KEY}} {{UNICICLES_OPT_KEY}} {{BITCOMP_NRUN_OPT}}

I think that will work (set stphseed at least)

Grenville

Thanks Grenville, I will try this.
Right now all my tasks have a submit-fail error with no error message.. Is there anything going on that could be causing this, or is this likely to be on my end (note the suite ran before, so unlikely to be my configuration)

Thanks,
Andrea

on puma, can you

ssh ln0[1,2,3,4] ?

I can for all but ln01, which gives me a “Connection closed by ….” error message

You’re out of space on /work