Postprocessing error after archer2 software upgrade

Hi CMS,

I was running UKESMv11.1 suite well before archer2 login was closed for software upgrade.

I have restarted my suite using ‘rose suite-run --restart’. Model is running but I am getting an error in the postprocessing. The suite ID is ‘u-cw803’.

The postprocessing error message is:


Lmod has detected the following error: The following module(s) are unknown:
“gcc/10.2.0” “cray-hdf5/1.12.0.3”

Please check the spelling or version number. Also try “module spider …”
It is also possible your cache file is out-of-date; it may help to try:
$ module --ignore-cache load “gcc/10.2.0” “cray-hdf5/1.12.0.3”

Also make sure that all modulefiles written in TCL start with the string
#%Module

Executing this command requires loading “gcc/10.2.0” which failed while
processing the following module(s):

Module fullname   Module Filename
---------------   ---------------
postproc/2022.03  /work/y07/shared/umshared/modulefiles/postproc/2022.03.luaExecuting this command requires loading "cray-hdf5/1.12.0.3" which failed while

processing the following module(s):

Module fullname   Module Filename
---------------   ---------------
postproc/2022.03  /work/y07/shared/umshared/modulefiles/postproc/2022.03.lua

Is this related to the archer2 software upgrade? How to fix it?

Regards, Alok

Hi CMS,

I think I have missed the cms-helpdesk page banner message “ARCHER2 Major Software Upgrade: Following the ARCHER2 return to service NCAS-CMS will rebuild and reinstall software that we maintain. UM suites will not run until we have completed this work, please wait for an announcement from us before attempting to run UM suites.” and/or previous emails. I thought everything has been installed after the archer2 email dated 14-06-2023. If that is the case then I think I will able to run the suite without any issues once NCAS-CMS rebuild and reinstall software.

Regards, Alok

Hi CMS,

I have followed the instructions Updating a UM suite after the ARCHER2 O/S upgrade and restated the suite but still getting the same post-processing error.

Additionally, I got another error in the atmos_main. The job.err file has the following information:


slurmstepd: error: *** STEP 3752655.0 ON nid003946 CANCELLED AT 2023-06-22T16:44:33 DUE TO TIME LIMIT ***

slurmstepd: error: *** JOB 3752655 ON nid003946 CANCELLED AT 2023-06-22T16:44:33 DUE TO TIME LIMIT ***

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

srun: error: nid003946: tasks 0-63: Terminated

srun: launch/slurm: _step_signal: Terminating StepId=3752655.0


The model is not finishing in the requested time, which was fine before the ARCHER2 O/S upgrade.

Is there any way to complete this run (u-cw803)?

Regards, Alok

Hi Alok,

Please rose suite-run --reload your suite and retrigger the failed atmos_main task.

The --cpus-per-task that you added to the site/archer2.rc file have not been picked up by the suite.

Regards,
Ros.

Hi Ros,

Thanks for this. I have reloaded the suite and retrigged the failed atmos_main and postprocessing (of the previous month) tasks.

I am still getting the same error in the postprocessing.

Regards, Alok

Alok

in site/archer2.rc
change
module load postproc/2022.03
to
module load postproc

then rose suite-run --reload and tretrigger

Grenville

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.