Previous working-job now seg-faulting on 1st timestep after OS upgrade (ARCHER2 v8.4 GA4 UM-UKCA)

I’m getting an error message when I submit the GA4 UM-UKCA job from PUMA2 to ARCHER2.

I’ve done all the set-up steps for PUMA2, following the instructions at the NCAS-CMS page:

https://ncas-cms.github.io/um-training/getting-setup-selfstudy.html#terminal

But I’m guessing maybe there needs also to be an adjustment to the “machine bindings” file one of the hand-edits not sure how to configure the GA4 UM-UKCA UMUI job

When submitting from the UMUI, it’s giving the following error message:

ERROR: can't use non-numeric string as operand of "!" while attempting to access account gmann on host  login.archer2.ac.uk. Note that repeated failures may result in  expiry of password due to security procedures on some  machines. Check user id, hostname and password  for your account on the host machine.
type or paste code here

Please can you help get back up-and-running with the GA4 UM-UKCA job on ARCHER2 from PUMA2?

Thanks
Graham

Hi Graham

I’m still working on this - I did have the job running at one point, but broke it while trying to speed it up. It’s OOMing now.

The submit button does not work from PUMA2 - users are (or will be) asked to run the UMSUBMIT_ARCHER2 script in the job directory.

There are a couple of incorrect puma paths is xprbs that are fixed in my job xpskb (which is OOMing)

Grenville

Graham

xpskb is running - I had to change to 1 OMP thread. Sorry we don’t have the resources to chase down a threading problem in a umui job - it seems to be running fast enough.

See also https://cms.ncas.ac.uk/puma2/umui

Grenville

Hi Grenville,

Thanks a lot for this.

Ah, OK – I updated the xpsq-u job for the host-name
(to “ln01” instead of “login.archer2.ac.uk”).

And then I copied xpsq-u to xpsq-v, adding-in also to xpsq-v the changes in your test job xpsk-b

→ updated revision number for container-file (22831 → 22852)
→ updated revision number for vn8.4_ncas branch (22838 → 22852)
→ change to 1 OMP thread and PE-decomp 16x16 rather than 2 OMP threads and 8x24
→ Reconfig to 4x6 instead of 8x28

From submitting via the command-line UMSUBMIT_ARCHER2 script, that then compiled OK.

The reconfiguration compile & run both also completed successfully.
And the actual run-job for the model executable is in the queue as a 2-node job.

Can I check then, you explained OOM problem (Out Of Memory, right?).
And then I guess this will fail once it gets queue’d to run.

From the wording there, it sounds like this is something you know how to fix?
(or a “known/expected” interim error or so?).

Please can you jot down the next steps, re: any additional steps.
If straightforward, I could potentially try a further-amended job, e.g. with edit to settings/PE-config?

Graham

The recon and model ran successfully for me. I meant that we don’t have the resource to fix problems that arise from running on 2 threads

Grenvill

Hi Grenvill,

Thanks for this.

OK, great that the job ran OK for you.

For me, the xpsq-v job was queue’d to run this evening, but failed straitght away,
and is givng the “OOM” error you mentioned (see the info below from the .leave file)

I was referring to your job xpsk-b for the changes to implement, and that included
also to run on only 1 OMP thread.

Please can you clarify, which job id should I refer to for the job that ran OK for you?

Thanks
Graham

*****************************************************************
*****************************************************************
     Job started at : Wed 06 Dec 2023 08:38:51 PM GMT
*****************************************************************
*****************************************************************
     Run started from UMUI
cp -u with changes from Grenville`s xpsq-u
This job is using UM directory /work/y07/shared/umshared,
-------------------------------
Processing STASHC file for ROSE
-------------------------------
Backup of STASHC file created! 
/work/n02/n02/gmann/tmp/tmp.nid004581.140194/xpsqv.stashc_preROSE
-------------------------------
Processing STASHC file complete
-------------------------------
***************************************************************
   Starting script :   qsatmos
   Starting time   :   Wed 06 Dec 2023 08:38:52 PM GMT
***************************************************************


/work/n02/n02/gmann/um/xpsqv/bin/qsatmos: Executing model run

*********************************************************
UM Executable : /work/n02/n02/gmann/um/xpsqv/bin/xpsqv.exe
*********************************************************


slurmstepd: error: Detected 2 oom-kill event(s) in StepId=5017099.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid004582: tasks 155,175: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=5017099.0
slurmstepd: error: *** STEP 5017099.0 ON nid004581 CANCELLED AT 2023-12-06T20:39:07 ***
srun: error: nid004581: tasks 0-127: Terminated
srun: Force Terminated StepId=5017099.0
xpsqv: Run failed
*****************************************************************
   Ending script   :   qsatmos
   Completion code :   1
   Completion time :   Wed 06 Dec 2023 08:39:08 PM GMT
*****************************************************************


/work/n02/n02/gmann/um/xpsqv/bin/qsmaster: Failed in qsatmos in job xpsqv
***************************************************************
   Starting script :   qsfinal
   Starting time   :   Wed 06 Dec 2023 08:39:08 PM GMT
***************************************************************

Checking requirement for atmosphere resubmit...
/work/n02/n02/gmann/um/xpsqv/bin/qsresubmit: Error: no resubmit details found
*****************************************************************
   Ending script   :   qsfinal
   Completion code :   0
   Completion time :   Wed 06 Dec 2023 08:39:08 PM GMT
*****************************************************************

/work/n02/n02/gmann/um/xpsqv/bin/qsmaster: Failed in qsfinal in job xpsqv

Graham

my job xpskb ran out for 413 timesteps (I’d asked for 20 mins wallclock, so it ran out of time)
see /work/n02/n02/grenvill/um/xpskb/pe_output/xpskb.fort6.pe0

Grenville

Hi Grenville,

Thanks for this.

I cross-checked my xpsq-v job with your xpsk-b, and there were only very-minor diffs.
I updated this to 20-min job, and my model run xpsq-x then is identical to your job xpsk-b.

But the xpsq-x job is giving the same “Out of Memory” error message as the xpsq-v.

Please can you take a look and see if something is somehow not set-up quite right in my environment or so?

Thanks
Graham

PS I’ve copied below the standard output from the .leave file (the model run)
and the .leave file is here

/home/n02/n02/gmann/output/xpsqx000.xpsqx.d23341.t125842.leave

***************************************************************
   Starting script :   qsatmos
   Starting time   :   Thu 07 Dec 2023 01:54:41 PM GMT
***************************************************************


/work/n02/n02/gmann/um/xpsqx/bin/qsatmos: Executing model run

*********************************************************
UM Executable : /work/n02/n02/gmann/um/xpsqx/bin/xpsqx.exe
*********************************************************


slurmstepd: error: Detected 1 oom-kill event(s) in StepId=5033810.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid001752: task 86: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=5033810.0
slurmstepd: error: *** STEP 5033810.0 ON nid001752 CANCELLED AT 2023-12-07T13:54:52 ***
srun: error: nid001753: tasks 128-255: Terminated
srun: Force Terminated StepId=5033810.0
xpsqx: Run failed
*****************************************************************
   Ending script   :   qsatmos
   Completion code :   1
   Completion time :   Thu 07 Dec 2023 01:54:53 PM GMT
*****************************************************************


/work/n02/n02/gmann/um/xpsqx/bin/qsmaster: Failed in qsatmos in job xpsqx
***************************************************************
   Starting script :   qsfinal
   Starting time   :   Thu 07 Dec 2023 01:54:53 PM GMT
***************************************************************

Checking requirement for atmosphere resubmit...
/work/n02/n02/gmann/um/xpsqx/bin/qsresubmit: Error: no resubmit details found
*****************************************************************
   Ending script   :   qsfinal
   Completion code :   0
   Completion time :   Thu 07 Dec 2023 01:54:53 PM GMT
*****************************************************************


/work/n02/n02/gmann/um/xpsqx/bin/qsmaster: Failed in qsfinal in job xpsqx
 <<<< Information about How Many Lines of Output follow >>>>
 9  lines in main OUTPUT file.
 0 lines of O/P from pe0.
 <<<<         Lines of Output Information ends          >>>>


 ==============================================================================
 =================================== OUTPUT ===================================
 ==============================================================================

 UMUI Namelist output in /work/n02/n02/gmann/um/xpsqx/xpsqx.umui.nl
 DATAW/DATAM file listing in /work/n02/n02/gmann/um/xpsqx/xpsqx.list
 STASH output should be in /work/n02/n02/gmann/um/xpsqx/xpsqx.stash


 ==============================================================================
 =============================== UM RUN OUTPUT ================================
 ==============================================================================

qsatmos: %MODEL% output follows:-

qsatmos: Stack requested for UM job:  GB
srun --cpus-per-task=1 --hint=nomultithread --distribution=block:block /work/n02/n02/gmann/um/xpsqx/bin/xpsqx.exe
0+1 records in
0+1 records out

Hi Graham,

Don’t know if you made any changes to your job since your last message, but I’ve just copied your xpsqv job (as xpcnd) and it’s run ok to timestep 420 (I ran it in the short queue so this just as far as it could get in the 20mins allowed). The only changes I made were to use my userid, budget, run in short queue, and recon on 1 node.

Regards,
Ros.

Hi Ros,
Thanks for this.
OK, that’s strange then, but that’s good news I guess.
Dunno if maybe it could just because this was the 1st job I’ve run on PUMA2 (to ARCHER2).

I’ll try submitting it again – did you change to the short-queue via hand-edit?
Or does it make that queue-decision based on requested wall-clock time-limit?.

Also – can you think of any reason the OOM message memory-limit error?
(e.g. old environment file from pumanew or similar)?

Thanks
Graham

I tried submitting the job again last night, copied to a clean new-job xpsq-y, there also changing
the reconfig hand-edit to “N” to match exactly the settings in your job xpcn-d.

But that just gave exactly the same situation, with compiling and reconfiguring OK, but
then giving the same “Out Of Memory” error when the 2-node (256-core) parallel-job is queue’d to run.

-rw-r--r-- 1 gmann n02 161828555 Dec  4 11:21 xpskb.fort6.pe0
-rw-r--r-- 1 gmann n02  14596396 Dec  4 14:37 xpsqv000.xpsqv.d23338.t134138.comp.leave
-rw-r--r-- 1 gmann n02    268896 Dec  5 03:56 xpsqv000.xpsqv.d23338.t134138.rcf.leave
-rw-r--r-- 1 gmann n02      5766 Dec  6 20:39 xpsqv000.xpsqv.d23338.t134138.leave
-rw-r--r-- 1 gmann n02  14596474 Dec  7 12:48 xpsqw000.xpsqw.d23341.t115430.comp.leave
-rw-r--r-- 1 gmann n02    269106 Dec  7 12:49 xpsqw000.xpsqw.d23341.t115430.rcf.leave
-rw-r--r-- 1 gmann n02  14596478 Dec  7 13:48 xpsqx000.xpsqx.d23341.t125842.comp.leave
-rw-r--r-- 1 gmann n02    268955 Dec  7 13:53 xpsqx000.xpsqx.d23341.t125842.rcf.leave
-rw-r--r-- 1 gmann n02      5822 Dec  7 13:54 xpsqx000.xpsqx.d23341.t125842.leave
-rw-r--r-- 1 gmann n02  14596546 Dec  9 01:06 xpsqy000.xpsqy.d23343.t001918.comp.leave
-rw-r--r-- 1 gmann n02    268930 Dec  9 01:13 xpsqy000.xpsqy.d23343.t001918.rcf.leave
-rw-r--r-- 1 gmann n02      5855 Dec  9 01:15 xpsqy000.xpsqy.d23343.t001918.leave
gmann@ln02:~/output> tail xpsqy000.xpsqy.d23343.t001918.rcf.leave
0+1 records in
0+1 records out
257825 bytes (258 kB, 252 KiB) copied, 0.000506815 s, 509 MB/s
*****************************************************************
****************************************************************
     Job ended at :  Sat 09 Dec 2023 01:13:30 AM GMT
****************************************************************
*****************************************************************
 
Submitted batch job 5043383
gmann@ln02:~/output> cat xpsqy000.xpsqy.d23343.t001918.leave

Currently Loaded Modules:
  1) craype-x86-rome
  2) libfabric/1.12.1.2.2.0.0
  3) craype-network-ofi
  4) perftools-base/22.12.0
  5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta
  6) craype/2.7.19
  7) cray-dsmml/0.2.2
  8) cray-mpich/8.1.23
  9) cray-libsci/22.12.1.1
 10) PrgEnv-cray/8.3.3
 11) bolt/0.8
 12) epcc-setup-env
 13) load-epcc-module
 14) cce/15.0.0
 15) cray-hdf5-parallel/1.12.2.1
 16) cray-netcdf-hdf5parallel/4.9.0.1
 17) um/2023.06

 

*****************************************************************
     Version 8.4 template, Unified Model ,  Non-Operational
     Created by UMUI version 8.4                       
*****************************************************************
Host is nid003708
PATH used = /opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/bin:/opt/cray/pe/hdf5-parallel/1.12.2.1/bin:/opt/cray/pe/hdf5/1.12.2.1/bin:/opt/cray/pe/cce/15.0.0/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/15.0.0/binutils/cross/x86_64-aarch64/aarch64-linux-gnu/../bin:/opt/cray/pe/cce/15.0.0/utils/x86_64/bin:/opt/cray/pe/cce/15.0.0/bin:/opt/cray/pe/cce/15.0.0/cce-clang/x86_64/bin:/work/y07/shared/utils/core/bolt/0.8/bin:/work/y07/shared/utils/core/bin:/opt/cray/pe/mpich/8.1.23/ofi/crayclang/10.0/bin:/opt/cray/pe/mpich/8.1.23/bin:/opt/cray/pe/craype/2.7.19/bin:/opt/cray/pe/perftools/22.12.0/bin:/opt/cray/pe/papi/6.0.0.17/bin:/opt/cray/libfabric/1.12.1.2.2.0.0/bin:/usr/local/bin:/usr/bin:/bin:/usr/lib/mit/bin:/opt/cray/pe/bin:/work/y07/shared/umshared/vn8.4/cce/utils:/work/y07/shared/umshared/bin:/work/y07/shared/umshared/vn8.4/bin:/work/n02/n02/gmann/um/xpsqy/bin:/work/y07/shared/umshared/vn8.4/cce/scripts:/work/y07/shared/umshared/vn8.4/cce/exec
*****************************************************************
*****************************************************************
     Job started at : Sat 09 Dec 2023 01:14:44 AM GMT
*****************************************************************
*****************************************************************
     Run started from UMUI
cp -x but switch-off increase-nodes-Reconfig hand-edit (same as Ros`s xpcnd)
This job is using UM directory /work/y07/shared/umshared,
-------------------------------
Processing STASHC file for ROSE
-------------------------------
Backup of STASHC file created! 
/work/n02/n02/gmann/tmp/tmp.nid003708.174571/xpsqy.stashc_preROSE
-------------------------------
Processing STASHC file complete
-------------------------------
***************************************************************
   Starting script :   qsatmos
   Starting time   :   Sat 09 Dec 2023 01:14:44 AM GMT
***************************************************************


/work/n02/n02/gmann/um/xpsqy/bin/qsatmos: Executing model run

*********************************************************
UM Executable : /work/n02/n02/gmann/um/xpsqy/bin/xpsqy.exe
*********************************************************


slurmstepd: error: Detected 1 oom-kill event(s) in StepId=5043383.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid003711: task 189: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=5043383.0
slurmstepd: error: *** STEP 5043383.0 ON nid003708 CANCELLED AT 2023-12-09T01:15:03 ***
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=5043383.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
xpsqy: Run failed
*****************************************************************
   Ending script   :   qsatmos
   Completion code :   1
   Completion time :   Sat 09 Dec 2023 01:15:04 AM GMT
*****************************************************************


/work/n02/n02/gmann/um/xpsqy/bin/qsmaster: Failed in qsatmos in job xpsqy
***************************************************************
   Starting script :   qsfinal
   Starting time   :   Sat 09 Dec 2023 01:15:04 AM GMT
***************************************************************

Checking requirement for atmosphere resubmit...
/work/n02/n02/gmann/um/xpsqy/bin/qsresubmit: Error: no resubmit details found
*****************************************************************
   Ending script   :   qsfinal
   Completion code :   0
   Completion time :   Sat 09 Dec 2023 01:15:04 AM GMT
*****************************************************************


/work/n02/n02/gmann/um/xpsqy/bin/qsmaster: Failed in qsfinal in job xpsqy
 <<<< Information about How Many Lines of Output follow >>>>
 9  lines in main OUTPUT file.
 0 lines of O/P from pe0.
 <<<<         Lines of Output Information ends          >>>>


 ============================================================================== 
 =================================== OUTPUT =================================== 
 ============================================================================== 

 UMUI Namelist output in /work/n02/n02/gmann/um/xpsqy/xpsqy.umui.nl
 DATAW/DATAM file listing in /work/n02/n02/gmann/um/xpsqy/xpsqy.list
 STASH output should be in /work/n02/n02/gmann/um/xpsqy/xpsqy.stash

 
 ============================================================================== 
 =============================== UM RUN OUTPUT ================================ 
 ============================================================================== 

qsatmos: %MODEL% output follows:-

qsatmos: Stack requested for UM job:  GB
srun --cpus-per-task=1 --hint=nomultithread --distribution=block:block /work/n02/n02/gmann/um/xpsqy/bin/xpsqy.exe
0+1 records in
0+1 records out
436 bytes copied, 7.745e-05 s, 5.6 MB/s
*****************************************************************
****************************************************************
     Job ended at :  Sat 09 Dec 2023 01:15:04 AM GMT
****************************************************************
*****************************************************************
 

I did also try logging out of PUMA2, and doing a manual-submit on ARCHER2 of the umuisubmit_run script (doing a “srun umuisubmit_run” initially, and then “ksh ./umuisubmit_run” on ARCHER2 from the umui_runs directory).

But that manual-submit of the umuisubmit_run script didn’t work
(this had used to work fine on ARCHER-1, via doing a “qsub umuisubmit_run” from the umui_runs dir)

drwxr-xr-x 2 gmann n02 4096 Dec  4 13:26 xpsqu-338132642/
drwxr-xr-x 2 gmann n02 4096 Dec  4 13:41 xpsqv-338134132/
drwxr-xr-x 2 gmann n02 4096 Dec  7 11:54 xpsqw-341115424/
drwxr-xr-x 2 gmann n02 4096 Dec  7 12:58 xpsqx-341125836/
drwxr-xr-x 2 gmann n02 4096 Dec  9 00:19 xpsqy-343001913/
gmann@ln02:~/umui_runs> cd xpsqy-343001913/
gmann@ln02:~/umui_runs/xpsqy-343001913> ls -lrt
total 388
-rw-r--r-- 1 gmann n02     43 Dec  9 00:19 USR_PATHS_OVRDS
-rw-r--r-- 1 gmann n02      0 Dec  9 00:19 USR_MACH_OVRDS
-rw-r--r-- 1 gmann n02    143 Dec  9 00:19 USR_FILE_OVRDS
-rwxr-xr-x 1 gmann n02   4193 Dec  9 00:19 UMSUBMIT_ARCHER2*
-rwxr-xr-x 1 gmann n02   6373 Dec  9 00:19 UMSUBMIT*
-rw-r--r-- 1 gmann n02    175 Dec  9 00:19 UAFLDS_A
-rw-r--r-- 1 gmann n02    156 Dec  9 00:19 UAFILES_A
-rw-r--r-- 1 gmann n02  53093 Dec  9 00:19 STASHC
-rw-r--r-- 1 gmann n02   2689 Dec  9 00:19 SIZES
-rw-r--r-- 1 gmann n02   8028 Dec  9 00:19 SHARED
-rw-r--r-- 1 gmann n02   8651 Dec  9 00:19 SCRIPT
-rw-r--r-- 1 gmann n02   7723 Dec  9 00:19 RECONA
-rw-r--r-- 1 gmann n02 108830 Dec  9 00:19 PRESM_A
-rw-r--r-- 1 gmann n02    120 Dec  9 00:19 PPCNTL
-rwxr-xr-x 1 gmann n02   2983 Dec  9 00:19 MAIN_SCR*
-rw-r--r-- 1 gmann n02     77 Dec  9 00:19 IOSCNTL
-rw-r--r-- 1 gmann n02     99 Dec  9 00:19 INITHIS
-rw-r--r-- 1 gmann n02   4556 Dec  9 00:19 INITFILEENV
-rw-r--r-- 1 gmann n02   5479 Dec  9 00:19 FCM_UMSCRIPTS_CFG
-rw-r--r-- 1 gmann n02   7265 Dec  9 00:19 FCM_UMRECON_CFG
-rw-r--r-- 1 gmann n02   8006 Dec  9 00:19 FCM_UMATMOS_CFG
-rw-r--r-- 1 gmann n02   3427 Dec  9 00:19 FCM_BLD_COMMAND
-rw-r--r-- 1 gmann n02   5107 Dec  9 00:19 EXT_SCRIPT_LOG
-rwxr-xr-x 1 gmann n02   4194 Dec  9 00:19 EXTR_SCR*
-rwxr-xr-x 1 gmann n02    289 Dec  9 00:19 COMP_SWITCHES*
-rw-r--r-- 1 gmann n02    718 Dec  9 00:19 CNTLGEN
-rw-r--r-- 1 gmann n02  26574 Dec  9 00:19 CNTLATM
-rw-r--r-- 1 gmann n02   7560 Dec  9 00:19 CNTLALL
-rwxr-xr-x 1 gmann n02  13198 Dec  9 00:19 SUBMIT*
-rwxr-xr-x 1 gmann n02  10110 Dec  9 00:19 umuisubmit_run*
-rwxr-xr-x 1 gmann n02  10092 Dec  9 00:19 umuisubmit_rcf*
-rwxr-xr-x 1 gmann n02   4042 Dec  9 00:19 umuisubmit_compile*
-rwxr-xr-x 1 gmann n02    497 Dec  9 00:19 stage_2_submit*
-rwxr-xr-x 1 gmann n02    494 Dec  9 00:19 stage_1_submit*
gmann@ln02:~/umui_runs/xpsqy-343001913> vi umuisubmit_run
gmann@ln02:~/umui_runs/xpsqy-343001913> srun umuisubmit_run
srun: error: Job rejected: Please specify a partition name.
srun: error: Unable to allocate resources: Unspecified error
gmann@ln02:~/umui_runs/xpsqy-343001913> vi stage_2_submit
gmann@ln02:~/umui_runs/xpsqy-343001913> vi umuisubmit_run
gmann@ln02:~/umui_runs/xpsqy-343001913> ksh ./umuisubmit_run

Currently Loaded Modules:
  1) craype-x86-rome                         7) cray-dsmml/0.2.2       13) load-epcc-module
  2) libfabric/1.12.1.2.2.0.0                8) cray-mpich/8.1.23      14) cce/15.0.0
  3) craype-network-ofi                      9) cray-libsci/22.12.1.1  15) cray-hdf5-parallel/1.12.2.1
  4) perftools-base/22.12.0                 10) PrgEnv-cray/8.3.3      16) cray-netcdf-hdf5parallel/4.9.0.1
  5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta  11) bolt/0.8               17) um/2023.06
  6) craype/2.7.19                          12) epcc-setup-env

 

*****************************************************************
     Version 8.4 template, Unified Model ,  Non-Operational
     Created by UMUI version 8.4                       
*****************************************************************
Host is ln02
PATH used = /opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/bin:/opt/cray/pe/hdf5-parallel/1.12.2.1/bin:/opt/cray/pe/hdf5/1.12.2.1/bin:/opt/cray/pe/cce/15.0.0/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/15.0.0/binutils/cross/x86_64-aarch64/aarch64-linux-gnu/../bin:/opt/cray/pe/cce/15.0.0/utils/x86_64/bin:/opt/cray/pe/cce/15.0.0/bin:/opt/cray/pe/cce/15.0.0/cce-clang/x86_64/bin:/work/y07/shared/utils/core/bolt/0.8/bin:/work/y07/shared/utils/core/bin:/opt/cray/pe/mpich/8.1.23/ofi/crayclang/10.0/bin:/opt/cray/pe/mpich/8.1.23/bin:/opt/cray/pe/craype/2.7.19/bin:/opt/cray/pe/perftools/22.12.0/bin:/opt/cray/pe/papi/6.0.0.17/bin:/opt/cray/libfabric/1.12.1.2.2.0.0/bin:/home/n02/n02/gmann/bin:/usr/local/bin:/usr/bin:/bin:/usr/lib/mit/bin:/opt/cray/pe/bin:/work/y07/shared/umshared/software/bin:/work/y07/shared/umshared/bin:/work/y07/shared/utils/core/python/miniconda2/bin:/work/y07/shared/umshared/vn8.4/cce/utils:/work/y07/shared/umshared/bin:/work/y07/shared/umshared/vn8.4/bin:/work/n02/n02/gmann/um/xpsqy/bin:/work/y07/shared/umshared/vn8.4/cce/scripts:/work/y07/shared/umshared/vn8.4/cce/exec
*****************************************************************
*****************************************************************
     Job started at : Sat  9 Dec 01:34:14 GMT 2023
*****************************************************************
*****************************************************************
     Run started from UMUI
cp -x but switch-off increase-nodes-Reconfig hand-edit (same as Ros`s xpcnd)
This job is using UM directory /work/y07/shared/umshared,
-------------------------------
Processing STASHC file for ROSE
-------------------------------
Backup of STASHC file created! 
/work/n02/n02/gmann/tmp/tmp.ln02.235323/xpsqy.stashc_preROSE
-------------------------------
Processing STASHC file complete
-------------------------------
***************************************************************
   Starting script :   qsatmos
   Starting time   :   Sat  9 Dec 01:34:14 GMT 2023
***************************************************************


/work/n02/n02/gmann/um/xpsqy/bin/qsatmos: Executing model run

*********************************************************
UM Executable : /work/n02/n02/gmann/um/xpsqy/bin/xpsqy.exe
*********************************************************


srun: error: Job rejected: Please specify a partition name.
srun: error: Unable to allocate resources: Unspecified error
xpsqy: Run failed
*****************************************************************
   Ending script   :   qsatmos
   Completion code :   1
   Completion time :   Sat  9 Dec 01:34:16 GMT 2023
*****************************************************************


/work/n02/n02/gmann/um/xpsqy/bin/qsmaster: Failed in qsatmos in job xpsqy
***************************************************************
   Starting script :   qsfinal
   Starting time   :   Sat  9 Dec 01:34:16 GMT 2023
***************************************************************

Checking requirement for atmosphere resubmit...
/work/n02/n02/gmann/um/xpsqy/bin/qsresubmit: Error: no resubmit details found
*****************************************************************
   Ending script   :   qsfinal
   Completion code :   0
   Completion time :   Sat  9 Dec 01:34:16 GMT 2023
*****************************************************************


/work/n02/n02/gmann/um/xpsqy/bin/qsmaster: Failed in qsfinal in job xpsqy
 <<<< Information about How Many Lines of Output follow >>>>
 9  lines in main OUTPUT file.
 0 lines of O/P from pe0.
 <<<<         Lines of Output Information ends          >>>>


 ============================================================================== 
 =================================== OUTPUT =================================== 
 ============================================================================== 

 UMUI Namelist output in /work/n02/n02/gmann/um/xpsqy/xpsqy.umui.nl
 DATAW/DATAM file listing in /work/n02/n02/gmann/um/xpsqy/xpsqy.list
 STASH output should be in /work/n02/n02/gmann/um/xpsqy/xpsqy.stash

 
 ============================================================================== 
 =============================== UM RUN OUTPUT ================================ 
 ============================================================================== 

qsatmos: %MODEL% output follows:-

qsatmos: Stack requested for UM job:  GB
srun --cpus-per-task=1 --hint=nomultithread --distribution=block:block /work/n02/n02/gmann/um/xpsqy/bin/xpsqy.exe
0+1 records in
0+1 records out
436 bytes copied, 7.1128e-05 s, 6.1 MB/s
*****************************************************************
****************************************************************
     Job ended at :  Sat  9 Dec 01:34:16 GMT 2023
****************************************************************
*****************************************************************

Hi Ros and Grenville,

Please can either of you confirm the exact way you did the “manual submit”.

The main way I was trying, was following exactly the instructions at the PUMA2 UMUI web-page here https://cms.ncas.ac.uk/puma2/umui

i.e. I just did ./UMSUBMIT_ARCHER2 from my '~/umui_jobs/${JOBID}/ ’ directory

Not sure if something could be different in my linux environment set-up, but I followed the steps on the NCAS-CMS webpage for the transition https://cms.ncas.ac.uk/puma2/

Thanks in advance for your help to figure out what’s different here in my set-up

Best regards,

Cheers
Graham

Hi Graham,

Yes just ./UMSUBMIT_ARCHER2 from inside the ~/umui_jobs/JOBID directory.

I’ve also tried running with your ARCHER2 .bash_profile and your model executable and it still works fine for me, so at this precise moment I don’t know what else to suggest.

I will have another look later today.

Regards,
Ros.

Hi Ros,

That’s really strange then, so you can set the job running OK from the xpsqy model executable,
but for me the model run fails straight-away, giving an “out of memory” error message.

I tried again, submitting the xpsq-y job (after deleting the um_extracts and /work/n02/n02/gmann/um/xpsqy directory for a clean extract & run), but this just gave exactly the same behaviour – it compiled OK, reconfigured OK, but then crashed immediately with out-of-memory error.

I’ve done a Screen-Shot, and uploaded the PNG for this below to demonstrate the error message I’m getting.

Please can you check anything you can thihnk of here that might be causing this problem.

I can’t see why this would make any difference, but should I try setting-up my environment again for PUMA2 submission to ARCHER?

Or I could try also submitting directly from the model executable in the same way you have.

Can you send me instructions for how I can try that, to submit from the model executable
(do I copy the job to another UMUI job, and set this to run from the other executable?)

Thanks
Graham

Hi Graham,

Your PUMA2 environment won’t affect the running of an executable on ARCHER2.

I didn’t do anything special to run from your executable. I just replaced my executable with yours and resubmitted the model step with:

archer2$ cd ~/umui_runs/xpcnd-<blah>
archer2$ sbatch umuisubmit_run

I’ve also copied your .bash_profile and .bashrc on ARCHER2 and it still runs ok. The only small error I see in the .bash_profile is that you should now have

. /work/y07/shared/umshared/bin/rose-um-env-puma2

rather than

. /work/y07/shared/umshared/bin/rose-um-env

I’ll retry overnight, just incase I made a mistake.

Regards,
Ros.

You could also try running the model on more nodes and see if that resolves it…

Hi Ros,

Thanks for this.

I tried doing the sbatch umuisubmit_run on ARCHER2 from the ~/umui_runs/xpsqy- directory.

And that seems to work fine with that method (to submit the from the umui_runs directory).

So that seems to be the solution (for me at least) – that the ./UMUISUBMIT_ARCHER2 on PUMA2 only completes the compile and reconfiguration steps, with the final stage, to submit the model executable, requiring to do that from ARCHER2, via the sbatch umuisubmit_run method from the umui_runs directory.

(But the original submission from PUMA2 gives an immediate Out Of Memory error.)

So this is OK, I’m happy to proceed and just know I’ll need to do it that way, via the two-stage method.

Thanks so much for your help with this,

Best regards,

Cheers
Graham

-rw-r--r-- 1 gmann n02 279274 Dec 19 00:21 xpsqy.fort6.pe115
-rw-r--r-- 1 gmann n02 279273 Dec 19 00:21 xpsqy.fort6.pe114
-rw-r--r-- 1 gmann n02 279225 Dec 19 00:21 xpsqy.fort6.pe113
-rw-r--r-- 1 gmann n02 279226 Dec 19 00:21 xpsqy.fort6.pe112
-rw-r--r-- 1 gmann n02 279275 Dec 19 00:21 xpsqy.fort6.pe111
-rw-r--r-- 1 gmann n02 279267 Dec 19 00:21 xpsqy.fort6.pe110
-rw-r--r-- 1 gmann n02 279205 Dec 19 00:21 xpsqy.fort6.pe11
-rw-r--r-- 1 gmann n02 279168 Dec 19 00:21 xpsqy.fort6.pe109
-rw-r--r-- 1 gmann n02 278986 Dec 19 00:21 xpsqy.fort6.pe108
-rw-r--r-- 1 gmann n02 279226 Dec 19 00:21 xpsqy.fort6.pe107
-rw-r--r-- 1 gmann n02 279235 Dec 19 00:21 xpsqy.fort6.pe106
-rw-r--r-- 1 gmann n02 279276 Dec 19 00:21 xpsqy.fort6.pe105
-rw-r--r-- 1 gmann n02 279265 Dec 19 00:21 xpsqy.fort6.pe104
-rw-r--r-- 1 gmann n02 279225 Dec 19 00:21 xpsqy.fort6.pe103
-rw-r--r-- 1 gmann n02 279272 Dec 19 00:21 xpsqy.fort6.pe102
-rw-r--r-- 1 gmann n02 279173 Dec 19 00:21 xpsqy.fort6.pe101
-rw-r--r-- 1 gmann n02 279126 Dec 19 00:21 xpsqy.fort6.pe100
-rw-r--r-- 1 gmann n02 279203 Dec 19 00:21 xpsqy.fort6.pe10
-rw-r--r-- 1 gmann n02 278990 Dec 19 00:21 xpsqy.fort6.pe1
-rw-r--r-- 1 gmann n02 308656 Dec 19 00:21 xpsqy.fort6.pe0
gmann@ln03:/work/n02/n02/gmann/um/xpsqy/pe_output> grep 'Atm_Step' xpsqy.fort6.pe0
Atm_Step: Timestep        1   Model time:   1990-12-01 00:20:00
 Atm_Step: L_USE_CARIOLLE =  F
 Atm_Step: Cariolle scheme not called
Atm_Step: Timestep        2   Model time:   1990-12-01 00:40:00
Atm_Step: Timestep        3   Model time:   1990-12-01 01:00:00
Atm_Step: Timestep        4   Model time:   1990-12-01 01:20:00
Atm_Step: Timestep        5   Model time:   1990-12-01 01:40:00
Atm_Step: Timestep        6   Model time:   1990-12-01 02:00:00
Atm_Step: Timestep        7   Model time:   1990-12-01 02:20:00
Atm_Step: Timestep        8   Model time:   1990-12-01 02:40:00
Atm_Step: Timestep        9   Model time:   1990-12-01 03:00:00
Atm_Step: Timestep       10   Model time:   1990-12-01 03:20:00
Atm_Step: Timestep       11   Model time:   1990-12-01 03:40:00
Atm_Step: Timestep       12   Model time:   1990-12-01 04:00:00
Atm_Step: Timestep       13   Model time:   1990-12-01 04:20:00
Atm_Step: Timestep       14   Model time:   1990-12-01 04:40:00
Atm_Step: Timestep       15   Model time:   1990-12-01 05:00:00
Atm_Step: Timestep       16   Model time:   1990-12-01 05:20:00
Atm_Step: Timestep       17   Model time:   1990-12-01 05:40:00
Atm_Step: Timestep       18   Model time:   1990-12-01 06:00:00
Atm_Step: Timestep       19   Model time:   1990-12-01 06:20:00
Atm_Step: Timestep       20   Model time:   1990-12-01 06:40:00
Atm_Step: Timestep       21   Model time:   1990-12-01 07:00:00
Atm_Step: Timestep       22   Model time:   1990-12-01 07:20:00
Atm_Step: Timestep       23   Model time:   1990-12-01 07:40:00
Atm_Step: Timestep       24   Model time:   1990-12-01 08:00:00
Atm_Step: Timestep       25   Model time:   1990-12-01 08:20:00
Atm_Step: Timestep       26   Model time:   1990-12-01 08:40:00