But I’m guessing maybe there needs also to be an adjustment to the “machine bindings” file one of the hand-edits not sure how to configure the GA4 UM-UKCA UMUI job
When submitting from the UMUI, it’s giving the following error message:
ERROR: can't use non-numeric string as operand of "!" while attempting to access account gmann on host login.archer2.ac.uk. Note that repeated failures may result in expiry of password due to security procedures on some machines. Check user id, hostname and password for your account on the host machine.
type or paste code here
Please can you help get back up-and-running with the GA4 UM-UKCA job on ARCHER2 from PUMA2?
xpskb is running - I had to change to 1 OMP thread. Sorry we don’t have the resources to chase down a threading problem in a umui job - it seems to be running fast enough.
Ah, OK – I updated the xpsq-u job for the host-name
(to “ln01” instead of “login.archer2.ac.uk”).
And then I copied xpsq-u to xpsq-v, adding-in also to xpsq-v the changes in your test job xpsk-b
→ updated revision number for container-file (22831 → 22852)
→ updated revision number for vn8.4_ncas branch (22838 → 22852)
→ change to 1 OMP thread and PE-decomp 16x16 rather than 2 OMP threads and 8x24
→ Reconfig to 4x6 instead of 8x28
From submitting via the command-line UMSUBMIT_ARCHER2 script, that then compiled OK.
The reconfiguration compile & run both also completed successfully.
And the actual run-job for the model executable is in the queue as a 2-node job.
Can I check then, you explained OOM problem (Out Of Memory, right?).
And then I guess this will fail once it gets queue’d to run.
From the wording there, it sounds like this is something you know how to fix?
(or a “known/expected” interim error or so?).
Please can you jot down the next steps, re: any additional steps.
If straightforward, I could potentially try a further-amended job, e.g. with edit to settings/PE-config?
For me, the xpsq-v job was queue’d to run this evening, but failed straitght away,
and is givng the “OOM” error you mentioned (see the info below from the .leave file)
I was referring to your job xpsk-b for the changes to implement, and that included
also to run on only 1 OMP thread.
Please can you clarify, which job id should I refer to for the job that ran OK for you?
Thanks
Graham
*****************************************************************
*****************************************************************
Job started at : Wed 06 Dec 2023 08:38:51 PM GMT
*****************************************************************
*****************************************************************
Run started from UMUI
cp -u with changes from Grenville`s xpsq-u
This job is using UM directory /work/y07/shared/umshared,
-------------------------------
Processing STASHC file for ROSE
-------------------------------
Backup of STASHC file created!
/work/n02/n02/gmann/tmp/tmp.nid004581.140194/xpsqv.stashc_preROSE
-------------------------------
Processing STASHC file complete
-------------------------------
***************************************************************
Starting script : qsatmos
Starting time : Wed 06 Dec 2023 08:38:52 PM GMT
***************************************************************
/work/n02/n02/gmann/um/xpsqv/bin/qsatmos: Executing model run
*********************************************************
UM Executable : /work/n02/n02/gmann/um/xpsqv/bin/xpsqv.exe
*********************************************************
slurmstepd: error: Detected 2 oom-kill event(s) in StepId=5017099.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid004582: tasks 155,175: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=5017099.0
slurmstepd: error: *** STEP 5017099.0 ON nid004581 CANCELLED AT 2023-12-06T20:39:07 ***
srun: error: nid004581: tasks 0-127: Terminated
srun: Force Terminated StepId=5017099.0
xpsqv: Run failed
*****************************************************************
Ending script : qsatmos
Completion code : 1
Completion time : Wed 06 Dec 2023 08:39:08 PM GMT
*****************************************************************
/work/n02/n02/gmann/um/xpsqv/bin/qsmaster: Failed in qsatmos in job xpsqv
***************************************************************
Starting script : qsfinal
Starting time : Wed 06 Dec 2023 08:39:08 PM GMT
***************************************************************
Checking requirement for atmosphere resubmit...
/work/n02/n02/gmann/um/xpsqv/bin/qsresubmit: Error: no resubmit details found
*****************************************************************
Ending script : qsfinal
Completion code : 0
Completion time : Wed 06 Dec 2023 08:39:08 PM GMT
*****************************************************************
/work/n02/n02/gmann/um/xpsqv/bin/qsmaster: Failed in qsfinal in job xpsqv
my job xpskb ran out for 413 timesteps (I’d asked for 20 mins wallclock, so it ran out of time)
see /work/n02/n02/grenvill/um/xpskb/pe_output/xpskb.fort6.pe0
I cross-checked my xpsq-v job with your xpsk-b, and there were only very-minor diffs.
I updated this to 20-min job, and my model run xpsq-x then is identical to your job xpsk-b.
But the xpsq-x job is giving the same “Out of Memory” error message as the xpsq-v.
Please can you take a look and see if something is somehow not set-up quite right in my environment or so?
Thanks
Graham
PS I’ve copied below the standard output from the .leave file (the model run)
and the .leave file is here
***************************************************************
Starting script : qsatmos
Starting time : Thu 07 Dec 2023 01:54:41 PM GMT
***************************************************************
/work/n02/n02/gmann/um/xpsqx/bin/qsatmos: Executing model run
*********************************************************
UM Executable : /work/n02/n02/gmann/um/xpsqx/bin/xpsqx.exe
*********************************************************
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=5033810.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid001752: task 86: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=5033810.0
slurmstepd: error: *** STEP 5033810.0 ON nid001752 CANCELLED AT 2023-12-07T13:54:52 ***
srun: error: nid001753: tasks 128-255: Terminated
srun: Force Terminated StepId=5033810.0
xpsqx: Run failed
*****************************************************************
Ending script : qsatmos
Completion code : 1
Completion time : Thu 07 Dec 2023 01:54:53 PM GMT
*****************************************************************
/work/n02/n02/gmann/um/xpsqx/bin/qsmaster: Failed in qsatmos in job xpsqx
***************************************************************
Starting script : qsfinal
Starting time : Thu 07 Dec 2023 01:54:53 PM GMT
***************************************************************
Checking requirement for atmosphere resubmit...
/work/n02/n02/gmann/um/xpsqx/bin/qsresubmit: Error: no resubmit details found
*****************************************************************
Ending script : qsfinal
Completion code : 0
Completion time : Thu 07 Dec 2023 01:54:53 PM GMT
*****************************************************************
/work/n02/n02/gmann/um/xpsqx/bin/qsmaster: Failed in qsfinal in job xpsqx
<<<< Information about How Many Lines of Output follow >>>>
9 lines in main OUTPUT file.
0 lines of O/P from pe0.
<<<< Lines of Output Information ends >>>>
==============================================================================
=================================== OUTPUT ===================================
==============================================================================
UMUI Namelist output in /work/n02/n02/gmann/um/xpsqx/xpsqx.umui.nl
DATAW/DATAM file listing in /work/n02/n02/gmann/um/xpsqx/xpsqx.list
STASH output should be in /work/n02/n02/gmann/um/xpsqx/xpsqx.stash
==============================================================================
=============================== UM RUN OUTPUT ================================
==============================================================================
qsatmos: %MODEL% output follows:-
qsatmos: Stack requested for UM job: GB
srun --cpus-per-task=1 --hint=nomultithread --distribution=block:block /work/n02/n02/gmann/um/xpsqx/bin/xpsqx.exe
0+1 records in
0+1 records out
Don’t know if you made any changes to your job since your last message, but I’ve just copied your xpsqv job (as xpcnd) and it’s run ok to timestep 420 (I ran it in the short queue so this just as far as it could get in the 20mins allowed). The only changes I made were to use my userid, budget, run in short queue, and recon on 1 node.
Hi Ros,
Thanks for this.
OK, that’s strange then, but that’s good news I guess.
Dunno if maybe it could just because this was the 1st job I’ve run on PUMA2 (to ARCHER2).
I’ll try submitting it again – did you change to the short-queue via hand-edit?
Or does it make that queue-decision based on requested wall-clock time-limit?.
Also – can you think of any reason the OOM message memory-limit error?
(e.g. old environment file from pumanew or similar)?
I tried submitting the job again last night, copied to a clean new-job xpsq-y, there also changing
the reconfig hand-edit to “N” to match exactly the settings in your job xpcn-d.
But that just gave exactly the same situation, with compiling and reconfiguring OK, but
then giving the same “Out Of Memory” error when the 2-node (256-core) parallel-job is queue’d to run.
-rw-r--r-- 1 gmann n02 161828555 Dec 4 11:21 xpskb.fort6.pe0
-rw-r--r-- 1 gmann n02 14596396 Dec 4 14:37 xpsqv000.xpsqv.d23338.t134138.comp.leave
-rw-r--r-- 1 gmann n02 268896 Dec 5 03:56 xpsqv000.xpsqv.d23338.t134138.rcf.leave
-rw-r--r-- 1 gmann n02 5766 Dec 6 20:39 xpsqv000.xpsqv.d23338.t134138.leave
-rw-r--r-- 1 gmann n02 14596474 Dec 7 12:48 xpsqw000.xpsqw.d23341.t115430.comp.leave
-rw-r--r-- 1 gmann n02 269106 Dec 7 12:49 xpsqw000.xpsqw.d23341.t115430.rcf.leave
-rw-r--r-- 1 gmann n02 14596478 Dec 7 13:48 xpsqx000.xpsqx.d23341.t125842.comp.leave
-rw-r--r-- 1 gmann n02 268955 Dec 7 13:53 xpsqx000.xpsqx.d23341.t125842.rcf.leave
-rw-r--r-- 1 gmann n02 5822 Dec 7 13:54 xpsqx000.xpsqx.d23341.t125842.leave
-rw-r--r-- 1 gmann n02 14596546 Dec 9 01:06 xpsqy000.xpsqy.d23343.t001918.comp.leave
-rw-r--r-- 1 gmann n02 268930 Dec 9 01:13 xpsqy000.xpsqy.d23343.t001918.rcf.leave
-rw-r--r-- 1 gmann n02 5855 Dec 9 01:15 xpsqy000.xpsqy.d23343.t001918.leave
gmann@ln02:~/output> tail xpsqy000.xpsqy.d23343.t001918.rcf.leave
0+1 records in
0+1 records out
257825 bytes (258 kB, 252 KiB) copied, 0.000506815 s, 509 MB/s
*****************************************************************
****************************************************************
Job ended at : Sat 09 Dec 2023 01:13:30 AM GMT
****************************************************************
*****************************************************************
Submitted batch job 5043383
gmann@ln02:~/output> cat xpsqy000.xpsqy.d23343.t001918.leave
Currently Loaded Modules:
1) craype-x86-rome
2) libfabric/1.12.1.2.2.0.0
3) craype-network-ofi
4) perftools-base/22.12.0
5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta
6) craype/2.7.19
7) cray-dsmml/0.2.2
8) cray-mpich/8.1.23
9) cray-libsci/22.12.1.1
10) PrgEnv-cray/8.3.3
11) bolt/0.8
12) epcc-setup-env
13) load-epcc-module
14) cce/15.0.0
15) cray-hdf5-parallel/1.12.2.1
16) cray-netcdf-hdf5parallel/4.9.0.1
17) um/2023.06
*****************************************************************
Version 8.4 template, Unified Model , Non-Operational
Created by UMUI version 8.4
*****************************************************************
Host is nid003708
PATH used = /opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/bin:/opt/cray/pe/hdf5-parallel/1.12.2.1/bin:/opt/cray/pe/hdf5/1.12.2.1/bin:/opt/cray/pe/cce/15.0.0/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/15.0.0/binutils/cross/x86_64-aarch64/aarch64-linux-gnu/../bin:/opt/cray/pe/cce/15.0.0/utils/x86_64/bin:/opt/cray/pe/cce/15.0.0/bin:/opt/cray/pe/cce/15.0.0/cce-clang/x86_64/bin:/work/y07/shared/utils/core/bolt/0.8/bin:/work/y07/shared/utils/core/bin:/opt/cray/pe/mpich/8.1.23/ofi/crayclang/10.0/bin:/opt/cray/pe/mpich/8.1.23/bin:/opt/cray/pe/craype/2.7.19/bin:/opt/cray/pe/perftools/22.12.0/bin:/opt/cray/pe/papi/6.0.0.17/bin:/opt/cray/libfabric/1.12.1.2.2.0.0/bin:/usr/local/bin:/usr/bin:/bin:/usr/lib/mit/bin:/opt/cray/pe/bin:/work/y07/shared/umshared/vn8.4/cce/utils:/work/y07/shared/umshared/bin:/work/y07/shared/umshared/vn8.4/bin:/work/n02/n02/gmann/um/xpsqy/bin:/work/y07/shared/umshared/vn8.4/cce/scripts:/work/y07/shared/umshared/vn8.4/cce/exec
*****************************************************************
*****************************************************************
Job started at : Sat 09 Dec 2023 01:14:44 AM GMT
*****************************************************************
*****************************************************************
Run started from UMUI
cp -x but switch-off increase-nodes-Reconfig hand-edit (same as Ros`s xpcnd)
This job is using UM directory /work/y07/shared/umshared,
-------------------------------
Processing STASHC file for ROSE
-------------------------------
Backup of STASHC file created!
/work/n02/n02/gmann/tmp/tmp.nid003708.174571/xpsqy.stashc_preROSE
-------------------------------
Processing STASHC file complete
-------------------------------
***************************************************************
Starting script : qsatmos
Starting time : Sat 09 Dec 2023 01:14:44 AM GMT
***************************************************************
/work/n02/n02/gmann/um/xpsqy/bin/qsatmos: Executing model run
*********************************************************
UM Executable : /work/n02/n02/gmann/um/xpsqy/bin/xpsqy.exe
*********************************************************
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=5043383.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid003711: task 189: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=5043383.0
slurmstepd: error: *** STEP 5043383.0 ON nid003708 CANCELLED AT 2023-12-09T01:15:03 ***
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=5043383.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
xpsqy: Run failed
*****************************************************************
Ending script : qsatmos
Completion code : 1
Completion time : Sat 09 Dec 2023 01:15:04 AM GMT
*****************************************************************
/work/n02/n02/gmann/um/xpsqy/bin/qsmaster: Failed in qsatmos in job xpsqy
***************************************************************
Starting script : qsfinal
Starting time : Sat 09 Dec 2023 01:15:04 AM GMT
***************************************************************
Checking requirement for atmosphere resubmit...
/work/n02/n02/gmann/um/xpsqy/bin/qsresubmit: Error: no resubmit details found
*****************************************************************
Ending script : qsfinal
Completion code : 0
Completion time : Sat 09 Dec 2023 01:15:04 AM GMT
*****************************************************************
/work/n02/n02/gmann/um/xpsqy/bin/qsmaster: Failed in qsfinal in job xpsqy
<<<< Information about How Many Lines of Output follow >>>>
9 lines in main OUTPUT file.
0 lines of O/P from pe0.
<<<< Lines of Output Information ends >>>>
==============================================================================
=================================== OUTPUT ===================================
==============================================================================
UMUI Namelist output in /work/n02/n02/gmann/um/xpsqy/xpsqy.umui.nl
DATAW/DATAM file listing in /work/n02/n02/gmann/um/xpsqy/xpsqy.list
STASH output should be in /work/n02/n02/gmann/um/xpsqy/xpsqy.stash
==============================================================================
=============================== UM RUN OUTPUT ================================
==============================================================================
qsatmos: %MODEL% output follows:-
qsatmos: Stack requested for UM job: GB
srun --cpus-per-task=1 --hint=nomultithread --distribution=block:block /work/n02/n02/gmann/um/xpsqy/bin/xpsqy.exe
0+1 records in
0+1 records out
436 bytes copied, 7.745e-05 s, 5.6 MB/s
*****************************************************************
****************************************************************
Job ended at : Sat 09 Dec 2023 01:15:04 AM GMT
****************************************************************
*****************************************************************
I did also try logging out of PUMA2, and doing a manual-submit on ARCHER2 of the umuisubmit_run script (doing a “srun umuisubmit_run” initially, and then “ksh ./umuisubmit_run” on ARCHER2 from the umui_runs directory).
But that manual-submit of the umuisubmit_run script didn’t work
(this had used to work fine on ARCHER-1, via doing a “qsub umuisubmit_run” from the umui_runs dir)
drwxr-xr-x 2 gmann n02 4096 Dec 4 13:26 xpsqu-338132642/
drwxr-xr-x 2 gmann n02 4096 Dec 4 13:41 xpsqv-338134132/
drwxr-xr-x 2 gmann n02 4096 Dec 7 11:54 xpsqw-341115424/
drwxr-xr-x 2 gmann n02 4096 Dec 7 12:58 xpsqx-341125836/
drwxr-xr-x 2 gmann n02 4096 Dec 9 00:19 xpsqy-343001913/
gmann@ln02:~/umui_runs> cd xpsqy-343001913/
gmann@ln02:~/umui_runs/xpsqy-343001913> ls -lrt
total 388
-rw-r--r-- 1 gmann n02 43 Dec 9 00:19 USR_PATHS_OVRDS
-rw-r--r-- 1 gmann n02 0 Dec 9 00:19 USR_MACH_OVRDS
-rw-r--r-- 1 gmann n02 143 Dec 9 00:19 USR_FILE_OVRDS
-rwxr-xr-x 1 gmann n02 4193 Dec 9 00:19 UMSUBMIT_ARCHER2*
-rwxr-xr-x 1 gmann n02 6373 Dec 9 00:19 UMSUBMIT*
-rw-r--r-- 1 gmann n02 175 Dec 9 00:19 UAFLDS_A
-rw-r--r-- 1 gmann n02 156 Dec 9 00:19 UAFILES_A
-rw-r--r-- 1 gmann n02 53093 Dec 9 00:19 STASHC
-rw-r--r-- 1 gmann n02 2689 Dec 9 00:19 SIZES
-rw-r--r-- 1 gmann n02 8028 Dec 9 00:19 SHARED
-rw-r--r-- 1 gmann n02 8651 Dec 9 00:19 SCRIPT
-rw-r--r-- 1 gmann n02 7723 Dec 9 00:19 RECONA
-rw-r--r-- 1 gmann n02 108830 Dec 9 00:19 PRESM_A
-rw-r--r-- 1 gmann n02 120 Dec 9 00:19 PPCNTL
-rwxr-xr-x 1 gmann n02 2983 Dec 9 00:19 MAIN_SCR*
-rw-r--r-- 1 gmann n02 77 Dec 9 00:19 IOSCNTL
-rw-r--r-- 1 gmann n02 99 Dec 9 00:19 INITHIS
-rw-r--r-- 1 gmann n02 4556 Dec 9 00:19 INITFILEENV
-rw-r--r-- 1 gmann n02 5479 Dec 9 00:19 FCM_UMSCRIPTS_CFG
-rw-r--r-- 1 gmann n02 7265 Dec 9 00:19 FCM_UMRECON_CFG
-rw-r--r-- 1 gmann n02 8006 Dec 9 00:19 FCM_UMATMOS_CFG
-rw-r--r-- 1 gmann n02 3427 Dec 9 00:19 FCM_BLD_COMMAND
-rw-r--r-- 1 gmann n02 5107 Dec 9 00:19 EXT_SCRIPT_LOG
-rwxr-xr-x 1 gmann n02 4194 Dec 9 00:19 EXTR_SCR*
-rwxr-xr-x 1 gmann n02 289 Dec 9 00:19 COMP_SWITCHES*
-rw-r--r-- 1 gmann n02 718 Dec 9 00:19 CNTLGEN
-rw-r--r-- 1 gmann n02 26574 Dec 9 00:19 CNTLATM
-rw-r--r-- 1 gmann n02 7560 Dec 9 00:19 CNTLALL
-rwxr-xr-x 1 gmann n02 13198 Dec 9 00:19 SUBMIT*
-rwxr-xr-x 1 gmann n02 10110 Dec 9 00:19 umuisubmit_run*
-rwxr-xr-x 1 gmann n02 10092 Dec 9 00:19 umuisubmit_rcf*
-rwxr-xr-x 1 gmann n02 4042 Dec 9 00:19 umuisubmit_compile*
-rwxr-xr-x 1 gmann n02 497 Dec 9 00:19 stage_2_submit*
-rwxr-xr-x 1 gmann n02 494 Dec 9 00:19 stage_1_submit*
gmann@ln02:~/umui_runs/xpsqy-343001913> vi umuisubmit_run
gmann@ln02:~/umui_runs/xpsqy-343001913> srun umuisubmit_run
srun: error: Job rejected: Please specify a partition name.
srun: error: Unable to allocate resources: Unspecified error
gmann@ln02:~/umui_runs/xpsqy-343001913> vi stage_2_submit
gmann@ln02:~/umui_runs/xpsqy-343001913> vi umuisubmit_run
gmann@ln02:~/umui_runs/xpsqy-343001913> ksh ./umuisubmit_run
Currently Loaded Modules:
1) craype-x86-rome 7) cray-dsmml/0.2.2 13) load-epcc-module
2) libfabric/1.12.1.2.2.0.0 8) cray-mpich/8.1.23 14) cce/15.0.0
3) craype-network-ofi 9) cray-libsci/22.12.1.1 15) cray-hdf5-parallel/1.12.2.1
4) perftools-base/22.12.0 10) PrgEnv-cray/8.3.3 16) cray-netcdf-hdf5parallel/4.9.0.1
5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta 11) bolt/0.8 17) um/2023.06
6) craype/2.7.19 12) epcc-setup-env
*****************************************************************
Version 8.4 template, Unified Model , Non-Operational
Created by UMUI version 8.4
*****************************************************************
Host is ln02
PATH used = /opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/bin:/opt/cray/pe/hdf5-parallel/1.12.2.1/bin:/opt/cray/pe/hdf5/1.12.2.1/bin:/opt/cray/pe/cce/15.0.0/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/15.0.0/binutils/cross/x86_64-aarch64/aarch64-linux-gnu/../bin:/opt/cray/pe/cce/15.0.0/utils/x86_64/bin:/opt/cray/pe/cce/15.0.0/bin:/opt/cray/pe/cce/15.0.0/cce-clang/x86_64/bin:/work/y07/shared/utils/core/bolt/0.8/bin:/work/y07/shared/utils/core/bin:/opt/cray/pe/mpich/8.1.23/ofi/crayclang/10.0/bin:/opt/cray/pe/mpich/8.1.23/bin:/opt/cray/pe/craype/2.7.19/bin:/opt/cray/pe/perftools/22.12.0/bin:/opt/cray/pe/papi/6.0.0.17/bin:/opt/cray/libfabric/1.12.1.2.2.0.0/bin:/home/n02/n02/gmann/bin:/usr/local/bin:/usr/bin:/bin:/usr/lib/mit/bin:/opt/cray/pe/bin:/work/y07/shared/umshared/software/bin:/work/y07/shared/umshared/bin:/work/y07/shared/utils/core/python/miniconda2/bin:/work/y07/shared/umshared/vn8.4/cce/utils:/work/y07/shared/umshared/bin:/work/y07/shared/umshared/vn8.4/bin:/work/n02/n02/gmann/um/xpsqy/bin:/work/y07/shared/umshared/vn8.4/cce/scripts:/work/y07/shared/umshared/vn8.4/cce/exec
*****************************************************************
*****************************************************************
Job started at : Sat 9 Dec 01:34:14 GMT 2023
*****************************************************************
*****************************************************************
Run started from UMUI
cp -x but switch-off increase-nodes-Reconfig hand-edit (same as Ros`s xpcnd)
This job is using UM directory /work/y07/shared/umshared,
-------------------------------
Processing STASHC file for ROSE
-------------------------------
Backup of STASHC file created!
/work/n02/n02/gmann/tmp/tmp.ln02.235323/xpsqy.stashc_preROSE
-------------------------------
Processing STASHC file complete
-------------------------------
***************************************************************
Starting script : qsatmos
Starting time : Sat 9 Dec 01:34:14 GMT 2023
***************************************************************
/work/n02/n02/gmann/um/xpsqy/bin/qsatmos: Executing model run
*********************************************************
UM Executable : /work/n02/n02/gmann/um/xpsqy/bin/xpsqy.exe
*********************************************************
srun: error: Job rejected: Please specify a partition name.
srun: error: Unable to allocate resources: Unspecified error
xpsqy: Run failed
*****************************************************************
Ending script : qsatmos
Completion code : 1
Completion time : Sat 9 Dec 01:34:16 GMT 2023
*****************************************************************
/work/n02/n02/gmann/um/xpsqy/bin/qsmaster: Failed in qsatmos in job xpsqy
***************************************************************
Starting script : qsfinal
Starting time : Sat 9 Dec 01:34:16 GMT 2023
***************************************************************
Checking requirement for atmosphere resubmit...
/work/n02/n02/gmann/um/xpsqy/bin/qsresubmit: Error: no resubmit details found
*****************************************************************
Ending script : qsfinal
Completion code : 0
Completion time : Sat 9 Dec 01:34:16 GMT 2023
*****************************************************************
/work/n02/n02/gmann/um/xpsqy/bin/qsmaster: Failed in qsfinal in job xpsqy
<<<< Information about How Many Lines of Output follow >>>>
9 lines in main OUTPUT file.
0 lines of O/P from pe0.
<<<< Lines of Output Information ends >>>>
==============================================================================
=================================== OUTPUT ===================================
==============================================================================
UMUI Namelist output in /work/n02/n02/gmann/um/xpsqy/xpsqy.umui.nl
DATAW/DATAM file listing in /work/n02/n02/gmann/um/xpsqy/xpsqy.list
STASH output should be in /work/n02/n02/gmann/um/xpsqy/xpsqy.stash
==============================================================================
=============================== UM RUN OUTPUT ================================
==============================================================================
qsatmos: %MODEL% output follows:-
qsatmos: Stack requested for UM job: GB
srun --cpus-per-task=1 --hint=nomultithread --distribution=block:block /work/n02/n02/gmann/um/xpsqy/bin/xpsqy.exe
0+1 records in
0+1 records out
436 bytes copied, 7.1128e-05 s, 6.1 MB/s
*****************************************************************
****************************************************************
Job ended at : Sat 9 Dec 01:34:16 GMT 2023
****************************************************************
*****************************************************************
i.e. I just did ./UMSUBMIT_ARCHER2 from my '~/umui_jobs/${JOBID}/ ’ directory
Not sure if something could be different in my linux environment set-up, but I followed the steps on the NCAS-CMS webpage for the transition https://cms.ncas.ac.uk/puma2/
Thanks in advance for your help to figure out what’s different here in my set-up
Yes just ./UMSUBMIT_ARCHER2 from inside the ~/umui_jobs/JOBID directory.
I’ve also tried running with your ARCHER2 .bash_profile and your model executable and it still works fine for me, so at this precise moment I don’t know what else to suggest.
That’s really strange then, so you can set the job running OK from the xpsqy model executable,
but for me the model run fails straight-away, giving an “out of memory” error message.
I tried again, submitting the xpsq-y job (after deleting the um_extracts and /work/n02/n02/gmann/um/xpsqy directory for a clean extract & run), but this just gave exactly the same behaviour – it compiled OK, reconfigured OK, but then crashed immediately with out-of-memory error.
I’ve done a Screen-Shot, and uploaded the PNG for this below to demonstrate the error message I’m getting.
Please can you check anything you can thihnk of here that might be causing this problem.
I can’t see why this would make any difference, but should I try setting-up my environment again for PUMA2 submission to ARCHER?
Or I could try also submitting directly from the model executable in the same way you have.
Can you send me instructions for how I can try that, to submit from the model executable
(do I copy the job to another UMUI job, and set this to run from the other executable?)
Your PUMA2 environment won’t affect the running of an executable on ARCHER2.
I didn’t do anything special to run from your executable. I just replaced my executable with yours and resubmitted the model step with:
archer2$ cd ~/umui_runs/xpcnd-<blah>
archer2$ sbatch umuisubmit_run
I’ve also copied your .bash_profile and .bashrc on ARCHER2 and it still runs ok. The only small error I see in the .bash_profile is that you should now have
. /work/y07/shared/umshared/bin/rose-um-env-puma2
rather than
. /work/y07/shared/umshared/bin/rose-um-env
I’ll retry overnight, just incase I made a mistake.
I tried doing the sbatch umuisubmit_run on ARCHER2 from the ~/umui_runs/xpsqy- directory.
And that seems to work fine with that method (to submit the from the umui_runs directory).
So that seems to be the solution (for me at least) – that the ./UMUISUBMIT_ARCHER2 on PUMA2 only completes the compile and reconfiguration steps, with the final stage, to submit the model executable, requiring to do that from ARCHER2, via the sbatch umuisubmit_run method from the umui_runs directory.
(But the original submission from PUMA2 gives an immediate Out Of Memory error.)
So this is OK, I’m happy to proceed and just know I’ll need to do it that way, via the two-stage method.
Thanks so much for your help with this,
Best regards,
Cheers
Graham
-rw-r--r-- 1 gmann n02 279274 Dec 19 00:21 xpsqy.fort6.pe115
-rw-r--r-- 1 gmann n02 279273 Dec 19 00:21 xpsqy.fort6.pe114
-rw-r--r-- 1 gmann n02 279225 Dec 19 00:21 xpsqy.fort6.pe113
-rw-r--r-- 1 gmann n02 279226 Dec 19 00:21 xpsqy.fort6.pe112
-rw-r--r-- 1 gmann n02 279275 Dec 19 00:21 xpsqy.fort6.pe111
-rw-r--r-- 1 gmann n02 279267 Dec 19 00:21 xpsqy.fort6.pe110
-rw-r--r-- 1 gmann n02 279205 Dec 19 00:21 xpsqy.fort6.pe11
-rw-r--r-- 1 gmann n02 279168 Dec 19 00:21 xpsqy.fort6.pe109
-rw-r--r-- 1 gmann n02 278986 Dec 19 00:21 xpsqy.fort6.pe108
-rw-r--r-- 1 gmann n02 279226 Dec 19 00:21 xpsqy.fort6.pe107
-rw-r--r-- 1 gmann n02 279235 Dec 19 00:21 xpsqy.fort6.pe106
-rw-r--r-- 1 gmann n02 279276 Dec 19 00:21 xpsqy.fort6.pe105
-rw-r--r-- 1 gmann n02 279265 Dec 19 00:21 xpsqy.fort6.pe104
-rw-r--r-- 1 gmann n02 279225 Dec 19 00:21 xpsqy.fort6.pe103
-rw-r--r-- 1 gmann n02 279272 Dec 19 00:21 xpsqy.fort6.pe102
-rw-r--r-- 1 gmann n02 279173 Dec 19 00:21 xpsqy.fort6.pe101
-rw-r--r-- 1 gmann n02 279126 Dec 19 00:21 xpsqy.fort6.pe100
-rw-r--r-- 1 gmann n02 279203 Dec 19 00:21 xpsqy.fort6.pe10
-rw-r--r-- 1 gmann n02 278990 Dec 19 00:21 xpsqy.fort6.pe1
-rw-r--r-- 1 gmann n02 308656 Dec 19 00:21 xpsqy.fort6.pe0
gmann@ln03:/work/n02/n02/gmann/um/xpsqy/pe_output> grep 'Atm_Step' xpsqy.fort6.pe0
Atm_Step: Timestep 1 Model time: 1990-12-01 00:20:00
Atm_Step: L_USE_CARIOLLE = F
Atm_Step: Cariolle scheme not called
Atm_Step: Timestep 2 Model time: 1990-12-01 00:40:00
Atm_Step: Timestep 3 Model time: 1990-12-01 01:00:00
Atm_Step: Timestep 4 Model time: 1990-12-01 01:20:00
Atm_Step: Timestep 5 Model time: 1990-12-01 01:40:00
Atm_Step: Timestep 6 Model time: 1990-12-01 02:00:00
Atm_Step: Timestep 7 Model time: 1990-12-01 02:20:00
Atm_Step: Timestep 8 Model time: 1990-12-01 02:40:00
Atm_Step: Timestep 9 Model time: 1990-12-01 03:00:00
Atm_Step: Timestep 10 Model time: 1990-12-01 03:20:00
Atm_Step: Timestep 11 Model time: 1990-12-01 03:40:00
Atm_Step: Timestep 12 Model time: 1990-12-01 04:00:00
Atm_Step: Timestep 13 Model time: 1990-12-01 04:20:00
Atm_Step: Timestep 14 Model time: 1990-12-01 04:40:00
Atm_Step: Timestep 15 Model time: 1990-12-01 05:00:00
Atm_Step: Timestep 16 Model time: 1990-12-01 05:20:00
Atm_Step: Timestep 17 Model time: 1990-12-01 05:40:00
Atm_Step: Timestep 18 Model time: 1990-12-01 06:00:00
Atm_Step: Timestep 19 Model time: 1990-12-01 06:20:00
Atm_Step: Timestep 20 Model time: 1990-12-01 06:40:00
Atm_Step: Timestep 21 Model time: 1990-12-01 07:00:00
Atm_Step: Timestep 22 Model time: 1990-12-01 07:20:00
Atm_Step: Timestep 23 Model time: 1990-12-01 07:40:00
Atm_Step: Timestep 24 Model time: 1990-12-01 08:00:00
Atm_Step: Timestep 25 Model time: 1990-12-01 08:20:00
Atm_Step: Timestep 26 Model time: 1990-12-01 08:40:00