Cancelled due to time limit ***

I am getting the following error trying to run u-cy520
(fails at coupled)

  • a previous post suggest going into /site/archer2.rc - but there seems to be a number of possible lines with ‘execution time limit’?

Many thanks,
Jeremy

???
??? WARNING ???
? Warning code: -1
? Warning from routine: ASAD_CINIT
? Warning message: NRSTEPS IS OUT OF RANGE, RESETTING
? Warning from processor: 0
? Warning number: 68
???

slurmstepd: error: *** STEP 4097245.0+0 ON nid002216 CANCELLED AT 2023-08-01T00:30:49 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 4097245 ON nid002216 CANCELLED AT 2023-08-01T00:30:49 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 4097245.0+2 ON nid004049 CANCELLED AT 2023-08-01T00:30:49 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 4097245.0+1 ON nid004034 CANCELLED AT 2023-08-01T00:30:49 DUE TO TIME LIMIT ***
srun: error: nid003341: tasks 704-711,713-746,748-767: Terminated
srun: launch/slurm: _step_signal: Terminating StepId=4097245.0+0
2023-07-31T23:30:50Z CRITICAL - failed/EXIT

Hi Jeremy,

In rose edit GUI go to suite conf → run initialisation and cycling and increase the wallclock time.

Regards
Ros.

Thanks Ros,

I changed to 4hr 30 mins - I am now getting fail at UMbuild > fcm_make2_um

Many thanks for help,

activity log:
[jobs-submit ret_code] 0
[jobs-submit out] 2023-08-01T09:18:28Z|19501001T0000Z/fcm_make2_um/01|0|4102177
(jgrist02@login.archer2.ac.uk) 2023-08-01T09:18:28Z [STDOUT] Submitted batch job 4102177
[jobs-poll ret_code] 0
[jobs-poll out] 2023-08-01T09:23:30Z|19501001T0000Z/fcm_make2_um/01|{“batch_sys_name”: “slurm”, “batch_sys_job_id”: “4102177”, “batch_sys_exit_polled”: 0, “time_submit_exit”: “2023-08-01T09:18:28Z”, “time_run”: “2023-08-01T09:18:42Z”}
[jobs-poll ret_code] 0
[jobs-poll out] 2023-08-01T09:28:33Z|19501001T0000Z/fcm_make2_um/01|{“batch_sys_name”: “slurm”, “batch_sys_job_id”: “4102177”, “batch_sys_exit_polled”: 0, “time_submit_exit”: “2023-08-01T09:18:28Z”, “time_run”: “2023-08-01T09:18:42Z”}
[jobs-poll ret_code] 0
[jobs-poll out] 2023-08-01T09:33:33Z|19501001T0000Z/fcm_make2_um/01|{“batch_sys_name”: “slurm”, “batch_sys_job_id”: “4102177”, “batch_sys_exit_polled”: 0, “time_submit_exit”: “2023-08-01T09:18:28Z”, “time_run”: “2023-08-01T09:18:42Z”}
[jobs-poll ret_code] 0
[jobs-poll out] 2023-08-01T09:38:35Z|19501001T0000Z/fcm_make2_um/01|{“batch_sys_name”: “slurm”, “batch_sys_job_id”: “4102177”, “batch_sys_exit_polled”: 0, “time_submit_exit”: “2023-08-01T09:18:28Z”, “time_run”: “2023-08-01T09:18:42Z”}
[jobs-poll ret_code] 0
[jobs-poll out] 2023-08-01T09:43:35Z|19501001T0000Z/fcm_make2_um/01|{“batch_sys_name”: “slurm”, “batch_sys_job_id”: “4102177”, “batch_sys_exit_polled”: 0, “time_submit_exit”: “2023-08-01T09:18:28Z”, “time_run”: “2023-08-01T09:18:42Z”}
[jobs-poll ret_code] 0
[jobs-poll out] 2023-08-01T09:48:36Z|19501001T0000Z/fcm_make2_um/01|{“batch_sys_name”: “slurm”, “batch_sys_job_id”: “4102177”, “batch_sys_exit_polled”: 0, “time_submit_exit”: “2023-08-01T09:18:28Z”, “time_run”: “2023-08-01T09:18:42Z”}
[jobs-poll ret_code] 0
[jobs-poll out] 2023-08-01T09:53:36Z|19501001T0000Z/fcm_make2_um/01|{“batch_sys_name”: “slurm”, “batch_sys_job_id”: “4102177”, “run_status”: 1, “run_signal”: “EXIT”, “time_submit_exit”: “2023-08-01T09:18:28Z”, “time_run”: “2023-08-01T09:18:42Z”, “time_run_exit”: “2023-08-01T09:49:32Z”}
[((‘job-logs-retrieve’, ‘failed’), 1) ret_code] 0

Hi Jeremy,

If you look in the job.err file for the fcm_make2_um you’ll see that the compile has now failed due to timelimit.

In the site/archer2.rc file in the [[HPC_SERIAL]] section increase the execution time limit and try again.

Just to note for the future: when you increased the timelimit for the coupled task you didn’t need to rerun the compile tasks. Just do a reload and retrigger the coupled task.

Cheers,
Ros.

Hi Ros,

Thanks for your help,
I increased the time from 30M to 60M but now get the following error:

Jeremy

[FAIL] /work/n02/n02/jgrist02/cylc-run/u-cy520/share/fcm_make_um/fcm-make2.lock: lock exists at the destination

[FAIL] fcm make -C /work/n02/n02/jgrist02/cylc-run/u-cy520/share/fcm_make_um -n 2 -j 6 # return-code=1

2023-08-01T13:37:10Z CRITICAL - failed/EXIT

delete /work/n02/n02/jgrist02/cylc-run/u-cy520/share/fcm_make_um/fcm-make2.lock (it is a directory) & retrigger

Grenville

Hi Grenville,

It now gets through to ‘Coupled’ where is sits ‘running’ for some time before failing

It seems to be a time out error again.

I have wall clock time set for 4h 30 min and execution time limit in the [[HPC_SERIAL]] section of site/archer2.rc set as PT60M.

Should I increase further? Many thanks for your help, Jeremy

Snippet below from /work/n02/n02/jgrist02/cylc-run/u-cy520/log/job/19501001T0000Z/coupled/01/job.err

??? WARNING ???
? Warning code: -1
? Warning from routine: ASAD_CINIT
? Warning message: NRSTEPS IS OUT OF RANGE, RESETTING
? Warning from processor: 0
? Warning number: 68
???

slurmstepd: error: *** STEP 4104686.0+0 ON nid005643 CANCELLED AT 2023-08-01T21:56:57 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 4104686 ON nid005643 CANCELLED AT 2023-08-01T21:56:57 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 4104686.0+1 ON nid005689 CANCELLED AT 2023-08-01T21:56:57 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 4104686.0+2 ON nid005721 CANCELLED AT 2023-08-01T21:56:57 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: nid005672: tasks 192-200,202-228,230-255: Terminated
srun: launch/slurm: _step_signal: Terminating StepId=4104686.0+0
srun: launch/slurm: _step_signal: Terminating StepId=4104686.0+1

Hi Jeremy

ROSE_LAUNCHER_PREOPTS_NEMO is not quite right - it needs the --hint=nomultithread --distribution=block:block entries (as in ROSE_LAUNCHER_PREOPTS_XIOS)

See Updating a UM suite after the ARCHER2 O/S upgrade Coupled Suites item 3.

ROSE_LAUNCHER_PREOPTS_NEMO="--het-group=1 --nodes=9 --ntasks=1152 --tasks-per-node=128 --cpus-per-task=1 --cpu-bind=cores --export=all,OMP_NUM_THREADS=1,HYPERTHREADS=1"
ROSE_LAUNCHER_PREOPTS_XIOS="--het-group=2 --nodes=5 --ntasks=20 --tasks-per-node=4 --cpus-per-task=1 --cpu-bind=cores  --hint=nomultithread --distribution=block:block --export=all,OMP_NUM_THREADS=1,HYPERTHREADS=1"

Grenville

Hi -

Thank you for your help,

I have corrected the archer2.rc file as suggested, but still get this error.
/work/n02/n02/jgrist02/cylc-run/u-cy520/log/job/19501001T0000Z/coupled/01

???
??? WARNING ???
? Warning code: -1
? Warning from routine: ASAD_CINIT
? Warning message: NRSTEPS IS OUT OF RANGE, RESETTING
? Warning from processor: 0
? Warning number: 68
???

slurmstepd: error: *** STEP 4120259.0+0 ON nid002201 CANCELLED AT 2023-08-03T19:15:20 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 4120259 ON nid002201 CANCELLED AT 2023-08-03T19:15:20 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 4120259.0+2 ON nid004612 CANCELLED AT 2023-08-03T19:15:20 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 4120259.0+1 ON nid004603 CANCELLED AT 2023-08-03T19:15:20 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: nid004615: task 15: Terminated
srun: launch/slurm: _step_signal: Terminating StepId=4120259.0+0
srun: launch/slurm: _step_signal: Terminating StepId=4120259.0+1

As an alternative I have tried running
u-cy010 GC5 N216 ORCA025
As that has been updated and tested. However, this fails immediately at fcm_make_ocean and fcm_make_um

jgrist02@ln01:/work/n02/n02/jgrist02/cylc-run/u-cy621/log/job/19780901T0000Z/fcm_make2_pp/01> vi job.err

Use of uninitialized value in concatenation (.) or string at /mnt/lustre/a2fs-work1/work/y07/shared/umshared/software/fcm-2019.09.0/bin/…/lib/FCM/Util.pm line 281.

Jeremy

Jeremy

Our CANARI suites have
--het-group=2 --nodes=7 --ntasks=28 --tasks-per-node=4 --cpus-per-task=32 --hint=nomultithread --distribution=block:block --export=all,OMP_NUM_THREADS=1,HYPERTHREADS=1 ./xios.exe

your suite has
--het-group=2 --nodes=5 --ntasks=20 --tasks-per-node=4 --cpus-per-task=1 --cpu-bind=cores --hint=nomultithread --distribution=block:block --export=all,OMP_NUM_THREADS=1,HYPERTHREADS=1 ./xios.exe

Try setting --cpus-per-task=32 in ROSE_LAUNCHER_PREOPTS_XIOS

I note too that we use CONFIG_MODULE_NAME = GC3-PrgEnv/v3 but I’m no sure if this would account for a slow down.

Grenville

Many thanks,

I made the change:
ROSE_LAUNCHER_PREOPTS_XIOS = --het-group=2 --nodes={{XIOS_NODES}} --ntasks={{XIOS_TASKS}} --tasks-per-node=4 --cpus-per-task=32 --cpu-bind=cores --hint=nomultithread --distribution=block:block --export=all,OMP_NUM_THREADS=1,HYPERTHREADS=1
{% endif %}

but still get similar error:
/work/n02/n02/jgrist02/cylc-run/u-cy520/log/job/19501001T0000Z/coupled/01/job.err

??? WARNING ???
? Warning code: -1
? Warning from routine: ASAD_CINIT
? Warning message: NRSTEPS IS OUT OF RANGE, RESETTING
? Warning from processor: 0
? Warning number: 68
???

slurmstepd: error: *** STEP 4129185.0+0 ON nid003176 CANCELLED AT 2023-08-04T18:40:16 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 4129185 ON nid003176 CANCELLED AT 2023-08-04T18:40:16 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 4129185.0+1 ON nid003192 CANCELLED AT 2023-08-04T18:40:16 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 4129185.0+2 ON nid003240 CANCELLED AT 2023-08-04T18:40:16 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: nid003243: tasks 14-15: Terminated
srun: launch/slurm: _step_signal: Terminating StepId=4129185.0+0

I will try changing CONFIG_MODULE_NAME = GC3-PrgEnv/v3 in suite conf/ machine options to see if helps,

Jeremy

Hi
I tried changing CONFIG_MODULE_NAME = GC3-PrgEnv/v3 in suite conf/ machine options
but still getting similar time limit error:

from /work/n02/n02/jgrist02/cylc-run/u-cy520/log/job/19501001T0000Z/coupled/01/job.err
? Warning code: -1

? Warning from routine: ASAD_CINIT

? Warning message: NRSTEPS IS OUT OF RANGE, RESETTING

? Warning from processor: 0

? Warning number: 68

???

slurmstepd: error: *** STEP 4149716.0+0 ON nid001257 CANCELLED AT 2023-08-06T17:47:00 DUE TO TIME LIMIT ***

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

slurmstepd: error: *** JOB 4149716 ON nid001257 CANCELLED AT 2023-08-06T17:47:00 DUE TO TIME LIMIT ***

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

slurmstepd: error: *** STEP 4149716.0+1 ON nid001284 CANCELLED AT 2023-08-06T17:47:00 DUE TO TIME LIMIT ***

slurmstepd: error: *** STEP 4149716.0+2 ON nid001299 CANCELLED AT 2023-08-06T17:47:00 DUE TO TIME LIMIT ***

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

srun: error: nid001307: task 15: Terminated

srun: launch/slurm: _step_signal: Terminating StepId=4149716.0+0

thanks,
Jeremy

Jeremy

See /home/grenville/u-cy520/site/archer2.rc where in ROSE_LAUNCHER_PREOPTS_XIOS, there is no --cpu-bind=cores, also in ROSE_LAUNCHER_PREOPTS_NEMO there is OMP_PLACES=cores

My copy of your suite with these changes ran at normal speed.

Grenville

Thanks Grenville,
I made these changes and get slightly different error fail:
/work/n02/n02/jgrist02/cylc-run/u-cy520/log/job/19501001T0000Z/coupled/01/job.err

snippet from file:

???
??? WARNING ???
? Warning code: -1
? Warning from routine: ASAD_CINIT
? Warning message: NRSTEPS IS OUT OF RANGE, RESETTING
? Warning from processor: 0
? Warning number: 68
???

terminate called after throwing an instance of ‘xios::CException’

???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 24
???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 24
? Error from routine: WRITHEAD
? Error message: WRITHEAD: Addressing conflict

? Error code: 24
? Error from routine: WRITHEAD
? Error message: WRITHEAD: Addressing conflict
? Error from processor: 353
? Error from processor: 967
? Error number: 68
???

? Error from routine: WRITHEAD
? Error message: WRITHEAD: Addressing conflict
? Error from processor: 80
? Error number: 68
???

? Error number: 68
???

[967] exceptions: An non-exception application exit occured.
[967] exceptions: whilst in a serial region
[967] exceptions: Task had pid=30363 on host nid004000
[967] exceptions: Program is “/mnt/lustre/a2fs-work2/work/n02/n02/jgrist02/cylc-run/u-cy520/work/19501001T0000Z/coupled/./atmos.exe”
???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 24
? Error from routine: WRITHEAD
? Error message: WRITHEAD: Addressing conflict
? Error from processor: 478
? Error number: 68
???

[80] exceptions: An non-exception application exit occured.
[80] exceptions: whilst in a serial region
[80] exceptions: Task had pid=4016 on host nid001680
[80] exceptions: Program is “/mnt/lustre/a2fs-work2/work/n02/n02/jgrist02/cylc-run/u-cy520/work/19501001T0000Z/coupled/./atmos.exe”

Hi Jeremy

You are finding errors that we have fixed in CANARI suites - what are you trying to run, maybe we have a suite already configured to do what you need ?

Grenville

ultimately - I just want to run an ensemble of 6 month long simulations - (the experiment bit is changes to the ocean initial conditions )-
As long as it is ORCA025 - N216, I would hope it would be ok.

If you have something like that which runs, that would be great

Jeremy

Hi Jeremy

u-cn134 is the latest CANARI suite – however, it is set up to reconfigure and run from a UM 10.6 start file - I believe u-cy520 is set up the same way, but is reconfiguring a UM 11.6 start file. That is the cause of the error you see. The work around is to not reconfigure /work/n02/n02/jgrist02/cylc-run/reinhard_restarts/cw475a.da19501001_00.

switch off reconfiguration, and set astart = /work/n02/n02/jgrist02/cylc-run/reinhard_restarts/cw475a.da19501001_00

We will investigate why the reconfiguration mishandles 11.6 files.

Grenville

hi Grenville,

I tried those 2 things:
1)suite conf >Domain Decomposition> Build and run> ’ Run Reconfiguration’ switched to false
2)um>namelist>Model Input and Output>Dumping and Meaning> ‘astart’ set to /work/n02/n02/jgrist02/cylc-run/reinhard_restarts/cw475a.da19501001_00

However, now I get a quicker fail at fcm_make2_um:
/work/n02/n02/jgrist02/cylc-run/u-cy520/log/job/19501001T0000Z/fcm_make2_um/01/job.err

[FAIL] Reading from filehandle failed at /mnt/lustre/a2fs-work1/work/y07/shared/umshared/software/fcm-2019.09.0/bin/…/lib/FCM/Util.pm line 257.
[FAIL] compile+ 0.0 ! HALO_EXCHANGE_MPI_MOD.mod ← um/src/control/mpp/halo_exchange_mpi_mod.F90
[FAIL] Reading from filehandle failed at /mnt/lustre/a2fs-work1/work/y07/shared/umshared/software/fcm-2019.09.0/bin/…/lib/FCM/Util.pm line 257.
[FAIL] compile+ 0.0 ! HALO_EXCHANGE_DDT_MOD.mod ← um/src/control/mpp/halo_exchange_ddt_mod.F90
[FAIL] compile ---- ! halo_exchange.o ← um/src/control/mpp/halo_exchange.F90
[FAIL] ! HALO_EXCHANGE.mod : depends on failed target: halo_exchange.o
[FAIL] ! HALO_EXCHANGE_DDT_MOD.mod: update task failed
[FAIL] ! HALO_EXCHANGE_MPI_MOD.mod: update task failed
[FAIL] ! halo_exchange.o : depends on failed target: HALO_EXCHANGE_DDT_MOD.mod
[FAIL] ! halo_exchange.o : depends on failed target: HALO_EXCHANGE_MPI_MOD.mod

[FAIL] fcm make -C /work/n02/n02/jgrist02/cylc-run/u-cy520/share/fcm_make_um -n 2 -j 6 # return-code=2
2023-08-11T10:29:11Z CRITICAL - failed/EXIT

Jeremy

This error is not related to anything you have done. You can just retrigger - but there is no need to keep rebuilding the model, switch off the UM, Drivers, and Ocean builds.

Grenville

Many thanks -

I have tried that - and also tried it from the start again, with what seems like a similar error.
/work/n02/n02/jgrist02/cylc-run/u-cy520/log/job/19501001T0000Z/coupled/01/job.err

???
??? WARNING ???
? Warning code: -1
? Warning from routine: ASAD_CINIT
? Warning message: NRSTEPS IS OUT OF RANGE, RESETTING
? Warning from processor: 0
? Warning number: 68
???

terminate called after throwing an instance of ‘xios::CException’
srun: error: nid006753: task 18: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=4237388.0+0
srun: launch/slurm: _step_signal: Terminating StepId=4237388.0+2
srun: launch/slurm: _step_signal: Terminating StepId=4237388.0+1
slurmstepd: error: *** STEP 4237388.0+2 ON nid006680 CANCELLED AT 2023-08-14T13:06:15 ***
slurmstepd: error: *** STEP 4237388.0+0 ON nid002421 CANCELLED AT 2023-08-14T13:06:15 ***
slurmstepd: error: *** STEP 4237388.0+1 ON nid005759 CANCELLED AT 2023-08-14T13:06:15 ***
srun: error: nid006685: task 15: Terminated
srun: error: nid006680: task 1: Terminated
srun: error: nid006753: task 17: Terminated
srun: error: nid006680: tasks 2-3: Terminated
srun: error: nid006683: tasks 10-11: Terminated