Runs getting timed out- Archer Slow?

siddharth.gumber · 22 January 2025 12:49

Hello CMS

Myself and Dr Andrew Orr @anmcr have been doing high res production runs (1999 to present day) over the Alps and the Himalayas for the past four months or so. I have using five suites repeatedly to submit jobs. Archer was quite efficient last year just before Christmas with each CRUN (high res fcst) taking about 15-20 min to run. I saw that even with a high Archer load, my jobs never timed out. However, I have been noticing that in the past three weeks or so, my jobs are getting timed out - each CRUN sometimes taking 3h and then eventually failing due to time limit. Do you think that this is to do with the slowness in creating output files or just the slowness overall vis-a-vis Archer2. Please advise as we are already quite tight on time and the available CU’s we have. My suites are u-dj760, u-dj762, u-dk430, u-dm288.
Best Wishes
Sid

grenville · 22 January 2025 13:26

Sid

Which tasks in particular are running slowly?

Grenville

siddharth.gumber · 22 January 2025 13:50

Alps_km1p5_RAL3_um_fcst_000 upto 005 and
Him_km1p5_RAL3_um_fcst_000 upto 005 (so 12 CRUNS for two domains per day)

siddharth.gumber · 22 January 2025 22:05

Hi Grenville. Just looked up at the status of the runs after the Archer crash:

This is what is expected! Any reason why Archer must take sometimes 3h to finish the same job, with the same number of alotted NODES.
Best Wishes
Sid

grenville · 23 January 2025 10:35

Sid

I can only guess at reasons why there has been variability in performance - I assume it is because of contention in the the system, either on the network or in the IO system or both. If you supply the ARCHER helpdesk with the slurm job ids of quick and slow runs, they may be able to provide more information.

(Please note: screen shots of the cylc gui are not very helpful (the suite id doesn’t appear for example), it’s more helpful to supply the address of a job.out or job.err file)

Grenville

cemachelen · 23 January 2025 13:43

I’ve been noticing this too, had to add loads of extra wallclock time to simple recon tasks. If it helps my jobs look like ARCHER2 is having I/O issues which make sense with recent crashes, longer CRUNs not dumping restarts have much less performance drop.

grenville · 23 January 2025 14:56

Sid, Helen

We have been running suites that use NVMe (Data management and transfer - ARCHER2 User Documentation) - we have found it gives consistent performance. The drawback (minor) is that NVMe is a scratch file system with an imposed deletion policy for unused files. The simplest way to use NVMe is to add the following two lines at the top of the rose-suite.conf file:

root-dir{share}=ln*=/mnt/lustre/a2fs-nvme/work/n02/n02/$USER
root-dir{work}=ln*=/mnt/lustre/a2fs-nvme/work/n02/n02/$USER

This puts work and share on NVMe.

I would not recommend doing this on a suite that is already running - maybe worth a try for a new suite?

Grenville

siddharth.gumber · 23 January 2025 15:36

Hi Grenville. Many thanks. Shall I add these two lines right at the top?

root-dir{share}=ln*=/mnt/lustre/a2fs-nvme/work/n02/n02/aurocumulus
root-dir{work}=ln*=/mnt/lustre/a2fs-nvme/work/n02/n02/aurocumulus
[jinja2:suite.rc]
!!!rg02_centre=50.7,-3.5

Is this alright? I am adding this to a suite u-dj614 that is not running but I intend to run. Best Wishes
Sid

grenville · 23 January 2025 16:17

Sid

Yes, that looks OK. I hope this helps. Remember that files will be deleted from NVMe if not accessed for 28 days – that includes executables etc.

Grenville

system · 22 February 2025 16:18

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Innermost LAM suddenly failing due to walltime limit Unified Model ARCHER2	3	225	26 July 2023
ARCHER2 job cancelled due to time limit? Rose/Cylc and FCM PUMA , ARCHER2	4	380	23 February 2022
Jobs failing on ARCHER2 Unified Model ARCHER2	4	300	27 November 2023
Time out error in 'coupled' task for UKESM suite on Archer2 Unified Model ARCHER2	14	266	19 February 2024
Time limit error occurring ARCHER2	3	20	12 February 2025

Runs getting timed out- Archer Slow?

Related topics