Hello CMS
Myself and Dr Andrew Orr @anmcr have been doing high res production runs (1999 to present day) over the Alps and the Himalayas for the past four months or so. I have using five suites repeatedly to submit jobs. Archer was quite efficient last year just before Christmas with each CRUN (high res fcst) taking about 15-20 min to run. I saw that even with a high Archer load, my jobs never timed out. However, I have been noticing that in the past three weeks or so, my jobs are getting timed out - each CRUN sometimes taking 3h and then eventually failing due to time limit. Do you think that this is to do with the slowness in creating output files or just the slowness overall vis-a-vis Archer2. Please advise as we are already quite tight on time and the available CU’s we have. My suites are u-dj760, u-dj762, u-dk430, u-dm288.
Best Wishes
Sid
Sid
Which tasks in particular are running slowly?
Grenville
Alps_km1p5_RAL3_um_fcst_000 upto 005 and
Him_km1p5_RAL3_um_fcst_000 upto 005 (so 12 CRUNS for two domains per day)
Hi Grenville. Just looked up at the status of the runs after the Archer crash:
This is what is expected! Any reason why Archer must take sometimes 3h to finish the same job, with the same number of alotted NODES.
Best Wishes
Sid
Sid
I can only guess at reasons why there has been variability in performance - I assume it is because of contention in the the system, either on the network or in the IO system or both. If you supply the ARCHER helpdesk with the slurm job ids of quick and slow runs, they may be able to provide more information.
(Please note: screen shots of the cylc gui are not very helpful (the suite id doesn’t appear for example), it’s more helpful to supply the address of a job.out or job.err file)
Grenville
I’ve been noticing this too, had to add loads of extra wallclock time to simple recon tasks. If it helps my jobs look like ARCHER2 is having I/O issues which make sense with recent crashes, longer CRUNs not dumping restarts have much less performance drop.
Sid, Helen
We have been running suites that use NVMe (Data management and transfer - ARCHER2 User Documentation) - we have found it gives consistent performance. The drawback (minor) is that NVMe is a scratch file system with an imposed deletion policy for unused files. The simplest way to use NVMe is to add the following two lines at the top of the rose-suite.conf file:
root-dir{share}=ln*=/mnt/lustre/a2fs-nvme/work/n02/n02/$USER
root-dir{work}=ln*=/mnt/lustre/a2fs-nvme/work/n02/n02/$USER
This puts work
and share
on NVMe.
I would not recommend doing this on a suite that is already running - maybe worth a try for a new suite?
Grenville
1 Like
Hi Grenville. Many thanks. Shall I add these two lines right at the top?
root-dir{share}=ln*=/mnt/lustre/a2fs-nvme/work/n02/n02/aurocumulus
root-dir{work}=ln*=/mnt/lustre/a2fs-nvme/work/n02/n02/aurocumulus
[jinja2:suite.rc]
!!!rg02_centre=50.7,-3.5
Is this alright? I am adding this to a suite u-dj614 that is not running but I intend to run. Best Wishes
Sid
Sid
Yes, that looks OK. I hope this helps. Remember that files will be deleted from NVMe if not accessed for 28 days – that includes executables etc.
Grenville