Dear Ros.
I was trying to make the u-ch427 work, but most of your answers are about u-bo026, so I resumed working on that. I only need one of them to work.
First of all, the option --opt-conf-key=n96e works with n216 and n512, but not with n96.
(u-bo026-n96-ens3)
-bash-4.1$ rose suite-run --opt-conf-key=n96e --new
[FAIL] Bad optional configuration key(s): n96e
(after run without the flag)
Error message: Too many processors in the North-South direction ( 56) to support the extended halo size ( 5). Try running with 28 processors.
? Error from processor: 0
I changed now to the number of processors of u-ch427. Let’s see if I get away with it.
u-ch427-ens9 is working fine with n96 and n216; I’m still testing n512.
No, replying directly to your comments:
| |
I will check with the Discourse guys but I suspect the ---
that you are putting in your email as a separator is being translated as start of email footer as that is what is traditionally used to separate the footer from the email body which isn’t wanted in the helpdesk posts.
Thanks, I’ll try to pay attention to it.
So onto your questions:
- If a shorter run length is good then change the run length to a few hours or days, whatever you need. Just because a job you copy is set to 1 month doesn’t mean it has to stay that way.
I honestly don’t know the answer to this question. I thought you were asking about something else. I changed to one day in u-ch427 (n512) to see if that works. One thing I always change is to run just one cycle, so I believe the total amount might not be relevant. But it’s just a guess.
- The number of nodes used is determined by the decomposition (EWxNS), number of OMP threads and hyperthreads selected. Look at the job script that’s generated in
~/cylc-run/<suiteid>/log/job/<task>/NN/job
and you’ll see in the Slurm header how many nodes have been requested, along with the requested wallclock, which queue you’ve selected to run in and the budget account you are running under.
I could find the numbers, but I still don’t know how they are generated. What’s the math involved here?
- Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits). I answered this back in submit-trying - #6. This indicates that you are trying to run on more nodes than the queue allows, that the time limit you’ve specified is too long for the queue or you’ve exceeded the max number of jobs you can have queueing at any one time. The details for all the queues is available on the ARCHER2 website: Running jobs - ARCHER2 User Documentation
I know the details from Archer, but I don’t know how to change the number of nodes that the queue allows because I don’t know how this number is calculated. At least now I know where to find those numbers (log-job).
-
Incorrect E-W resolution for Land/Sea Mask Ancillary (u-bo026-n216-ens3, u-bo026-n512-ens3 - recon )
Are you running this suite (u-bo026
) with the required --opt-conf-key
? For N216 you should be using rose suite-run --opt-conf-key=n216e
Switching to n512e
for the N512 resolution. This suite has optional overrides to make it easier to get the correct decomposition appropriate for the model resolution. Look in ~/roses/u-bo026-n216-ens3/opt
to see what settings this overrides.
This directory, ~/roses/suite/opt, is really good. It was what I was looking for to have some guidance in how to tune the numbers. However, the configuration for u-bo026 is the same for u-ch427, and I’m assuming this is the configuration for the Archer2 4-cabinet, and not the full system. Maybe that’s why I’m still having problems adjusting wallclock time and processors per node.
-
Model resolutions for the models you are running are as follows:
-
| Resolution | Grid Points | Grid Cell size (degrees) | Spacing at mid-latitudes |
- | - | - | - |
N96 | 192 x 145 | 1.88 x 1.25 | ~135km |
N216 | 432 x 325 | 0.83 x 0.56 | ~60km |
N512 | 1024 x 769 | 0.35 x 0.23 | ~25km |
So N512 has approximately 5x the number of grid points to N216. So when I guestimated the time for your N512 setup I multipled the ~1 hour it took for the N216 on 92 nodes by 5 for the resolution change and then 2.5 for just under half the number of nodes giving shy of 13 hours. That was just a really crude estimation.
How does that relate to the number of processors? Does it matter at all?
Hope that answers your questions.
We are moving forward. Thank you so much! 
Kind regards, Luciana.