Submit-retrying

Hi.

When I tried to run the n512 suite (u-bo026-n512-ens3) with the new archer2.rc file, I’m getting submit-retrying. I was only running four suites at that time: u-bo026-n96-ens3 / u-bo026-n216-ens3 / u-bo026-n512-ens3 / u-ch427.

The limits to new queues were supposed to be approx the double of the old system, so I’m confused why I got this message.

Kind regards, Luciana.

Hi Luciana,

I can’t see a submit-retrying task in u-bo026-n512-ens3. All the tasks only have 1 try that I can see in the cylc-run dir on pumatest.

Which task is it that has the problem? What is the actual error message? There’s information on each of the ARCHER2 queues including node limits, time limits and number of jobs is on their website: Running jobs - ARCHER2 User Documentation which may help you diagnose the problem.

Please also change the permissions on your /work directory so we can see it:

chmod -R g+rX /work/n02/n02/lrpedro

Regards,
Ros.

Dear Ros.

I’ve changed the permissions in Archer2. Now I’m also getting this message when I log into Archer2. Can I just ignore it?

Hi Luciana,

What error message are you getting when you login to ARCHER2?

Regards,
Ros.

Dear Ros.

The suites are running, but I still don’t have a result. I do have another weird problem.

When I tried to run suites u-bo026-n96-ens3/u-bo026-n216-ens3/u-bo026-n512-ens3 with –new, I get:

[FAIL] rsync -a --exclude=.* --timeout=1800 --rsh=ssh\ -oBatchMode=yes --exclude=963fb933-4dd2-493d-8336-a0a5fcbb10e6 --exclude=log/963fb933-4dd2-493d-8336-a0a5fcbb10e6 --exclude=share/963fb933-4dd2-493d-8336-a0a5fcbb10e6 --exclude=share/cycle/963fb933-4dd2-493d-8336-a0a5fcbb10e6 --exclude=work/963fb933-4dd2-493d-8336-a0a5fcbb10e6 --exclude=/.* --exclude=/cylc-suite.db --exclude=/log --exclude=/log.* --exclude=/state --exclude=/share --exclude=/work ./ login.archer2.ac.uk:cylc-run/u-bo026-n216-ens3 # return-code=12, stderr=
[FAIL] rsync: mkdir “/home1/home/n02/n02/lrpedro/cylc-run/u-bo026-n216-ens3” failed: File exists (17)
[FAIL] rsync error: error in file IO (code 11) at main.c(664) [Receiver=3.1.3]
[FAIL] rsync: connection unexpectedly closed (9 bytes received so far) [sender]
[FAIL] rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]

Without --new, they were submitted.

BTW, I’m still having the error about cray-netcdf when I log into Archer2.

Kind regards, Luciana.

Hi Luciana,

We’ve seen the rsync error too on occasion - it is intermittent so usually trying again works - we don’t know the cause.

u-bo026-n96-ens3 is still submit-retrying. When this happens you need to look for the error message in the log/job/<cycle>/<task>/NN/job-activity.log file. Look in home/luciana/cylc-run/u-bo026-n96-ens3/log/job/19880901T0000Z/atmos_main/08/job-activity.log and you’ll see the error message:

(login.archer2.ac.uk) 2021-12-01T08:13:13Z [STDERR] sbatch: error: QOSMaxNodePerUserLimit
(login.archer2.ac.uk) 2021-12-01T08:13:13Z [STDERR] sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

This indicates that you are trying to run on more nodes than the queue (short queue in this instance) allows. See the ARCHER2 queue documentation I posted above in reply submit-retrying - #2

Your cray-netcdf error on login is because you are loading modules in your ~/.bash_profile that don’t exist on the 23-cab. Remove the module load line.

Regards,
Ros.

Dear Ros.

After several attempts (and your support!), now I have different errors and some questions about the old ones.

Dear Ros.

Some updates after the weekend. I finally made one suite to work, just one, but the other ones, with similar configuration, are giving me CANCELLED DUE TO TIME LIMIT, even after changing MAIN_CLOCK to 24H! Should I contact Archer2 Support or is this something else you can suggest to help me?

Kind regards, Luciana.

Hi Luciana,

Please point me to one of the suites that is failing due to 24hr wallclock - I’ve had a quick look at some of your suites and I can’t find any.

I can see that u-ch427-216-ens3 ran out of time with 59min wallclock and u-ch427-ens6 & u-ch427-ens3 failed with only a 20min wallclock.

Regards,
Ros.

Dear Ros.

u-ch427-216-ens6 worked with 20min. That’s why I would expect the others to succeed within 20min.

u-ch427-512 failed with 24h.

Can you please address the other questions too, so I learn that once and for all? Thank you.

Kind regards,

Luciana.

Hi Luciana,

You need to consider difference in job resolution and number of nodes running on when estimating the expected runtime. The 2 suites you mentioned in your last reply are very different:

u-ch427-216-ens6 is N216 and completed in 50mins on 92nodes (Timelimit specified was 59mins)

u-ch427-512 has a time limit of 12hours specified not 24hours. This suite is a much higher resolution, N512, and is also running on a lot less nodes (38 nodes in total) than the lower resolution one, so I estimate would take in the region of ~13 hours to complete.

Do you need to be running for as long as 1month in order to get the information you require or would a shorter run length suffice?

Can you please address the other questions too, so I learn that once and for all? Thank you.

I think I’ve answered all your questions within this topic as far as I can see and also in your other 2 open queries, so I’m not sure which unanswered questions you are referring to. Do let me know.

Regards,
Ros.

Dear Ros.

I don’t know why, but I checked the NCAS webpage and my message was truncated. I’m copying here the whole message from my previous email. It’s connected to the answer you just gave me, about the resolution and options to better use Archer2 capacity. In my case, the shorter the run-length the better. Can you also tell me where you are getting the information about the number of nodes? As I mentioned before, I just copied the suite and I’m only interested in the total time for different resolutions using distinct XIOS options. The suite with 9 ensembles and n216 also worked within the 59min, so it would be nice to understand how you estimate the time too.

date: 3 Dec 2021, 15:34
mailed-by: gmail.com

Dear Ros.

After several attempts (and your support!), now I have different errors and some questions about the old ones.

Again, the message is truncated! So now I’m using https://cms-helpdesk.ncas.ac.uk/ to copy it. It seems the three - - - is causing the message to truncate, but here it becomes a full-fledged line.


Dear Ros.

After several attempts (and your support!), now I have different errors and some questions about the old ones.


Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits).

This is always the message. It doesn’t tell me what’s wrong. I don’t know how to translate job submit limit, user’s size and/or time limits into variables that I’m using and might be out of limit. What I’ve done was to check the new limits and they are pretty much the double of the old ones, so I have no idea why it’s not working now. Is there a reference to understand how to translate those errors to a suite?


Another thing I don’t know is the error I’m getting now.

Incorrect E-W resolution for Land/Sea Mask Ancillary (u-bo026-n216-ens3, u-bo026-n512-ens3 - recon )

Too many processors in the North-South direction (56) to support the extended halo size (5). Try running with 28 processors.

What’s the resolution for the models? I’m running n96, n216 and n512 in Archer2, suite u-ch427? The only reference I have is for u-bo026 in http://cms.ncas.ac.uk/wiki/Archer2. And even in this case, those numbers aren’t the ones that work, but at least it’s a reference. I need at least boundaries to play with if I have to just guess those numbers. Other messages are like this last one, but I stay today the whole day testing the suggestions until I got to Incorrect E-W resolution…


u-ch427-512

I’m getting CANCELLED DUE TO TIME LIMIT, even after changing MAIN_CLOCK to 6H.


Kind regards,

Luciana.

Hi Luciana,

I will check with the Discourse guys but I suspect the --- that you are putting in your email as a separator is being translated as start of email footer as that is what is traditionally used to separate the footer from the email body which isn’t wanted in the helpdesk posts.

So onto your questions:

  1. If a shorter run length is good then change the run length to a few hours or days, whatever you need. Just because a job you copy is set to 1 month doesn’t mean it has to stay that way.

  2. The number of nodes used is determined by the decomposition (EWxNS), number of OMP threads and hyperthreads selected. Look at the job script that’s generated in ~/cylc-run/<suiteid>/log/job/<task>/NN/job and you’ll see in the Slurm header how many nodes have been requested, along with the requested wallclock, which queue you’ve selected to run in and the budget account you are running under.

  3. Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits).

    I answered this back in submit-trying - #6. This indicates that you are trying to run on more nodes than the queue allows, that the time limit you’ve specified is too long for the queue or you’ve exceeded the max number of jobs you can have queueing at any one time. The details for all the queues is available on the ARCHER2 website: Running jobs - ARCHER2 User Documentation

  4. Incorrect E-W resolution for Land/Sea Mask Ancillary (u-bo026-n216-ens3, u-bo026-n512-ens3 - recon )

    Are you running this suite (u-bo026) with the required --opt-conf-key? For N216 you should be using rose suite-run --opt-conf-key=n216e Switching to n512e for the N512 resolution. This suite has optional overrides to make it easier to get the correct decomposition appropriate for the model resolution. Look in ~/roses/u-bo026-n216-ens3/opt to see what settings this overrides.

  5. Model resolutions for the models you are running are as follows:

    Resolution Grid Points Grid Cell size (degrees) Spacing at mid-latitudes
    N96 192 x 145 1.88 x 1.25 ~135km
    N216 432 x 325 0.83 x 0.56 ~60km
    N512 1024 x 769 0.35 x 0.23 ~25km

    So N512 has approximately 5x the number of grid points to N216. So when I guestimated the time for your N512 setup I multipled the ~1 hour it took for the N216 on 92 nodes by 5 for the resolution change and then 2.5 for just under half the number of nodes giving shy of 13 hours. That was just a really crude estimation.

Hope that answers your questions.

Regards,
Ros.

Just following up on the truncation - yes 2 or more -- indicates an automated email signature and will always be truncated. Don’t use them in the body of your emails and all should then be well.

Dear Ros.

I was trying to make the u-ch427 work, but most of your answers are about u-bo026, so I resumed working on that. I only need one of them to work.

First of all, the option --opt-conf-key=n96e works with n216 and n512, but not with n96.

(u-bo026-n96-ens3)

-bash-4.1$ rose suite-run --opt-conf-key=n96e --new
[FAIL] Bad optional configuration key(s): n96e

(after run without the flag)
Error message: Too many processors in the North-South direction ( 56) to support the extended halo size ( 5). Try running with 28 processors.
? Error from processor: 0

I changed now to the number of processors of u-ch427. Let’s see if I get away with it.

u-ch427-ens9 is working fine with n96 and n216; I’m still testing n512.

No, replying directly to your comments:


I will check with the Discourse guys but I suspect the --- that you are putting in your email as a separator is being translated as start of email footer as that is what is traditionally used to separate the footer from the email body which isn’t wanted in the helpdesk posts.

Thanks, I’ll try to pay attention to it.

So onto your questions:

  1. If a shorter run length is good then change the run length to a few hours or days, whatever you need. Just because a job you copy is set to 1 month doesn’t mean it has to stay that way.

I honestly don’t know the answer to this question. I thought you were asking about something else. I changed to one day in u-ch427 (n512) to see if that works. One thing I always change is to run just one cycle, so I believe the total amount might not be relevant. But it’s just a guess.

  1. The number of nodes used is determined by the decomposition (EWxNS), number of OMP threads and hyperthreads selected. Look at the job script that’s generated in ~/cylc-run/<suiteid>/log/job/<task>/NN/job and you’ll see in the Slurm header how many nodes have been requested, along with the requested wallclock, which queue you’ve selected to run in and the budget account you are running under.

I could find the numbers, but I still don’t know how they are generated. What’s the math involved here?

  1. Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits). I answered this back in submit-trying - #6. This indicates that you are trying to run on more nodes than the queue allows, that the time limit you’ve specified is too long for the queue or you’ve exceeded the max number of jobs you can have queueing at any one time. The details for all the queues is available on the ARCHER2 website: Running jobs - ARCHER2 User Documentation

I know the details from Archer, but I don’t know how to change the number of nodes that the queue allows because I don’t know how this number is calculated. At least now I know where to find those numbers (log-job).

  1. Incorrect E-W resolution for Land/Sea Mask Ancillary (u-bo026-n216-ens3, u-bo026-n512-ens3 - recon )

    Are you running this suite (u-bo026) with the required --opt-conf-key? For N216 you should be using rose suite-run --opt-conf-key=n216e Switching to n512e for the N512 resolution. This suite has optional overrides to make it easier to get the correct decomposition appropriate for the model resolution. Look in ~/roses/u-bo026-n216-ens3/opt to see what settings this overrides.

This directory, ~/roses/suite/opt, is really good. It was what I was looking for to have some guidance in how to tune the numbers. However, the configuration for u-bo026 is the same for u-ch427, and I’m assuming this is the configuration for the Archer2 4-cabinet, and not the full system. Maybe that’s why I’m still having problems adjusting wallclock time and processors per node.

  1. Model resolutions for the models you are running are as follows:

  2. | Resolution | Grid Points | Grid Cell size (degrees) | Spacing at mid-latitudes |

    • | - | - | - |
      N96 | 192 x 145 | 1.88 x 1.25 | ~135km |
      N216 | 432 x 325 | 0.83 x 0.56 | ~60km |
      N512 | 1024 x 769 | 0.35 x 0.23 | ~25km |

    So N512 has approximately 5x the number of grid points to N216. So when I guestimated the time for your N512 setup I multipled the ~1 hour it took for the N216 on 92 nodes by 5 for the resolution change and then 2.5 for just under half the number of nodes giving shy of 13 hours. That was just a really crude estimation.

How does that relate to the number of processors? Does it matter at all?

Hope that answers your questions.

We are moving forward. Thank you so much! :slight_smile:

Kind regards, Luciana.

Hi Luciana,

The same processor decompositions that you had working on the 4cab for the various resolutions will work on the 23cab.

There isn’t an optional override for N96 - you’ll see there is no corresponding file in the opt directory.

The Slurm header in the job script is generated in the site/archer2.rc file. You’ll see the math for the --nodes calculation in there.

You say you only need one of these suites to work, I would suggest concentrating on the one that you had working on the 4 cabinet system and that you’d configured to get the information you are trying to collect - I believe that’s u-bo026.

Regards,
Ros.

Dear Ros.

After several tests, I have some suites running. As you suggested, I used the configuration for the 4-cabinet, but I’m getting more stable results with u-ch427, so I plan on sticking with this suite. I just need some help further adjusting the n512. The current configuration is

MAIN_ATM_PROCX=40
MAIN_ATM_PROCY=32
MAIN_OMPTHR_ATM=3
MAIN_CLOCK=‘PT120M’

I would like to be able to change the processors to have faster results (n96 and n216 are both running within 5min). When I tried randomly, I got stuck with errors like Incorrect E-W resolution. Any suggestions?

About the number of nodes, I tried to chase the maths, but I got stuck into functions and ifs. I’ll copy here what I got.

–nodes={{((ATMOS_NEXEC|int*NODE_ATM|int)+XIOS_NODES|int)}}

ATMOS_NEXEC = ATMOS_NENS (ok!)

NODE_ATM
{% set NODE_ATM = node(TASKS_ATM, MAIN_OMPTHR_ATM, MAIN_HYPTHR_ATM, APPN) %} (function where?!)

XIOS_NODES
{% set XIOS_NODES = (XIOS_NPROC/XPPN)|round(0,‘ceil’)|int %}
{% set XPPN = XIOS_PPN if XIOS_PPN is defined else PPN %} (ifs?!)
{% set PPN = 128 %}

XIOS_NPROC=16

XPPN=XIOS_PPN=6

XIOS_NODES = 16/6 = 3?

The limit for the standard queue in Archer2 is 1024, so I have plenty of room to play with more processors and I don’t think this will be an issue, but it depends on the suggestions you might give me for the n512 and, eventually, for the n1280 too.

Kind regards, Luciana.

Hi Luciana,

Sorting out best decompositions is trial and error. Just looking at N512 suites we’ve tested on Archer2, we’ve used something like 32 EW x 32 NS x 2 OMP with 2 XIOS nodes. You can try running it on more but at some point chucking more processors at it will not make it go any faster.

{% set NODE_ATM = node(TASKS_ATM, MAIN_OMPTHR_ATM, MAIN_HYPTHR_ATM, APPN) %} (function where?!)

Look a few lines further up the file and you’ll find the node function.

{% set XPPN = XIOS_PPN if XIOS_PPN is defined else PPN %} (ifs?!)

This is just a standard if defined else statement written on one line.

You can confirm the values of the calculated variables (e.g. XIOS_NODES) by looking in the suite.rc.processed file.

Regards,
Ros.

Dear Ros.

Thank you for your support. I have some extra questions:

  • I’m trying to run the n1280 using the u-ch427 and rose-suite-n1280e.conf:

MAIN_ATM_PROCX=94

MAIN_ATM_PROCY=64
MAIN_OMPTHR_ATM=2

EXPT_BASIS=‘19790301T0000Z’

However, I’m getting an error in install_cold:

The following have been reloaded with a version change:

  1. cce/11.0.4 => cce/12.0.3

[FAIL] file:/work/n02/n02/lrpedro/cylc-run/u-ch427-n1280-template/share/data/etc/um_ancils_gl=source=/work/n02/n02/annette/HRCM/ancil/data/ancil_versions/n1280e_orca12/GA7.1_AMIP/v6/ancils: bad or missing value
2021-12-14T21:36:42Z CRITICAL - failed/EXIT

It’s my first time trying to run n1280, so it might have other things to configure that I’m not aware of.

  • Is there a command-line equivalent to gcylc Trigger (run now)? Would it be restart?

  • Is there another place to get the information about the total number of nodes other than the log/job? Maybe a command in Archer2? It would be good to have this information (it’s sent to slurm!) as soon as the job starts, and not after it finishes. And am I right to believe that the only number that is relevant in this case is the one in the job atmos_main? I checked all the other jobs and they appear to request just one node, with the only exception of recon, which requests six nodes.
    Kind regards, Luciana.