Submit-retrying

Dear Ros.

After several attempts (and your support!), now I have different errors and some questions about the old ones.

Dear Ros.

Some updates after the weekend. I finally made one suite to work, just one, but the other ones, with similar configuration, are giving me CANCELLED DUE TO TIME LIMIT, even after changing MAIN_CLOCK to 24H! Should I contact Archer2 Support or is this something else you can suggest to help me?

Kind regards, Luciana.

Hi Luciana,

Please point me to one of the suites that is failing due to 24hr wallclock - I’ve had a quick look at some of your suites and I can’t find any.

I can see that u-ch427-216-ens3 ran out of time with 59min wallclock and u-ch427-ens6 & u-ch427-ens3 failed with only a 20min wallclock.

Regards,
Ros.

Dear Ros.

u-ch427-216-ens6 worked with 20min. That’s why I would expect the others to succeed within 20min.

u-ch427-512 failed with 24h.

Can you please address the other questions too, so I learn that once and for all? Thank you.

Kind regards,

Luciana.

Hi Luciana,

You need to consider difference in job resolution and number of nodes running on when estimating the expected runtime. The 2 suites you mentioned in your last reply are very different:

u-ch427-216-ens6 is N216 and completed in 50mins on 92nodes (Timelimit specified was 59mins)

u-ch427-512 has a time limit of 12hours specified not 24hours. This suite is a much higher resolution, N512, and is also running on a lot less nodes (38 nodes in total) than the lower resolution one, so I estimate would take in the region of ~13 hours to complete.

Do you need to be running for as long as 1month in order to get the information you require or would a shorter run length suffice?

Can you please address the other questions too, so I learn that once and for all? Thank you.

I think I’ve answered all your questions within this topic as far as I can see and also in your other 2 open queries, so I’m not sure which unanswered questions you are referring to. Do let me know.

Regards,
Ros.

Dear Ros.

I don’t know why, but I checked the NCAS webpage and my message was truncated. I’m copying here the whole message from my previous email. It’s connected to the answer you just gave me, about the resolution and options to better use Archer2 capacity. In my case, the shorter the run-length the better. Can you also tell me where you are getting the information about the number of nodes? As I mentioned before, I just copied the suite and I’m only interested in the total time for different resolutions using distinct XIOS options. The suite with 9 ensembles and n216 also worked within the 59min, so it would be nice to understand how you estimate the time too.

date: 3 Dec 2021, 15:34
mailed-by: gmail.com

Dear Ros.

After several attempts (and your support!), now I have different errors and some questions about the old ones.

Again, the message is truncated! So now I’m using https://cms-helpdesk.ncas.ac.uk/ to copy it. It seems the three - - - is causing the message to truncate, but here it becomes a full-fledged line.


Dear Ros.

After several attempts (and your support!), now I have different errors and some questions about the old ones.


Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits).

This is always the message. It doesn’t tell me what’s wrong. I don’t know how to translate job submit limit, user’s size and/or time limits into variables that I’m using and might be out of limit. What I’ve done was to check the new limits and they are pretty much the double of the old ones, so I have no idea why it’s not working now. Is there a reference to understand how to translate those errors to a suite?


Another thing I don’t know is the error I’m getting now.

Incorrect E-W resolution for Land/Sea Mask Ancillary (u-bo026-n216-ens3, u-bo026-n512-ens3 - recon )

Too many processors in the North-South direction (56) to support the extended halo size (5). Try running with 28 processors.

What’s the resolution for the models? I’m running n96, n216 and n512 in Archer2, suite u-ch427? The only reference I have is for u-bo026 in http://cms.ncas.ac.uk/wiki/Archer2. And even in this case, those numbers aren’t the ones that work, but at least it’s a reference. I need at least boundaries to play with if I have to just guess those numbers. Other messages are like this last one, but I stay today the whole day testing the suggestions until I got to Incorrect E-W resolution…


u-ch427-512

I’m getting CANCELLED DUE TO TIME LIMIT, even after changing MAIN_CLOCK to 6H.


Kind regards,

Luciana.

Hi Luciana,

I will check with the Discourse guys but I suspect the --- that you are putting in your email as a separator is being translated as start of email footer as that is what is traditionally used to separate the footer from the email body which isn’t wanted in the helpdesk posts.

So onto your questions:

  1. If a shorter run length is good then change the run length to a few hours or days, whatever you need. Just because a job you copy is set to 1 month doesn’t mean it has to stay that way.

  2. The number of nodes used is determined by the decomposition (EWxNS), number of OMP threads and hyperthreads selected. Look at the job script that’s generated in ~/cylc-run/<suiteid>/log/job/<task>/NN/job and you’ll see in the Slurm header how many nodes have been requested, along with the requested wallclock, which queue you’ve selected to run in and the budget account you are running under.

  3. Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits).

    I answered this back in submit-trying - #6. This indicates that you are trying to run on more nodes than the queue allows, that the time limit you’ve specified is too long for the queue or you’ve exceeded the max number of jobs you can have queueing at any one time. The details for all the queues is available on the ARCHER2 website: Running jobs - ARCHER2 User Documentation

  4. Incorrect E-W resolution for Land/Sea Mask Ancillary (u-bo026-n216-ens3, u-bo026-n512-ens3 - recon )

    Are you running this suite (u-bo026) with the required --opt-conf-key? For N216 you should be using rose suite-run --opt-conf-key=n216e Switching to n512e for the N512 resolution. This suite has optional overrides to make it easier to get the correct decomposition appropriate for the model resolution. Look in ~/roses/u-bo026-n216-ens3/opt to see what settings this overrides.

  5. Model resolutions for the models you are running are as follows:

    Resolution Grid Points Grid Cell size (degrees) Spacing at mid-latitudes
    N96 192 x 145 1.88 x 1.25 ~135km
    N216 432 x 325 0.83 x 0.56 ~60km
    N512 1024 x 769 0.35 x 0.23 ~25km

    So N512 has approximately 5x the number of grid points to N216. So when I guestimated the time for your N512 setup I multipled the ~1 hour it took for the N216 on 92 nodes by 5 for the resolution change and then 2.5 for just under half the number of nodes giving shy of 13 hours. That was just a really crude estimation.

Hope that answers your questions.

Regards,
Ros.

Just following up on the truncation - yes 2 or more -- indicates an automated email signature and will always be truncated. Don’t use them in the body of your emails and all should then be well.

Dear Ros.

I was trying to make the u-ch427 work, but most of your answers are about u-bo026, so I resumed working on that. I only need one of them to work.

First of all, the option --opt-conf-key=n96e works with n216 and n512, but not with n96.

(u-bo026-n96-ens3)

-bash-4.1$ rose suite-run --opt-conf-key=n96e --new
[FAIL] Bad optional configuration key(s): n96e

(after run without the flag)
Error message: Too many processors in the North-South direction ( 56) to support the extended halo size ( 5). Try running with 28 processors.
? Error from processor: 0

I changed now to the number of processors of u-ch427. Let’s see if I get away with it.

u-ch427-ens9 is working fine with n96 and n216; I’m still testing n512.

No, replying directly to your comments:


| |

  • | - |

I will check with the Discourse guys but I suspect the --- that you are putting in your email as a separator is being translated as start of email footer as that is what is traditionally used to separate the footer from the email body which isn’t wanted in the helpdesk posts.

Thanks, I’ll try to pay attention to it.

So onto your questions:

  1. If a shorter run length is good then change the run length to a few hours or days, whatever you need. Just because a job you copy is set to 1 month doesn’t mean it has to stay that way.

I honestly don’t know the answer to this question. I thought you were asking about something else. I changed to one day in u-ch427 (n512) to see if that works. One thing I always change is to run just one cycle, so I believe the total amount might not be relevant. But it’s just a guess.

  1. The number of nodes used is determined by the decomposition (EWxNS), number of OMP threads and hyperthreads selected. Look at the job script that’s generated in ~/cylc-run/<suiteid>/log/job/<task>/NN/job and you’ll see in the Slurm header how many nodes have been requested, along with the requested wallclock, which queue you’ve selected to run in and the budget account you are running under.

I could find the numbers, but I still don’t know how they are generated. What’s the math involved here?

  1. Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits). I answered this back in submit-trying - #6. This indicates that you are trying to run on more nodes than the queue allows, that the time limit you’ve specified is too long for the queue or you’ve exceeded the max number of jobs you can have queueing at any one time. The details for all the queues is available on the ARCHER2 website: Running jobs - ARCHER2 User Documentation

I know the details from Archer, but I don’t know how to change the number of nodes that the queue allows because I don’t know how this number is calculated. At least now I know where to find those numbers (log-job).

  1. Incorrect E-W resolution for Land/Sea Mask Ancillary (u-bo026-n216-ens3, u-bo026-n512-ens3 - recon )

    Are you running this suite (u-bo026) with the required --opt-conf-key? For N216 you should be using rose suite-run --opt-conf-key=n216e Switching to n512e for the N512 resolution. This suite has optional overrides to make it easier to get the correct decomposition appropriate for the model resolution. Look in ~/roses/u-bo026-n216-ens3/opt to see what settings this overrides.

This directory, ~/roses/suite/opt, is really good. It was what I was looking for to have some guidance in how to tune the numbers. However, the configuration for u-bo026 is the same for u-ch427, and I’m assuming this is the configuration for the Archer2 4-cabinet, and not the full system. Maybe that’s why I’m still having problems adjusting wallclock time and processors per node.

  1. Model resolutions for the models you are running are as follows:

  2. | Resolution | Grid Points | Grid Cell size (degrees) | Spacing at mid-latitudes |

    • | - | - | - |
      N96 | 192 x 145 | 1.88 x 1.25 | ~135km |
      N216 | 432 x 325 | 0.83 x 0.56 | ~60km |
      N512 | 1024 x 769 | 0.35 x 0.23 | ~25km |

    So N512 has approximately 5x the number of grid points to N216. So when I guestimated the time for your N512 setup I multipled the ~1 hour it took for the N216 on 92 nodes by 5 for the resolution change and then 2.5 for just under half the number of nodes giving shy of 13 hours. That was just a really crude estimation.

How does that relate to the number of processors? Does it matter at all?

Hope that answers your questions.

We are moving forward. Thank you so much! :slight_smile:

Kind regards, Luciana.

Hi Luciana,

The same processor decompositions that you had working on the 4cab for the various resolutions will work on the 23cab.

There isn’t an optional override for N96 - you’ll see there is no corresponding file in the opt directory.

The Slurm header in the job script is generated in the site/archer2.rc file. You’ll see the math for the --nodes calculation in there.

You say you only need one of these suites to work, I would suggest concentrating on the one that you had working on the 4 cabinet system and that you’d configured to get the information you are trying to collect - I believe that’s u-bo026.

Regards,
Ros.

Dear Ros.

After several tests, I have some suites running. As you suggested, I used the configuration for the 4-cabinet, but I’m getting more stable results with u-ch427, so I plan on sticking with this suite. I just need some help further adjusting the n512. The current configuration is

MAIN_ATM_PROCX=40
MAIN_ATM_PROCY=32
MAIN_OMPTHR_ATM=3
MAIN_CLOCK=‘PT120M’

I would like to be able to change the processors to have faster results (n96 and n216 are both running within 5min). When I tried randomly, I got stuck with errors like Incorrect E-W resolution. Any suggestions?

About the number of nodes, I tried to chase the maths, but I got stuck into functions and ifs. I’ll copy here what I got.

–nodes={{((ATMOS_NEXEC|int*NODE_ATM|int)+XIOS_NODES|int)}}

ATMOS_NEXEC = ATMOS_NENS (ok!)

NODE_ATM
{% set NODE_ATM = node(TASKS_ATM, MAIN_OMPTHR_ATM, MAIN_HYPTHR_ATM, APPN) %} (function where?!)

XIOS_NODES
{% set XIOS_NODES = (XIOS_NPROC/XPPN)|round(0,‘ceil’)|int %}
{% set XPPN = XIOS_PPN if XIOS_PPN is defined else PPN %} (ifs?!)
{% set PPN = 128 %}

XIOS_NPROC=16

XPPN=XIOS_PPN=6

XIOS_NODES = 16/6 = 3?

The limit for the standard queue in Archer2 is 1024, so I have plenty of room to play with more processors and I don’t think this will be an issue, but it depends on the suggestions you might give me for the n512 and, eventually, for the n1280 too.

Kind regards, Luciana.

Hi Luciana,

Sorting out best decompositions is trial and error. Just looking at N512 suites we’ve tested on Archer2, we’ve used something like 32 EW x 32 NS x 2 OMP with 2 XIOS nodes. You can try running it on more but at some point chucking more processors at it will not make it go any faster.

{% set NODE_ATM = node(TASKS_ATM, MAIN_OMPTHR_ATM, MAIN_HYPTHR_ATM, APPN) %} (function where?!)

Look a few lines further up the file and you’ll find the node function.

{% set XPPN = XIOS_PPN if XIOS_PPN is defined else PPN %} (ifs?!)

This is just a standard if defined else statement written on one line.

You can confirm the values of the calculated variables (e.g. XIOS_NODES) by looking in the suite.rc.processed file.

Regards,
Ros.

Dear Ros.

Thank you for your support. I have some extra questions:

  • I’m trying to run the n1280 using the u-ch427 and rose-suite-n1280e.conf:

MAIN_ATM_PROCX=94

MAIN_ATM_PROCY=64
MAIN_OMPTHR_ATM=2

EXPT_BASIS=‘19790301T0000Z’

However, I’m getting an error in install_cold:

The following have been reloaded with a version change:

  1. cce/11.0.4 => cce/12.0.3

[FAIL] file:/work/n02/n02/lrpedro/cylc-run/u-ch427-n1280-template/share/data/etc/um_ancils_gl=source=/work/n02/n02/annette/HRCM/ancil/data/ancil_versions/n1280e_orca12/GA7.1_AMIP/v6/ancils: bad or missing value
2021-12-14T21:36:42Z CRITICAL - failed/EXIT

It’s my first time trying to run n1280, so it might have other things to configure that I’m not aware of.

  • Is there a command-line equivalent to gcylc Trigger (run now)? Would it be restart?

  • Is there another place to get the information about the total number of nodes other than the log/job? Maybe a command in Archer2? It would be good to have this information (it’s sent to slurm!) as soon as the job starts, and not after it finishes. And am I right to believe that the only number that is relevant in this case is the one in the job atmos_main? I checked all the other jobs and they appear to request just one node, with the only exception of recon, which requests six nodes.
    Kind regards, Luciana.

Hi Luciana,

  • For the install_cold app you’ll need to change the path of
    /work/n02/n02/annette/HRCM/ancil/data/ancil_versions/n1280e_orca12/GA7.1_AMIP/v6/ancils which can now be found under /work/y07/shared/umshared/HRCM/....

  • The cylc command for triggering a task from the command line is:
    cylc trigger REG TASK
    See cylc trigger --help for full details.

  • The job script that is in the log/job directory is the script that is submitted to Slurm and is available as soon as the task is submitted.

    The easiest way, however, to see how many nodes a task is running on once its been submitted is to query the Slurm queue on ARCHER2 using squeue -u <username>.

    atmos_main is the task that runs the model and is the only task affected by the change in atmos processor decomposition.

Regards,
Ros.

Dear Ros.

Still, in the suite u-ch427-n1280-template, I’m having this error message now.

? Error message: Failed to open file /work/n02/n02/grenvill/cylc-run/u-ce930/share/data/History_Data/ce930a.da19790301_00

About the number of nodes, I was exchanging some messages with Archer2 Support and they told me about:

sacct -j --format=“JobID,NNodes,elapsed”

I’m still testing the options because squeue always starts with 1 because the first process requests only one node.

Thank you for the other hints.

Kind regards, Luciana.

Hi Luciana,

Grenville is just copying /work/n02/n02/grenvill/cylc-run/u-ce930/share/data/History_Data/ce930a.da19790301_00 over from the 4-cab but it will be while before it’s there as the transfer is being very slow. We’ll let you know.

Each task in the suite is submitted as a separate batch job with a separate Slurm JobID so you need to do squeue on the atmos_main job id - you won’t be able to get any information out of squeue or sacct regarding number of nodes for the atmos_main task until it is submitted. If you want this before the task is submitted you’ll have to look yourself at the suite.rc.processed file to see what number of nodes it has calculated.

Cheers,
Ros.

That’s great. Thank you very much! :slight_smile:

Luciana

The n1280 version of u-ch427 doesn’t need to run the reconfiguration - please switch it off. I copied the start file /work/n02/n02/grenvill/cylc-run/u-bo026-ens-inc1280/share/data/History_Data/u-b026a.da19790301_00, which the suite is configured to use.

There one or two more file paths that need changing for n1280
change /work/n02/n02/annette/HRCM/cmip6spectralmonthly to /work/y07/shared/umshared/HRCM/cmip6spectralmonthly
and change /work/n02/n02/annette/HRCM/easy_aerosol/final/1949-2015/n1280e to /work/y07/shared/umshared/HRCM/easy_aerosol/final/1949-2015/n1280e

Hopefully that’s all, but see my copy of the suite if I’ve forgotten any

Grenville