Suite stalled at 'submit' stage

I am trying to run suite u-cs713 under a new Monsoon project code (step).

The suite is a copy of one I’d run previously under the ‘solar’ project.

I lost a few files that I’d previously accessed from my /projects/solar directories (because I didn’t copy them across in time before they were deleted at the end of the project).

One of these was the ‘meta’ data, and Ros has copied this across for me to /home/d04/rhatcher/meta/vn10.3_nudge_meta

I’ve also had to pull some start files from moose again (they are in /projects/step/lgray/startdumps/an074_20080901/)

I’ve changed the project code from solar to step in the suite.

As far as I’m aware this is all I’ve changed. However, the suite stalls at the ‘submit’ stage.

If I look in /cylc-run/u-cs713/log/job/20080901T0000Z/install_ancil/01/job-activity.log I see this message:

[jobs-submit cmd] cylc jobs-submit --utc-mode --host=xcs-c --remote-mode – ‘$HOME/cylc-run/u-cs713/log/job’ 20080901T0000Z/install_ancil/01

[jobs-submit ret_code] 32

[jobs-submit out] 2022-12-21T12:05:30Z|20080901T0000Z/install_ancil/01|32|None

(xcs-c) 2022-12-21T12:05:30Z [STDERR] qsub: error: [PBSZeroProject] ‘acsis’ has no share allocation in ‘collaboration’ trustzone

[((‘event-mail’, ‘submission retry’), 1) ret_code] 0

I don’t understand why the ‘ACSIS’ project is referred to here - maybe it’s a red herring - as far as I’m aware this suite has nothing to do with the ACSIS project. I’ve searched in my suite for any mention of the ACSIS project but can’t find anything.

Can you help spot what is wrong please?
thank you!
Lesley

I think that you are a member of the acsis group, and your suite is running in /project/acsis (although you link back to your home, so you can see it here). Perhaps you want to be somewhere else? I think that if you try to put:

root-dir=*=/projects/<project-name>/$USER 

at the top of rose-suite.conf and retry you might make the directory in another project space?

Hi Dave
In my cylc-run/u-cs713 directory I have the following:
app log.20221221T134110Z.tar.gz rose-suite.conf_monsoon site
bin log.20221221T135047Z rose-suite.conf_nci_raijin suite.rc
cylc-suite.db meta rose-suite.info suite.rc.processed
log rose-suite.conf_meto_cray share work

So there is no rose-suite.conf file.

I tried editing rose-suite.conf_monsoon and added root-dir=*=/projects/step/$USER as you suggested.

But when I type rose suite-run from within the cylc-run/u-cs713 directory I get the following message:
[FAIL] /projects/acsis/lgray/cylc-run/u-cs713: rose-suite.conf not found.

What am I doing wrong here?

I probably am a member of the acsis group, but have always run under ‘solar’ in the past, and now want to run under ‘step’ - there are various places in the suite set-up to specify the project name, but I thought I’d picked up on all of these (and if I do a search for 'apsis I get a null return).

thanks
Lesley

You want to make changes to your suite in roses, then run it in such a way that it creates the cylc-run correctly.

So first observe the status of cylc-run:
ls -lht ~lgray/cylc-run/u-cs713
lrwxrwxrwx 1 lgray mo_users 38 Dec 21 13:41 /home/d03/lgray/cylc-run/u-cs713 → /projects/acsis/lgray/cylc-run/u-cs713

ie linked to acsis. Now change: ~lgray/roses/u-cs713/rose-suite.conf (as above)

then run the suite (from the roses/u-cs713) with --new and it should delete the cylc-run and make a new one. If it doesn’t work perfectly by itself you can probably delete the cylc-run/u-cs713 by hand and run again.

Hi Dave
Thanks for the clarification. I did a rose suite-clean from within the cycle-run directory, then went into roses directory and made the change you suggested. So the first few lines of my rose-suite.conf look like this:

[jinja2:suite.rc]
root-dir=*=/projects/step/$USER
ANCIL_OPT_KEYS=‘’
ATM_OPENMP_WAIT_POLICY=‘PASSIVE’
ATM_PROCX=16
ATM_PROCY=14

When I tried running the suite I get the following:

[INFO] export CYLC_VERSION=7.8.12
[INFO] export ROSE_ORIG_HOST=xcslc0
[INFO] export ROSE_SITE=
[INFO] export ROSE_VERSION=2019.01.7
[INFO] symlink: /projects/acsis/lgray/cylc-run/u-cs713 <= /home/d03/lgray/cylc-run/u-cs713
[INFO] create: log.20221221T150615Z
[INFO] delete: log
[INFO] symlink: log.20221221T150615Z <= log
[INFO] log.20221221T150336Z.tar.gz <= log.20221221T150336Z
[INFO] delete: log.20221221T150336Z/
[INFO] create: log/suite
[INFO] create: log/rose-conf
[INFO] symlink: rose-conf/20221221T150615-run.conf <= log/rose-suite-run.conf
[INFO] symlink: rose-conf/20221221T150615-run.version <= log/rose-suite-run.version
[INFO] delete: suite.rc
[INFO] install: suite.rc
[INFO] REGISTERED u-cs713 → /home/d03/lgray/cylc-run/u-cs713
[FAIL] cylc validate -o /working/d03/lgray/jtmp/tmp.vABWDTvxZe/tmpRx8FKL --strict u-cs713 # return-code=1, stderr=
[FAIL] Jinja2Error:
[FAIL] File “”, line 51, in template
[FAIL] TemplateSyntaxError: expected token ‘end of statement block’, got ‘-’
[FAIL] Context lines:
[FAIL] {% set UM_OPT_KEYS=‘aeroclim’ %}
[FAIL] {% set USE_DEFAULT_ACCOUNT=true %}
[FAIL] {% set USE_MOOPROJECT=true %}
[FAIL] {% set root-dir=*=/projects/step/lgray %} ← Jinja2Error

I tried some variants e.g. ROOT-DIR in capitals, just in case that mattered, and e.g. set-root=‘/projects/step/$USER’ but neither of these worked.

thanks
Lesley

Ahh, I meant at the very top of the file. As in the first line.

It’s not supposed to be within any jinja or anything else

as in…

try this

Hi Dave
Yes, that worked, and now I get

ls -lht ~lgray/cylc-run/u-cs713
lrwxrwxrwx 1 lgray mo_users 37 Dec 21 15:48 /home/d03/lgray/cylc-run/u-cs713 → /projects/step/lgray/cylc-run/u-cs713

so it’s pointing my home directory to the step project now.

However, it’s still failing with the same error message:
lgray@xcslc0:~/cylc-run/u-cs713/log/job/20080901T0000Z/install_ancil/NN> more job-activity.log

[jobs-submit cmd] cylc jobs-submit --utc-mode --host=xcs-c --remote-mode – ‘$HOME/cylc-run/u-cs713/log/job’ 20080901T0000Z/install_ancil/04
[jobs-submit ret_code] 32
[jobs-submit out] 2022-12-21T15:52:34Z|20080901T0000Z/install_ancil/04|32|None
(xcs-c) 2022-12-21T15:52:34Z [STDERR] qsub: error: [PBSZeroProject] ‘acsis’ has no share allocation in ‘collaboration’ trustzone

So it’s still picking up ‘acsis’ from somewhere. Any ideas? (I’ve tried inserting your extra line of code at the top of the rose-suite.conf_monsoon file in the roses directory, but this didn’t help).

thanks
Lesley

Ok. I don’t think that you are giving the PBS scheduler the instruction to charge an account for the resources. I believe that on Monsoon you use -P (for project) , rather than the general -A (for account). I actually never use this system, and can’t seem to see the details for some reason.

But I think that your job will submit if you add this variable. It looks as though changing the bit about -P HPC_GROUPS[0] to be the group you want to charge will be needed. If you can see an option in the logic of the suite, then change this. Otherwise I’d just over-write it.

you want your apps to have:
[[[directives]]]
-P=[name of charging group]

in the suite.rc

when you look at the job which is generated, it should have this argument. You can also test a submission of a job on its own if this is easier than rebooting the suite.

Hi Dave

Thank you - that’s worked!

If possible (after Christmas!), it would be good to find where the ‘master switch’ is that specifies ‘acsis’ as my default charging code and somehow change it to ‘step’, if possible, otherwise I’ll have to go in by hand and do these rather laborious changes manually every time I set up a new suite. Maybe next time I’m down in Reading I could sit down with someone from CMS and try to do this.

But - for now - I am back in business again. Thank you!! And hope you have a very Happy Christmas.

Lesley

Ok. Great.

I guess the HPC_GROUPS thing is probably running the groups unix command, and taking the first one in the list. But that’s a guess - feel free to raise the ticket in the new year and I can look at it properly.

Hopefully the computer works hard whilst you’re enjoying a break.