Problem with project codes in Monsoon suite

lesleygray · 5 January 2023 14:04

Dear CMS Supper Team.

Happy New Year.

I’m just picking up on a recent thread, because I am still having trouble getting my suite to run. Somehow, my suite is still defaulting to using project-acsis for some tasks, instead of using my new project-step. Dave helped earlier (see previous thread) to identify some changes that we inserted directly, and this enabled the suite to progress a little further, but the suite is now failing further on, and not writing out certain files (including pp output files) so I think we need to get to the source of the problem.

My suite is u-ct078 e.g.
/home/d03/lgray/cylc-run/u-ct078/log/job/20080901T0000Z/postproc_r001i1p00000/01/job-activity.log

I have had some help from my Met Office contacts (Martin Andrews/Jeff Knight), who have helped identify the problem (see below), and I’ve also e-mailed the Monsoon Team, in case the issue is at their end: more details here:

Dear Monsoon Support Team
I have a problem with running suites under my new project-step.

I was previously a member of project-solar and project-acsis, both of which have recently finished.

When I try running my suite (u-ct078) under project-step, it fails because it is somehow picking up a reference to the acsis project - see the message below from Martin Andrews, who has been trying to help me with this.

I would very much appreciate help with finding where this default to acsis is set, so that it can be over-ridden.

thank you very much!

Lesley

-------- Forwarded Message --------

Subject:	RE: missing pp files
Date:	Thu, 5 Jan 2023 12:58:59 +0000

Hi Lesley,

Jeff and I have pored over the suite files, the output files, and compared with u-bs746 config (which looks to be a previous project-solar version).

The only concrete piece of information seems to lie within the job-activity.log file relating to the failed postproc task. It looks as though project acsis is being picked up from somewhere as part of the queue submission and this is stopping the postproc task from either running or writing log output. Below is the path to the file, and the error contained within.

/home/d03/lgray/cylc-run/u-ct078/log/job/20080901T0000Z/postproc_r001i1p00000/01/job-activity.log
(xcs-c) 2023-01-04T13:23:23Z [STDERR] qsub: error: [PBSZeroProject] ‘acsis’ has no share allocation in ‘collaboration’ trustzone
We searched everywhere in the suites to see where acsis could be being picked up, but found no trails that lead to acsis.

grenville · 5 January 2023 15:17

Hi Lesley

In /home/d03/lgray/cylc-run/u-ct078/log/job/20080901T0000Z/postproc_r001i1p00000/01/job, it says:

#PBS -N postproc_r001i1p00000.20080901T0000Z.u-ct078
#PBS -o cylc-run/u-ct078/log/job/20080901T0000Z/postproc_r001i1p00000/01/job.out
#PBS -e cylc-run/u-ct078/log/job/20080901T0000Z/postproc_r001i1p00000/01/job.err
#PBS -W umask=0022
#PBS -l ncpus=1
#PBS -q shared
#PBS -l walltime=03:00:00

I think it needs an additional directive

#PBS -P step

Please could you add that line in the job file and then
qsub job

that should be quick & we’ll know if I’m barking up the wrong tree soon enough. If that does work, we can modify the suite.

Grenville

dcase · 5 January 2023 15:25

Leslie,

I’m glad that you got going - I think that you have the same problem as before, in that there’s no -P argument for postproc. I guess that you could do what you did before and change the suite?
If you wanted to go to the root, I think that you would look at site/Monsoon.rc (or meto_cray if that is the one you use) and change this. For monsoon there is an HPC family from which most things are inherited, and the -P is protected with {% if not USE_DEFAULT_ACCOUNT %}. As a guess I would just write something here, such as -P=step , and not bother with this extra variable, and hopefully that’ll do it.
As I said on the other ticket, I guess that the acsis appears because that is a group that you are a member of. You are also a member of step (groups lgray lists both) - if the suite isn’t using this, then maybe PBS does??

These are just from eyeballing the suite - if you want to keep all the variables and the control for other users, then perhaps it’ll take more than guesswork. If you just want to get it going for yourself then perhaps just change the -P directive in the appropriate file.

lesleygray · 5 January 2023 15:52

Hi Grenville

I did that, and got this:

lgray@xcslc0:~/cylc-run/u-ct078/log/job/20080901T0000Z/postproc_r001i1p00000/01> vi job
lgray@xcslc0:~/cylc-run/u-ct078/log/job/20080901T0000Z/postproc_r001i1p00000/01> qsub job
4850653.xcs00
Was this what you were hoping to see?

Lesley

lesleygray · 5 January 2023 16:13

Hi
Shall I try going into roses/u-ct078/site/Monsoon.rc and insert -P step, as Dave suggests? I would normally clear out the cylc-run directories and re-submit the suite, to see if this works, but I don’t want to do this if you are still looking at the cylc-run files (so I’ll hold off doing this until I hear from you).
thanks
Lesley

grenville · 5 January 2023 16:41

Hi Lesley

getting 4850653.xcs00 indicates the job submitted at least (it appeared to work OK)

We may have been fooled by what was going on in this suite. If you are happy to keep going then, in MONSoon.rc, change

      {% if not USE_DEFAULT_ACCOUNT %}
        -P={{ACCOUNT_USR}}
        {% endif %}

to

      #    {% if not USE_DEFAULT_ACCOUNT %}
            -P=step
     #       {% endif %}

This should set the PBS directive in all jobs that inherit HPC

Then rose suite-run --reload and retrigger failed tasks (maybe just one to start)

I hope there are no more side effects.

Grenville

lesleygray · 5 January 2023 17:17

Hi Grenville
Yes, of course - I’ll try that and let you know how it got on in the morning.
thanks so much for all your help
Lesley

lesleygray · 6 January 2023 09:39

Hi Grenville
I made the amendment you suggested in MONSooN.rc. I cleaned out cylc-run and submitted the suite from scratch. The suite has moved forward, insofar as I now get some pp files appearing for the 1st 2 months of the 1st ensemble, but it doesn’t continue with the rest of the months (it does restarts at 2-monthly intervals) and it doesn’t produce anything for the 2nd ensemble (I reduced the suite to only run 2 ensemble members - usually I’d do 50).

If I do a grep acsis . on the log/job directory then I’m still seeing the same error message, so somehow it is still trying to use acsis.

Is there some way that I can be removed from the acsis project, so it defaults to step? (the acsis project has presumably been de-activated, but not completely removed from the system).

I also received this message from the Monsoon support team last night, which suggests the same fix as yours (I think), although I didn’t understand the bit about $HOME/.profile because I don’t have a .profile (and for some reason I also can’t access the help page he points to).

thanks for your help
Lesley

Hi Lesley,
There is a section on setting which project to charge suite time and where to write data in the Monsoon User Guide, see https://code.metoffice.gov.uk/doc/monsoon2/rose.html#rose-basics

I think setting a default value for $DATADIR in your $HOME/.profile will change any defaults, but you may want to explicitly put a “-P = step” line into your suite.rc file.

I hope that helps.
Kind Regards,
Roger
Roger Milton | Monsoon Technical Lead | Tel: +44 (0)330 135 2241

grenville · 6 January 2023 10:47

Hi Lesley

I am unsure why the inheritance didn’t work as expected. Please also add the -P directive in MONSoon.rc here:

  {% if RUN_PP %}
    [[POSTPROC_RESOURCE]]
        inherit = HPC_SERIAL, RETRIES
        pre-script = module load moose-client-wrapper python/v2.7.9
        retry delays = 2*PT10S, 2*PT5M, 2*PT30M, 2*PT1H, PT3H, 3*PT6H
        {% if mooproject is defined %}
        script = {{TASK_RUN_COMMAND}} --define="[namelist:suitegen]mooproject={{MOOPROJECT}}"
        {% endif %}
        [[[directives]]]
            -l walltime=03:00:00
            -P=step
    {% endif %}

I’ll try harder to understand what happened.

Grenville

lesleygray · 6 January 2023 12:42

Hi Grenville

Thanks - I’ve tried this, but the suite doesn’t seem to like it.

I’ve done a ‘diff’ with the MONSooN.rc from a previous successful suite (u-bs746) to make sure I’ve not changed anything else by mistake, but I only see the 2 changes you’ve suggested.

Lesley

lesleygray · 6 January 2023 12:48

Here’s a better screenshot of the error message:

lesleygray · 6 January 2023 12:51

And with the log showing:

grenville · 6 January 2023 13:00

Hi Lesley

missing a “=”

-P=step

Grenville

dcase · 6 January 2023 13:02

In your Monsoon.rc you have the correct -P=step in the HPC family
But then you’ve put it again without the = in ATMOS_RESOURCE

You can inherit from the HPC, or just add the = (as Grenville’s just added)

Also, if you want the unix groups changed so that you aren’t in acsis the monsoon unix admins can do this for you, I’m sure, but you’d have to ask them specifically

lesleygray · 6 January 2023 13:16

Ooopps, thank you - corrected, and now running. I’ll let you know how it gets on …
thanks
Lesley

lesleygray · 6 January 2023 15:59

Hi Grenville

Sorry if I have made another mistake, but the model is not building properly - in /cylc-run/u-ct078/share/fcm_make_um I get the following error message:

[FAIL] ftn -oo/rcf_create_dump_mod.o -c -I./include -s default64 -e m -J ./include -I/projects/um1/gcom/gcom5.3/meto_xc40_cray_mpp/build/include -I/projects/um1/grib_api/cce-8.3.4/1.13.0/include -O2 -Ovector1 -hfp0 -hflex_mp=strict -h omp /home/d03/lgray/cylc-run/u-ct078/share/fcm_make_um/preprocess-recon/src/um/src/utility/qxreconf/rcf_create_dump_mod.F90 # rc=1
[FAIL] ftn-2136 crayftn: ERROR in command line
[FAIL] Unable to obtain a Cray Compiling Environment License.
[FAIL] compile 35.8 ! rcf_create_dump_mod.o ← um/src/utility/qxreconf/rcf_create_dump_mod.F90
[info] 6 worker processes destroyed
[info] compile targets: modified=453, unchanged=0, failed=1, total-time=208.1s
[info] compile+ targets: modified=400, unchanged=0, failed=0, total-time=0.7s
[info] install targets: modified=2, unchanged=0, failed=0, total-time=0.0s
[info] TOTAL targets: modified=855, unchanged=0, failed=2, elapsed-time=75.2s
[FAIL] ! RCF_CREATE_DUMP_MOD.mod: depends on failed target: rcf_create_dump_mod.o
[FAIL] ! rcf_create_dump_mod.o: update task failed
[FAIL] make 2 build-recon # 82.6s
[FAIL] make 2 # 1010.7s

I’m not sure if I should just wait and hope that it sorts itself out …sometimes I see red markers by the build processes (in cycle gui) but it still proceeds just fine after a while. I’m not sufficiently expert to know whether this is a showstopper or not…

grenville · 6 January 2023 16:58

Hi Lesley

This problem is a Monsoon issue

[FAIL] Unable to obtain a Cray Compiling Environment License.

there is a finite number of Licences - this should go away if you retry

Grenville

lesleygray · 8 January 2023 21:28

Dear Grenville, Dave, Ros

Thank you so much for all your help - the suite has successfully run, after a few more hitches that I was able to fix.

I now have output from 30 ensemble-members to analyse. I just need to remind myself what science I was trying to do!

Thanks again - I hope not to have to bother you again any time soon …
Lesley

Topic		Replies	Views
Suite picking up the wrong default resource project name Unified Model Monsoon2	4	243	27 November 2023
Suite stalled at 'submit' stage Unified Model	16	209	19 February 2024
Ancillary suite failure Unified Model Monsoon2	6	44	17 August 2024
Ancilary file missing? Unified Model Monsoon2	6	185	19 February 2024
Crashed with REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH Unified Model Monsoon2	3	112	22 March 2024

Problem with project codes in Monsoon suite

Related topics