Updating Suite u-cp976

Hi Ayesha
Excellent! Thank you!
Patrick

Hi Patrick,

I’ve tried running u-cp976 with the prebuild jules.exe and I get this error with the spinup fails:


[WARN] file:jules_deposition.nml: skip missing optional source: namelist:jules_deposition_species(:slight_smile:
[WARN] file:ancillaries.nml: skip missing optional source: namelist:jules_flake
[WARN] file:ancillaries.nml: skip missing optional source: namelist:urban_properties
[WARN] file:imogen.nml: skip missing optional source: namelist:imogen_run_list
[WARN] file:cable_surface_types.nml: skip missing optional source: namelist:cable_surface_types
[FAIL] $RUN_JULES # return-code=139
2022-10-24T11:52:20Z CRITICAL - failed/EXIT

does the RUN_JULES error mean that there is a problem with jules.exe or is it something else?

Thanks,

Ayesha

Hi Ayesha
I am not sure where the problem is, whether it is with the JULES pre-built executable, or with something else. Have you tried to run the jules.exe pre-build from the command line? If that pre-build runs ok, or at least gives less cryptic errors, then it could be a problem with your suite.

Are you trying to run the pre-build on an INTEL node of JASMIN or and AMD node? Did you compile it on the same type of node?
Patrick

Hi Patrick,
How do I run jules.exe from the command line?
I’m on Monsoon, so I’m not sure what node I’m on sorry. Do you know how I find out? I used this to make jules.exe:
export JULES_PLATFORM=meto-xc40-cce
export JULES_BUILD=fast
export JULES_NETCDF=netcdf

Thanks,
Ayesha

Hi Ayesha
To run the jules executable from the command line, first locate the jules.exe pre-build that you compiled. Then change to that directory, and type:
./jules.exe

That should give some error message since it isn’t pointing to the JULES namelists, but at least it will show that it is properly compiled. If that doesn’t work, it’s possible that on Monsoon, you need to load some modules before running jules.exe from the command line.

If it does try to start running properly, then maybe you can also look at your job.out and job.err log files to look for more detailed error messages from when you try to run jules with Rose/Cylc?

I had forgotten that you are working on Monsoon. So disregard my comments about running on the same processor-type as you compile on. Those comments were for the JASMIN platform.
Patrick

Hi Patrick,

Yeah it looks like jules.exe is the issue. I tried loading the modules that I loaded before to build it but it still didn’t work:

ahussain@xcslc0:~/MODELS/vn6.1_jules/build/bin> ls
jules.exe rose-jules-run
ahussain@xcslc0:~/MODELS/vn6.1_jules/build/bin> ./jules.exe
[Tue Oct 25 08:51:13 2022] [unknown] Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(506):
MPID_Init(192)…: channel initialization failed
MPID_Init(569)…: PMI2 init failed: 1
ahussain@xcslc0:~/MODELS/vn6.1_jules/build/bin> module swap PrgEnv-cray PrgEnv-cray/5.2.82
ahussain@xcslc0:~/MODELS/vn6.1_jules/build/bin> module load cray-netcdf-hdf5parallel/4.3.2
ahussain@xcslc0:~/MODELS/vn6.1_jules/build/bin> module load cray-snplauncher/7.0.4
ahussain@xcslc0:~/MODELS/vn6.1_jules/build/bin> ./jules.exe
[Tue Oct 25 08:52:19 2022] [unknown] Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(506):
MPID_Init(192)…: channel initialization failed
MPID_Init(569)…: PMI2 init failed: 1

Do you know which modules I should load to get jules.exe to work?

Thanks,

Ayesha

Hi Ayesha:
Since your prebuild is MPI capable, to test to see if your prebuild is working or not, you might have to do something like mpirun jules.exe, to test if it is at all close to being compiled properly. There might be more settings to try, in order to get the mpirun to work properly.

It could be much easier for both of us and for others in the future (if you already have u-al752 working on Monsoon), to take the build capability of u-al752 and put that build capability in your new suites. This is the 3rd time I suggested this.
Patrick

Hi Patrick,

Sorry about all of this, I really appreciate you helping me! The main reason why I want to do this myself is that eventually, I’ll have development versions of JULES to run through the suite, which I’ll have compiled elsewhere on my system while making the code changes.

I’ll try the u-al752 method now.

Thanks,

Ayesha

Hi Ayesha:
No need to be sorry. I am glad to try to help. And I am glad that you’re now trying to integrate the build process in the suite.

You can use the suite to do the builds of the development versions of JULES, instead of building it elsewhere. You can also have a Boolean flags called BUILD and RUN that you can set up in your rose-suite.conf file, so that the suite knows if you want to build/make the JULES executable or not and if you want to run JULES or not.

Then, you can run the suite during code development with BUILD=true and RUN=false. Next, when you want to run JULES after a successful build, you can save a bit of time by running the suite with BUILD=false and RUN=true. Then later, if you or anyone else wants to run the suite in full, they can just set BUILD=true and RUN=true.

The suite u-al752 doesn’t have such Boolean flags for BUILD and RUN, but the unrelated suite u-bb316 does have the Boolean flag for BUILD, if you’d like to look at how BUILD is used in the suite.rc file for that suite.
Patrick

Hi Ayesha:
Also, to test out new JULES FORTRAN code with a suite like u-al752, all one really needs to do is point the suite to the new FORTRAN code, by changing the rose-suite.conf file from:

JULES_FCM='fcm:jules.x_tr'
JULES_REVISION='21512'
RUNID='JULES_vn6.2'

to something like:

#this would point to your new branch that you have checked in to MOSRS:
JULES_FCM='fcm:jules.x_br/dev/patrickmcguire/vn6.2_branch'
#this would be the appropriate revision number for this hypothetical branch:
JULES_REVISION='24113' 
RUNID='JULES_vn6.2_branch'

If you want to just practice building the JULES code with the u-al752 suite, you can start the suite running, and then put a hold on the JULES app, so that it doesn’t run after the fcm_make task builds JULES.

Patrick

Hi Patrick,

Thank you, that makes sense! I copied over the fcm_make folder from u-al752 (so the rose-app.conf and fcm-make.cfg),

I also changed rose-suite.conf so it is now:
MAKE_PLOTS=True
SKIP_SPINUP=True
L_TOP=True
L_IRRIG_DMD=False
L_TRIFFID_N=True
LOCATION=‘MONSOON’
#JULES_EXE=‘/home/d03/ahussain/MODELS/vn6.2_jules/build/bin/jules.exe’
BUILD=True
JULES_FCM=‘fcm:jules.xm_tr’
JULES_REVISION=‘21512’

I saw that there’s information on fcm_make in site/suite.rc.MONSOON, which I point to in the rose-suite.conf file. I’m not sure whether I’m already using this or not though?

I added this to suite.rc in the [runtime] area:

[[fcm_make]]
inherit = None, FCM_MAKE_{{ LOCATION }}

And I changed the [scheduling] section to this:

[[dependencies]]
    graph = """
{%- if BUILD %}
    JULES_REVISION = {{ JULES_REVISION }}
    JULES_FCM = {{ JULES_FCM }}
    {%- if JULES_REVISION == '' %}
        AT_JULES_REVISION = ''
    {%- else %}
        AT_JULES_REVISION= @{{ JULES_REVISION }}
    {%- endif %}
    {%- if not SKIP_SPINUP %}
        fcm_make => SPINUP_RUNS
        SPINUP_RUNS:finish-all & SPINUP_RUNS:succeed-all => MAIN_RUNS
    {%- else %}
        fcm_make => MAIN_RUNS
    {%- endif %}
{%- else %}
    {%- if not SKIP_SPINUP %}
        SPINUP_RUNS
        SPINUP_RUNS:finish-all & SPINUP_RUNS:succeed-all => MAIN_RUNS
    {%- else %}
        MAIN_RUNS
    {%- endif %}
{%- endif %}
{%- if MAKE_PLOTS %}
    MAIN_RUNS:finish-all & MAIN_RUNS:succeed-all => make_plots
{%- endif %}
    """

and changed the [[jules]] part to this:
[[jules]]
inherit = None, JULES_{{ LOCATION }}
{%- if BUILD %}
script = “rose task-run --path= --path=share/fcm_make/build/bin”
{%- else %}
script = “rose task-run --path= --path=dirname {{JULES_EXE}}
{%- endif %}

I kept everything else the same (sorry if this is confusing to read, I wasn’t sure how best to show you this).

I then tried running the suite and I got this:
ahussain@xcslc0:~/roses/u-cp976> rose suite-run --new
[INFO] export CYLC_VERSION=7.8.12
[INFO] export ROSE_ORIG_HOST=xcslc0
[INFO] export ROSE_SITE=
[INFO] export ROSE_VERSION=2019.01.7
[INFO] delete: localhost:cylc-run/u-cp976
[INFO] symlink: /projects/intimp/ahussain/cylc-run/u-cp976 <= /home/d03/ahussain/cylc-run/u-cp976
[INFO] create: log.20221026T145820Z
[INFO] delete: log
[INFO] symlink: log.20221026T145820Z <= log
[INFO] log.20221026T144451Z.tar.gz <= log.20221026T144451Z
[INFO] delete: log.20221026T144451Z/
[INFO] create: log/suite
[INFO] create: log/rose-conf
[INFO] symlink: rose-conf/20221026T145820-run.conf <= log/rose-suite-run.conf
[INFO] symlink: rose-conf/20221026T145820-run.version <= log/rose-suite-run.version
[INFO] install: site/suite.rc.MONSOON
[INFO] source: https://code.metoffice.gov.uk/svn/roses-u/a/l/7/5/2/trunk/site/suite.rc.MONSOON@207939
[INFO] REGISTERED u-cp976 → /home/d03/ahussain/cylc-run/u-cp976
[FAIL] cylc validate -o /working/d03/ahussain/jtmp/tmp.qDsXNSyrhA/tmpXpcWD7 --strict u-cp976 # return-code=1, stderr=
[FAIL] WARNING - deprecated items were automatically upgraded in ‘suite definition’:
[FAIL] WARNING - * (6.4.0) [runtime][make_plots][command scripting] → [runtime][make_plots][script] - value unchanged
[FAIL] WARNING - naked dummy tasks detected (no entry under [runtime]):
[FAIL] + 21512
[FAIL] + AT_JULES_REVISION
[FAIL] + JULES_FCM
[FAIL] + JULES_REVISION
[FAIL] + xm_tr
[FAIL] + fcm
[FAIL] ERROR: strict validation fails naked dummy tasks

Do you know why it is failing? I don’t have much confidence in my changes, so please let me know what I forgot to add.

Thanks,

Ayesha

Hi Ayesha:
That sounds like good progress! I suspect you might have a typo or something in your rose-suite.conf file or in your suite.rc file. This is based upon your error messages that say:

[FAIL] + 21512
[FAIL] + AT_JULES_REVISION

I don’t have access to Monsoon currently. It might be a couple of weeks before I will have Monsoon access again. Can you check your changes to your suite into the MOSRS archive? Then I could look at them. If you could do your work on JASMIN instead of Monsoon, I could help you much more.
Patrick

Hi Patrick,

I’m committed my changes: https://code.metoffice.gov.uk/trac/roses-u/browser/c/p/9/7/6/trunk/

I’ve also tried running this suite on JASMIN and I’m getting the same FAIL messages as I do on Monsoon. So I think it’s successfully ported but I’ll keep looking at it.

Thanks,

Ayesha

Hi Ayesha:
I updated your u-cp976 suite. The new update is u-cr679.

It was missing a site/suite.rc.MONSOON or site/suite.rc.CEDA_JASMIN; I copied the one from u-al752 for site/suite.rc.CEDA_JASMIN .

Also, the [scheduling] section was a bit complicated for me to quickly understand, so I replaced the [scheduling] section with what was in u-al752. I had to put the JULES string in lower case (jules) in the [scheduling] section, in order to match the [[jules]] section of [runtime]. Otherwise, it complained when I did rose suite-run.

For JASMIN, we also need to use the JULES_FCM='fcm:jules.x_tr' instead of JULES_FCM='fcm:jules.xm_tr'
You can see this change mentioned in Step #10 of:

It looks like it’s building now on JASMIN. The fcm_make app is running. And there are some jules apps and the make_plots app that are waiting to run. It still might need further fixing, but maybe one step has been taken forward.

Patrick

Hi Ayesha:
How is it working now?
Patrick

Hi, it ran perfectly when build=true and skip_spinup=false, but didn’t work when build=false. I think it was something to do with scheduling so I’m trying it again now (it says ‘submitted’ and has done for a while, so I think there’s a queue this morning). Thank you so much for helping out so much!!! I really appreciate it.

Hi Ayesha:
I am glad it works with build=true. Are you running it first with build=true, before trying to modify the suite and then running it with build=false?

Also, once the fcm_make successfully runs and maybe the jules app fails somewhere, you can just make changes to the jules app, run rose suite-run --reload (and maybe a rose sgc if there is no GUI), and then retrigger the jules app without retriggering the successful fcm_make app.

Patrick

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.