Fcm_make_pp job issue on Monsoon using Cylc8

I’m trying to run a GAL9 job using vn13.7 on monsoon and I’m running into issues with fcm_make_pp job.

[FAIL] mirror.target = : incorrect value in declaration
[FAIL] config-file=/scratch/d00/rwaters/cylc-run/u-dm191/run1/work/19880901T0000Z/fcm_make_pp/fcm-make.cfg:4
[FAIL] config-file= - file:///home/d04/fcm/srv/svn/moci.xm/main/trunk/Postprocessing/fcm_make/postproc.cfg@4419:12
[FAIL] config-file= -  - file:///home/d04/fcm/srv/svn/moci.xm/main/trunk/Postprocessing/fcm_make/inc/remote.cfg@4419:6

[FAIL] fcm make -f /scratch/d00/rwaters/cylc-run/u-dm191/run1/work/19880901T0000Z/fcm_make_pp/fcm-make.cfg -C /home/d00/rwaters/cylc-run/u-dm191/run1/share/fcm_make_pp -j 4 # return-code=9
2025-01-07T10:07:05Z CRITICAL - failed/ERR

Here is the ./app/fcm_make_pp/rose-app.conf

meta=archive_and_meaning/fcm_make/postproc_2.4

[env]
config_base=fcm:moci.xm_tr
config_rev=@postproc_2.4
extract=extract
install=build
install_host=remote.cfg
model_config=um-atmos.cfg
pp_rev=postproc_2.4
pp_sources=

and the ./app/fcm_make_pp/file/fcm-make.cfg

$config_base{?}=fcm:moci.xm_tr
$config_rev{?}=

include = $config_base/Postprocessing/fcm_make/postproc.cfg$config_rev
extract.location{diff}[pp] = $pp_sources

For reference the suite is u-dm191.

Any help would be appreciated.

Thanks in advance.

Hi Rob,

It looks like the run copy (~/cylc-run/) of u-dm191 has been removed, so am unable to check the runtime settings. However, I suspect the issue is at this line in Postprocessing/fcm-make/inc/remote.cfg

mirror.target = ${ROSE_TASK_MIRROR_TARGET}

and cylc-8 might not be exporting this value (will have to be verified in the actual jobscript)
I am not sure what the value of this variable is in Cylc-8, but will check in cylc-7 suites and see if exporting this explicitly in the [environment] settings can help,

Just to add- I doubt that the original developer of GALx suites has access to Monsoon so even though the option is avaiiable the suites are not regularly tested on other systems and rely on users to feedback any portability changes.

Hi Mohit,

Thanks for your help.

I’ve just re-run the suite so there should be a cylc-run contents available now.

Where exactly is the runtime setting I’m looking for?

I didn’t realise that the GAL9 team didn’t have access to Monsoon. I’ve been providing feedback to Paul Earnshaw as I’ve had to make a few other changes to get it working on Monsoon.

Thanks again,

Rob

Hi Rob,

The runtime settings (and environment inherited by job) can be seen ~/cylc-run/suite-id/runX/log/job/date-time/task-name/job and job.out. In this case:

~rwaters/cylc-run/u-dm191/run1/log/job/19880901T0000Z/fcm_make_pp/NN/job.out

and there is no ROSE_TASK_MIRROR_TARGET setting exported.
Surprisingly, this setting is exported in cylc-8 suites I have on our internal HPC, so not sure what is going on here. Note that Monsoon set-up is unique that ‘launch’ and ‘execute’ systems are the same (whereas e.g. we have Puma for launch and execution on ARCHER2), so I wonder if there are additional settings required here.

In site/monsoon.cylc, under the EXTRACT_RESOURCE block can you try replacing the

platform = xcsc

with (my Monsoon cylc-7 definition)

[[[remote]]]
host = $ROSE_ORIG_HOST

Thanks

Ah interesting, I removed the use of $ROSE_ORIG_HOST because I was getting undefined platform.

When I use your suggestion I first get the following warning:

WARNING - deprecated settings found (please replace with [runtime][EXTRACT_RESOURCE]platform):
    [runtime][EXTRACT_RESOURCE][remote]host = $ROSE_ORIG_HOST

Then cylc fails to play the workflow

    message: A mixture of Cylc 7 (host) and Cylc 8 (platform) logic should not be used. In this case for the task "19880901T0000Z/fcm_make_pp" the following are not compatible:
    workflow: u-dm191/run1
    host: xcslc0
    port: 43132
    owner: rwaters

If I put $ROSE_ORIG_HOST back as the platform (what it was originally). The job fails to submit with the following message:

rwaters@xcslc0:~/cylc-run/u-dm191/run2/log/job/19880901T0000Z/fcm_make_pp/01> cat job-activity.log
[jobs-submit cmd] (platform not defined)
[jobs-submit ret_code] 1
[jobs-submit err] No matching platform "xcslc0" found
[(('event-mail', 'submission failed'), 1) ret_code] 0

Because the only available platforms are:

rwaters@xcslc0:~/cylc-run/u-dm191/run2/log/job/19880901T0000Z/fcm_make_pp/01> cylc config --platforms
[platforms]
    [[localhost]]
        install target = localhost
        ssh command = ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no
        copyable environment variables = FCM_VERSION
        submission polling intervals = PT30M
        execution polling intervals = PT30M
        execution time limit polling intervals = PT5M, PT10M
        clean job submission environment = True
    [[xcsc]]
        install target = localhost
        ssh command = ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no
        copyable environment variables = FCM_VERSION
        submission polling intervals = PT30M
        execution polling intervals = PT30M
        execution time limit polling intervals = PT5M, PT10M
        clean job submission environment = False
        hosts = localhost
        job runner = pbs
        err tailer = qcat -f -e %(job_id)s
        out tailer = qcat -f -o %(job_id)s
        err viewer = qcat -e %(job_id)s
        out viewer = qcat -o %(job_id)s
        job name length maximum = 236
        [[[meta]]]
            description = HPC PBS job

Yes, I suspected the [[[remote]]] would be cylc-8 incompatible.
In the cylc-8 suite I am running, the EXTRACT_RESOURCE has no platform = setting, and in the job.out I see


[INFO] export config_base=fcm:moci.xm_tr
[INFO] export config_rev=@postproc_2.4
[INFO] export extract=extract
[INFO] export install=build
[INFO] export install_host=remote.cfg
[INFO] export model_config=um-nemocice.cfg
[INFO] export nemo_tools=fcm:nemo.xm/utils/tools_r4.0-HEAD@16016
[INFO] export pp_rev=postproc_2.4
[INFO] export pp_sources=branches/dev/ericaneininger/postproc_2.4_restrict_atmos_process_methods
[INFO] export verify_config=verify.cfg
[INFO] source: $HOME/cylc-run/suite-id/app/fcm_make_pp/file/fcm-make.cfg
[INFO] install: fcm-make.cfg
[INFO] source: $HOME/cylc-run/suite-id/app/fcm_make_pp/file/fcm-make.cfg
[INFO] export ROSE_TASK_MIRROR_TARGET=(hostname):cylc-run/suite-id/share/fcm_make_pp
[INFO] export MIRROR_TARGET=(hostname):cylc-run/suite-id/share/fcm_make_pp
[init] make # 2025-01-06T14:57:22

Is it possible to share the suite ID for your working cylc8 run?

Is ROSE_TASK_MIRROR_TARGET specified anywhere in a .cylc file? I can’t find any reference to it within my suites.

Hi Rob,
The suite is u-dm037, but as mentioned this is not run on Monsoon- I was only trying to relate the site/x settings with what appears in the job.out. It is also a different (UKESM) configuration that has more complicated pp tasks.

The ROSE_TASK_MIRROR_TARGET appears to be something added by Cylc in the background, hence trying to find out what settings are needed for that to appear.

Hi Mohit,

Is it perhaps configured for the platforms?

Is there any mention of it when cylc config is run?

Cheers,

Rob

[[fcm_make_pp]]
    inherit = RUN_MAIN, EXTRACT_RESOURCE
   [[[environment]]]
      ROSE_TASK_MIRROR_TARGET = xcs-c:cylc-run/u-dm191/share/fcm_make_pp

Hard coding the env variable did make fcm_make_pp run but it now fails at fcm_make2_pp (so frustrating!)

Anyone have any ideas on:

job.error:

[FAIL] no configuration specified or found

[FAIL] fcm make -C /home/d00/rwaters/cylc-run/u-dm191/run1/share/fcm_make_pp -n 2 -j 1 # return-code=2
2025-01-08T11:39:57Z CRITICAL - failed/ERR

job.out:

Workflow : u-dm191/run1
Job : 19880901T0000Z/fcm_make2_pp/01 (try 1)
User@Host: rwaters@shared100

2025-01-08T11:39:54Z INFO - started
[INFO] Configuration: /home/d00/rwaters/cylc-run/u-dm191/run1/app/fcm_make_pp/
[INFO]     file: rose-app.conf
[INFO]     optional key: (monsoon)
[INFO] export PATH=/opt/cray/netcdf/4.3.2/bin:/home/d00/rwaters/cylc-run/u-dm191/run1/share/bin:/home/d00/rwaters/cylc-run/u-dm191/run1/bin:/opt/ukmo/subversion/1.8.19/bin:/opt/python/gnu/2.7.9/bin/:/opt/ukmo/mass/moose-monsoon-client-latest/bin:/opt/cray/mpt/7.0.4/gni/bin:/opt/cray/atp/1.7.5/bin:/opt/cray/rca/1.0.0-2.0502.60530.1.62.ari/bin:/opt/cray/pmi/5.0.5-1.0000.10300.134.8.ari/bin:/opt/cray/craype/2.2.1/bin:/opt/cray/cce/8.3.4/cray-binutils/x86_64-unknown-linux-gnu/bin:/opt/cray/cce/8.3.4/craylibs/x86-64/bin:/opt/cray/cce/8.3.4/cftn/bin:/opt/cray/cce/8.3.4/CC/bin:/opt/cray/llm/default/bin:/opt/cray/llm/default/etc:/opt/cray/xpmem/0.1-2.0502.64982.7.29.ari/bin:/opt/cray/ugni/6.0-1.0502.10863.8.29.ari/bin:/opt/cray/udreg/2.3.2-1.0502.10518.2.17.ari/bin:/opt/cray/lustre-cray_ari_s/2.5_3.0.101_0.46.1_1.0502.8871.45.1-1.0502.21728.75.4/sbin:/opt/cray/lustre-cray_ari_s/2.5_3.0.101_0.46.1_1.0502.8871.45.1-1.0502.21728.75.4/bin:/opt/cray/alps/5.2.5-2.0502.9955.44.1.ari/sbin:/opt/cray/alps/5.2.5-2.0502.9955.44.1.ari/bin:/opt/cray/sdb/1.1-1.0502.63652.4.25.ari/bin:/opt/cray/nodestat/2.2-1.0502.60539.1.31.ari/bin:/opt/modules/3.2.10.3/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/pbs/bin:/usr/lib/qt3/bin:/opt/ukmo/supported/bin/:/opt/cray/bin:/opt/ukmo/supported/bin/:/opt/ukmo/supported/bin/:/home/d04/fcm/bin:/opt/ukmo/supported/bin/:/opt/ukmo/supported/bin/:/home/d04/fcm/bin:/opt/ukmo/supported/bin/:/opt/ukmo/supported/bin/:/home/d04/fcm/bin
[INFO] export config_base=fcm:moci.xm_tr
[INFO] export config_rev=@postproc_2.4
[INFO] export extract=extract
[INFO] export install=build
[INFO] export install_host=remote.cfg
[INFO] export model_config=um-atmos.cfg
[INFO] export pp_rev=postproc_2.4
[INFO] export pp_sources=branches/dev/ericaneininger/postproc_2.4_restrict_atmos_process_methods
[INFO] source: /home/d00/rwaters/cylc-run/u-dm191/run1/app/fcm_make_pp/file/fcm-make.cfg
[INFO] install: fcm-make.cfg
[INFO]     source: /home/d00/rwaters/cylc-run/u-dm191/run1/app/fcm_make_pp/file/fcm-make.cfg
[init] make 2              # 2025-01-08T11:39:57Z
[info] FCM 2021.05.0 (/common/fcm/fcm-2021.05.0)
[init] make 2 config-parse # 2025-01-08T11:39:57Z
[FAIL] make 2 config-parse # 0.0s
[FAIL] make 2              # 0.0s
============================= PBS epilogue =============================

file %r not found...
End of Job Report
Run at 2025-01-08 11:39:58 for job 4455987.xcs00
Submitted              : 2025-01-08 11:39:40
Queued                 : 2025-01-08 11:39:40
Started                : 2025-01-08 11:39:52
Completed              : 2025-01-08 11:39:58
Queued Time            : 0:00:12 (12 seconds)
Elapsed Time           : 0:00:06 (6 seconds, 2% of limit)
Walltime Limit         : 0:05:00 (300 seconds)
Node Time Limit        : 0:00:09 (9 seconds)
Node Time              : 0:00:00 (0 seconds, 2% of limit)
Job Name               : fcm_make2_pp.19880901T0000Z.u-dm191-run1
Queue                  : shared
Owner                  : rwaters
Group                  : mo_users
Project                : ukca-cam
Subproject             :
Funding                :
Trustzone              : collaboration
STDOUT                 : /home/d00/rwaters/cylc-run/u-dm191/run1/log/job/19880901T0000Z/fcm_make2_pp/01/job.out
STDERR                 : /home/d00/rwaters/cylc-run/u-dm191/run1/log/job/19880901T0000Z/fcm_make2_pp/01/job.err
Job Directory          : /scratch/jtmp/pbs.4455987.xcs00.x8z
Job Arch               :
CPU Core Type          : broadwell
Total Nodes            : 1
Total Tasks            : 1
Parent Node            : shared100
Parent Node Memory     :
Parent Node CPU Time   :
Compute Nodes          :
Electrical Groups      :
Run Version            : 1

                  CPU  Wallclock  Wallclock RSS Memory     Memory     Memory
Node ID         Count       Used  Requested       Used       Used  Requested   CPU Time
========== ========== ========== ========== ========== ========== ========== ==========
633                 2          0         0%        0.0      49.4M       2.3%        0.0
========== ========== ========== ========== ========== ========== ========== ==========

For more information see documentation

flow.cylc section:

    [[fcm_make2_pp]]
        inherit = RUN_MAIN, PPBUILD_RESOURCE