Modification required to run a Monsoon suite on ARCHER2

Hi CMS,

I am trying to modify a suite (Which was running on MONSOON) to run on ARCHER2. I am carrying forward Iodine development work by Ewa (she has done all development on MONSOON and a rose suite of her job is u-cn823). I have taken a copy of job u-cn823 and the new one is u-cp798. How to modify this suite to run on ARCHER2? I think I have to modify ‘Host Machine’ for ARCHER2 and some (maybe more!) directories to run this on ARCHER2.

Can you please point me to any page available to modify the suite to run on ARCHER2? I think some instructions to modify a Monsoon suite to run on ARCHER2 were available on http://cms.ncas.ac.uk/wiki/Archer2 but it seems this page is moved somewhere.

Regards, Alok

please see Porting

Hi Grenville,

Thanks for pointing to this page. I have followed the instructions and modified the suite (i.e. u-cp798). The suite I am converting is a copy of a Monsoon suite. I am unable to submit the job on ARCHER2 and getting the following error message:

[INFO] REGISTERED u-cp798 → /home/akpandeyjnu/cylc-run/u-cp798
[FAIL] cylc validate -o /tmp/tmph4sFgn --strict u-cp798 # return-code=1, stderr=
[FAIL] Jinja2Error:
[FAIL] File “/home/akpandeyjnu/cylc-run/u-cp798/site/archer2.rc”, line 159, in top-level template code
[FAIL] execution time limit = {{CLOCK}}
[FAIL] UndefinedError: ‘AINITIAL’ is undefined

Please help me.

Regards, Alok

Dear CMS,
I think the problem is with startdump file and I have modified the archer2.rc file in directory /home/akpandeyjnu/cylc-run/u-cp798 with the startdump file - ‘/work/n02/n02/alok/ukca_start/ce185a.da20091201_00’ but after running the ‘rose suite-run’ the archer2.rc file automatically restores to previous one and still getting the same error. Is it defined somewhere in rose?
Please point me the directories for CMIP6_AEROCHEM_EMS, CMIP6_BIOG_EMS, CHEM_INIT_FILE, SST_SICE_ANCIL directories on archer2.

Regards, Alok

Hi Alok,

Any changes you make to the suite must be in the ~/roses/u-cp798 directory. Change the site/archer2.rc file in there. The files under ~/cylc-run are then generated from these.

You should find all the equivalent ancil paths under the central UMDIR on ARCHER2: /work/y07/shared/umshared

e.g. /projects/ancils/cmip6 on Monsoon is /work/y07/shared/umshared/cmip6 on ARCHER2

Regards,
Ros.

Hi Ros,

Thank you for pointing ancil path. I have modified archer2.rc file under /home/akpandeyjnu/roses/u-cp798/site and now able to submit the job from pumatest, but fcm_make_um fails and job.err file has the following message:

[FAIL] config-file=/home/akpandeyjnu/cylc-run/u-cp798/work/20091201T0000Z/fcm_make_um/fcm-make.cfg:3
[FAIL] config-file= - svn://pumatest/um.xm_svn/main/branches/dev/simonwilson/vn11.0_archer2_compile/fcm-make/ncas-ex-cce/um-atmos-safe.cfg
[FAIL] svn://pumatest/um.xm_svn/main/branches/dev/simonwilson/vn11.0_archer2_compile/fcm-make/ncas-ex-cce/um-atmos-safe.cfg: cannot load config file
[FAIL] svn://pumatest/um.xm_svn/main/branches/dev/simonwilson/vn11.0_archer2_compile/fcm-make/ncas-ex-cce/um-atmos-safe.cfg: not found
[FAIL] svn: warning: W170000: URL ‘svn://pumatest/um.xm_svn/main/branches/dev/simonwilson/vn11.0_archer2_compile/fcm-make/ncas-ex-cce/um-atmos-safe.cfg’ non-existent in revision 112491
[FAIL]
[FAIL] svn: E200009: Could not display info for all targets because some targets don’t exist

[FAIL] fcm make -f /home/akpandeyjnu/cylc-run/u-cp798/work/20091201T0000Z/fcm_make_um/fcm-make.cfg -C /home/akpandeyjnu/cylc-run/u-cp798/share/fcm_make_um -j 4 mirror.target=login.archer2.ac.uk:cylc-run/u-cp798/share/fcm_make_um mirror.prop{config-file.name}=2 # return-code=1
2022-09-16T15:19:32Z CRITICAL - failed/EXIT

The job-activity.log has the following information:
[jobs-submit ret_code] 0
[jobs-submit out] 2022-09-16T15:19:30Z|20091201T0000Z/fcm_make_um/01|0|20124
2022-09-16T15:19:30Z [STDOUT] 20124
[((‘event-mail’, ‘failed’), 1) ret_code] 0

Please could you suggest to me the possible cause?

Regards, Alok

Hi Alok,

The error is because it can’t find the compile configs for ARCHER2. We do not have UM vn11.0 installed on ARCHER2. You will need to upgrade to a newer version of the UM.

Regards,
Ros

Hi Ros,

Thanks for this. I have modified the rose-app.conf file with the AMIP job app/fcm_make_um/rose-app.conf. It succeeded for fcm_make_um, fcm_make2_um and install_ancil. Now I am getting error in recon and the error message is:
The following have been reloaded with a version change:

  1. cce/11.0.4 => cce/12.0.3
    [WARN] file:STASHC: skip missing optional source: namelist:exclude_package(:slight_smile:
    [WARN] file:RECONA: skip missing optional source: namelist:trans(:slight_smile:
    [FAIL] file:STASHmaster=source=fcm:um.xm_br/dev/ewabednarz/vn11.0_DEST_plus_Iodine/rose-meta/um-atmos/vn11.0/etc/stash/STASHmaster@HEAD: bad or missing value
    2022-09-21T13:51:37Z CRITICAL - failed/EXIT

Please note the chemistry scheme that was developed at vn11.0 by my colleagues on MONSOON and I am trying to run the same job on ARCHER2.

Regards, Alok

Alok

There is a procedure for upgrading a suite - you appear to not have followed it. Please look at Upgrading — Rose Documentation 2.0.0 documentation.

Also to fix the problem with the STASHmaster:

  1. Add the following to the rose-suite.conf file:
[file:app/um/file/STASHmaster]
source=fcm:um.xm_br/dev/ewabednarz/vn11.0_DEST_plus_Iodine/rose-meta/um-atmos/vn11.0/etc/stash/STASHmaster@HEAD
  1. Remove the following lines from the app/um/rose-app.conf file:
[file:STASHmaster]
source=fcm:um.xm_br/dev/ewabednarz/vn11.0_DEST_plus_Iodine/rose-meta/um-atmos/vn11.0/etc/stash/STASHmaster@HEAD

And obviously will need to upgrade that vn11.0 branch first.

Grenville

Hi Grenville,

Thanks for pointing towards upgrading documentation.
I have tried to upgrade the current suite (u-cp798) and followed the instructions. But fcm_make_um has failed.
I thought I am missing some steps, so I have created a new copy of the MONSOON Iodine job (u-cn823), upgraded the suite to 13.0 and followed Porting documentation. The new suite is u-cr052.
I am getting the same error message in both suites (u-cp798 and u-cr052). The job.err file has the following information:
[FAIL] um/src/atmosphere/UKCA/photolib/calcjs_mod.F90: merge results in conflict
[FAIL] merge output: /home/akpandeyjnu/cylc-run/u-cr052/share/fcm_make_um/.fcm-make/extract/merge/um/src/atmosphere/UKCA/photolib/calcjs_mod.F90.diff
[FAIL] source from location 0: (none)
[FAIL] source from location 1: svn://pumatest/um.xm_svn/main/branches/dev/lukeabraham/vn11.0_ukca_linox_tweaks/src/atmosphere/UKCA/photolib/calcjs_mod.F90@51068
[FAIL] !!! source from location 3: svn://pumatest/um.xm_svn/main/branches/dev/ewabednarz/vn11.0_DEST_plus_Iodine/src/atmosphere/UKCA/photolib/calcjs_mod.F90@97107
[FAIL] fcm make -f /home/akpandeyjnu/cylc-run/u-cr052/work/20091201T0000Z/fcm_make_um/fcm-make.cfg -C /home/akpandeyjnu/cylc-run/u-cr052/share/fcm_make_um -j 4 mirror.target=login.archer2.ac.uk:cylc-run/u-cr052/share/fcm_make_um mirror.prop{config-file.name}=2 # return-code=2
2022-09-29T14:24:49Z CRITICAL - failed/EXIT

I believe it is a little tricky to run as it is a copy of the MONSSON suite with UMv11.0 which is not available on ARCHER2.

u-cr001
Meanwhile, I have also tried to take a copy of u-bd366 (An ARCHER TS2000 nudged suites - GA7.1 StratTrop suites - UKCA) and upgraded it following Upgrading — Rose Documentation 2.0.0 documentation. Further, I followed Porting to submit it on ARCHER2, but unfortunately, it is failing in fcm_make2_um with the following error message:
The following have been reloaded with a version change:

  1. cce/11.0.4 => cce/12.0.3
    [FAIL] UKCA_NTP_MOD.mod: same target from [ukca/src/control/core/interface/ukca_ntp_mod.F90, um/src/atmosphere/UKCA/ukca_ntp_mod.F90]
    [FAIL] required by: ukca_um_interf_mod.o
    [FAIL] required by: UKCA_UM_INTERF_MOD.mod
    [FAIL] required by: ukca_scavenging_diags_mod.o
    [FAIL] required by: UKCA_SCAVENGING_DIAGS_MOD.mod
    [FAIL] required by: tracer_restore_mod.o
    [FAIL] required by: TRACER_RESTORE_MOD.mod
    [FAIL] required by: ni_conv_ctl_mod.o
    [FAIL] required by: NI_CONV_CTL_MOD.mod
    [FAIL] required by: atmos_physics2_mod.o
    [FAIL] required by: ATMOS_PHYSICS2_MOD.mod
    [FAIL] required by: atm_step_4a.o
    [FAIL] required by: u_model_4a.o
    [FAIL] required by: um_shell.o
    [FAIL] required by: um-atmos.exe
    [FAIL] UKCA_MODE_SETUP.mod: same target from [ukca/src/science/core/aerosols/glomap/ukca_mode_setup.F90,
    …………………………
    …………………….
    [ukca/src/science/glomap_clim/glomap_clim_calc_drydiam_mod.F90, um/src/atmosphere/GLOMAP_CLIM/glomap_clim_calc_drydiam_mod.F90]
    [FAIL] required by: glomap_clim_cndc_mod.o
    [FAIL] required by: GLOMAP_CLIM_CNDC_MOD.mod
    [FAIL] required by: allocate_ukca_cdnc_mod.o
    [FAIL] required by: ALLOCATE_UKCA_CDNC_MOD.mod
    [FAIL] required by: atm_step_4a.o
    [FAIL] required by: u_model_4a.o
    [FAIL] required by: um_shell.o
    [FAIL] required by: um-atmos.exe
    [FAIL] UKCA_RADAER_PREPARE_MOD.mod: same target from [ukca/src/science/radaer/ukca_radaer_prepare.F90, um/src/atmosphere/UKCA/ukca_radaer_prepare.F90]

[FAIL] fcm make -C /work/n02/n02/alok/cylc-run/u-cr001/share/fcm_make_um -n 2 -j 128 # return-code=2
2022-09-29T14:56:01Z CRITICAL - failed/EXIT

Suite ‘u-cr001’ is a copy of the release job and easy to run on ARCHER2. Am I missing something? Can you please suggest me how to fix these errors?

Regards, Alok

Hi Alok,

Regarding u-cr001: That is a copy of a UM11.2 suite so all you need to do is port it to ARCHER2 using the Porting instructions. You don’t need to upgrade the suite as UM11.2 is installed on ARCHER2.

You can’t run a suite at one UM version (e.g. vn13.0) with code branches from a different UM version (e.g. vn11.2)

Regards,
Ros.

Hi Alok,

Who are you working with to take forward the Iodine development work previously started by Ewa? Is it the same group that Ewa was working with? - the Lancaster UKCA group?? If so why can’t you get access to Monsoon to continue the work on there rather than have to port to ARCHER2?

Cheers
Ros.

Hi Ros,

Many thanks for your help with this!

I think I should explain: the developments were actually initially done on ARCHER1, and were only moved to Monsoon once ARCHER1 was shutting down so the work can continue, and then never moved back to ARCHER2 (as I was changing institutions/projects).

Now we have a NERC grant that Alok is working on that includes a lot of ARCHER2 time for production runs (after some further developments to the scheme), since Monsoon is not really meant to be used for production runs.

Do you reckon it would be possible to get UM vn11.0 installed on ARCHER2? Is that something that CMS could potentially assist with?

Many thanks,

Ewa

Hi Ewa, Alok,

Thanks for the explanation Ewa. One of the reasons we installed from UM11.1 onwards on ARCHER2 was because there was a significant change that went into the UM that means the setup/configuration of UM11.0 is different to versions 11.1+ so would take more work for us to install and we just don’t have the resources to be able to port & support all UM versions unfortunately. Upgrading your suite from UM11.0 to UM11.1 shouldn’t be too much trouble. We and other users have successfully upgraded suites.

I would suggest trying to upgrade the suite one version to UM11.1 which we do support on ARCHER2. Whilst we encourage people to not get too far behind in UM versions, trying to upgrade to UM13.0 is just way too big a jump to make in one step and certainly won’t work with your UM11.0 branch without a lot of work. Moving it to UM11.1 hopefully won’t be very difficult.

Alok, I’d suggest to please try upgrading the suite to UM11.1 following the instructions Grenville linked to above (ie. running rose app-upgrade -a vn11.1 in the app/fcm_make_um and app/um directories). Then port the suite to ARCHER2 using the porting instructions.

Remove Luke’s branch: branches/dev/lukeabraham/vn11.0_ukca_linox_tweaks as that went into UM11.1 code release.

Replace Mohit’s branch with the vn11.1 equivalent: branches/dev/mohitdalvi/vn11.1_ukca_fix_o3_ste

You will then need to upgrade Ewa’s branch to vn11.1. To do this create a new branch:

  • fcm bc DEST_plus_Iodine fcm:um.x-tr@vn11.1
  • fcm co fcm:um.x-br/dev/<your-mosrs-usrname>/vn11.1_DEST_plus_Iodine
  • cd vn11.1_DEST_plus_Iodine
  • fcm merge fcm:um.x-br/dev/ewabednarz/vn11.0_DEST_plus_Iodine

You will get some merge conflicts, where FCM can’t automatically merge lines. You will need to manually resolve these using the command fcm conflicts.

Regards,
Ros.

Hi Ros,

Thanks for the responses. I am working with the Lancaster UKCA group and we decided to switch to ARCHER2 for all Iodine development work. I believe, Ewa has explained the requirement of using ARCHER2. Many Thanks Ewa!

Thanks for explaining upgradation and porting nicely. I thought initially - the most recent update will be more helpful – that is not the case. I am going to take a new copy of the Ewa job again and upgrade it to UM11.1 and then port the suite to ARCHER2.

Meanwhile, I tried to run a copy of u-bd366 and followed porting instructions. The new job is u-cr175 and it has failed in recon with the following error:
The following have been reloaded with a version change:

  1. cce/11.0.4 => cce/12.0.3
    [WARN] file:STASHC: skip missing optional source: namelist:exclude_package
    [WARN] file:ATMOSCNTL: skip missing optional source: namelist:jules_urban2t_param
    [WARN] file:RECONA: skip missing optional source: namelist:trans
    [WARN] file:IDEALISE: skip missing optional source: namelist:idealised
    [WARN] file:RECONA: skip missing optional source: namelist:ideal_free_tracer
    [WARN] file:IOSCNTL: skip missing optional source: namelist:lustre_control
    [WARN] file:IOSCNTL: skip missing optional source: namelist:lustre_control_custom_files
    [WARN] file:RECONA: skip missing optional source: namelist:recon_idealised
    [WARN] file:SHARED: skip missing optional source: namelist:jules_urban_switches
    [FAIL] namelist:items(4a4f86c3)=ancilfilename: SST_SICE_ANCIL: unbound variable
    [FAIL] namelist:run_ukca=ukca_em_files: CMIP6_AEROCHEM_EMS: unbound variable
    [FAIL] namelist:run_nudging=ndg_datapath: NUDGE_DATA: unbound variable
    [FAIL] source: namelist:items
    [FAIL] source: namelist:run_ukca
    [FAIL] source: namelist:run_nudging
    2022-10-03T15:04:44Z CRITICAL - failed/EXIT

u-bd366 is a released nudged suite and it should run after porting to ARCHER2. How I can fix this error?

Regards, Alok

Hi Alok,

If you look in the old archer.rc file you’ll see it sets these variables. These are non-standard variables so not present in our provided template *.rc files so you need to add these yourself in the appropriate places in the archer2.rc file.

Cheers,
Ros.

Hi Ros,

Thanks for this. I have modified archer2.rc file (I have used archer.rc to modify archer2.rc file). This is still failing in recon with a different error message. The job.err file has the following error message:
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 1
? Error from routine: io:file_open
? Error message: Failed to open file /work/n02/n02/ukca/initial/N96eL85/au917a.da20080901_00
? Error from processor: 0
? Error number: 3

I have not found the directory/file ‘/work/n02/n02/ukca/initial/N96eL85/au917a.da20080901_00’ on Archer2.

Is this error associated with the unavailability of the file or something else?

Regards, Alok

The ukca directory is under /work/y07/shared/umshared/

Cheers,
Ros.

Hi Ros,

Thanks for pointing right directory. Now, u-cr175 is failing in the atmos_main with the following error message.

???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 1
? Error from routine: EM_FOPEN
? Error message: NetCDF error
? Error from processor: 0
? Error number: 47

[0] exceptions: An non-exception application exit occured.
[0] exceptions: whilst in a serial region
[0] exceptions: Task had pid=118520 on host nid001061
[0] exceptions: Program is “/work/n02/n02/alok/cylc-run/u-cr175/share/fcm_make_um/build-atmos/bin/um-atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
srun: error: nid001061: task 0: Exited with exit code 255
srun: launch/slurm: _step_signal: Terminating StepId=2412731.0
slurmstepd: error: *** STEP 2412731.0 ON nid001061 CANCELLED AT 2022-10-06T15:01:42 ***
srun: error: nid001062: tasks 120-239: Terminated
srun: error: nid001063: tasks 240-359: Terminated
srun: error: nid001061: tasks 1-119: Terminated
srun: Force Terminated StepId=2412731.0
[FAIL] um-atmos <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=143
2022-10-06T14:01:42Z CRITICAL - failed/EXIT

Can you please point me what this error means and how to fix it?

Regards, Alok

Hi Alok,

It means it can’t find/open a file. Have you looked in the job.out and/or pe_output files to see if it tells you which file it’s got a problem with?

Regards,
Ros.