Vn13.3 AMIP suite on Archer2

Hi,

I’m trying to convert a suite, u-dk142, a copy of a Monsoon vn13.3 AMIP suite to work on Archer2 but I’m getting 2 errors I haven’t seen before:

In fcm_make_um:

[FAIL] ukca: name-spaces declared but not used

[FAIL] fcm make -f /home4/home/n02-puma/jweber/cylc-run/u-dk142/work/19800411T0000Z/fcm_make_um/fcm-make.cfg -C /home/n02/n02/jweber/cylc-run/u-dk142/share/fcm_make_um -j 4 mirror.target=ln04:cylc-run/u-dk142/share/fcm_make_um mirror.prop{config-file.name}=2 # return-code=2
2024-10-16T15:08:35Z CRITICAL - failed/EXIT

in fcm_make_pp

[FAIL] pp/Postprocessing/platforms/transfer.py: merge results in conflict
[FAIL] merge output: /home/n02/n02/jweber/cylc-run/u-dk142/share/fcm_make_pp/.fcm-make/extract/merge/pp/Postprocessing/platforms/transfer.py.diff
[FAIL] source from location 0: svn://puma2.archer2.ac.uk/moci.xm/main/trunk/Postprocessing/platforms/transfer.py@4419
[FAIL] source from location 1: svn://puma2.archer2.ac.uk/moci.xm/main/branches/dev/annetteosprey/postproc_2.3_archer2/Postprocessing/platforms/transfer.py@3910
[FAIL] !!! source from location 2: svn://puma2.archer2.ac.uk/moci.xm/main/branches/dev/rosalynhatcher/postproc_2.3_pptransfer_gridftp_nopw/Postprocessing/platforms/transfer.py@4422

[FAIL] fcm make -f /home4/home/n02-puma/jweber/cylc-run/u-dk142/work/19800411T0000Z/fcm_make_pp/fcm-make.cfg -C /home/n02/n02/jweber/cylc-run/u-dk142/share/fcm_make_pp -j 4 mirror.target=ln04:cylc-run/u-dk142/share/fcm_make_pp mirror.prop{config-file.name}=2 # return-code=2
2024-10-16T15:08:33Z CRITICAL - failed/EXIT

Do you have any advice on solving these?

Thanks for your help,

James

Hi James,

I do not think the changes in fcm_make_pp and fcm_make_um (see snapshot) will work.
For pp- there is no branch with that name, so you might have to find out the appropriate one by looking at recent ARCHER suites.
For um- the config_revision is essential, otherwise the build will try to use code from head of the trunk and cause a mismatch with the setup.

Mohit

Thanks, Mohit.

I added the config_revision back in but the fcm_make_um error has persisted.

For the fcm_make_pp error, I tried with Ros’ postproc_2.4_archer2_jasmin_rewrite@5092 and also with pp_sources blank and, on both occasions, it generated the same error as above.

I haven’t found another vn13.3 Archer2 suite - I was hoping to use vn13.3 to make use of Dan Grosvenor’s updated hygroscopicity fix.

The name-spaces error in fcm_make_um particularly confusing.

Can you think of anything else to try?

Cheers,

James

James

You have
ROSE_APP_OPT_CONF_KEYS = archer2
which is picking up a UM12.1 config branch. Probably just delete this line from archer2.rc

Take a look at my u-dj927.

Grenville

should have said - that’s in this section
[[UMBUILD]]
[[[environment]]]
CONFIG = ncas-ex-cce
OPENMP= true
OPTIM = safe
PREBUILD =
ROSE_APP_OPT_CONF_KEYS = archer2

check for other optional overrides

Hi James,

The changes that Mohit and Grenville may have got your suite working.

If your suite is nudged, another option could be to take a look at my suite u-db533 - this is a UM13.1 nudged UKESM1.1 AMIP suite on ARCHER2. It should be possible to upgrade this to UM13.3 and then apply the differences from your Monsoon2 suite to get a working equivalent.

Best wishes,
Luke

Thanks, Grenville. I removed the ROSE_APP_OPT_CONF_KEYS = archer2 in the UMBUILD section which in my AMIP suite is closest to your coupled suite u-dj927 but the I got the same error messages.

I then removed it from the ATMOS_RESOURCE section as well and got the same error messages. Finally I removed it from the EXTRACT_RESOURCE section, which left the [[Environment]] subsection blank and the run failed as before.

The suite.rc file also has:
ROSE_APP_OPT_CONF_KEYS = {{CONFIG_OPT}} {{BITCOMP_NRUN_OPT}}

Should that be removed/altered too>

Thanks, Luke. I’m not planning to run nudged but could try upgrading the suite. Is there any documentation on how to do that?

Thanks,

James

Hi James,

The merge conflicts you are getting for fcm_make_pp are because you are still using postproc_2.4 trunk with postproc_2.3 branches which won’t work.

Either remove the override file app/fcm_make_pp/opt/rose-app-archer2.conf or delete its contents.

Then follow the instructions here for postproc_2.4: https://cms.ncas.ac.uk/unified-model/postproc/ to upgrade postproc for ARCHER2.

Similarly delete the contents of the app/fcm_make_um/opt/rose-app-archer2.conf as it is still picking up the UM12.1 config branch due to a higher level setting.

Regards,
Ros.

Thanks, Ros. I’ve worked through your advice and that on https://cms.ncas.ac.uk/unified-model/postproc/.

I think something may have gone wrong with the

cp ~um1/jdma/jdma.rc ~/roses/<suiteid>
cp -r ~um1/jdma/metadata/ncas_extras ~/roses/<suite-id>/meta

section as I get the below error when I try to run

[FAIL] cylc validate -o /tmp/tmp4FDt5S --strict u-dk142 # return-code=1, stderr=
[FAIL] Jinja2Error:
[FAIL] File “/home/n02/n02/jweber/cylc-run/u-dk142/jdma.rc”, line 1, in top-level template code
[FAIL] {% if RUN and POSTPROC and PPTRANSFER and JDMA %}
[FAIL] UndefinedError: ‘RUN’ is undefined

I haven’t done the pp transfer tasks yet as I need to work out how and where to put the data on Jasmin. Are they required for the jdma?

Cheers,

James

Hi James,

The tasks are named slightly differently in your suite.

In jdma.rc change the first line to be:

{% if TASK_RUN and TASK_POSTPROC and TASK_PPTRANSFER and JDMA %}

Then try running again.

We can sort out the settings for pptransfer and jdma (if using) later once you’ve got the model running.

Cheers,
Ros.

Thanks, Ros. u-dk142 now runs but doesn’t seem to archive anything on Archer2 (I have a directory on work /work/n02/n02/jweber/archive) - can this be done separately to the pptransfer to Jasmin?

I will also look to get the Jasmin transfer working.

Cheers,

James

Hi James,

The postproc step hasn’t run successfully yet and this is the step that puts data in the /work/n02/n02/jweber/archive directory.

If you look in the job.err file for the postproc task you will see that it ran out memory.

In site/archer2.rc in section [[POSTPROC_RESOURCE]] try up’ing the memory requested for this task by adding

[[[directives]]]
    --mem = 25Gb

Reload the suite and retrigger the postproc task.

Once we’ve got postproc working, if you’re transferring the data to JASMIN we should then configure the suite so it automatically deletes the data from ARCHER2 once it has been successfully transferred.

Let me know once you’ve got postproc working ok and I’ll tell you what to do next.

Regards,
Ros.

Thanks, Ros. u-dk142 now sends output to the archive directory. I will follow the steps for setting up the transfer to Jasmin and let you know if I have any problems.

Cheers,

James

Hi Ros,

u-dk142 is stuck on submit-retrying for pptransfer with the job.err message

Lmod is automatically replacing “cce/15.0.0” with “gcc/11.2.0”.

Due to MODULEPATH changes, the following have been reloaded:

  1. cray-mpich/8.1.23

[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
[WARN] [SUBPROCESS]: Command: globus-url-copy -vb -cd -r -cc 4 -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/cylc-run/u-dk142/share/cycle/19800101T0000Z/u-dk142/19800101T0000Z/ gsiftp://gridftp1.jasmin.ac.uku-dk142/19800101T0000Z/
[SUBPROCESS]: Error = 1:

error: Unable to check destination url for sync: gsiftp://gridftp1.jasmin.ac.uku-dk142/19800101T0000Z/
globus_xio: Unable to connect to gridftp1.jasmin.ac.uku-dk142:2811
globus_xio: globus_libc_getaddrinfo failed.
globus_common: Name or service not known

[WARN] Transfer command failed: globus-url-copy -vb -cd -r -cc 4 -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/cylc-run/u-dk142/share/cycle/19800101T0000Z/u-dk142/19800101T0000Z/ gsiftp://gridftp1.jasmin.ac.uku-dk142/19800101T0000Z/
[ERROR] transfer.py: Unknown Error - Return Code=1
[FAIL] Command Terminated
[FAIL] Terminating PostProc…
[FAIL] transfer.py <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2024-10-24T05:06:55Z CRITICAL - failed/EXIT

Could this be linked to upcoming retirement of the xfer-sp server on Jasmin?

Cheers,

James

James

Please confirm that you have followed these instructions Configuring PPTransfer ?

Grenville

Hi Grenville,

I thought I had done all the changes but realised the transfer_dir had not been defined. When I added that, u-dk142 throw an error (below) in the atmos_main (despite completing it successfully before) so I am trying to work why that has happened.

? Error code: 58
? Error from routine: UKCA_CHEMISTRY_CTL
? Error message: ERROR: Number of chemical active species /= jpctr
? Error from processor: 464

Once I have solved that, I will return to the pptransfer challenge and let you know how I get one.

Thanks,

James

Hi Grenville,

I’ve switched to use u-dk384 which is a copy of u-dk142@301081 before I made the stash changes which caused the above atmos_main error. u-dk384 has all the modification listed in the Configuring PPTransfer link, including the transfer dir and completes atmos_main.

I also checked the credential is valid.

However, it is still failing on pptransfer with

Lmod is automatically replacing “cce/15.0.0” with “gcc/11.2.0”.

Due to MODULEPATH changes, the following have been reloaded:

  1. cray-mpich/8.1.23

[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
[WARN] [SUBPROCESS]: Command: globus-url-copy -vb -cd -r -cc 4 -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/cylc-run/u-dk384/share/cycle/19800101T0000Z/u-dk384/19800101T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk384/19800101T0000Z/
[SUBPROCESS]: Error = 1:

error: Unable to check destination url for sync: gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk384/19800101T0000Z/
globus_ftp_client: the server responded with an error
530 530-Login incorrect. : globus_gss_assist: Gridmap lookup failure: Could not map /DC=uk/DC=ac/DC=jasmin/O=STFC RAL/CN=jmw240
530-
530 End.

[WARN] Transfer command failed: globus-url-copy -vb -cd -r -cc 4 -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/cylc-run/u-dk384/share/cycle/19800101T0000Z/u-dk384/19800101T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk384/19800101T0000Z/
[ERROR] transfer.py: Unknown Error - Return Code=1
[FAIL] Command Terminated
[FAIL] Terminating PostProc…
[FAIL] transfer.py <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2024-10-25T10:29:20Z CRITICAL - failed/EXIT

Thanks,

James

James

There has been a similar error reported - https://cms-helpdesk.ncas.ac.uk/t/gridftp-transfer-from-archer2-to-jasmin/702. The solution seemed to be related to not having access to hpxfer. Can you ssh to hpxfer1.jasmin.ac.uk from ARCHER?

Does a simple list work from the Archer2 command line:

globus-url-copy -vb -cred /work/n02/n02/jweber/cred.jasmin -list gsiftp://gridftp1.jasmin.ac.uk/home/users/jmw240

Grenville

Hi Grenville,

When I run

jweber@ln01:~> globus-url-copy -vb -cred /work/n02/n02/jweber/cred.jasmin -list gsiftp://gridftp1.jasmin.ac.uk/home/users/jmw240

The output is:

gsiftp://gridftp1.jasmin.ac.uk/home/users/jmw240
jmw240

When I ran ssh -YAX jmw240@hpxfer1.jasmin.ac.uk on archer2, the response was:

ssh: connect to host hpxfer1.jasmin.ac.uk port 22: Connection refused

I’ve just seen that my user role for the hpxfer server has expired. When I try to apply to extend it, I am asked for “The IP address from which you will be accessing the high-performance transfer machines. If you are not sure what to put here, please contact your local network administrator.”

Would this be my reading IP address or an Archer2 address?

Thanks,

James

Hi James,
On that page it should give a dummy IP address you can enter if you going from ARCHER2 as they already know about it, otherwise ARCHER2 IP address is fine.

Updated: The dummy address is: 130.246.130.166

Cheers,
Ros