UKESM1.1 AMIP run -- u-dp730

Hi Simon,

In postproc -> post processing select Archer for the archive_command
and then
transfer_dir under postproc -> postprocessing -> JASMIN Transfer is where you want the data to go on JASMIN.

Cheers,
Ros.

Thanks Ros,
did all that and rerun workflow – cylc vip. Now getting a failure from fcm_make_um
[FAIL] ln04:cylc-run/u-dr157/run2/share/fcm_make_um: cannot create mirror target
make extract has worked…

Simon

Hi Simon,

Please retrigger the fcm_make_um task. It tried to use ln04 - see Grenville’s previous comment above.

Cheers,
Ros

Hi Ros,
Thanks and it worked this time.

Is there a way of having fcm_make_um try different login nodes if the first one it tried did not work? I guess I could add some cylc retry options, but that would retry (repeatedly) if there was a failure due to messing up the build… Presumably the mirror step is copying stuff from puma to archer2.

Simon

Hi Simon,

Cylc automatically tries each of the 4 login nodes until it finds one it can connect to. Unfortunately at the moment login node 4 is alive and contactable but is only allowing those users with access to test the new os upgrade on.

Yes you can add retries to a task but it can’t diagnose when and when not to retry. If the task fails it will retry whatever the reason.

Cheers,
Ros

Ok – annoying! I guess cylc needs a does node exist and are allowed to logon test :slight_smile:
Simon

Hi Ros,

And pptransfer failed :frowning:
Looks like I need to authenticate… But I thought I did that yesterday when I set up globus and the keys are good for 30 days… See /home/n02/n02/tetts/cylc-run/u-dr157/run2/log/job/19790101T0000Z/pptransfer/01/job.err [You should have read permission for all my files..]

[SUBPROCESS]: Error = 4:

  • The resource you are trying to access requires you to re-authenticate.*
    message: Missing required data_access consent

Please use “globus session update” to re-authenticate with specific identities.

Do I need to make sure the new globus-cli module is loaded ~(module load globus-cli/3.35.2)? I don’t see any obvious way of doing that for my suite. Or pay attention to the short-lived credentials stuff…

Simon

I tried running the command interactively (after loading the 3.35.2 module) and got asked to authenticate using:
globus session consent ‘urn:globus:auth:scope:transfer.api.globus.org:all[*https://auth.globus.org/scopes/a2f53b7f-1b4e-4dce-9b7c-349ae760fee0/data_access *https://auth.globus.org/scopes/3e90d018-0d05-461a-bbaf-aab605283d21/data_access]’. Did so reran and it worked! However, looking at my globus logs there are a bunch of errors like:

Error (transfer)
Endpoint: Archer2 file systems (3e90d018-0d05-461a-bbaf-aab605283d21)
Server: 193.62.216.43:443
File: /work/n02/n02/tetts/cylc-run/u-dr157/run2/share/cycle/19790101T0000Z/dr157a.pk1979jan.pp
Command: RETR /work/n02/n02/tetts/cylc-run/u-dr157/run2/share/cycle/19790101T0000Z/dr157a.pk1979jan.pp
Message: Data channel authentication failed
---
Details: 500-Command failed. : globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.\r\n500-globus_xio: Operation was canceled\r\n500-globus_xio: Operation timed out\r\n500 End.\r\n

and (at the time of writing) only 9 files out of 10 have made it to JASMIN. I assume globus will keep trying until the files are transferred. The bandwidth looks very poor though – 3.21 Mb/sec. Should I worry about that?

pptransfer is now complaining that “A transfer with identical paths has not yet completed”. SO suggesting it would be OK to run. Will see once my globus transfer actually completes.
Simon

And I confirm that works – thus I have a working suite :slight_smile:
SImon

And my next question – How do I make the workflow use scratch space? In cylc7 the advice was to add:
root-dir{share}=ln*=/mnt/lustre/a2fs-nvme/work/n02/n02/$USER
root-dir{work}=ln*=/mnt/lustre/a2fs-nvme/work/n02/n02/$USER

Is this still OK for cylc 8 or should I do something else?
And does prebuild work in the same way on cylc8?

And another question about cylc8. I will make some changes to my workflow including extending the run from 3 months to a few years. Is there a way of getting the model to continue rather that start afresh?

Simon

Hi Simon,

  1. To use NVMe for a cylc8 suite in site/archer2.rc in the [[HPC]] section change platform from archer2 to archer2-nvme. You will need to start the run from the beginning, you can’t switch a run part way through.

  2. Yes prebuilds work in same way. This is FCM rather than Cylc.

  3. To extend a run; change the run length, then restart it with cylc play <suiteid>. You may or may not have to manually trigger the first task in the next cycle. e.g. cylc trigger u-dr157 //<cycle>/atmos_main.

Cheers,
Ros.

Hi Ros,
thanks very much.
Simon

And here are my best guess at the changes needed.

Changes from u-dp730 (UKESM1.1 AMIP) @ 13.8 :

  1. Set owner, account, site to ARCHER2 and Q to Standard.
  2. Turn off testing: suite conf → Tasks → Run Development Tests – set False.
  3. If you have not setup globus then you need to do so… On archer2 you need an updated module for globus – so do “module load globus-cli/3.35.2” first. Depending on when you read this the default globus module might have been changed. If so do “module load globus-cli” . See Configuring PPTransfer using Globus for instructions though step#5 “Repeat for JASMIN endpoint” is not necessary. If you want to do it you need to do a bunch of stuff to install globus… For #6 I recommend deleting the .globus file/dir/link first.
  4. Set up the archiving:
    In postproc → post processing select Archer for the archive command
    and then
    transfer_dir under postproc → postprocessing → JASMIN Transfer is where you want the data to go on JASMIN.
  5. Run your model! Pptransfer will probably fail. Look in the logs for the command ran and then on archer2 having loading the appropriate globus module run the command. If you get an error you will be asked to authenticate. Do that! Then rerun the pptransfer command.
  6. If you want to use scratch space (28 day lifetime) then edit site/archer2.rc.In the [[HPC]] section change platform from archer2 to archer2-nvme. Also change archer2-bg to archer2-nvme-bg. You will need to start the run from the beginning, you can’t switch a run part way through.

Just to clarify for anyone reading this step #5 “Repeat for JASMIN endpoint” is necessary. This step is run on ARCHER2, as indicated by the command prompt, NOT on JASMIN, to enable authentication to JASMIN and is a necessary part of the setup.

Cheers,
Ros

And to get reduce the likelihood of failure by having the extract retry a few times, is it sensible to change [[EXTRACT_RESOURCE]] in site/archer2.rc to:

[[EXTRACT_RESOURCE]]
inherit = LINUX
submission retry delays = PT1M, PT5M, PT15M # retry 3 times in case of failure – this is new.

As best as I can tell EXTRACT_RESOURCE is inherited by:
fcm_make_um, fcm_make_pp & fcm_make_pp_archive_host

Simon

I don’t think this is working .I think I should be using execution retry delays, not submission retry delays!

Hi Simon,

Yes.

submission retry delays is to handle temporary issues with job submission.

execution retry delays is to handle job execution failures.

We don’t usually specify execution retry delays for FCM extraction as 99% of failures need the user to do something to fix the issue rather than it being a temporary system issue. Of course, it’s entirely up to you if you do or not.

Regards,
Ros.

Hi Ros,
thanks – I am getting enough failures from the failed attempts to use ln04 that doing this makes sense for now. The alternative would be to modify whatever is trying to use ln04 so it doesn’t. That would actually fix the root cause. So how would I do that?
Simon

You can’t easily as it is set centrally. Annette has temporarily removed ln04 node from the list until ln04 is released back to general user access following the OS upgrade. So you won’t get ln04 failures now.

Cheers,
Ros.

:slight_smile: I will remove that change from my suite then.
S