Re-running UKESM suites archiving issue


Hello CMS Team,

I have discovered a slightly complicated issue which may be a combination from some CYLC8 changes and recently discovered UKESM1.2 bug issue, requiring re-running of model simulations.

I am re-running u-dr800 and u-dr928 after a bug was discovered.
In CYLC8, this creates a new run number, e.g. run3 in my case. So far, so good.

Re-running these suites, all the JDMA tasks are succeeding, so I thought that all was good and either the model runs were being put to tape under the “run3” label or overwriting the old data, either way, I am getting “succeeded” JDMA tasks.

Now checking JDMA, I discover that all the JDMA tasks putting my new model output to tape have failed (example below, many more of these).

49814 adittus stabilisation u-dr800/20460401 elastictape 2026-01-23 21:15 FAILED
49815 adittus stabilisation u-dr800/20460701 elastictape 2026-01-23 21:21 FAILED
49816 adittus stabilisation u-dr800/20461001 elastictape 2026-01-23 21:26 FAILED
49831 adittus stabilisation u-dr928/21010101 elastictape 2026-01-24 01:57 FAILED
49832 adittus stabilisation u-dr928/21010401 elastictape 2026-01-24 02:02 FAILED
49833 adittus stabilisation u-dr928/21010701 elastictape 2026-01-24 02:07 FAILED
49834 adittus stabilisation u-dr928/21011001 elastictape 2026-01-24 02:12 FAILED
49898 adittus stabilisation u-dr928/21020101 elastictape 2026-01-25 00:07 FAILED

So these transfers have been failing, despite the model/suite happily telling me that all is well.
Luckily, the data still seems to be available on the transfer cache.

Is there a way to fix this in the short term, i.e. archive the data sitting on the JASMIN transfer cache manually under a “run3” label please?

And in the longer term, is there a way that we can fix this issue when re-running suite under a new “runX” directory, please? This labelling appears to be only used on ARCHER2, but not propagated downstream onto JASMIN / tape. Perhaps I have missed something or omitted to do / upgrade something too - if that’s the case, please let me know!

Many thanks!
Andrea

Hi Andrea,

pptransfer & jdma like all the UM workflow tasks run the same code under cylc 7 as under cylc 8 (ie. they behave in the same way), thus they know nothing about what cylc run number it is (this wasn’t available in cylc7) equally it wouldn’t be appropriate either for the task to assume that you want to differentiate output from each run directory.

NLDS, which we will be moving too, has more flexibility on how to tag data so we will be adding an option for the user to specify what tags to attach to their data which could include using the cylc run id.

The JDMA task submits the MIGRATE/PUT request to JDMA and the success/failure of the task is simply that the PUT request has been successfully submitted (as per the CANARI runs). As has always been the case, the user has to manually check that the JDMA request succeeds. It is impossible for the cylc task to determine this because of the asynchronous nature of the PUTs. Sometimes it takes days for a batch to be put to tape and if the JDMA task had to wait for this, model workflows would simply stall. The same will be true of NLDS when we move to that, the workflow will issue the request but won’t be able to tell you when the request has been actioned or if it succeeded just that it has been successfully submitted to the NLDS server.

I assume the FAILUREs are due to the same data already being on ET? I don’t have access to the stabilisation GWS to be able to see.

The JDMA system prevents over-writing or putting of the same data twice. You would need to delete the data off of tape first and then re-run. This is what we did with CANARI when we had to re-run segments of an ensemble member.

Yes, you can manually put the data to tape with whatever label you like by running:

jdma -w stabilisation -l <your-prefered-label> migrate /path/to/your/data/on/jasmin

However since JDMA won’t allow you to put the same source directory/files twice you will either have to move the data to another directory on XFC or delete the dodgey batches from tape first.

Sorry this has caused confusion.

Cheers,
Ros