Adding fcm_make_pp/postproc step to Nesting Suite

Hello

I’ve previously had a custom app set up for archiving to JASMIN that used the gridftp server, because I was using a version of the nesting suite that didn’t have an fcm_make_pp or postproc app (see e.g. the ncas_archive app in suite u-di808 - which was ages and ages ago copied from suite u-cy045 I believe which belongs to Helen Burns). I also have a version of the nesting suite adapted from the GAL9 branch of the nesting suite which I sort of…tried to figure out how to port to ARCHER2 with a bit of trial and error, but it now seems to run at least (u-dk846).

With the closure of the gridftp server, however, I don’t think I can adapt my existing app to work in the same way with the new globus-cli system. So I thought I would try to copy in a version of the fcm_make_pp and postproc apps and add them to my suite graphs in the place of the ncas_archive app… but it hasn’t seemed to work very well. My attempts can be found in u-dl827. Currently, it’s failing to do the fcm_make_pp step because it seems to be getting errors about host selection on ARCHER2, which is confusing since I thought that would happen in the job submission step, which seems to be working. (That said, in the past few weeks I’ve had continuous problems with host selection failing and timing out at job submission, leading me to have to retrigger jobs manually, which is annoying. Not sure if that’s a me thing or ARCHER-wide…)

I know it’s a bit of a frankenstein of a suite, but do you have any tips for trying to add the fcm_make_pp and postproc steps to a version of the nesting suite? Or is there anything that I’ve missed in trying to make this work?

Thanks!

Fran

Hi Fran,

Probably the easiest is to replace the gridftp command in the ncas_archive app with a simple script that issues the globus commands.

  • In the [[NCAS_ARCHIVE]] family add pre-script to load globus-cli module. I think that’s the correct place.
  [[NCAS_ARCHIVE]]
    inherit = HOST_HPC
    pre-script = """
                 module load globus-cli
                 module list
                 """
     ....
  • In ncas_archive/rose-app.conf set

    default = transfer.sh

  • Create a file ncas_archive/bin/transfer.sh with execute permissions containing for example:

#!/bin/bash
  
SRC_COLL='3e90d018-0d05-461a-bbaf-aab605283d21'
DEST_COLL='a2f53b7f-1b4e-4dce-9b7c-349ae760fee0'
LABEL='FranTest'

echo "globus transfer --format unix --jmespath 'task_id' --recursive --fail-on-quota-errors --sync-level checksum --verify-checksum --label ${LABEL} ${SRC_COLL}:${DATA_DIRECTORY} ${DEST_COLL}:${ROOT_PATH}/${FINAL_DIRECTORY}"

id=$(globus transfer --format unix --jmespath 'task_id' --recursive --fail-on-quota-errors --sync-level checksum --verify-checksum --label ${LABEL} ${SRC_COLL}:${DATA_DIRECTORY} ${DEST_COLL}:${ROOT_PATH}/${FINAL_DIRECTORY})

echo "Waiting on 'globus transfer' task: $id"

globus task wait -H $id
if [ $? -eq 0 ]; then
  echo "$task_id completed successfully";
else
  echo "$task_id failed";
  exit 1
fi

# Can then do a further check on the status if you want with `status=$(globus task show --jq "status" --format=UNIX $id)`

Caveat: The above script doesn’t quite work, in that it doesn’t capture the transfer task id which is essential for the wait command. At the moment I can’t quite see what I’ve done wrong.

Hope that helps.

Regards,
Ros.

Just edited to fix the script name to transfer.sh not .py. :grimacing:

Ignore my caveat above - the script does work and capture the task id, I was doing something stupid with my test data directory. :frowning_face:

Hi Ros

Thank you so much for this!

I have tried to make these changes and while it seems like it should work, I’m having some trouble with the globus cli login - the step now throws the following error…

MissingLoginError: Missing login for Globus Transfer.
Please run:

  globus login

Usage: globus task wait [OPTIONS] TASK_ID

Error: Missing argument 'TASK_ID'.
[FAIL] transfer.sh <<'__STDIN__'
[FAIL] 
[FAIL] '__STDIN__' # return-code=1
2024-12-19T15:25:55Z CRITICAL - failed/EXIT

(I assume the second set of errors are because globus hasn’t run the first command so $id doesn’t exist).
However, I have already run globus login on the login nodes (so when I run it again, it tells me I’m already logged in). So I tried to add it in the [[NCAS_ARCHIVE]] pre-script, but it just timed out after an hour because it requires interacting with the web interface. Then I tried to add it in the transfer.sh file which also didn’t work. (Both of those gave a Login timed out. Please try again.)

Do you know how I can make sure I’m logged in on the compute nodes?

Best,
Fran

PS I’m also conscious that it’s probably approaching Christmas annual leave times so thank you for helping me out so late in the day!

Hi Fran,

I think the problem is that /work/n02/n02/franmorr/.globus is a directory rather than a symlink to your /home/n02/n02/franmorr/.globus directory.

Please move or remove /work/n02/n02/franmorr/.globus directory and then redo the symlink in step 6.

ARCHER2> cd /work/n02/n02/<archer2_username>
ARCHER2> ln -s ~/.globus .globus

That should fix it.
Cheers,
Ros.

That’s worked! Thanks so much Ros! Merry Christmas :slight_smile:

Fran

Glad that fixed it. Merry Christmas.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.