PP to Jasmin: [FAIL] Bad optional configuration key(s): archer2

Hello,

I’m running a copy of u-bc613/archer2, u-cu408, and trying to archive data to Jasmin. I’ve set up the transfer requirements following Configuring PPTransfer and I can see that data has been sent on Archer2 to /work/n02/n02/jweber/archive/u-cu408.

However, the fcm_make2_pptransfer has failed with
Lmod is automatically replacing “cce/11.0.4” with “gcc/10.2.0”.

Due to MODULEPATH changes, the following have been reloaded:

  1. cray-mpich/8.1.4

[FAIL] Bad optional configuration key(s): archer2
2023-02-17T02:40:18Z CRITICAL - failed/EXIT

I looked at this similar ticket (Postproc failing) but it appears the solutions proposed there are already run my suite.

Could you say what I’m doing wrong?

Many thanks for your help,

James

Hi James,

That suite has not got the correct branches in postproc so in its current state pptransfer won’t work. I guess it was ported over before gridftp was available.

I’ll take a look at fixing it up later this afternoon.

Cheers,
Ros.

Thank you, Ros, that’s great.

James

Hi Ros, is there another suite I can copy to get the right branches?

Cheers,

James

Hi James,

There are a couple of other changes needed as well as a branch change. I’m just testing it out now.

Cheers,
Ros.

Hi James,

Try adding the branch fcm:moci.xm-br/dev/rosalynhatcher/postproc_2.3_pptransfer_gridftp_nopw to the fcm_make_pp → Configuration → pp_sources.

Then in the suite.rc file remove the line:

{{ 'fcm_make_pptransfer => fcm_make2_pptransfer' + (' => pptransfer' if RUN else '') if PPTRANSFER else '' }}

I’m not 100% sure that’s going to completely do it. I’ve got a problem with my suite but I think it’s my dodgey environment.

Let me know how you get on.

Cheers,
Ros.

Hi Ros, thanks for your help on this.

I made the changes to u-cu408 and set off a 3m run from Oct 2016. fcm_make2_pptransfer succeeded and did the coupled job. postproduction_atmos succeeded and I can see the data in /work/n02/n02/jweber/archive/u-cu408/20261001T0000Z.

However postproc_cice failed with: “Directory does not exist /work/n02/n02/jweber/cylc-run/u-cu408/share/data/History_Data/CICEhist/archive_ready”.

Postproc_nemo failed with “slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3133450.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.”

I set both to succeeded just to see if pp_transfer would work but this failed too with:
“[SUBPROCESS]: Command: globus-url-copy -vb -cd -r -cc 4 -verify-checksum -sync -sync-level 2 -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/archive/u-cu408/20261001T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20261001T0000Z/
[SUBPROCESS]: Error = 2:
No such file or directory
[WARN] Transfer command failed: globus-url-copy -vb -cd -r -cc 4 -verify-checksum -sync -sync-level 2 -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/archive/u-cu408/20261001T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20261001T0000Z/
[ERROR] transfer.py: System Error: Remove archive directory does not exist (ReturnCode=2)”

Could the postproc_nemo error be fixed by increasing the memory in archer2.rc file? Not sure about the other two.

Cheers,

James

Hi James,

I’ll take a look. That’s the same error I got. Things got quite complicated with all the postproc versions and different suites.

You should NOT have any fcm_make_pptransfer tasks. They are not needed on ARCHER2 and can cause confusion. They are a relic from ARCHER. Make sure you have removed the line I specified above re fcm_make_pptransfer from the suite.rc file.

I’ll get back to you shortly.
Cheers,
Ros.

Thanks, Ros. Sorry, that was an error in my last message - the run only had fcm_make_pp and fcm_make2_pp tasks (there are no fcm_make_pptransfer tasks). I removed the line in suite.rc as you suggested.

Cheers,

James

Hi James,

I’ve finally got pptransfer working in that suite.

I’ve committed my changes to u-bc613/archer2 branch if you want to view/compare with your suite.

u-bc613/archer2 pptransfer changes

In summary:

  1. Panel fcm_make_pp → Configuration
    config_base: fcm:moci.xm_br/pkg/rosalynhatcher/postproc_2.3_archer2_jasmin_pkg
    config_rev:
    pp_rev: postproc_2.3
    pp_sources: fcm:moci.xm_br/pkg/rosalynhatcher/postproc_2.3_archer2_jasmin_pkg

  2. site/archer2.rc
    Remove 2 lines containing: module use /work/n02/n02/grenvill/modulefiles
    Remove from [[PPTRANSFER_RESOURCE]]

    [[[job]]]
        batch system = background

That now works successfully for me.

Cheers,
Ros.

Hi Ros,

Postproc_cice failed again with: " [FAIL] check_directory: Exiting - Directory does not exist: /work/n02/n02/jweber/cylc-run/u-cu408/share/data/History_Data/CICEhist/archive_ready

[FAIL] Terminating PostProc…

[FAIL] main_pp.py cice <<‘STDIN

Just as we have to specify an archive file on Archer2 for the data which are to be transferred, does a specific CICE archive file need to be specified?

pptransfer failed with: "/mnt/lustre/a2fs-work1/work/y07/shared/umshared/software/cylc-7.8.12/lib/cylc/job.sh: line 86: echo: write error: Disk quota exceeded

2023-02-22T08:19:36Z CRITICAL - failed/ERR"

I’ve really cut back on stuff stored on Archer2 but the 600 Gb disk quota can make it hard to do much. Is it possible to have this increased?

Could the Postproc_cice failure by affecting pptransfer as well?

Cheers,

James

Hi James,

I’ve increased your ARCHER2 quota. Please try again and see if that fixes your issues.
Regarding CICEhist directory - you don’t need to change anything there - that archive_ready directory is automatically created. It worked fine for me.

Cheers,
Ros.

Hi Ros,

I’m afraid postproc_cice has failed again with

check_directory: Exiting - Directory does not exist: /work/n02/n02/jweber/cylc-run/u-cu408/share/data/History_Data/CICEhist/archive_ready

I can confirm that no archive_ready directory has been created.

James

Hi James,

I suspect the problem might be caused by switching off CICE create_means but leaving the archive_means switch set to True. Try turning off archive of CICE means and see if that fixes it.

Cheers,
Ros.

Hi Ros,

Thank you that has solved the postproc_cice issue.

However, pptransfer failed with

[WARN] [SUBPROCESS]: Command: globus-url-copy -vb -cd -r -cc 4 -verify-checksum -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/archive/u-cu408/20270701T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20270701T0000Z/

[SUBPROCESS]: Error = 1:

error: Unable to list destination directory for sync: gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20270701T0000Z/

globus_ftp_client: the server responded with an error

500 500-Command failed. : an authentication operation failed

500-globus_gsi_callback_module: Could not verify credential

500-globus_gsi_callback_module: Error with signing policy

500-globus_gsi_callback_module: Error in OLD GAA code: The subject of the certificate “/DC=uk/DC=ac/DC=jasmin/O=STFC RAL/CN=JASMIN” does not match the signing policies defined in /home/users/jmw240/.globus/certificates/7ed47087.signing_policy

500 End.

[WARN] Transfer command failed: globus-url-copy -vb -cd -r -cc 4 -verify-checksum -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/archive/u-cu408/20270701T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20270701T0000Z/

[ERROR] transfer.py: Unknown Error - Return Code=1

I can see that some data has been transferred to Jasmin and wondered if this could be a checksum issue. Your suite had checksum set to false whereas it was true for me. I ran u-cu408 for a further 3 months with checksum now false in postproc/rose-app.conf but this time postproc_nemo failed again with the error:
"Detected 1 oom-kill event(s) in StepId=3133450.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.”

Does this require a change to the suite.rc or archer2.rc file?

When I set postproc_nemo to succeeded, pptransfer then failed with:

[WARN] [SUBPROCESS]: Command: globus-url-copy -vb -cd -r -cc 4 -verify-checksum -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/archive/u-cu408/20280401T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20280401T0000Z/
[SUBPROCESS]: Error = 1:

error: globus_ftp_client: the server responded with an error
500 500-Command failed. : callback failed.
500-globus_gsi_callback_module: Could not verify credential
500-globus_gsi_callback_module: Error with signing policy
500-globus_gsi_callback_module: Error in OLD GAA code: The subject of the certificate “/DC=uk/DC=ac/DC=jasmin/O=STFC RAL/CN=JASMIN” does not match the signing policies defined in /home/users/jmw240/.globus/certificates/7ed47087.signing_policy
500 End.

Source: file:///work/n02/n02/jweber/archive/u-cu408/20280401T0000Z/
Dest: gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20280401T0000Z/
cu408a.pa2028jun.pp

Source: file:///work/n02/n02/jweber/archive/u-cu408/20280401T0000Z/
Dest: gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20280401T0000Z/
cu408a.pi20280101.pp

Source: file:///work/n02/n02/jweber/archive/u-cu408/20280401T0000Z/
Dest: gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20280401T0000Z/
cu408a.ps2028djf.pp

Source: file:///work/n02/n02/jweber/archive/u-cu408/20280401T0000Z/
Dest: gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20280401T0000Z/
cu408a.pj20260701.pp

Bit confused by this. Might be two separate issues?

James

Hi James,

  • Yes the verify_chksums must be switched off. That is the separate checksum which we had to use on old ARCHER. You’ll see in the globus_url_copy command above that we’re using the gridftp inbuilt checksum verification option now.

  • The postproc_nemo OOM - yes you will need to increase the amount of memory requested. You’ll need to experiment with how much memory you need to request.

    In archer2.rc add to the [[POSTPROC_RESOURCE]] section

   [[[directives]]]
        --mem=25Gb
  • PPTransfer - Looks like something went wrong part way through the transfer. If you look at try 02 it’s failed straightaway unable to list the JASMIN directory . I’d suggest trying to run that globus-url-copy command on the ARCHER2 command line and see if it works. (You’ll need to module load postproc first) If you get the same error - check GWS quota and that you can still get to that directory on JASMIN.

Cheers,
Ros.

Thanks, Ros.

I checked and /gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20280401T0000Z/ does exist on Jasmin. This is odd because, as you say, some of the files did make it over to Jasmin. I checked the GWS quota and we have several Tb spare.

After doing module load prostproc I ran the following on the command line:
globus-url-copy -vb -cd -r -cc 4 -verify-checksum -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/archive/u-cu408/20280401T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20280401T0000Z/

This failed with:

error: Unable to list destination directory for sync: gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20280401T0000Z/

globus_ftp_client: the server responded with an error

500 500-Command failed. : an authentication operation failed

500-globus_gsi_callback_module: Could not verify credential

500-globus_gsi_callback_module: Error with signing policy

500-globus_gsi_callback_module: Error in OLD GAA code: The subject of the certificate “/DC=uk/DC=ac/DC=jasmin/O=STFC RAL/CN=JASMIN” does not match the signing policies defined in /home/users/jmw240/.globus/certificates/7ed47087.signing_policy

500 End.

When I done scp transfers between archer2 and Jasmin I have put a : between uk and /gws but there isn’t one here.

I looked in /home/users/jmw240/.globus/certificates/7ed47087.signing_policy and found:

Note that this root signing policy has been EXTENDED compared to the

IGTF accredited version: this is OK because relying parties (in this

case the NGS) are supposed to define their own signing policies.

access_id_CA X509 ‘/C=UK/O=eScienceRoot/OU=Authority/CN=UK e-Science Root’
pos_rights globus CA:sign
cond_subjects globus ‘“/C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA” “/C=UK/O=eScienceSLCSHierarchy/OU=Authority/CN=SLCS Top Level CA” “/C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA 2A” “/C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA 2B” “/DC=uk/DC=ac/DC=ceda/O=STFC RAL/CN=Centre for Environmental Data Analysis”’

Could this signing_policy need editing?

Cheers,

James

Hi James,

Can you try running a simple command to list your JASMIN home directory like:

globus-url-copy -cred /work/n02/n02/jweber/cred.jasmin -vb -list gsiftp://gridftp1.jasmin.ac.uk/home/users/JASMIN_USERNAME/

If that does work I’d then try listing the GWS directory:
gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/20280401T0000Z/

Check your credential is still valid:
openssl x509 -in /work/n02/n02/jweber/cred.jasmin -noout -startdate -enddate

Failing all that I’d probably try regenerating the credential and trying again:
$UMDIR/bin/onlineca-get-cert-wget.sh -U https://slcs.jasmin.ac.uk/certificate/ -l JASMIN_USERNAME -o /work/n02/n02/jweber/cred.jasmin

The certs and signing_policy is generated automatically and can’t/shouldn’t be edited.

Cheers,
Ros.

Hi Ros,

Thanks for your help with this.

When I run the following on Archer2
globus-url-copy -cred /work/n02/n02/jweber/cred.jasmin -vb -list gsiftp://gridftp1.jasmin.ac.uk/home/users/jmw240/

I get the error:

error: globus_ftp_client: the server responded with an error

500 500-Command failed. : an authentication operation failed

500-globus_gsi_callback_module: Could not verify credential

500-globus_gsi_callback_module: Error with signing policy

500-globus_gsi_callback_module: Error in OLD GAA code: The subject of the certificate “/DC=uk/DC=ac/DC=jasmin/O=STFC RAL/CN=JASMIN” does not match the signing policies defined in /home/users/jmw240/.globus/certificates/7ed47087.signing_policy

500 End.

However, when I omit the final /, i.e. globus-url-copy -cred /work/n02/n02/jweber/cred.jasmin -vb -list gsiftp://gridftp1.jasmin.ac.uk/home/users/jmw240

I get the response:
gsiftp://gridftp1.jasmin.ac.uk/home/users/jmw240
jmw240

Is this the correct response?

The same is true when I try:
globus-url-copy -cred /work/n02/n02/jweber/cred.jasmin -vb -list gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408

and

globus-url-copy -cred /work/n02/n02/jweber/cred.jasmin -vb -list gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/mass_extracts/from_archer2/u-cu408/

The former seems to work and the latter fails.

The credential was still valid but I have regenerated it anyway although I don’t think this affected things.

Cheers,

James

Hi James,

It works both ways for me. So something weird going on for you.

ARCHER2-23cab> globus-url-copy -cred /work/n02/n02/ros/cred.jasmin -vb -list gsiftp://gridftp1.jasmin.ac.uk/home/users/rshatcher
gsiftp://gridftp1.jasmin.ac.uk/home/users/rshatcher
    rshatcher 

ARCHER2-23cab> globus-url-copy -cred /work/n02/n02/ros/cred.jasmin -vb -list gsiftp://gridftp1.jasmin.ac.uk/home/users/rshatcher/
gsiftp://gridftp1.jasmin.ac.uk/home/users/rshatcher/
    .Xauthority 
    .bash_history 
    .bash_logout 
    .bash_profile 
    .bash_profile.160920 
    .bash_profile.260417
    ....

And you get the same with the new credfile? - wasn’t quite sure from your message whether this did or did not change the responses.

If it definitely didn’t make any difference, I’ll have a chat with someone at JASMIN.
Cheers,
Ros.