Vn13.3 AMIP suite on Archer2

Thanks, Ros, I’ve applied for access to hpxfer with that IP address.

James

Hi Ros,

I was granted access to the hpxfer server on Friday evening. From the jasmin login node, I can run the command “ssh -Y hpxfer2.jasmin.ac.uk” successfully but “ssh -Y hpxfer1.jasmin.ac.uk” does not work.

However, if I run “ssh -Y hpxfer2.jasmin.ac.uk” on Archer2, I get the following error.

jweber@hpxfer2.jasmin.ac.uk: Permission denied (publickey,gssapi-keyex,gssapi-with-mic)

I set off a rerun of u-dk384 last night but received the error below on pptransfer.

Lmod is automatically replacing “cce/15.0.0” with “gcc/11.2.0”.

Due to MODULEPATH changes, the following have been reloaded:

  1. cray-mpich/8.1.23

[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
[WARN] [SUBPROCESS]: Command: globus-url-copy -vb -cd -r -cc 4 -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/cylc-run/u-dk384/share/cycle/19800101T0000Z/u-dk384/19800101T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk384/19800101T0000Z/
[SUBPROCESS]: Error = 1:

error: Unable to check destination url for sync: gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk384/19800101T0000Z/
globus_ftp_client: the server responded with an error
530 530-Login incorrect. : globus_gss_assist: Gridmap lookup failure: Could not map /DC=uk/DC=ac/DC=jasmin/O=STFC RAL/CN=jmw240
530-
530 End.

[WARN] Transfer command failed: globus-url-copy -vb -cd -r -cc 4 -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/cylc-run/u-dk384/share/cycle/19800101T0000Z/u-dk384/19800101T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk384/19800101T0000Z/
[ERROR] transfer.py: Unknown Error - Return Code=1
[FAIL] Command Terminated
[FAIL] Terminating PostProc…
[FAIL] transfer.py <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2024-10-28T05:08:16Z CRITICAL - failed/EXIT

Is there something else I need to do to set up the link between Archer2 and Jasmin’s hpxfer?

Thanks,

James

I added a "/’ to the end of the Jasmin path and the archive appears to have worked. Not quite sure why but looks like everything is sorted, thanks for your help.

James

Hi,

Just a quick follow up question. I’ve had a couple of pptransfer failures with the last one returning the error message below.

[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
Traceback (most recent call last):
File “/mnt/lustre/a2fs-work1/work/y07/shared/umshared/metomi/cylc-7.8.12/bin/cylc-message”, line 140, in
main()
File “/mnt/lustre/a2fs-work1/work/y07/shared/umshared/metomi/cylc-7.8.12/bin/cylc-message”, line 136, in main
return record_messages(suite, task_job, messages)
File “/mnt/lustre/a2fs-work1/work/y07/shared/umshared/metomi/cylc-7.8.12/lib/cylc/task_message.py”, line 72, in record_messages
handle.write(‘%s %s - %s\n’ % (event_time, severity, message))
IOError: [Errno 122] Disk quota exceeded

I can’t work out if this is a disk quota error on Jasmin or Archer2. Could you advise?

Thanks,

James

James

The n02 /work space filled up on ARCHER. Pl try again now.

Grenville

Thanks, Grenville. It seems to be working now.

James

Hi,

u-dk655, a coy of u-dk384, has failed on atmos_main with the below error which I haven’t seen before:

MPICH ERROR [Rank 432] [job id 7996193.0] [Thu Nov 7 22:00:24 2024] [nid006161] - Abort(405395855) (rank 432 in comm 352): Fatal error in PMPI_Send: Other MPI error, error stack:
PMPI_Send(163)…: MPI_Send(buf=0xdea75c8, count=86, dtype=USER, dest=479, tag=8, comm=0xc400000b) failed
PMPI_Send(143)…:
MPIR_Wait_impl(41)…:
MPID_Progress_wait(193)…:
MPIDI_Progress_test(89)…:
MPIDI_OFI_handle_cq_error(1062): OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - transport retry counter exceeded)

srun: error: nid006132: tasks 1-55: Exited with exit code 143
srun: launch/slurm: _step_signal: Terminating StepId=7996193.0
slurmstepd: error: *** STEP 7996193.0 ON nid006132 CANCELLED AT 2024-11-07T22:00:25 ***
srun: error: nid006132: tasks 56-62: Exited with exit code 143
srun: error: nid006134: tasks 63-76: Exited with exit code 143
srun: error: nid006134: tasks 77-125: Terminated
srun: error: nid006163: tasks 441-503: Terminated
srun: error: nid006156: tasks 189-251: Terminated
srun: error: nid006160: tasks 315-377: Terminated
srun: error: nid006161: tasks 378-440: Terminated
srun: error: nid006159: tasks 252-314: Terminated
srun: error: nid006151: tasks 126-188: Terminated
srun: error: nid006132: task 0: Terminated
srun: Force Terminated StepId=7996193.0
[FAIL] um-atmos <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=143
2024-11-07T22:00:44Z CRITICAL - failed/EXIT

Given problems with Archer2 storage, is this a problem with my suite or a wider issue?

Thanks,

James

James

We see these errors from time to time - a retrigger of the failed task should be enough.

Grenville

Thanks, Grenville, atmos_main completed after retriggering.

James

Hi, separate issue but it looks like the bottleneck for this run is the postproc task rather than atmos_main. I’ve cut down the output diagnostics and increased the memory for the POSTPROC_RESOURCE but is there anything else I can do to speed it up?

Thanks,

James

Hi,

pptransfer has failed with the below error message even after several retrigger attempts

[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
[WARN] [SUBPROCESS]: Command: globus-url-copy -vb -cd -r -cc 4 -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/cylc-run/u-dk655/share/cycle/19871001T0000Z/u-dk655/19871001T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk655/19871001T0000Z/
[SUBPROCESS]: Error = 1:

error: Unable to list destination directory for sync: gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk655/19871001T0000Z/
globus_ftp_client: the server responded with an error
500 500-Command failed. : an authentication operation failed
500-globus_gsi_callback_module: Could not verify credential
500-globus_gsi_callback_module: Error with signing policy
500-globus_gsi_callback_module: Error in OLD GAA code: The subject of the certificate “/DC=uk/DC=ac/DC=jasmin/O=STFC RAL/CN=JASMIN” does not match the signing policies defined in /home/users/jmw240/.globus/certificates/7ed47087.signing_policy
500 End.

[WARN] Transfer command failed: globus-url-copy -vb -cd -r -cc 4 -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/cylc-run/u-dk655/share/cycle/19871001T0000Z/u-dk655/19871001T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk655/19871001T0000Z/
[ERROR] transfer.py: Unknown Error - Return Code=1
[FAIL] Command Terminated
[FAIL] Terminating PostProc…
[FAIL] transfer.py <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2024-11-18T14:32:09Z CRITICAL - failed/EXIT

My credential was still in date but I have renewed it so it runs to 18th Dec anyway.

Jasmin has been having issues but I think they are now resolved. Does this look like a Jasmin or Archer2 issue?

Cheers,

James

Hi James,

There have been some storage issues at CEDA since the 14th, especially affecting /gws/nopw/ . They seem to have fixed this today but have advised that this is ‘at risk’

Centre for Environmental Data Analysis - Status

Thanks, Mohit. I’ll wait until Jasmin is back to normal operation and try again.

James

Hi,

Jasmin seems to be working now but I’m getting a different error with pptransfer:

Lmod is automatically replacing “cce/15.0.0” with “gcc/11.2.0”.

Due to MODULEPATH changes, the following have been reloaded:

  1. cray-mpich/8.1.23

[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
Traceback (most recent call last):
File “/work/n02/n02/jweber/cylc-run/u-dk655/share/fcm_make_pp/build/bin/transfer.py”, line 451, in
main()
File “/work/n02/n02/jweber/cylc-run/u-dk655/share/fcm_make_pp/build/bin/transfer.py”, line 429, in main
transfer = Transfer()
File “/work/n02/n02/jweber/cylc-run/u-dk655/share/fcm_make_pp/build/bin/transfer.py”, line 48, in init
self._globus_cli = nl_transfer.globus_cli
AttributeError: ‘ReadNamelist’ object has no attribute ‘globus_cli’
[FAIL] transfer.py <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2024-11-20T09:43:16Z CRITICAL - failed/EXIT

Any suggestions?

Cheers,

James

Hi James,

I made some changes recently to the postproc_2.4_archer2_jasmin_rewrite branch in preparation for moving everyone to using Globus for data transfer.

Please set the revision number of the branch to 5092 to pick up the version compatible with your current setup.

Regards,
Ros

Ok, thanks, Ros. Will this require a full stop and start with new completion of fcm_make_pp etc?

James

Sorry, does this change only need to be done in fcm_make_pp or do I need to change the below in postproc/rose-app.conf too?

meta=/home/n02/n02/ros/meta/postproc_2.4_archer2_jasmin_rewrite/rose-meta/archive_and_meaning/postproc/pp24_t588

Thanks,

James

Hi James,

Just in the fcm_make_pp app.
And yes you will need to re-run fcm_make_pp & fcm_make2_pp tasks either by stopping the suite and starting again or reloading the suite and retriggering those tasks.

Cheers,
Ros.

Thanks, Ros. I made what I thought was the correct change but u-dk655 has failed again on pptransfer.

Could you check my suite to see if I have implemented the revision correctly?

Cheers,

James

Hi James,

As soon as ARCHER2/PUMA2 come back I’ll take a look.

Regards,
Ros.