Thanks, Ros, I’ve applied for access to hpxfer with that IP address.
James
Thanks, Ros, I’ve applied for access to hpxfer with that IP address.
James
Hi Ros,
I was granted access to the hpxfer server on Friday evening. From the jasmin login node, I can run the command “ssh -Y hpxfer2.jasmin.ac.uk” successfully but “ssh -Y hpxfer1.jasmin.ac.uk” does not work.
However, if I run “ssh -Y hpxfer2.jasmin.ac.uk” on Archer2, I get the following error.
jweber@hpxfer2.jasmin.ac.uk: Permission denied (publickey,gssapi-keyex,gssapi-with-mic)
I set off a rerun of u-dk384 last night but received the error below on pptransfer.
Lmod is automatically replacing “cce/15.0.0” with “gcc/11.2.0”.
Due to MODULEPATH changes, the following have been reloaded:
[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
[WARN] [SUBPROCESS]: Command: globus-url-copy -vb -cd -r -cc 4 -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/cylc-run/u-dk384/share/cycle/19800101T0000Z/u-dk384/19800101T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk384/19800101T0000Z/
[SUBPROCESS]: Error = 1:
error: Unable to check destination url for sync: gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk384/19800101T0000Z/
globus_ftp_client: the server responded with an error
530 530-Login incorrect. : globus_gss_assist: Gridmap lookup failure: Could not map /DC=uk/DC=ac/DC=jasmin/O=STFC RAL/CN=jmw240
530-
530 End.
[WARN] Transfer command failed: globus-url-copy -vb -cd -r -cc 4 -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/cylc-run/u-dk384/share/cycle/19800101T0000Z/u-dk384/19800101T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk384/19800101T0000Z/
[ERROR] transfer.py: Unknown Error - Return Code=1
[FAIL] Command Terminated
[FAIL] Terminating PostProc…
[FAIL] transfer.py <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2024-10-28T05:08:16Z CRITICAL - failed/EXIT
Is there something else I need to do to set up the link between Archer2 and Jasmin’s hpxfer?
Thanks,
James
I added a "/’ to the end of the Jasmin path and the archive appears to have worked. Not quite sure why but looks like everything is sorted, thanks for your help.
James
Hi,
Just a quick follow up question. I’ve had a couple of pptransfer failures with the last one returning the error message below.
[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
Traceback (most recent call last):
File “/mnt/lustre/a2fs-work1/work/y07/shared/umshared/metomi/cylc-7.8.12/bin/cylc-message”, line 140, in
main()
File “/mnt/lustre/a2fs-work1/work/y07/shared/umshared/metomi/cylc-7.8.12/bin/cylc-message”, line 136, in main
return record_messages(suite, task_job, messages)
File “/mnt/lustre/a2fs-work1/work/y07/shared/umshared/metomi/cylc-7.8.12/lib/cylc/task_message.py”, line 72, in record_messages
handle.write(‘%s %s - %s\n’ % (event_time, severity, message))
IOError: [Errno 122] Disk quota exceeded
I can’t work out if this is a disk quota error on Jasmin or Archer2. Could you advise?
Thanks,
James
James
The n02 /work space filled up on ARCHER. Pl try again now.
Grenville
Thanks, Grenville. It seems to be working now.
James
Hi,
u-dk655, a coy of u-dk384, has failed on atmos_main with the below error which I haven’t seen before:
MPICH ERROR [Rank 432] [job id 7996193.0] [Thu Nov 7 22:00:24 2024] [nid006161] - Abort(405395855) (rank 432 in comm 352): Fatal error in PMPI_Send: Other MPI error, error stack:
PMPI_Send(163)…: MPI_Send(buf=0xdea75c8, count=86, dtype=USER, dest=479, tag=8, comm=0xc400000b) failed
PMPI_Send(143)…:
MPIR_Wait_impl(41)…:
MPID_Progress_wait(193)…:
MPIDI_Progress_test(89)…:
MPIDI_OFI_handle_cq_error(1062): OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - transport retry counter exceeded)
srun: error: nid006132: tasks 1-55: Exited with exit code 143
srun: launch/slurm: _step_signal: Terminating StepId=7996193.0
slurmstepd: error: *** STEP 7996193.0 ON nid006132 CANCELLED AT 2024-11-07T22:00:25 ***
srun: error: nid006132: tasks 56-62: Exited with exit code 143
srun: error: nid006134: tasks 63-76: Exited with exit code 143
srun: error: nid006134: tasks 77-125: Terminated
srun: error: nid006163: tasks 441-503: Terminated
srun: error: nid006156: tasks 189-251: Terminated
srun: error: nid006160: tasks 315-377: Terminated
srun: error: nid006161: tasks 378-440: Terminated
srun: error: nid006159: tasks 252-314: Terminated
srun: error: nid006151: tasks 126-188: Terminated
srun: error: nid006132: task 0: Terminated
srun: Force Terminated StepId=7996193.0
[FAIL] um-atmos <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=143
2024-11-07T22:00:44Z CRITICAL - failed/EXIT
Given problems with Archer2 storage, is this a problem with my suite or a wider issue?
Thanks,
James
James
We see these errors from time to time - a retrigger of the failed task should be enough.
Grenville
Thanks, Grenville, atmos_main completed after retriggering.
James
Hi, separate issue but it looks like the bottleneck for this run is the postproc task rather than atmos_main. I’ve cut down the output diagnostics and increased the memory for the POSTPROC_RESOURCE but is there anything else I can do to speed it up?
Thanks,
James
Hi,
pptransfer has failed with the below error message even after several retrigger attempts
[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
[WARN] [SUBPROCESS]: Command: globus-url-copy -vb -cd -r -cc 4 -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/cylc-run/u-dk655/share/cycle/19871001T0000Z/u-dk655/19871001T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk655/19871001T0000Z/
[SUBPROCESS]: Error = 1:
error: Unable to list destination directory for sync: gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk655/19871001T0000Z/
globus_ftp_client: the server responded with an error
500 500-Command failed. : an authentication operation failed
500-globus_gsi_callback_module: Could not verify credential
500-globus_gsi_callback_module: Error with signing policy
500-globus_gsi_callback_module: Error in OLD GAA code: The subject of the certificate “/DC=uk/DC=ac/DC=jasmin/O=STFC RAL/CN=JASMIN” does not match the signing policies defined in /home/users/jmw240/.globus/certificates/7ed47087.signing_policy
500 End.
[WARN] Transfer command failed: globus-url-copy -vb -cd -r -cc 4 -sync -cred /work/n02/n02/jweber/cred.jasmin /work/n02/n02/jweber/cylc-run/u-dk655/share/cycle/19871001T0000Z/u-dk655/19871001T0000Z/ gsiftp://gridftp1.jasmin.ac.uk/gws/nopw/j04/sheffield/jweber/archive/u-dk655/19871001T0000Z/
[ERROR] transfer.py: Unknown Error - Return Code=1
[FAIL] Command Terminated
[FAIL] Terminating PostProc…
[FAIL] transfer.py <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2024-11-18T14:32:09Z CRITICAL - failed/EXIT
My credential was still in date but I have renewed it so it runs to 18th Dec anyway.
Jasmin has been having issues but I think they are now resolved. Does this look like a Jasmin or Archer2 issue?
Cheers,
James
Hi James,
There have been some storage issues at CEDA since the 14th, especially affecting /gws/nopw/ . They seem to have fixed this today but have advised that this is ‘at risk’
Thanks, Mohit. I’ll wait until Jasmin is back to normal operation and try again.
James
Hi,
Jasmin seems to be working now but I’m getting a different error with pptransfer:
Lmod is automatically replacing “cce/15.0.0” with “gcc/11.2.0”.
Due to MODULEPATH changes, the following have been reloaded:
[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
Traceback (most recent call last):
File “/work/n02/n02/jweber/cylc-run/u-dk655/share/fcm_make_pp/build/bin/transfer.py”, line 451, in
main()
File “/work/n02/n02/jweber/cylc-run/u-dk655/share/fcm_make_pp/build/bin/transfer.py”, line 429, in main
transfer = Transfer()
File “/work/n02/n02/jweber/cylc-run/u-dk655/share/fcm_make_pp/build/bin/transfer.py”, line 48, in init
self._globus_cli = nl_transfer.globus_cli
AttributeError: ‘ReadNamelist’ object has no attribute ‘globus_cli’
[FAIL] transfer.py <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2024-11-20T09:43:16Z CRITICAL - failed/EXIT
Any suggestions?
Cheers,
James
Hi James,
I made some changes recently to the postproc_2.4_archer2_jasmin_rewrite
branch in preparation for moving everyone to using Globus for data transfer.
Please set the revision number of the branch to 5092 to pick up the version compatible with your current setup.
Regards,
Ros
Ok, thanks, Ros. Will this require a full stop and start with new completion of fcm_make_pp etc?
James
Sorry, does this change only need to be done in fcm_make_pp or do I need to change the below in postproc/rose-app.conf too?
meta=/home/n02/n02/ros/meta/postproc_2.4_archer2_jasmin_rewrite/rose-meta/archive_and_meaning/postproc/pp24_t588
Thanks,
James
Hi James,
Just in the fcm_make_pp
app.
And yes you will need to re-run fcm_make_pp
& fcm_make2_pp
tasks either by stopping the suite and starting again or reloading the suite and retriggering those tasks.
Cheers,
Ros.
Thanks, Ros. I made what I thought was the correct change but u-dk655 has failed again on pptransfer.
Could you check my suite to see if I have implemented the revision correctly?
Cheers,
James
Hi James,
As soon as ARCHER2/PUMA2 come back I’ll take a look.
Regards,
Ros.