Permission denied (publickey) failure

James_Weber · 25 February 2022 10:40

Hi,

I’m tying to run u-cl073 but it fails with:

[FAIL] ssh -oBatchMode=yes -n login.archer2.ac.uk env\ ROSE_VERSION=2019.01.3\ CYLC_VERSION=7.8.7\ bash\ -l\ -c\ ‘"$0"\ “$@”’\ rose\ suite-run\ -vv\ -n\ u-cl073\ --run=run\ --remote=uuid=326b7a89-9948-411f-87cb-bab4894997e1,now-str=20220225T103207Z,root-dir=’$DATADIR’ # return-code=255, stderr=

[FAIL] Permission denied (publickey).

Following the advice in ticket (https://cms-helpdesk.ncas.ac.uk/t/host-key-verification-failed/466/4) I removed all the archer entries in the known hosts file and then re-added them by sshing in using:

ssh -i ~/.ssh/id_rsa_archerum jweber@loginx.archer2.ac.uk

For x=1,2,3,4 and blank

When I did this, the command line response was:

-bash-4.1$ ssh -i ~/.ssh/id_rsa_archerum jweber@login4.archer2.ac.uk
The authenticity of host ‘login4.archer2.ac.uk (193.62.216.45)’ can’t be established.
RSA key fingerprint is 1c:0f:77:c8:b0:b0:c9:8d:4a:90:cf:31:e2:a6:76:ae.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘login4.archer2.ac.uk’ (RSA) to the list of known hosts.
Enter passphrase for key ‘/home/jmw240/.ssh/id_rsa_archerum’:
PTY allocation request failed on channel 0
Comand rejected by policy. Not in authorised list
Connection to login4.archer2.ac.uk closed.

I think this as it should be?

However, the error message above remains when I try to run the suite. Is there something else I should be doing?

Many thanks for your help.

James

RosalynHatcher · 25 February 2022 10:56

Hi James,

You need to make sure the ssh-key is attached to your ssh-agent. It should not prompt for your passphrase when you ssh on the command line. Nor should you have to supply the option -i ~/.ssh/id_rsa_archerum.

Please try running ssh-add ~/.ssh/id_rsa_archerum and once you’ve added the key run ssh login.archer2.ac.uk to check you are not prompted for any input and get the expected PTY allocation.... response.

Regards,
Ros.

James_Weber · 25 February 2022 11:52

Hi Ros,

Many thanks, I added my id_rsa_archerum and when I then ran ssh login.archer2.ac.uk I didn’t have to put in anything else.

However, my suite has failed on atmos_main with a large number of backtrace errors which I haven’t seen before and I’m not sure how to resolve them. The equivalent of this suite runs fine on Monsoon so I suspect it is an issue with how I have converted it to run on Archer2.

For example:
[380] exceptions: [backtrace]: ( 14) : _start in file /home/abuild/rpmbuild/BUILD/glibc-2.26/csu/…/sysdeps/x86_64/start.S line 122

Would you be able to advise?

Thanks,

James

grenville · 25 February 2022 16:57

James

It’s failing in ukca_main, but oddly, the backtrace doesn’t say where. Could you try running with PRINT_STATUS=PrStatus_Diag.

Grenville

James_Weber · 28 February 2022 15:35

Hi Grenville,

I’ve rerun with PrStatus_Diag but haven’t made much progress in understanding the error I’m afraid.

Best,

James

grenville · 2 March 2022 09:20

James

Not much help there - I’ll need to dig. This may take some time.

Grenville

dcase · 2 March 2022 09:39

One immediate thing that I’ve seen is that you’ve asked for 8 nodes and 64 tasks per node, but then ntasks is 504 (8*64=512). This may not be the issue, but worth changing.

grenville · 2 March 2022 11:48

James

Please allow me read access to

/work/n02/n02/jweber/GHGs/trgas_rcp_historical_2010.dat

(and any other files that I need to read)

Grenville

James_Weber · 2 March 2022 12:32

Hi Grenville,

Thanks for looking at this. I have run chmod 777 on all the folders I think you need to look at.

Re Davids’s point about cores, I didn’t change this from the Monsoon setup - do you think this could be an issue.

James

grenville · 2 March 2022 17:08

James
Well, I’ve mangled your start file!

A combination of

[file:$ROSE_DATA/${RUNID}a.astart]
mode=symlink+
source=$AINITIAL

and

archer2 xios_test$ ls -lrt /work/n02/n02/jweber/dump_files/cc298a.da20100101_00
-rwxrwxrwx 1 jweber n02 19944304640 Mar  2 16:07 /work/n02/n02/jweber/dump_files/cc298a.da20100101_00

has resulted in me mangling /work/n02/n02/jweber/dump_files/cc298a.da20100101_00

I’ve ended up in this sorry state, having run a reconfiguration.

lrwxrwxrwx 1 grenvill n02   52 Mar  2 09:54 cl073a.ainitial -> /work/n02/n02/jweber/dump_files/cc298a.da20100101_00
drwxr-sr-x 4 grenvill n02 4096 Mar  2 09:59 etc
lrwxrwxrwx 1 grenvill n02   52 Mar  2 15:39 cl073a.astart -> /work/n02/n02/jweber/dump_files/cc298a.da20100101_00

Can you put a new start file back?

(It’s never a good idea to allow the world write access to your files)

Grenville

James_Weber · 2 March 2022 17:53

Hi Grenville,

I’ve copied a new start file in /work/n02/n02/jweber/dump_files and called it cc298a.da20100101_00_cp .

Yes, sorry I’ve corrected the permissions now so I should have read/write/execute permission and others have read/execute only.

Thanks,

James

grenville · 4 March 2022 14:54

Hi James

It’s stopping in ukca_main1-ukca_main1.F90 here:

all_ntp(i)%data_3d(:,:,:) = all_tracers(:,:,:,n_no2)

The problem goes away if instead of passing all_ntp as an argument to ukca_main1, you use the module (all_ntp is available in the module). So I did:

SUBROUTINE ukca_main1(timestep_number, current_time,                           &
                      all_tracers, all_ntpx,                                    &
                      error_code, previous_time,                               &
                      error_message, error_routine)
...
TYPE(ntp_type), INTENT(IN OUT) :: all_ntpx(:)

...
USE ukca_ntp_mod

and that ran OK ( I didn’t check the results)

Not sure why this has happened - maybe the compiler is confused.

Grenville

grenville · 4 March 2022 15:40

I’m assuming that all_ntp is the one in the module of course!

James_Weber · 8 March 2022 11:05

Hi Grenville,

Thank you for looking into this. I can’t think of a reason why this would be causing a problem (aside from compiler issues) as this branch works in Monsoon.

I’m a bit confused regarding the changes you have made. I can find the SUBROUTINE code block in ukca_main1-ukca_main1.F90 and I assume the lines starting with TYPE and USE in your response are additions. Are these additions added directly after the SUBROUTINE block? If you have diff of your changes to the branch I could tell from that.

Many thanks,

James

grenville · 8 March 2022 11:23

James

Please see

/home/grenville/branches/vn12.0_CS2_SOA_improvements_w_ST_DMS_v3_ARCHER2/src/atmosphere/UKCA/ukca_main1-ukca_main1.F90

I’ve not checked if the MO have come across this on their new machine.

Grenville

James_Weber · 8 March 2022 15:00

Hi Grenville,

Thank you, I’ve made copied over your ukca_main1-ukca_main1.F90 changes (except the WRITE statements) to my branch. However, when I run u-cl073 I’m afraid I now get a different error. This looks a bit like one of the “known failure point” errors but I’m not certain. Have you seen this before?

[1] ???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
[1] ? Error code: 2
[1] ? Error from routine: GLUE_CONV_6A
[1] ? Error message: Deep conv went to model top at point 20 in seg 2 on call 1
[1] ? Error from processor: 356
[1] ? Error number: 86

Thanks,

James

grenville · 8 March 2022 15:26

That was the error I got when I mangled the start file - please check that you haven’t done the same.
Output from my run is /home/n02/n02/grenvill/cylc-run/cl073.

Grenville

James_Weber · 9 March 2022 09:47

Hi Grenville,

I reran with a clean dump file (cc298a.da20100101_00_v3 copied over from Jasmin). I checked a few fields using xconv after the run and they look ok. I’m a bit confused - is the corruption of the dump file a separate issue to that which you solved with the modifications to ukca_main1-ukca_main1.F90? If so, are there additional changes I need to make to my suite or branch.

Thanks,

James

grenville · 9 March 2022 11:21

James

I ran from a reconfigured /work/n02/n02/jweber/dump_files/cc298a.da20100101_00_cp.

I don’t believe the GLUE_CONV_6A error is related to the all_ntp error.

Grenville

James_Weber · 9 March 2022 12:42

Hi Grenville,

Sorry, I think I’m misunderstanding something. I also get the GLUE_CONV_6A error when I run with cc298a.da20100101_00_cp. Do I need to do something to the dump file in advance of running? Otherwise, I’m not certain what I’m doing wrong as I think I have the same branch changes and suite setup as you.

Thanks,

James

Topic		Replies	Views
Copied UKCA version not running on ARCHER2 Rose/Cylc and FCM ARCHER2	26	124	28 May 2024
Submit-failed Unified Model ARCHER2 , PUMATest	28	673	13 December 2023
Modification required to run a Monsoon suite on ARCHER2 Monsoon2 , ARCHER2	26	765	11 October 2022
U-ck217 unknown site archer2 Monsoon2 , ARCHER2	11	319	4 January 2022
Using correct project account code Unified Model ARCHER2	18	630	10 November 2021

Permission denied (publickey) failure

Related topics