Permission denied (publickey) failure

Hi,

I’m tying to run u-cl073 but it fails with:

[FAIL] ssh -oBatchMode=yes -n login.archer2.ac.uk env\ ROSE_VERSION=2019.01.3\ CYLC_VERSION=7.8.7\ bash\ -l\ -c\ ‘"$0"\ “$@”’\ rose\ suite-run\ -vv\ -n\ u-cl073\ --run=run\ --remote=uuid=326b7a89-9948-411f-87cb-bab4894997e1,now-str=20220225T103207Z,root-dir=’$DATADIR’ # return-code=255, stderr=

[FAIL] Permission denied (publickey).

Following the advice in ticket (https://cms-helpdesk.ncas.ac.uk/t/host-key-verification-failed/466/4) I removed all the archer entries in the known hosts file and then re-added them by sshing in using:

ssh -i ~/.ssh/id_rsa_archerum jweber@loginx.archer2.ac.uk

For x=1,2,3,4 and blank

When I did this, the command line response was:

-bash-4.1$ ssh -i ~/.ssh/id_rsa_archerum jweber@login4.archer2.ac.uk
The authenticity of host ‘login4.archer2.ac.uk (193.62.216.45)’ can’t be established.
RSA key fingerprint is 1c:0f:77:c8:b0:b0:c9:8d:4a:90:cf:31:e2:a6:76:ae.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘login4.archer2.ac.uk’ (RSA) to the list of known hosts.
Enter passphrase for key ‘/home/jmw240/.ssh/id_rsa_archerum’:
PTY allocation request failed on channel 0
Comand rejected by policy. Not in authorised list
Connection to login4.archer2.ac.uk closed.

I think this as it should be?

However, the error message above remains when I try to run the suite. Is there something else I should be doing?

Many thanks for your help.

James

Hi James,

You need to make sure the ssh-key is attached to your ssh-agent. It should not prompt for your passphrase when you ssh on the command line. Nor should you have to supply the option -i ~/.ssh/id_rsa_archerum.

Please try running ssh-add ~/.ssh/id_rsa_archerum and once you’ve added the key run ssh login.archer2.ac.uk to check you are not prompted for any input and get the expected PTY allocation.... response.

Regards,
Ros.

Hi Ros,

Many thanks, I added my id_rsa_archerum and when I then ran ssh login.archer2.ac.uk I didn’t have to put in anything else.

However, my suite has failed on atmos_main with a large number of backtrace errors which I haven’t seen before and I’m not sure how to resolve them. The equivalent of this suite runs fine on Monsoon so I suspect it is an issue with how I have converted it to run on Archer2.

For example:
[380] exceptions: [backtrace]: ( 14) : _start in file /home/abuild/rpmbuild/BUILD/glibc-2.26/csu/…/sysdeps/x86_64/start.S line 122

Would you be able to advise?

Thanks,

James

James

It’s failing in ukca_main, but oddly, the backtrace doesn’t say where. Could you try running with PRINT_STATUS=PrStatus_Diag.

Grenville

Hi Grenville,

I’ve rerun with PrStatus_Diag but haven’t made much progress in understanding the error I’m afraid.

Best,

James

James

Not much help there - I’ll need to dig. This may take some time.

Grenville

One immediate thing that I’ve seen is that you’ve asked for 8 nodes and 64 tasks per node, but then ntasks is 504 (8*64=512). This may not be the issue, but worth changing.

James

Please allow me read access to

/work/n02/n02/jweber/GHGs/trgas_rcp_historical_2010.dat

(and any other files that I need to read)

Grenville

Hi Grenville,

Thanks for looking at this. I have run chmod 777 on all the folders I think you need to look at.

Re Davids’s point about cores, I didn’t change this from the Monsoon setup - do you think this could be an issue.

James

James
Well, I’ve mangled your start file!

A combination of

[file:$ROSE_DATA/${RUNID}a.astart]
mode=symlink+
source=$AINITIAL

and

archer2 xios_test$ ls -lrt /work/n02/n02/jweber/dump_files/cc298a.da20100101_00
-rwxrwxrwx 1 jweber n02 19944304640 Mar  2 16:07 /work/n02/n02/jweber/dump_files/cc298a.da20100101_00

has resulted in me mangling /work/n02/n02/jweber/dump_files/cc298a.da20100101_00

I’ve ended up in this sorry state, having run a reconfiguration.

lrwxrwxrwx 1 grenvill n02   52 Mar  2 09:54 cl073a.ainitial -> /work/n02/n02/jweber/dump_files/cc298a.da20100101_00
drwxr-sr-x 4 grenvill n02 4096 Mar  2 09:59 etc
lrwxrwxrwx 1 grenvill n02   52 Mar  2 15:39 cl073a.astart -> /work/n02/n02/jweber/dump_files/cc298a.da20100101_00

Can you put a new start file back?

(It’s never a good idea to allow the world write access to your files)

Grenville

Hi Grenville,

I’ve copied a new start file in /work/n02/n02/jweber/dump_files and called it cc298a.da20100101_00_cp .

Yes, sorry I’ve corrected the permissions now so I should have read/write/execute permission and others have read/execute only.

Thanks,

James

Hi James

It’s stopping in ukca_main1-ukca_main1.F90 here:

all_ntp(i)%data_3d(:,:,:) = all_tracers(:,:,:,n_no2)

The problem goes away if instead of passing all_ntp as an argument to ukca_main1, you use the module (all_ntp is available in the module). So I did:

SUBROUTINE ukca_main1(timestep_number, current_time,                           &
                      all_tracers, all_ntpx,                                    &
                      error_code, previous_time,                               &
                      error_message, error_routine)
...
TYPE(ntp_type), INTENT(IN OUT) :: all_ntpx(:)

...
USE ukca_ntp_mod

and that ran OK ( I didn’t check the results)

Not sure why this has happened - maybe the compiler is confused.

Grenville

I’m assuming that all_ntp is the one in the module of course!

Hi Grenville,

Thank you for looking into this. I can’t think of a reason why this would be causing a problem (aside from compiler issues) as this branch works in Monsoon.

I’m a bit confused regarding the changes you have made. I can find the SUBROUTINE code block in ukca_main1-ukca_main1.F90 and I assume the lines starting with TYPE and USE in your response are additions. Are these additions added directly after the SUBROUTINE block? If you have diff of your changes to the branch I could tell from that.

Many thanks,

James

James

Please see

/home/grenville/branches/vn12.0_CS2_SOA_improvements_w_ST_DMS_v3_ARCHER2/src/atmosphere/UKCA/ukca_main1-ukca_main1.F90

I’ve not checked if the MO have come across this on their new machine.

Grenville

Hi Grenville,

Thank you, I’ve made copied over your ukca_main1-ukca_main1.F90 changes (except the WRITE statements) to my branch. However, when I run u-cl073 I’m afraid I now get a different error. This looks a bit like one of the “known failure point” errors but I’m not certain. Have you seen this before?

[1] ???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
[1] ? Error code: 2
[1] ? Error from routine: GLUE_CONV_6A
[1] ? Error message: Deep conv went to model top at point 20 in seg 2 on call 1
[1] ? Error from processor: 356
[1] ? Error number: 86

Thanks,

James

That was the error I got when I mangled the start file - please check that you haven’t done the same.
Output from my run is /home/n02/n02/grenvill/cylc-run/cl073.

Grenville

Hi Grenville,

I reran with a clean dump file (cc298a.da20100101_00_v3 copied over from Jasmin). I checked a few fields using xconv after the run and they look ok. I’m a bit confused - is the corruption of the dump file a separate issue to that which you solved with the modifications to ukca_main1-ukca_main1.F90? If so, are there additional changes I need to make to my suite or branch.

Thanks,

James

James

I ran from a reconfigured /work/n02/n02/jweber/dump_files/cc298a.da20100101_00_cp.

I don’t believe the GLUE_CONV_6A error is related to the all_ntp error.

Grenville

Hi Grenville,

Sorry, I think I’m misunderstanding something. I also get the GLUE_CONV_6A error when I run with cc298a.da20100101_00_cp. Do I need to do something to the dump file in advance of running? Otherwise, I’m not certain what I’m doing wrong as I think I have the same branch changes and suite setup as you.

Thanks,

James