Trying out JULES Fluxnet u-al752

Hi, I’ve been following this tutorial for getting started with JULES and running u-al752 Tutorial for setting up Rose/Cylc in order to run JULES on CEDA JASMIN - Land Surface Processes Group
but I can’t get the suite to run (step 15 of the tutorial).

I get this message when I try to run the suite:

[FAIL] file:bin/parallelise.py=source=fcm:jules.x_br/pkg/karinawilliams/r6715_python_packages/share/parallelise.py@19283: bad or missing value

From looking on the helpdesk, I think I’m getting the same error Running JULES FLUXNET suite u-al752. Here, they suggested rose suite-run --new and it seemed to work for Richa, but it isn’t helping for me.

Can someone please help? Also happy to provide more information!

Thanks, Ayesha

(Also not sure if this will help):

Previously I tried to run the suite and got this error:
[FAIL] ssh -oBatchMode=yes -oConnectTimeout=10 -n postproc env\ ROSE_VERSION=2019.01.3\ CYLC_VERSION=7.9.6\ bash\ -l\ -c\ '"$0"\ "$@"'\ rose\ suite-run\ -vv\ -n\ u-al752\ –run=run\ –remote=uuid=9c3372d2-540e-41db-9b06-77a22aa4a4d9,now-str=20220624T084519Z # return-code=255, stderr=

[FAIL] ssh: Could not resolve hostname postproc: Name or service not known

(I’m running this through sci1 and then cylc1)
I then found this post on the helpdesk https://cms-helpdesk.ncas.ac.uk/t/could-not-resolve-hostname-jasmin/427/2, so I changed my ~/.ssh/config file to

Host *
ServerAliveInterval 30

Host jlogin1
Hostname login1.jasmin . ac . uk [there are no spaces here normally, I can’t post something here with more than 2 links]
User ash221
IdentityFile ~/.ssh/id_rsa_jasmin
ForwardAgent yes
ControlMaster auto
ControlPath /tmp/ssh-socket-%r@%h-%p
ControlPersist yes

Host xfer?
Hostname %h.jasmin .ac.uk [there are no spaces here normally, I can’t post something here with more than 2 links]
User ash221
User ash221
ForwardAgent yes

Host sci? cylc1
HostName %h.jasmin .ac.uk [there are no spaces here normally, I can’t post something here with more than 2 links]
User ash221

Host sci* cylc*
User ash221
IdentityFile ~/.ssh/id_rsa_jasmin
ForwardAgent yes
ProxyCommand ssh -Y jlogin1 -W %h:%p
ControlMaster auto
ControlPath /tmp/ssh-socket-%r@%h-%p
ControlPersist yes

Then the first error message (about resolving hostnames) is now gone but now I have the new error message (about bad/missing value).

Thanks, Ayesha

Hi Ayesha:
The ‘file:…’ ‘bad or missing value’ error that you’re getting suggests that you don’t have the access to the Met Office Science Repository System (MOSRS) configured properly.

You should have been prompted for your MOSRS password when you logged in to cylc1. If you weren’t prompted for your MOSRS password when you logged in, then you should make sure your MOSRS configuration is set up correctly by following the appropriate steps in the tutorial.

Sometimes, the MOSRS password caching times out or something, and the easiest thing to do is log out of cylc1, and then log back in and you will be prompted for your MOSRS password again. An alternative is to type mosrs-cache-password at the cylc1 command-line prompt.

Once you have your MOSRS password properly cached, then you can test this by typing this command at the cylc1 command-line prompt:
fcm export fcm:jules.x_br/pkg/karinawilliams/r6715_python_packages/share/parallelise.py@19283
This command will copy the parallelise.py file (REVISION 19283) from the MOSRS, and you can view or edit that file in your working directory, if you’d like.

Furthermore, since you’re getting an error about the postproc host, you can see with these commands: cd ~/roses/u-al752; grep -r postproc * that this involves the file ~/roses/u-al752/site/suite.rc.MONSOON. This suggests that in your ~/roses/u-al752/rose-suite.conf configuration file, your LOCATION is still set for the MONSOON supercomputer, whereas it should be set for the CEDA_JASMIN supercomputer. If you haven’t followed the tutorial and made that change yet, there are also probably other steps that you haven’t reached yet in the tutorial.
Patrick

Thank you Patrick!

When I type ‘mosrs-cache-password’ I get this:

Met Office Science Repository Service password:
Subversion password cached
Traceback (most recent call last):
File “/usr/lib64/python2.7/runpy.py”, line 162, in _run_module_as_main
main”, fname, loader, pkg_name)
File “/usr/lib64/python2.7/runpy.py”, line 72, in _run_code
exec code in run_globals
File “/apps/jasmin/metomi/rose-2019.01.3/lib/python/rosie/ws_client_cli.py”, line 25, in
from rosie.ws_client import (
File “/apps/jasmin/metomi/rose-2019.01.3/lib/python/rosie/ws_client.py”, line 36, in
from rosie.ws_client_auth import RosieWSClientAuthManager
File “/apps/jasmin/metomi/rose-2019.01.3/lib/python/rosie/ws_client_auth.py”, line 38, in
import gtk
File “/usr/lib64/python2.7/site-packages/gtk-2.0/gtk/init.py”, line 64, in
_init()
File “/usr/lib64/python2.7/site-packages/gtk-2.0/gtk/init.py”, line 52, in _init
_gtk.init_check()
RuntimeError: could not open display
Error: Unable to access Rosie with given password
Run “mosrs-cache-password” to try caching your password again

I know it’s the right password (I can log in elsewhere with it, and without caching the password I normally can access rose).

Thanks for the tip about copying the parallelise file to my working directory. I then changed the rose-suite.conf file so I was using that copy instead.

When I tried running the suite again, the same issue appeared as before but now for the fluxnet_evaluation file. So in the end I copied:
fluxnet_evaluation.py
jules.py
make_time_coord.py
parallelise.py
and now the suite appears to be running! I’m not sure if this is the best workaround - but hopefully it’ll work now!

Another quick question - xmessage doesn’t work for me roughly 70% of the time, and I can’t access the GUI. Sometimes it’ll work, and sometimes it won’t. I’ve tried logging back out and in again, and again sometimes it works, sometimes it doesn’t. I’m using a mac, and I’ve quit/restarted/played around with Xquartz and nothing has happened.

So, for example, now I’m running u-al752 but I have no way of checking the progress. I type ‘rose suite-scan’ and I can see it’s running, but ‘rose sgc’ does nothing and when I type ‘rose bush start’ it says:
Traceback (most recent call last):
File “/usr/lib64/python2.7/runpy.py”, line 162, in _run_module_as_main
main”, fname, loader, pkg_name)
File “/usr/lib64/python2.7/runpy.py”, line 72, in _run_code
exec code in run_globals
File “/apps/jasmin/metomi/rose-2019.01.3/lib/python/rose/bush.py”, line 22, in
import cherrypy
ImportError: No module named cherrypy

Could you please help with this as well?

Thanks,

Ayesha

Hi Ayesha
If you’re getting those errors when you type mosrs-cache-password on cylc1, maybe you should instead log out and log back in, and it should then automatically ask for your MOSRS password.

If you still get those same errors after logging out and logging back in, and after entering your MOSRS password, then maybe it’s an Xwindows issue. Are you logging in to cylc1 from login1 or from login2? Are you using ssh -AX everywhere or ssh -AY everywhere? You might try ssh -AY to the lower-security login2, and then ssh -AY to cylc1. This is not a permanent solution, since it’s better to use ssh -AX on login1.

I have never used rose bush start. I get the same error as you do for this. Maybe you need to do a rose suite-run --restart?

Does Xclock work for you from cylc1?

The solution of using fcm export for each of the files from the command line is a makeshift solution. I am glad it helps in the short term, but it is not the thing to do in the long-term.

Patrick

Hi Patrick,

I log into cylc1 from login1 and then sci1. I use -AX for all of them. When I log in I type in my password each time and that seems to be fine, but I can use mosrs-cache-password.

Thank you for the tip about login2. Rose sgc works now, but my fcm_make failed. When I go on job.err I get this:

[FAIL] config-file=/work/scratch-pw/ash221/cylc-run/u-al752/work/1/fcm_make/fcm-make.cfg:2
[FAIL] config-file= - https://code.metoffice.gov.uk/svn/jules/main/trunk/etc/fcm-make/make.cfg@21512
[FAIL] https://code.metoffice.gov.uk/svn/jules/main/trunk/etc/fcm-make/make.cfg@21512: cannot load config file
[FAIL] https://code.metoffice.gov.uk/svn/jules/main/trunk/etc/fcm-make/make.cfg@21512: not found
[FAIL] svn: E170013: Unable to connect to a repository at URL ‘https://code.metoffice.gov.uk/svn/jules/main/trunk/etc/fcm-make/make.cfg
[FAIL] svn: E215004: No more credentials or we tried too many times.
[FAIL] Authentication failed

[FAIL] fcm make -f /work/scratch-pw/ash221/cylc-run/u-al752/work/1/fcm_make/fcm-make.cfg -C /home/users/ash221/cylc-run/u-al752/share/fcm_make -j 4 # return-code=1
2022-06-27T11:38:11+01:00 CRITICAL - failed/EXIT

Could you please help me?

Thanks,

Ayesha

Hi Ayesha
If you have to type your ssh passphrase each time you ssh, then you don’t have your ssh set up properly. You should be able to ssh without typing in your passphrase each time.
This needs to be fixed.

For your fcm_make error, it looks like your MOSRS password caching is not set up properly. You should set this up so that you don’t need to type the mosrs-cache-password command at the command prompt after you log in.
Patrick

Hi Ayesha
And you said that you’re logging in to sci1. This will not work for MOSRS or for Rose/cylc. You will need to run the suite from cylc1.
Patrick

Hi Ayesha:
Are things working OK now?
Patrick

Hi Patrick,

Sorry for the delay!

I was logging into sci1 then cylc1, so I was running the suite from cylc1. But I’ll go straight to cylc1 in the future.

Also sorry I’m not typing in my ssh passphrase every time I ssh, I’m typing in my MOSRS password each time I ssh into cylc1.

mosrs-cache-password isn’t working for me at all. I type it in and I get this error:

[ash221@cylc1 ~]$ mosrs-cache-password
Met Office Science Repository Service password:
Subversion password cached
Traceback (most recent call last):
File “/usr/lib64/python2.7/runpy.py”, line 162, in _run_module_as_main
main”, fname, loader, pkg_name)
File “/usr/lib64/python2.7/runpy.py”, line 72, in _run_code
exec code in run_globals
File “/apps/jasmin/metomi/rose-2019.01.3/lib/python/rosie/ws_client_cli.py”, line 25, in
from rosie.ws_client import (
File “/apps/jasmin/metomi/rose-2019.01.3/lib/python/rosie/ws_client.py”, line 36, in
from rosie.ws_client_auth import RosieWSClientAuthManager
File “/apps/jasmin/metomi/rose-2019.01.3/lib/python/rosie/ws_client_auth.py”, line 38, in
import gtk
File “/usr/lib64/python2.7/site-packages/gtk-2.0/gtk/init.py”, line 64, in
_init()
File “/usr/lib64/python2.7/site-packages/gtk-2.0/gtk/init.py”, line 52, in _init
_gtk.init_check()
RuntimeError: could not open display
Error: Unable to access Rosie with given password
Run “mosrs-cache-password” to try caching your password again

This error appears whether I’m on ssh -AY login2 or ssh -AX login1.

So for the fcm_make error, is this all because I can’t cache my MOSRS password?

Also, I now can’t open the GUI on login2 either (and I could beforehand). Is this an issue with my computer or with cylc1? It means I can’t check on the suite/what the errors are.

Thanks,

Ayesha

Hi Ayesha,

Make sure you are using ssh -AY or ssh -AX in all your ssh commands otherwise your display won’t be set.

Yes the fcm_make error is because your MOSRS password isn’t cached.

Regards
Ros

Hi Ros,

Thank you. I am using ssh -AY and ssh -AX and I still can’t cache my mosrs password.

Is there anything you can please recommend? I keep getting the same error when I try ‘mosrs-cache-password’.

Thanks,

Ayesha

Hi, I’ve been looking on the helpdesk and it looks like I get the same error as here: https://cms-helpdesk.ncas.ac.uk/t/login-issue-on-jasmin/267?u=ash221

I’ve tried restarting Xquartz but it doesn’t fix the issue. Is there anything else I can do?

Thanks,

Ayesha

Hi Ayesha:
When you’re on cylc1, can you see the clock when you type xclock?

If restarting Xquartz doesn’t help, have you tried restarting your Mac?
Patrick

Hi Patrick,

When I type xclock I get this:
[ash221@cylc1 ~]$ xclock
Error: Can’t open display: localhost:22.0

Restarting my Mac also doesn’t work (I’ve actually just got a new mac and I’m getting the same issue as I did on my old one).

Is there any way we can have a teams meeting or something equivalent? I’m really not sure what else to try fixing it.

Thanks,

Ayesha

Hi Ayesha:
Sometimes, especially when people have tried the xwindows with ssh from overseas, ssh -AX doesn’t work with login1. In this case, ssh -AY with login2 might work. Can you try these steps from your Mac?:

  1. ssh -AY login2.jasmin.ac.uk
  2. ssh -AY cylc1
  3. xclock

In steps 1 & 2, using ssh -AY both times is important. If you have your passkey set up properly, you shouldn’t need to type anything else. If you still get an Error: Can’t open display in step 3, please let me know.

Which version of Xquartz are you using? Which version of the MacOS are you using?
Patrick

Hi,

In step 3 I get this:
Error: Can’t open display: localhost:22.0

I have macOS Monterey Version 12.4
I have XQuartz Version 2.8.24.

XQuartz works on my computer (e.g. xclock works) but doesn’t work once I ssh in to jasmin.

Thanks,

Ayesha

Hi Ayesha:
I use macOS Catalina 10.15, and Xquartz 2.7.11.

What happens when you try to do the xclock on sci1 or sci2 instead of cylc1?

I have never tried this, but maybe this graphical linux desktop is a possibility?:

Do you think that this might work for you?
Patrick

Hi Patrick, I’ve been working on it and it works now! I’m not really sure what I did, but I can now cache my password and the GUI opens.

Thanks,

Ayesha

Hi Ayesha:
Yay! Congratulations! I am glad Xwindows works for you now, and that the Cylc GUI opens on cylc1. Does the suite u-al752 run better now, without failing when it tries to open the files that are downloaded from MOSRS?
Patrick