I am working with JULES on Jasmin (suite u-an231). I started a run last week and it was working fine, but for some reason it stopped working at spinup18. I didn’t change anything while it was running so I don’t understand what could have gone wrong. I tried restarting the jobs that failed using cylc reset --state=waiting . That didn’t work. I also tried stopping the suite and restarting the run from where it failed (with first run=false), but it didn’t work either.
The error message I get:
/bin/sh: rose-jules-run: command not found
[FAIL] rose-jules-run <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=127
2022-04-05T08:01:35Z CRITICAL - failed/EXIT
Thanks, Elise:
I can try to help.
What is your JASMIN username?
Have you already done chmod -R g+rX on your home directory, so I can see your files?
If not, can you do that? If there is anything private or confidential, you might consider moving it to a directory that wouldn’t then be readable by everyone in the group.
Patrick
Thanks, Elise:
I am looking now.
I am looking at ~elisedhn/cylc-run/u-an231/log/job/1/, and I see a bunch of tasks for different locations in there, with the suffix of _spin_01. Which location was it? I thought you had already done 18 spin-up cycles?
Patrick
Hi Elise:
I think you need to run the fcm_make app. In your last run, you had the BUILD turned off. And I don’t see anything right now in your ~elisedhn/cylc-run/u-an231/share/fcm_make/build/bin/ directory.
If you look at the file ~elisedhn/cylc-run/u-an231/log/job/1/amacayacu_standard_spin_01/01/job, it says that it is using the command rose task-run --path=share/fcm_make/build/bin, so that path is where I looked for scripts or binaries, but I didn’t see any.
I am trying to run your suite now as ~pmcguire/roses/u-an231_elisedhn2 from the cylc1 VM. It is waiting to BUILD in the SLURM short-serial queue. I guess it might take awhile right now to get through the queue.
Patrick
Hi, none of the locations worked (there are 7 locations and 5 types of runs for each). Yes, at first I started running with 20spin, and it failed on the 18th. I tried to restart each failed individual job but that didn’t work. So, I stopped the suite and tried restarting the whole suite with 3spinups (to get to 20 in total) and starting from the outputs of the 17th spin.
Hi Elise
When I ran your suite as ~pmcguire/roses/u-an231_elisedhn2 with BUILD enabled, the fcm_make app worked fine, and the spinup cycle1 was able to complete for all but one of the locations. I just retriggered the app that failed during the original submission, congo_dynamical_spin_01. Maybe the suite will then progress further. You can see the log files in ~pmcguire/cylc-run/u-an231_elisedhn2
Patrick
About the new problem, your suite.rc file defines the SPIN_INITFILE as containing the c20c string: SPIN_INITFILE = {{ site }}/{{ site }}_{{ perturb|lower }}_c20c.dump.{{ C20C_RUNTIME_START }}0101.0.nc
Hi Patrick,
Yes, the second problem was to do with how the restart was set up, and I worked around it by renaming some files.
Thanks for your help,
Elise