Since yesterday I have been experienced some issues on JASMIN while running a suite (u-cx502) on the test queue.
This suite has been working fine before yesterday, the main problem is that now it is unable to find the library libnetcdff.so.6 even though I have explicitly given the path where this is located (/gws/nopw/j04/jules/admin/netcdf/netcdf_par/3.1.1/intel.19.0.0/).
/home/users/caroduro/cylc-run/u-cx502/share/fcm_make/build/bin/jules.exe: error while loading shared libraries: libnetcdff.so.6: cannot open shared object file: No such file or directory
Hi Carolina:
JASMIN has had network issues this week. I think there are still network problems. Maybe that’s the cause of your problem. I am not sure.
Anyways, if it’s not network problems, then maybe it’s because you have your suite set up to use pre-builds. It’s not recommendable to use pre-builds. The suite should be changed to compile it on the fly, from scratch.
The following is maybe not the issue, but I guess you’re also aware that there’s a new set of libraries set up so that you’re not necessarily constrained to use the same processor type in LOTUS to both compile & run jules? This allows using AMD processors, or any type of processor. I sent an email about this to the JULES & JULES-USERS listserv 2-3 months ago. The new set up is also in the u-al752 suite.
I tried running your suite. It has a lot of paths that I don’t have access to. So I didn’t get very far. Even the path for the pre-built executable is one I don’t have access to. And I don’t have access to your roses & cylc-run directories, so I can’t see what you’re doing.
Patrick
Thank you so much for all this info.
I was testing the suite both, using a pre-compilation of the code and on the fly compilation, and they worked fine.
Something happened on JASMIN yesterday morning that I could not access to JASMIN as I normally do, but couple of hours later I could log in but the suite started to failed. Perhaps, it might just be a Network problem as you said.
I revise the changes of the libraries and those from the u-al752 suite, but there is something that is not very clear to me. In that suite the intel compiler is at version 20.0.0, but when calling to the netcdf libraries, they are under version intel.19.0.0. Should not they both need to be (compiled) at the same intel version? (I might be completely wrong about it).
Hi Carolina
did you try again since Wednesday?
Is the network still causing problems?
I was told to use version 20.0.0 for the intel compiler and the netcdf libraries at version intel.19.0.0. It works for me. Doesn’t it work for you?
Did you do an ldd on the jules.exe file to see if you are missing any libraries or anything? you need to do the module loads and set the environment variables before the ldd.
If you’re still having problems next week, then if you can set your permissions with chmod -R g+rX /home/users/username, then I might be able to look at your cylc-run and roses directories to see what the problem is. You might want to move anything that is private or confidential to a private directory first.
Patrick
Hi Carolina:
Have you tried the first doing the module loads and the exports on cylc1 from the env-script of [[JASMIN_LOTUS]] in the suite u-al752? If I do those module loads and exports 1st on cylc1 and then do the ldd on your executable, then the "not found"s for the shared object libraries that you have seen go away.
Patrick
Hi again Carolina:
One more thing (see my previous post as well): in the log file /home/users/caroduro/cylc-run/u-cx502/log/job/20240101T0000+0100/JULES_000007/01/job-activity.log
I see: "2023-10-17T01:58:00+01:00 [STDERR] sbatch: error: Batch job submission failed: Requested node configuration is not available"
When I look at: /home/users/caroduro/cylc-run/u-cx502/log/job/20240101T0000+0100/JULES_000007/01/job
I see:
#SBATCH --partition=test
#SBATCH --constraint=amd
As far as I know, the test partition/queue doesn’t have any amd nodes, so the constraint to amd nodes won’t work. Maybe you can try without the amd constraint, and/or in the short-serial or short-serial-4hr partitions? The short-serial-4hr partition only has amd nodes, as far as I know.
I suppose you know that you can modify your suite, and then do a rose suite-run --reload, and then just right click and retrigger the failed JULES apps? You don’t necessarily need to retrigger the fcm_make app, so this can save some time. But you probably already knew that.
Patrick
Thanks for your help and info. I did have time to test the suite before the JASMIN power manintenance. I will back to you if I have more question.
Many thanks for your patience.