Jules.exe libnetcdff.so.6 can't open shared obejct file

Since yesterday I have been experienced some issues on JASMIN while running a suite (u-cx502) on the test queue.

This suite has been working fine before yesterday, the main problem is that now it is unable to find the library libnetcdff.so.6 even though I have explicitly given the path where this is located (/gws/nopw/j04/jules/admin/netcdf/netcdf_par/3.1.1/intel.19.0.0/).

/home/users/caroduro/cylc-run/u-cx502/share/fcm_make/build/bin/jules.exe: error while loading shared libraries: libnetcdff.so.6: cannot open shared object file: No such file or directory

Any ideas why is not working anymore?

Hi Carolina:
JASMIN has had network issues this week. I think there are still network problems. Maybe that’s the cause of your problem. I am not sure.

Anyways, if it’s not network problems, then maybe it’s because you have your suite set up to use pre-builds. It’s not recommendable to use pre-builds. The suite should be changed to compile it on the fly, from scratch.

The following is maybe not the issue, but I guess you’re also aware that there’s a new set of libraries set up so that you’re not necessarily constrained to use the same processor type in LOTUS to both compile & run jules? This allows using AMD processors, or any type of processor. I sent an email about this to the JULES & JULES-USERS listserv 2-3 months ago. The new set up is also in the u-al752 suite.

I tried running your suite. It has a lot of paths that I don’t have access to. So I didn’t get very far. Even the path for the pre-built executable is one I don’t have access to. And I don’t have access to your roses & cylc-run directories, so I can’t see what you’re doing.
Patrick

Hi Patrick,

Thank you so much for all this info.
I was testing the suite both, using a pre-compilation of the code and on the fly compilation, and they worked fine.
Something happened on JASMIN yesterday morning that I could not access to JASMIN as I normally do, but couple of hours later I could log in but the suite started to failed. Perhaps, it might just be a Network problem as you said.

I revise the changes of the libraries and those from the u-al752 suite, but there is something that is not very clear to me. In that suite the intel compiler is at version 20.0.0, but when calling to the netcdf libraries, they are under version intel.19.0.0. Should not they both need to be (compiled) at the same intel version? (I might be completely wrong about it).

I am going to copy the executable one on a public folder where you can access from: https://gws-access.jasmin.ac.uk/public/uknetzero/CDR/jules.exe

Thanks again for all your help.

Hi Carolina
did you try again since Wednesday?
Is the network still causing problems?

I was told to use version 20.0.0 for the intel compiler and the netcdf libraries at version intel.19.0.0. It works for me. Doesn’t it work for you?

Did you do an ldd on the jules.exe file to see if you are missing any libraries or anything? you need to do the module loads and set the environment variables before the ldd.

If you’re still having problems next week, then if you can set your permissions with chmod -R g+rX /home/users/username, then I might be able to look at your cylc-run and roses directories to see what the problem is. You might want to move anything that is private or confidential to a private directory first.
Patrick

Hi Patrick,
I tried to run suite las Friday and now it has a submission problem.
I changed permissions already the suite and the cylc directories.

Hi Carolina:
Have you tried the first doing the module loads and the exports on cylc1 from the env-script of [[JASMIN_LOTUS]] in the suite u-al752? If I do those module loads and exports 1st on cylc1 and then do the ldd on your executable, then the "not found"s for the shared object libraries that you have seen go away.
Patrick

Hi again Carolina:
One more thing (see my previous post as well): in the log file
/home/users/caroduro/cylc-run/u-cx502/log/job/20240101T0000+0100/JULES_000007/01/job-activity.log
I see:
"2023-10-17T01:58:00+01:00 [STDERR] sbatch: error: Batch job submission failed: Requested node configuration is not available"

When I look at:
/home/users/caroduro/cylc-run/u-cx502/log/job/20240101T0000+0100/JULES_000007/01/job
I see:

#SBATCH --partition=test
#SBATCH --constraint=amd

As far as I know, the test partition/queue doesn’t have any amd nodes, so the constraint to amd nodes won’t work. Maybe you can try without the amd constraint, and/or in the short-serial or short-serial-4hr partitions? The short-serial-4hr partition only has amd nodes, as far as I know.

I suppose you know that you can modify your suite, and then do a rose suite-run --reload, and then just right click and retrigger the failed JULES apps? You don’t necessarily need to retrigger the fcm_make app, so this can save some time. But you probably already knew that.
Patrick

Hi Patrick,

Thanks for your help and info. I did have time to test the suite before the JASMIN power manintenance. I will back to you if I have more question.
Many thanks for your patience.

cheers
Carolina

Hi

It semmed to be a JASMIN Network problem. After JASMIN was OFF in October, it worked fine again.
Thanks Patrick for your help.
cheers

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.