NetCDF error when running JULES on LOTUS2

Hi there,

I am running JULES on LOTUS2 using MPI, and all submitted jobs ran smoothly for the first 2-3 hours. I requested 48 hours of running time, but they were often shut down in the middle of the job.

The error message:

{MPI Task 0} [FATAL ERROR] file_ncdf_write_var_1d: /work/scratch-pw3/byxu/output/ssp126/noO3/ukesm/period2/glob2.others.2043.nc: Error writing variable 'lai' (NetCDF error - NetCDF: HDF error)
Image              PC                Routine            Line        Source          
jules.exe          0000000000657CA4  Unknown               Unknown  Unknown
jules.exe          0000000000A1BA3A  Unknown               Unknown  Unknown
jules.exe          0000000000A0D9FE  Unknown               Unknown  Unknown
jules.exe          00000000009F5333  Unknown               Unknown  Unknown
jules.exe          0000000000A42C78  Unknown               Unknown  Unknown
jules.exe          00000000009EC4C1  Unknown               Unknown  Unknown
jules.exe          0000000000A4C6FD  Unknown               Unknown  Unknown
jules.exe          000000000040F8AA  Unknown               Unknown  Unknown
jules.exe          000000000040F5CD  Unknown               Unknown  Unknown
libc.so.6          00007F5B8AA295D0  Unknown               Unknown  Unknown
libc.so.6          00007F5B8AA29680  __libc_start_main     Unknown  Unknown
jules.exe          000000000040F4E5  Unknown               Unknown  Unknown
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

I still have enough storage on /work/scratch-pw3/, the namelists ran well on old LOTUS which is retired now, and the model was built based on the new environment.

So, I wonder if anyone has the same problem, could you please give me advice on how to solve this?

Thanks so much.

Best,
Beiyao

Beiyao,

would you mind rerunning with more memory?
The sbatch line would be something like:

–mem=[units]

so in the prescript/job submission script you could try:
#SBATCH --mem=32G

for example. Pick something appropriate

Hopefully that helps?

Hi Dave,

Thanks for your reply.

I request 54G memory, however, it still has the same error. I used to only request 30G memory when running on old LOTUS, and it has no problem.

Which suite is this? I assume you’re running from a cylc suite.

Also - possibly not the best fix, but you could try to see which timestep you get to, and then change the cycling? If you make the cycling more frequent it may help memory issues.
Another thing is to make sure that you’re running on a single node. You can put --ntasks-per-node to be equal to the number of tasks, for example

I am running JULES using namelist instead of cylc suite. I tried to add the following request to the script:

> #SBATCH --nodes=1              
> #SBATCH --ntasks=18           
> #SBATCH --ntasks-per-node=18

It still has the same error. I wonder if it is because LOTUS/JASMIN is unstable now?

I don’t know exactly which job you’re running, so let me know the path to the directory if you’d like.

You have had jobs fail with out of memory errors (sacct will tell you certain information such as this). If the job you’re talking about is still getting this then maybe try giving more memory per cpu:

eg --mem-per-cpu=8G

LOTUS/JASMIN might have problems. But also they might just have changed the way resources are allocated by the scheduler so it’s worth trying different options. You are one of the first JULES users on the new system, I think, so perhaps you hit the problems for the first time?

1 Like

Thanks so much! After adding --mem-per-cpu to the bash script, it works!

My script is like this now:

#SBATCH --chdir=.
#SBATCH --account=jules
#SBATCH --partition=highres
#SBATCH --qos=highres
#SBATCH --time=48:00:00
#SBATCH --output=jules.%j.log
#SBATCH --error=jules.%j.err
#SBATCH --nodes=1               
#SBATCH --ntasks=30            
#SBATCH --ntasks-per-node=30 
#SBATCH --mem-per-cpu=20G

I hope this can assist anyone who encounters the same issue.

Nice to hear Beiyao.

And thanks for sharing that result. It’s a new system so some suites will have to be adjusted; hopefully it continues running.