I am running JULES on LOTUS2 using MPI, and all submitted jobs ran smoothly for the first 2-3 hours. I requested 48 hours of running time, but they were often shut down in the middle of the job.
I still have enough storage on /work/scratch-pw3/, the namelists ran well on old LOTUS which is retired now, and the model was built based on the new environment.
So, I wonder if anyone has the same problem, could you please give me advice on how to solve this?
Which suite is this? I assume you’re running from a cylc suite.
Also - possibly not the best fix, but you could try to see which timestep you get to, and then change the cycling? If you make the cycling more frequent it may help memory issues.
Another thing is to make sure that you’re running on a single node. You can put --ntasks-per-node to be equal to the number of tasks, for example
I don’t know exactly which job you’re running, so let me know the path to the directory if you’d like.
You have had jobs fail with out of memory errors (sacct will tell you certain information such as this). If the job you’re talking about is still getting this then maybe try giving more memory per cpu:
eg --mem-per-cpu=8G
LOTUS/JASMIN might have problems. But also they might just have changed the way resources are allocated by the scheduler so it’s worth trying different options. You are one of the first JULES users on the new system, I think, so perhaps you hit the problems for the first time?