I am currently running JULES on JASMIN using the batch system.
When submitting only one model job on the batch system, the model runs fine without throwing any error. But when I submit more than two model jobs, then one or more models fail with the error below:
"
Program received signal SIGBUS: Access to an undefined portion of a memory object.
Backtrace for this error:
Could not print backtrace: /proc/self/exe, errno: 116
/var/spool/slurmd/job15941950/slurm_script: line 67: 1969851 Bus error (core dumped) $JULES_ROOT/build/bin/jules.exe $NAMELIST
"
I need assistance in resolving this issue as I need to run multiple models simultaneously. Thank you.
Hi Assumpta
Please let me know which commands you are using to run JULES.
David
Hi David
Thank you for your response.
I am using this command to run JULES - ‘$JULES_ROOT/build/bin/jules.exe $NAMELIST’
Please see below the content of my batch script:
#!/bin/bash
#SBATCH --partition=standard
#SBATCH --qos=long
#SBATCH --account=tesnbsclim
#SBATCH -o %j.out
#SBATCH -e %j.err
#SBATCH --ntasks=1
#SBATCH --time=4-23:59:59
#SBATCH --mem=120G
#SBATCH --mail-user=assumpta.onyeagoziri@uct.ac.za
#SBATCH --mail-type=ALL
echo "Date & time: "date
module load jaspy
export RSUITE=$HOME/cylc-src/u-dy129
export tmp1=grep -ir "JULES_SOURCE" $RSUITE/app/fcm_make/rose-app.conf
export JULES_ROOT=echo ${tmp1##*=}
unset tmp1
export tmp1=grep -ir "output_dir" $RSUITE/app/jules/rose-app.conf
export OUTPUT_DIR=echo ${tmp1##*=} | sed "s/'//g"
unset tmp1
export tmp1=grep -ir "run_id" $RSUITE/app/jules/rose-app.conf
export RUN_ID=echo ${tmp1##*=} |sed "s/'//g"
unset tmp1
echo ‘Rose suite is:’ $RSUITE;echo ’ which uses this version of JULES:’ $JULES_ROOT;echo ’ and will save output to these files: ls’ $OUTPUT_DIR’/‘$RUN_ID’’
geany &
nedit~/out_${RSUITE##/}.txt &
export NAMELIST=$HOME/cylc-src/nlists_${RSUITE##*/}; mkdir -p $NAMELIST; cd $NAMELIST; rose app-run -i -C $RSUITE/app/jules;
cd ~
cd $JULES_ROOT;fcm make -j 2 -f etc/fcm-make/make.cfg --new; echo -e “\a”
$JULES_ROOT/build/bin/jules.exe $NAMELIST
echo "All done at: "date
Thanks
Assumpta
Hi Assumpta
The script seems to have become garbled in the copying and pasting. Can you upload the file or point me towards a copy on JASMIN?
David
Thanks David. Please see the file here on jasmin:
/home/users/assumpta/tesnbs.sh
Hi Assumpta
I’m investigating and have ruled out a couple of possibilities.
Am I right in thinking that the log files under /home/users/assumpta/batch_sripts/tesnbsruns come from jobs that have failed in this way and that they were run from the batch scripts in the same directory? Do you have log files from a successful run somewhere for comparison?
David
Hi Assumpta
I haven’t been able to see where things are going wrong from the log files. I think you will need to run JULES with the debug options turned on so that you can see where the error is when it crashes. To turn the debug options on you will need to edit /home/users/assumpta/MODELS/vn7.7_cmfz/etc/fcm-make/platform/jasmin-gcc-nompi_new.cfg and change the line
$JULES_BUILD = normal
to
$JULES_BUILD = debug
Good luck
David
Hi David
Thank you for your message.
Yes the log files in my /home/users/assumpta/batch_scripts/tesnbsruns come from jobs that are successful and the ones that failed.
See an example in the same directory for the successful one: 20795926.err and 20795926.out
Thank you.
Hi Assumpta
Despite the successful runs, it appears that you are still getting failures. Are you going to try turning on the debug options as I suggested?
David
Hi David,
Thanks for your response.
I have tried your suggestion by replacing ‘normal’ with ‘debug’ in the jasmin-gcc-nompi_new.cfg file and now I see a different type of error - ‘SIGFPE: Floating-point exception - erroneous arithmetic operation’, which is different from the SIGBUS error I was battling with previously.
Please find the error file here: /home/users/assumpta/batch_sripts/tesnbsruns/22851420.err
It is quite surprising that the model runs fine without throwing any of these errors when I only run one or two models at a time. But when I run more than two models, then the errors start occurring. Thanks for your assistance in this matter.
Assumpta.
Hi Assumpta
The different error would be due to the debug options turning on more trapping of floating point errors.
From the backtrace at the end of 22851420.err, it appears that the error is occurring in the calculation of litter_flux at lines 94-95 of the preprocessed source file /home/users/assumpta/MODELS/vn7.7_cmfz/preprocess/src/jules/src/control/shared/calc_litter_flux_mod.F90.
Since the error appears to occur at random, I suspect that it’s due to one of the variables in the calculation not being initialised, or to one of these variables being ultimately calculated from a variable that wasn’t initialised. Most of the time whatever was in the uninitialised variable will lead to a valid calculation, although the result might be nonsense. Occasionally it will lead to an invalid calculation (like 1.0/0.0).
It might just be a coincidence that you haven’t yet seen an error when running only one or two models. After all, the more models you run, the more likely it is that one of them will suffer a random error.
Suggestions for what to do next:
-
Gather more information on where the bug is. I notice that you have two other jobs still running. If these fail, it will be useful to know where they fail. You could also try rerunning the job that led to 22851420.err exactly as before to see whether it fails in the same place.
-
Look through the code and see whether you can spot a variable that is being used uninitialised. The problem might be in the aforementioned calculation in calc_litter_flux_mod.F90 or it might be in one of the other files listed in the backtrace.
Good luck
David