JASMIN suites not completing model run for 2004

Copies of suite u-bx723 are running on JASMIN to completion as far as 2014.
However, when phenology/triffid is enabled in these copies of u-bx723 the suite runs until 2004 and never completes 2004 and eventually times out.
This occurs for suites u-cd640, u-cf835 and u-ch817 but retriggering does not solve the problem and don’t know how to fix it.
The error message in the job.err is as follows:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
jules.exe 00000000007D4AE3 Unknown Unknown Unknown
libpthread-2.17.s 00007FC0D40B7630 Unknown Unknown Unknown
jules.exe 0000000000657987 qsat_mod_mp_qsat_ 118 qsat_mod.F90
jules.exe 000000000074CAF5 screen_tq_mod_mp_ 406 screen_tq_jls.F90
jules.exe 000000000066C6B4 sf_impl2_mod_mp_s 1829 sf_impl2_jls.F90
jules.exe 00000000005A401C surf_couple_impli 443 surf_couple_implicit_mod.F90
jules.exe 000000000041296C control_ 621 control.F90
jules.exe 000000000040CC68 MAIN__ 129 jules.F90
jules.exe 000000000040CB92 Unknown Unknown Unknown
libc-2.17.so 00007FC0D3AF8555 __libc_start_main Unknown Unknown
jules.exe 000000000040CAA9 Unknown Unknown Unknown
slurmstepd: error: *** JOB 5037177 ON host330 CANCELLED AT 2021-10-01T11:15:28 ***
2021-10-01T11:15:30+01:00 CRITICAL - failed/EXIT

Hi Noel:
The basic advice is to find the time (day and hour and minute) in the year 2004 that the suite fails. You can restart the suite from the start dump of 2004, and make sure that print_step is equal to 1. You can also make sure that you have one of your output streams with various output variables output on maybe an hourly or 3-hourly sampling time. After you find out when the suite fails (by looking at your log files to see what day and hour and minute it fails), you can rerun but with an end time and date for your output stream to be just before the failure time. That way, your output file will close before the figure, and you can look at the output file to see which variables are messed up, and try to figure out what you need to change in order to get it working right.
This might take some iteration and some time, but it’s the best advice I have on this.
Patrick

Hi, thanks for response.
I only have dump files for every 10 years so I assume when I re-start the suite, it will re-start from the last dump file which is JULES-GL7.0.vn5.3.CRUNCEPv7SLURM.S2.dump.20000101.0.nc

In /app/jules/rose-app.conf file, can change the print_step=240 to print_step=1

In jules_time, the timestep_len = 3600 which I assume is hourly and applies to all 3 output profiles so I don’t need to do anything else about the sampling time.

The GUI does not show when I do a “cylc gscan” as the suite stopped a while ago.
Once I do a “rose suite-run --reload”, how can I re-trigger the suite from the last dump file without the GUI?

Hi Noel:
To restart a JULES suite from a different start dump file, I usually first manually change the name of the start dump file in the namelists, and also change the start date for the main run. You will probably also need to manually change the start date for the output_profile files.

You might need to use "rose suite-run --restart” instead of "rose suite-run --reload” if it stopped a while ago. Then retrigger the JULES app in the GUI.

What is your output_period in your jules_output_profile’s? That’s different than the timestep_len in jules_time.

If it is failing in 2004, you might first want to do a run that starts on 20000101 and ends on 20040101, with print_step=240 and the default output_period’s. And then do another run that starts on 20040101 with print_step=1 and output_period=3600 for your new jules_output_profile. That way, your hourly output profiles and log files with print_step=1 won’t be super long, and also, once you find the crash point in 2004, you won’t need to rerun from 20000101 again when you run with the end time just before the crash point.
Patrick

Hi, to answer the question in relation to output_period
The default setting in the suite are as follows:

[namelist:jules_output]
dump_period=10
nprofiles=3
output_dir=’$OUTPUT_FOLDER’
run_id=’$ID_STEM’

[namelist:jules_output_profile(1)]
file_period=-2
nvars=76
output_main_run=.true.
output_period=-2
output_spinup=.true.
output_type=76*‘M’
profile_name=‘Annual’

[namelist:jules_output_profile(2)]
file_period=-2
nvars=76
output_main_run=.true.
output_period=-1
output_spinup=.true.
output_type=76*‘M’
profile_name=‘Monthly’

[namelist:jules_output_profile(3)]
file_period=-2
nvars=28
output_main_run=.true.
output_period=-1
output_spinup=.true.
output_type=28*‘M’
profile_name=‘ilamb’

[namelist:jules_time]
l_360=.false.
l_leap=.false.
main_run_end=$MAIN_TASK_END
main_run_start=$MAIN_TASK_START
print_step=240
timestep_len=3600

Hi Noel:
Can you look up in the JULES documentation what an output_period of -1 or -2 means?
Patrick

output_period=-2 is for an annual period
output_period=-1 is for a monthly period
Both must be a multiple of the timestep_len=3600

Hi Noel:
Very good!
So if you want to change the output so that it has an hourly output_stream, then you should say that output_period=3600.

Maybe you want to keep the current output_stream’s and add an additional hourly output_stream for your debugging purposes. But make sure that you change the end time of your output_streams of all o your output_streams to correspond to being just before the model crashes.
Patrick

I ran the suite from the 2004 dump up until the hour before where the suite crashes.
RUNEND=2004,1,31,4,0,0
RUNSTART=2004,1,1,0,0,0
which produced the following dump file
JULES-GL7.0.vn5.3.CRUNCEPv7SLURM.S2.dump.20040131.14400.nc

From the original advice “The basic advice is to find the time (day and hour and minute) in the year 2004 that the suite fails.”
Should I now re-start from dump.20040131.14400.nc and try to find the minute?

Also, what advantage does print_step=1 have?
Where can I read that information?

Hi Noel
In the run you ran from 20040101_0h till 2004_0131_4h, did you also produce a daily or perhaps hourly output stream for that time period? If not, then I suggest you do so. If you did, then you can look at all the output variables in your output stream to see which ones are going bad just before it fails.

You might need to include at least the same output variables that were in your monthly output stream, and if you have some idea about what some of the other important variables that might be starting to become extreme, then you can include those output variables in the daily or hourly output stream as well.

I don’t think you need to worry about the precise minute of failure right now. Just check to see if there are any weird values in the various variables in the last time steps before failure.

The print step=1 just prints the timestep information in the log file, and this was used to figure out which timestep your run is failing at.

The weekend is starting for me now, so I won’t be back at the NCAS CMS Helpdesk until sometime after the weekend is over.
Patrick

Yes I have the hourly output, but I can’t see anything strange so far.

Hi Noel:
Which hourly variables are you looking at? Have you compared the maps of these variables at the last time steps before failure to those at earlier times in the runs? Do you see any grid cells with NANs in them for any of the variables? Do you see any extreme surface temperatures or soil temperatures? Is there any ice or snow on the surface anywhere? Are there any grid cells where the soil moisture goes to zero in any of the layers? You can look at the variables with ncview or with Python. I tend to look at the variables first with ncview.
Patrick