Segmentation fault if CRUN_LEN < run length

ambrogiovolonte · 13 February 2022 15:49

Dear CMS helpdesk,

I’m running a UM vn12.0 nesting suite on Monsoon (u-cl358), although at the moment I’m just running the global model, without any nested regions. My starting point was suite u-by395, as recommended in the Nesting Suites guide.

The suite currently fails if CRUN_LEN < length of the run, i.e., if the run is split into smaller jobs (24-hour long in my case). In particular, the first job succeeds and then I get a segmentation fault error at the beginning of the second job. This error occurs if the output domain area specified in the domain profiles in the STASH is a subset of the model area, while the suite runs fine if the output area specified is equal to the full model area (i.e., global). To my understanding, this would suggest that the issue might be somewhere in the STASH configuration, but I can’t really find what’s wrong with it.

For the time being I’m just setting CRUN_LEN equal to the total length of the run but this is not really ideal because the glm_fcst_um job fails during submission if the wallclock time is larger than 12 000 – 14 000 seconds (I haven’t found out where the exact threshold is). Having to keep the wallclock time below 12000 seconds means that the 3-day forecasts I’m currently running run out of time before completing (and ideally I should move soon to 5-day forecasts). Increasing the number of processors, originally set at (18, 32), does not seem to be speeding up the run very much, while substantially increasing the queuing time, so that’s also not a viable workaround.

Do you have any idea of how to solve the segmentation fault issue?

Many thanks for your attention and best wishes
Ambrogio

grenville · 15 February 2022 17:53

Hi Ambrogio

I can’t the segmentation error you refer to in the ticket in the model output.

Grenville

ambrogiovolonte · 16 February 2022 09:03

Hi Grenville,

Yes, sorry, that’s because to get some more output data I tried running the suite with CRUN_LEN = run_length. This didn’t really work either because, as you probably saw, the run failed when exceeding wallclock time (see more details in my original post) and did not complete.

I’ll now set CRUN_LEN back to 24 hours and re-run the suite. As soon as it gets to the end of the first job and fails with segmentation fault I will get back to you on here.

Many thanks again
Ambrogio

ambrogiovolonte · 16 February 2022 14:10

Hi @grenville,

The run has just failed with the segmentation fault error I described previously. Please let me know if I should upload here any logfiles or if there’s anything I could do.

Best wishes
Ambrogio

grenville · 17 February 2022 13:14

Hi Ambrogio

The second cycle fails while reading the start file. If you look in the start file with xconv, you will see most fields have 1536x1152 grid points in the horizontal. However there are several with 1536x256 that appear in the file because the fields are accumulations or time averages - these kinds of diagnostics are held in the start file when a meaning period spans a dump time. In this case, I’m not sure why the fields are in the file since they are supposed to be output on a whole number of hours - I don’t know exactly what the model needs to do the accumulation or mean. I also don’t know precisely which items are causing the problem, maybe all of them. I shall play with this, but you might want to switch off stash on the reduced area that have T1HRACC, T1HRAVG, TACC3HR and allow the full fields to do the sums ( then extract the reduced domain as a post processing step - actually post processing to the reduced domain might be easier all round - it’ll save on data written.)

Grenville

ambrogiovolonte · 18 February 2022 13:19

Hi Grenville,

Thanks a lot.

I’ve now re-run the suite switching accumulated and averaged STASH fields to global domain and the run completed successfully. I’m happy to keep it like this for the time being, but if you do find a general solution, please keep me posted.

Going back to the other issue in my original post, do you have an idea of why the glm job failed to submit when increasing the wallclock time above 12000 ~ 14000 seconds? I don’t have log files of that now but I could try changing that and re-run it if that’s helpful. Also, how do I know how many cores per node there are in the Monsoon Cray XC40? (just trying to optimise the number of cores I’m using). Please let me know if I need to raise another ticket to discuss this.

Many thanks again!
Ambrogio

grenville · 18 February 2022 14:13

Hi Ambrogio

There is a time limit of 4 hrs for all jobs on Monsoon.

Grenville

ambrogiovolonte · 18 February 2022 14:31

Hi Grenville

Ok, that’s good to know, thanks!

Topic		Replies	Views
Reconfiguration fails with 2022 analysis files (UM vn12.0 nested suite) Unified Model Monsoon2	11	333	24 November 2022
Convpp runs out of time Unified Model Monsoon2 , Nesting-Suite	2	203	21 December 2022
Getting (North/South halos)/Halos too small for advection unexpectedly in UKESM Unified Model Monsoon2	3	286	13 May 2022
INBOUNDA: Boundary data starts after start of current boundary data interval Unified Model Monsoon2 , Nesting-Suite	6	371	18 October 2023
Innermost LAM suddenly failing due to walltime limit Unified Model ARCHER2	3	225	26 July 2023

Segmentation fault if CRUN_LEN < run length

Related topics