Atmos_main failing after 25 days in one month run

Hi,

I am running the suite u-ds446 on Monsoon3. The model runs successfully for about 20 days, but after 25 days within a month, it stops producing data. The job Atmos_main fails without any visible error message.

Could you please help me investigate why this is happening?

Best regards,

Tanu

Please allow us read access to your space on Monsoon

Grenville

Hi Grenville

Sorry about the late. My space in open now.

Thanks

Tanu

Tanu

It’s a little difficult to know where to look when there are 45 runs. I think the problem is stated in run41 in the job.err file – have you been examining the error messages?

PBS: job killed: walltime 5442 exceeded limit 5400

it looks like the job needs more time to complete. Increase the wallclock time.

Grenville

I tried to run it by increasing it to 2 hour 30 min. But Walltime time error is also is coming there also. Model is stop producing data after 45 min.

which run has the increased wallclock?

I ran it monsoon2.

I will incease it to 2 hour30 min and ran it again

Hi Grenville

I ran the suite u-ds446 by increasing Wallclock to 2 hour 30 min run46. It failed again after 25 days.

Hi Tanu,

A few steps to try:

  • Turn off debug level logging as that has impact on performance: app/um/rose-app.conf β†’ [env]PrintStatus=PrStatus_Min

  • The processor set up seems to have copied from (my?) GAL9 configuration and may not be optimal for UKESM: rose-suite.conf β†’ MAIN_ATM_PROCX=32, MAIN_ATM_PROCY=18,MAIN_OMPTHR_ATM=2 (There are further optimisation being tested for UKESM1.1 with Rank Reordering).

  • The maximum wall-clock available is 3 hours so might help to complete the last 5 days.

If the runs do continue to hang at 25 days this might be related to data from ancillary being read or a specific process occurring around that day.

Mohit

Also, worth trying to move the input files form /home/ to /data/ ?

I think access from compute nodes to /home might itself be slow (and not even possible on ARCHER2).

Hi Mohit,

I will do it and let you know if it works or not.

Thanks
Tanu

Hi Mohit,

It did not work. Atmos main failing after 25 days.

Tanu

Atm_Step: Timestep 1800 Model time: 2019-01-26 00:00:00
Attempt to open file: /common/share/monsoon_ancils/atmos/GC5/n96e/easyaerosol/cmip6_stratos/climatology_1850-2014/v1//volc_aer_extinction_sw.nc returned status= 0 nfid= 65536
Attempt to open file: /common/share/monsoon_ancils/atmos/GC5/n96e/easyaerosol/cmip6_stratos/climatology_1850-2014/v1//volc_aer_absorption_sw.nc returned status= 0 nfid= 65536
Attempt to open file: /common/share/monsoon_ancils/atmos/GC5/n96e/easyaerosol/cmip6_stratos/climatology_1850-2014/v1//volc_aer_asymmetry_sw.nc returned status= 0 nfid= 65536
Attempt to open file: /common/share/monsoon_ancils/atmos/GC5/n96e/easyaerosol/cmip6_stratos/climatology_1850-2014/v1//volc_aer_extinction_lw.nc returned status= 0 nfid= 65536
Attempt to open file: /common/share/monsoon_ancils/atmos/GC5/n96e/easyaerosol/cmip6_stratos/climatology_1850-2014/v1//volc_aer_absorption_lw.nc returned status= 0 nfid= 65536
Attempt to open file: /common/share/monsoon_ancils/atmos/GC5/n96e/easyaerosol/cmip6_stratos/climatology_1850-2014/v1//volc_aer_asymmetry_lw.nc returned status= 0 nfid= 65536
update_pattern: updating coeffc and coeffs
Tot dry mass 0.51291E+19
Tot mass 0.51413E+19
Tot energy 0.13074E+25
tot dry energy 0.13075E+25
gr( rho cal) 0.38159E+24
KE( rho cal) 0.92906E+21
KEu(rho cal) 0.71085E+21
KEv(rho cal) 0.21821E+21
KEw(rho cal) 0.11890E+16
cvT( rho cal) 0.92500E+24
lq ( rho cal) 0.30327E+23
lqcf( rho cal) 0.84982E+20
lqcl( rho cal) 0.64562E+20
Final dry mass of atmosphere = 0.51291E+19 KG
Initial dry mass of atmosphere= 0.51291E+19 KG
Correction factor for rho_dry = 0.10000E+01
Final moisture = 0.12183E+17 KG
Initial moisture = 0.12263E+17 KG
change in moisture = -0.80694E+14 KG
Moisture added E-P in period = -0.80508E+14 KG
Error in moisture = -0.18589E+12 KG
Error as % of change = 0.23037E+00
q ( rho cal) 0.12126E+17
qcf( rho cal) 0.33979E+14
qcl( rho cal) 0.22773E+14
FINAL TOTAL ENERGY = 0.13074E+25 J/
INITIAL TOTAL ENERGY = 0.13072E+25 J/
CHG IN TOTAL ENERGY O. P. = 0.14228E+21 J/
FLUXES INTO ATM OVER PERIOD = -0.38066E+22 J/
ERROR IN ENERGY BUDGET = -0.39489E+22 J/

Attempt to open file: /projects/ukca-admin/analyses/era5/era5_1deg-model-levs_N48L137_2019012600_all.nc returned status= 0 nfid= 65536
Attempt to open file: /projects/ukca-admin/analyses/era5/era5_1deg-model-levs_N48L137_2019012606_all.nc returned status= 0 nfid= 65536

You could try changing to 10-day cycling (remember to set the dump frequency to 10 days.)

Grenville