North/South halos too small for advection

Dear CMS,

My vn13.5 suite u-dj442 failed after running for two hours. atmos_main got an error like this;

[1] ? Error code: 15
[1] ? Error from routine: LOCATE_HDPS
[1] ? Error message: North/South halos too small for advection.
[1] ? See the following URL for more information:
[1] ? https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints

I’m aware of this topic whose owner avoided this error by changing n_conv_calls for just the failing cycle.

But could you please tell me how I can change n_conv_calls “for just the failing cycle”? If you have any other suggestions please make.

Thanks,
Masaru

Hi Masaru,

This process needs to be done manually -(edit settings → run → wait for month to complete-> stop run → revert settings → resubmit for next segment).
The preferred method nowadays is to perturb the theta field as a way to change the model evolution, see tips_CRgeneral – MOCI (metoffice.gov.uk) for explanation, and this comment for using the method on Monsoon :

Hi Mohit,

Oh, then I misunderstood the word ‘cycle’ as it might mean just one time step or something.

But I tried the method you recommended. Looks like it worked. It’s pretty easy and good to know.

Thanks for help as always :grinning:
Masaru

Well, above was not actually true. The run failed in the same way. It failed at time step 6565 (2005-12-01 04:20:00) last time, and it did at 6573 (2005-12-01 07:00:00) this time. So not exactly the same.

Because the run failed during 2005-12-01 and the last dump file was dj442a.da20051201_00, I thought this is the one to perturb. And I did

/home/d03/myosh/cylc-run/perturb_theta.py dj442a.da20051201_00.orig --output dj442a.da20051201_00

This is the list of dump files right after this, where only dj442a.da20051201_00 was created today.

-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 20 15:18 dj442a.da20050902_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 20 15:30 dj442a.da20050912_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 20 15:43 dj442a.da20050922_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 20 15:54 dj442a.da20051001_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 20 15:55 dj442a.da20051002_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 20 16:07 dj442a.da20051012_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 20 16:19 dj442a.da20051022_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 20 16:31 dj442a.da20051101_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 20 16:43 dj442a.da20051111_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 20 16:55 dj442a.da20051121_00
-rw-r–r-- 1 myosh ukca-leeds 2672050176 Sep 23 10:45 dj442a.da20051201_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 20 17:07 dj442a.da20051201_00.orig

However, after atmos_main was rerun and crashed again, all dump files have been updated as you can see below (notice that all but xxxx_00.orig were created today). I expected that the run would start from 20051201 and therefore all existing dumps would not be touched. This means the run actually started from the beginning, from the start dump instead of the perturbed dump?

-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 23 11:38 dj442a.da20050902_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 23 11:50 dj442a.da20050912_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 23 12:02 dj442a.da20050922_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 23 12:12 dj442a.da20051001_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 23 12:14 dj442a.da20051002_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 23 12:26 dj442a.da20051012_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 23 12:37 dj442a.da20051022_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 23 12:49 dj442a.da20051101_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 23 13:01 dj442a.da20051111_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 23 13:13 dj442a.da20051121_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 23 13:26 dj442a.da20051201_00
-rw-r–r-- 1 myosh ukca-leeds 2676244480 Sep 20 17:07 dj442a.da20051201_00.orig

Maybe I should have restarted the run in a particular way so that the run starts from where it stopped previously? I did rose suite-run --restart but this only brought up the cylc window back showing the previous crash. So I just did ‘Trigger (run now)’ on atmos_main from that window.

Masaru

If you set cycling to 3 months (RESUB=P3M), this will finish the first run on 30th Nov. You could then ‘Hold’ the atmos_main task, perturb the 20051201_00 dump and then let the model continue.
However, given that the model is blowing up fairly early in the run I suspect there is an issue with the setup or any code changes that are making it unstable and you might see the problem frequently.

Hi Mohit,
Maybe you are right…

I set EXPT_RESUB=P3M and the first 3 months of simulation completed (with an error in postproc). Then I let atmos_main crash (actually forgot to hold) in the second cycle starting from 20051201. Perturbed dj442a.da20051201_00 and rerun atmos_main.

It fails with different error messages now.

? Error code: 100
? Error from routine: set_thermodynamic
? Error message: A total of 23 points had negative mass in set_thermodynamic. This indicates the pressure fields are inconsistent between different levels and the model is about to fail.

What would your advice be? Should I start over from the beginning?

Masaru

Hi Masaru,

As I hinted earlier, there is likely to be something in the setup/ code that is making the configuration unstable.
The best way would be to start from a similar configuration that works and add the ‘science’ changes incrementally.
Going from vn12 to 13.5 just through app-upgrade is a big jump, unless you have tested the model at intervening versions… Note also that rose app-upgrade mostly applies the default value for new items (Logicals to False, integers and reals to 0.0) rather than the science options, which have to be mainly set by hand.

HI Mohit,

Yes, I totally agree with you.
I upgraded vn12.0 suite to vn13.5 based on advice from people around me, but I was quite skeptical that it would work. In fact, I kind of anticipated that something like this would happen, to be honest. So this only proved my gut feeling was right (unfortunately).

So I think I should follow the ‘traditional’ way and start from a release job or a suite that is relatively close to my goal and working. Would you agree?

I’ve heard that Monsoon2 will be replaced with a new computer on which only UM vn13.5 or later can be run? If this is true is there a global atmosphere only nudged suite that you would recommend to use? That needs to be able to be run on both Monsoon2 and the new computer. I need to add capabilities to read 3D aviation emissions and simulate contrails, contrail cirrus, and their radiative effects on it.

Masaru

Hi Masaru,

I mainly port UKCA/ UKESM configurations to Monsoon as required, so you might have to ask around to see if any GA8/9 suites are ported.
The GAL9+Strattrop configuration at vn13.5 is u-df519. This should work on Monsoon by changing SITE and add account/ project names at the top-level.
This is already setup to use Gregorian calendar so the Nudging can be activated easily . However, this is a Y2000 fixed run so will need replacing emissions and other ancils (and radiation clmcfg namelist) to turn into a transient one.

Thank you for your advice, Mohit.
I forgot to tell you that I need to evaluate effective radiative forcing from aviation emissions. You told me before that not all suites are suitable to calculate ERFs, right? Is u-df519 suitable for this purpose?

I am not sure what configuration you were looking for at that time, but I may have implied that not all UKCA jobs are scientifically validated if they are for test purpose only. This is also the case with u-df519.
For assessed GA configurations you will have to look at the documentation GADocumentation/GAL9. It looks like the vn13.5 job is “u-df462”.
I am not sure if someone has ported it for their use, but since the Monsoon filesystems are mirrored to the internal HPCs it should be worth trying to run it on Monsoon.