Hi CMS team,
I have a bit of backstory to cover so please bear with me as I explain.
Last year I worked on getting a SSP245 HadGEM3‐GC3.1‐LL suite working with ozone redistribution on archer. That was technical hurdle 1. Thanks to lots of help from the CMS team I got the suite working. The second technical hurdle was setting up the suite to run a pacemaker experiment where the SSTs are nudged in a region. I have worked with a MO colleague to set the suite up to run this style of experiment and now have it working well enough that I want to test it for longer than a few days.
Last year when I was setting up the suite, I had a family of restarts I could use as HadGEM3‐GC3.1‐LL contributed 5 ensemble members to SSP245 for CMIP6. In testing, I was initially using the same restarts as the SSP245 suites (which were based on different realisations from the historical runs). When I tested these restarts, only the restarts from one of the suites worked The restarts from the other four failed very early into the simulation. I never got to the bottom of why they failed despite quite a bit of investigation. So I carried on using the one suite that did work to set the suite up to adapt it to a nudging suite.
Now I am at the point where I want to do longer runs. But I need a set of 5 restarts to run the style of experiment I want. The one suite of restarts I mentioned earlier I can’t use as I can’t get the model to run with the other four suites (and I want to run a 5 member ensemble). I have tried a different collection of restarts, now using a Jan 2020 branching date on the SSP245 runs themselves (not the restarts from the historical runs) and I have run into the same problem getting the suite to run. I have asked a number of people at the MO and the response has been very similar: this is strange and likely something to do with how the suite it set up on archer2.
In the ocean.out I have the following error:
===>>> : E R R O R
stpctl: the zonal velocity is larger than 20 m/s
kt= 2 max abs(U): 1.6544E+16, i j k: 298 321 1
output of last fields in numwso
===>>> : E R R O R
step: indic < 0
dia_wri_state : single instantaneous ocean state and forcing fields file created and named :output.abort.nc
===>>> : E R R O R
MPPSTOP
NEMO abort from dia_wri_state
In the past, I tried to track the error in the coding using debug statements but I did not find the source of the problem. I also tried perturbing the restarts but this did not help.
Do you have any ideas I could test or strategies to work out how to proceed? The suite ID is u-dt414.
Regards,
Penny
Hi Penny,
One thing I noticed in u-dt414 is that you have set the UM restart file in um/rose-app.conf as:
astart='/work/n02/n02/penmaher/ssp245_N96O1_restarts/ssp245_N96O1_restarts_202001/bj616a.da20200101_00'
The astart variable is used as the input to the UM but also the output from the reconfiguration. When reconfiguration is run, it reads the UM restart from aintial and writes a new restart to astart.
You have reconfiguration on in your suite and ainitial set to:
ainitial='$UMDIR/ukcmip6/Restarts/u-aq853/aq853a.da25940101_00'
So I think what is happening is that the recon task is overwriting the 2020 restart file - this seems to be backed up by the timestamp of the file and the logs from the recon task.
You either need to swich reconfiguration off or if you want to reconfigure set the UM start dump as ainitial.
Annette
Hi Annette,
I would be extremely pleased if that was the problem. I will test it this morning. Can I double checking first? I have not spent much time thinking about the recon step (and I will dig around in UMDP302 now). In this suite I am using ozone redistribution and nudging with input files that need linking. Would either of these steps need recon on? Short answer is that I don’t know if I want to run with recon on or off.
In the last part of our comment, do you mean the following:
ainitial='/work/n02/n02/penmaher/ssp245_N96O1_restarts/ssp245_N96O1_restarts_202001/bj616a.da20200101_00'
As this is the restart I want to use in the run.
Penny
I should also add, the suite I am using as the base is u-bj616. In this suite they use:
ainitial=‘/data/d01/ukcmip6/N96O1_ensemble1_dumps/aq853a.da25940101_00’
astart=‘/data/d01/ukcmip6/ssp585_N96O1_ensemble1_dumps/bg466a.da20150101_00’
Where bg466a is a historical ensemble member and the piCntrol spin up.
But I see now in this suite recon is turned off.
Hi Annette,
I would like to leave recon on. I have updated the recon restarts but I still get the same velocity error.
I have updated the restarts as follows:
astart=‘${ROSE_DATA}/${RUNID}.astart’
ainitial=‘/work/n02/n02/penmaher/ssp245_N96O1_restarts/ssp245_N96O1_restarts_202001/bj616a.da20200101_00’
I also tested with it hard coded a second time in case the above syntax would not work.
astart=‘/work/n02/n02/penmaher/ssp245_N96O1_restarts/ssp245_N96O1_restarts_202001/bj616a.da20200101_00’
I also did a rose suite-clean on the dir to ensure all old links were removed (and I had not done one in a while).
Do you see anything else a odd about my setup?
Penny
Hi Penny,
These settings should work with reconfiguration:
astart=‘${ROSE_DATA}/${RUNID}.astart’
ainitial=‘/work/n02/n02/penmaher/ssp245_N96O1_restarts/ssp245_N96O1_restarts_202001/bj616a.da20200101_00’
However you need to make sure you have a clean copy of bj616a.da20200101_00, because it was previously being overwritten by the reconfiguration with data from the u-aq853 restart.
Looking at the file, the header shows it was created today. You can see the fixed length header with uminfo bj616a.da20200101_00 | less. These entries give the creation timestamp:
header ( 35) = 2026
header ( 36) = 5
header ( 37) = 15
header ( 38) = 10
header ( 39) = 45
header ( 40) = 34
The file format is documented in UMDP F03.
Annette
Annette you are awesome! Thank did fix it indeed. I am extremely grateful for your help!