Unusual errors when performing EC config

Hi NCAS CMS,

I’m trying to perform a 10-day nested suite starting on 20190304. I have downloaded all the data for hourly ERA5 boundary conditions (/work/n02/n02/jostal/ERA5_analysis).

When performing the configuration for EC, the configuration fails at two hourly timesteps. I have no idea why it fails at these two timesteps in particular.

I have tried re-running the same reconfig multiple times. Rebuilt the model and started from scratch. I’ve tried re-downloading the ERA5 data again. Even tried changing the files to grib1 format using cdo, but then it couldn’t recognise multiple levels?!

The model run is u-de027. The following error appears,

???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 41
? Error from routine: RCF_GRIB_READ_DATA
? Error message: Unknown GRIB version whilst extracting length of message.
? Error from processor: 10
? Error number: 210
???

for instance in /work/n02/n02/jostal/cylc-run/u-de027/log/job/20190304T0000Z/ec_um_recon_181/02/job.err .

Any idea how I can change the grib version? Just strange how it’s not common across all hourly timesteps.

Kind regards,
Josh

Still no luck after re-building and re-running the model today. Crashes at ec_um_recon_095 and ec_um_recon_181.

Josh

Built a new UM job (u-de532) on Friday (18th Mar '24). Same um-recon issue occurs for hourly timesteps 95 and 181. Re-downloading ERA5 data for these timestamps. Still none the wiser? :man_shrugging:t5:

Hi Josh,

Just to say we’re not ignoring your query. At the moment we’re not sure what to advise.

Regards,
Ros.

:sweat_smile: Thanks. Yeah I’m a tad confused by it all.

I did think about trying a completely different 10-day period (maybe something up with extracted ERA5 data), but in the future I’m planning to perform nested simulations for various start dates. So if this error was to appear again, we’d know what to do.

Is there anyone I can contact at the Met Office? Stuart?

Kind regards,
Josh

Hi Josh

The reconfig thinks that the grib version for ec_grib_201903111200.t+000 (for ec_um_recon_180) is 2, but for ec_grib_201903111300.t+000 (for ec_um_recon_181) it finds 6.

This seems odd since ec_grib_201903111300.t+000 can be read OK with grib_api utilities and appears to be a perfectly good grib file. More investigation needed.

Grenville

Hi Grenville,

Yep, it’s bizarre. And ec_grib_201903111300.t+000 (for ec_um_recon_095) comes out as version 93?

Let me know if I can do anything? Clueless on what to do next.

Kind regards,
Josh

maybe just sidestep the check in rcf_grib_read_data_mod.F90
grib_dump says:

grenvill@ln02:/work/y07/shared/umshared/lib/cce-15.0.0/eccodes/2.24.1/bin> ./grib_dump -OtaH /work/n02/n02/jostal/ERA5_analysis//ec_grib_201903111300.t+000 | more
***** FILE: /work/n02/n02/jostal/ERA5_analysis//ec_grib_201903111300.t+000 
#==============   MESSAGE 1 ( length=81302 )               ==============
1-4       ascii (str) identifier = GRIB ( 0x47 0x52 0x49 0x42 )
5-6       unsigned (int) reserved = MISSING ( 0xFF 0xFF )
7         codetable (int) discipline = 0 ( 0x00 ) [Meteorological products (grib2/tables/5/0.0.table) ]
**8         unsigned (int) editionNumber = 2 ( 0x02 ) [ls.edition]**
9-16      section_length (int) totalLength = 81302 ( 0x00 0x00 0x00 0x00 0x00 0x01 0x3D 0x96 )

so there appears to be something dodgy with the reconfig code.

Grenville

Hi Grenville,

I’ll change the code and rebuild etc this afternoon. Could I ask for an extra 400 CU on account n02-NEX006247. Been trying to use it all up before end of March but would like to spend Tues-Thurs trying to fix this issue.

I’ll have plenty of CU time in the new “HPC” year.

Kind regards,
Josh

Josh

Hang on with the grib problem - my 1/2-baked suggestion didn’t work.

I added CUs.

Grenville

So did you try a new branch where you skip rcf_grib_read_data_mod.F90?

Josh

Oh. I realise you can’t you state the GRIB version as it can’t be assumed that it is either 1 or 2.

Josh - please tell me another reconfig that failed ie ec_um_recon_???

Grenville

Only two fail. ec_um_recon_095 and ec_um_recon_181.

Josh

We know why the reconfig is failing (it is a problem with the reconfig code and not with the grib files) - how to fix it is still unclear - we are working on that.

Grenville

1 Like

Josh

I’ve added a fix for this - please include fcm:um.xm/branches/dev/grenvillelister/vn12.0_grib-fix in the UM build and rebuild.
Please carefully check the results.

Grenville

1 Like

Hi Grenville,

Thanks for creating the new branch.

Sorry to be a pain but could you possibly explain what the new branch does differently? I can see that if skip == 1, then pos_in_file = 6. Will this work given that GRIB version 6 and 95 is being concluded for UM_config 095 and 181?

Currently updating job (and will double-check EC config output).

Kind regards,
Josh.

Josh
the reconfig searches for the bit pattern that spells GRIB in asci since that signifies the beginning of a field, but that bit pattern can appear elsewhere in the data and does (with quite low probability) - you found 2 cases – we know those occurrences don’t signify the start of a field because the next bits of data read are not sensible - the grib version and/or data size are wrong for a field. The reconfig doesn’t account for that as written. My hack says, if you find a dodgy grib version, assume you’ve not found the beginning of a field and skip it, and carry on searching.

It’s not foolproof - if there is a truly corrupt field it might struggle.

Hope that helps.

Grenville

1 Like

Hi Grenville,

Thanks. Multiple ec_um_recon are currently running. Fingers crossed for 095 and 181.

Kind regards,
Josh

Hi all,

Model is successfully running. Those two ec reconfigs worked and it looks like sensible output.

Thanks once again.

Josh