netCDF error on ARCHER2-23C

Hi Support Team,

I am trying to get an experiment (u-cg007) that I was previously running on the ARCHER2 4 cabinet system to run on th 23 cabinet system.

The experiment runs ok for the inital 24 hour cycle, but fails early in the second cycle, with the following error:

****************** NetCDF_File Error Report ***************************
Problem with unit 14 filename is /work/n02/n02/phill/cylc-run/u-cg007/share/data/history/ch_large_std_302pt5_
98L_fix_rh/atmos_ch_large_std_302pt5_98L_fix_rh_Table5_10000102_00.nc

???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 60
? Error from routine: NC_PUT_VAR_REAL_1D
? Error message: NetCDF: Numeric conversion not representable : NF90_PUT_VAR
? Error from processor: 0
? Error number: 57
???

Any suggestions for how to fix this would be greatly appreciated!

Thanks,

Peter

Peter,
I can’t see your files (you would have to chmod -R g+rX /home/n02/n02/
and chmod -R g+rX /work/n02/n02/)

but if you want a suggestion then I would see whether the crash is in the code in your branch, and if it is I would look at the type of the data being written. It’s possible that, because you have different compilers on the full ARCHER2, they are making different decisions as to the data type and so not matching what netcdf is expecting.

You can also compile with debug options if you can’t find where it’s crashing.
Hope that’s a sensible suggestion,

Dave

The chmod command should have had your name in it, ie /work/n02/n02/phill

Hi Dave,

Thanks for your help.

I’ve changed permissions on ARCHER2, so you should be able to view the folders now.

I attempted to compile with debug options, but got a compilation failure instead. I only used the “safe” compilation on the 4C system, so can’t say whether this is a new error on the 23C system. Unfortunately I didn’t copy this error before re-running a clean run with the safe compilation option, but (from memory and my google search history) the error was in pio_byteswap.c and related to the “span” variable “must have explicitly specified data sharing attributes”. I don’t think this is related to the netcdf error anyway?

Do you have any further suggestions, other than adding some print commands to check which variable is causing the error?

Thanks,

Peter

Some things can be compiler dependent - there are certainly problems with shumlib in CCE11 (which go away with CCE12) which are the compilers fault. But as you say - your error is not this.

I suggested the debug options to help find which line the code crashes in - but do you already know this? Are you the author of the subroutine? If so, you can put a link to the source and line here. Putting print statements to check every variable to ensure that all the data is valid, of the correct type, and with the correct bounds is a simple but hopefully sensible thing to do. If you can find which variable (data itself + count etc that is all passed to the put var routine) then I expect you will be close.

It’s possible that changing compiler versions will solve things if it worked on the 4cab system, but if this is your code it would be better to debug if there’s an issue.

Let us know what you see

The code crashes when trying to write diagnostics to a netCDF file - in particular it gets an error from NF90_PUT_VAR, which I think is netCDF library code, which is called by nc_put_var_real_1d. I guess this is a controlled exit rather than a crash? This is not code I’ve edited.

After doing some further digging, the error is when writing the 10metre u-wind diagnostic.

The experiment ran OK on the 4 cabinet system. It was set to run in 24 hour cycles and was crashing when writing hourly-mean diagnostics for the first hour of the second cycle. The model evolution over the first cycle looked reasonable, so I’ve changed the model to use a 48 hour cycle and it still crashes when writing the same diagnostic for the first hour of the second cycle, though this is 24 hours later. I think this means the error is probably related to the cycling?

The crash isn’t occuring in code I wrote, so I’d be happy to experiment with using a different compiler. How do you change the compiler that is used?

Thanks,

Peter

Hi Peter

I suspect the model is creating rubbish - please try to write out 64-bit data rather than 32-bit (look for ncvar_prec in the rose gui, set it to 2)

Grenville

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.