Lib-4171 : UNRECOVERABLE library error

Morning CMS team,

I am in the process of testing my suite u-dt414. Up until now I have only been running for a few months. I have attempted to run a 5 year test and the model is failing after completing 10 months. The suite is a SSP245 HadGEM3.1 run at N96 with SST nudging in the tropical Pacific.

I find it interesting that the model runs fine if there is syntax error in a write statement. So it is either entering a area of the code for the first time, or this is a red herring error.

The error I get in job.err is the same as in ticket 518: UNRECOVERABLE library error which is:

 
lib-4171 : UNRECOVERABLE library error
An output list item is incompatible with its data edit-descriptor.

Encountered during a sequential formatted WRITE to an internal file (character variable)
srun: error: nid002534: task 194: Exited with exit code 75

I have made a lot of changes in the suite, but any code changes I used in debugging are not pointed to in the code (I have previously used local code for both the UM and NEMO, but these branches are either commented out or removed).

Any idea how I would diagnose some more information from the suite output (i.e. which subroutine killed the run) or ideas on how to trap the error?

Penny

Hi Penny,

Yes, it does look like this is an untested error statement, or the actual message content going beyond what is expected. Unfortunately, there is no traceback available to see which routine this is coming from.
In the first instance you can try to increase the verbosity in case part of the message comes through or previous prints indicate where the model is.
app/um/rose-app.conf

  • [env]PRINT_STATUS=PrStatus_Diag (Note: for debug only, revert back for actual run)
  • [namelist:prnt_control]print_writers=1 (so output from all PEs is available).

Resubmit the failed coupled task.

Mohit

Thanks for the suggestion. The expanded messaging in job.err tells me the following:

lib-4211 : UNRECOVERABLE library error
A WRITE operation tried to write a record that was too long.

Encountered during a sequential formatted WRITE to an internal file (character variable)

So that is helpful in that it tells me why the write statement failed.

In the .pe output, each processor gets to the same place and prints the following (or similar to it):

Communicators....

Function     Comm ID  Processors     My Rank
--------     -------  ----------     -------
Global   -2080374783         198          62
Model    -1006632959         192          57
IO          67108864           6          -1
Leader      67108864           2          -1

This block of code that executes the table is within src/io_services/server/ios_init.F90 of the UM code. In the block of code that prints the above table finished as expected (i.e. the model does not crash during the printing of this table).

Six of the processors (eg 59) gets slightly further than this and print:

Communicators…

Function     Comm ID  Processors     My Rank

Global   -2080374783         198          59
Model    -1006632958           6           4
IO       -2080374779           6           4
Leader      67108864           2          -1

Info: PE      59 is an I/O server
IOS_queue: Logging to: ioserver_log.00059

The IOE_queue part occurs within src/io_services/server/ios_queue_mod.F90

Any ideas on what I should do apart from randomly searching in the code following this last line of output?

Many thanks,

Penny

Hi Penny,

That is not helpful, as the run has actually failed immediately on starting (albeit with the same error), while the original task ran for something like 16 days!
This might point to a bug in another print message that is activated only at the Diag level, or something on the ARCHER compiler causing strings to be interpreted differently(#).

Can you see if reverting to PrStatus_Normal, but setting [namelist:prnt_control]print_force_flush=.true. leads to the original outcome?

Mohit

#: In the past we have needed to reformat some text input files specifically for Archer.

I have rerun the suggested changes. Does it help?

There are a lot of messages about negative Q at every time step which is suspicious. It is possible one of the printed q/component values becomes unphysical and cannot fit in the error message.

It does indicate the model heading towards a ‘blow-up’, so see if you can try the perturb theta method for modifying the latest dump and pushing the run further.

Mohit