Lib-4171 : UNRECOVERABLE library error

penmaher · 4 June 2026 09:08

Morning CMS team,

I am in the process of testing my suite u-dt414. Up until now I have only been running for a few months. I have attempted to run a 5 year test and the model is failing after completing 10 months. The suite is a SSP245 HadGEM3.1 run at N96 with SST nudging in the tropical Pacific.

I find it interesting that the model runs fine if there is syntax error in a write statement. So it is either entering a area of the code for the first time, or this is a red herring error.

The error I get in job.err is the same as in ticket 518: UNRECOVERABLE library error which is:

 
lib-4171 : UNRECOVERABLE library error
An output list item is incompatible with its data edit-descriptor.

Encountered during a sequential formatted WRITE to an internal file (character variable)
srun: error: nid002534: task 194: Exited with exit code 75

I have made a lot of changes in the suite, but any code changes I used in debugging are not pointed to in the code (I have previously used local code for both the UM and NEMO, but these branches are either commented out or removed).

Any idea how I would diagnose some more information from the suite output (i.e. which subroutine killed the run) or ideas on how to trap the error?

Penny

mdalvi · 4 June 2026 11:37

Hi Penny,

Yes, it does look like this is an untested error statement, or the actual message content going beyond what is expected. Unfortunately, there is no traceback available to see which routine this is coming from.
In the first instance you can try to increase the verbosity in case part of the message comes through or previous prints indicate where the model is.
app/um/rose-app.conf

[env]PRINT_STATUS=PrStatus_Diag (Note: for debug only, revert back for actual run)
[namelist:prnt_control]print_writers=1 (so output from all PEs is available).

Resubmit the failed coupled task.

Mohit

penmaher · 4 June 2026 12:18

Thanks for the suggestion. The expanded messaging in job.err tells me the following:

lib-4211 : UNRECOVERABLE library error
A WRITE operation tried to write a record that was too long.

Encountered during a sequential formatted WRITE to an internal file (character variable)

So that is helpful in that it tells me why the write statement failed.

In the .pe output, each processor gets to the same place and prints the following (or similar to it):

Communicators....

Function     Comm ID  Processors     My Rank
--------     -------  ----------     -------
Global   -2080374783         198          62
Model    -1006632959         192          57
IO          67108864           6          -1
Leader      67108864           2          -1

This block of code that executes the table is within src/io_services/server/ios_init.F90 of the UM code. In the block of code that prints the above table finished as expected (i.e. the model does not crash during the printing of this table).

Six of the processors (eg 59) gets slightly further than this and print:

Communicators…

Function     Comm ID  Processors     My Rank

Global   -2080374783         198          59
Model    -1006632958           6           4
IO       -2080374779           6           4
Leader      67108864           2          -1

Info: PE      59 is an I/O server
IOS_queue: Logging to: ioserver_log.00059

The IOE_queue part occurs within src/io_services/server/ios_queue_mod.F90

Any ideas on what I should do apart from randomly searching in the code following this last line of output?

Many thanks,

Penny

mdalvi · 4 June 2026 12:41

Hi Penny,

That is not helpful, as the run has actually failed immediately on starting (albeit with the same error), while the original task ran for something like 16 days!
This might point to a bug in another print message that is activated only at the Diag level, or something on the ARCHER compiler causing strings to be interpreted differently(#).

Can you see if reverting to PrStatus_Normal, but setting [namelist:prnt_control]print_force_flush=.true. leads to the original outcome?

Mohit

#: In the past we have needed to reformat some text input files specifically for Archer.

penmaher · 4 June 2026 13:47

I have rerun the suggested changes. Does it help?

mdalvi · 4 June 2026 15:25

There are a lot of messages about negative Q at every time step which is suspicious. It is possible one of the printed q/component values becomes unphysical and cannot fit in the error message.

It does indicate the model heading towards a ‘blow-up’, so see if you can try the perturb theta method for modifying the latest dump and pushing the run further.

Mohit

penmaher · 5 June 2026 07:44

Nice pick up. Yes negative q in the convection scheme does sound like a problem. I will perturb theta and rerun.

penmaher · 5 June 2026 08:37

I did not update my restart dump frequency, so I will need to start the run again. Could I check with you, is it dumpfreqim within [namelist:nlstcgen] of the app/um/rose-app.conf or something else that sets the dump output frequency?

Penny

mdalvi · 5 June 2026 09:49

dumpfreqim is the output dump frequency.
It is not clear what the issue is here- the available restart dump for 20161001 (in share/data/History_Data/) is the one that would be perturbed and the coupled task re-submitted, without re-running the suite.

Mohit

penmaher · 5 June 2026 09:59

I was looking in the wrong spot for the restarts. I was looking in work directory and in the archived data on Jasmin. Thanks!

penmaher · 5 June 2026 12:08

I am extremely pleased to say that nudging theta worked. Many (many, many, many…) thanks.

Topic		Replies	Views
UNRECOVERABLE library error Monsoon2	2	304	22 March 2022
Lib-4171 : UNRECOVERABLE library error continued Unified Model ARCHER2	3	27	15 July 2026
WRITE operation to a record that was too long Unified Model	6	393	23 June 2021
Missing restart dump at start of new cycle Unified Model	4	74	18 June 2025
Coupled task failure Unified Model ARCHER2 , PUMATest	5	265	23 February 2023

Lib-4171 : UNRECOVERABLE library error

Related topics