BUFFIN: Read Failed

racheldiamond · 20 May 2022 10:02

Dear NCAS helpdesk,

My suite u-ck767 keeps failing near the end of the coupled task for 19071101T0000Z with error in cylc-run/u-ck767/log/job/19071101T0000Z/coupled/04/job.err:

BUFFIN: Read Failed: Cannot allocate memory
[1]
[1] ???
[1] ???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
[1] ? Error code: 22
[1] ? Error from routine: io:buffin
[1] ? Error message: Error in buffin errorCode= 0.00 len=26112/28160
[1] ? Error from processor: 0
[1] ? Error number: 0
[1] ???
[1]

I’ve checked SAFE and do have enough memory on the /work/ and /home/ filesystems and my other suites are running fine, so I don’t understand what the problem is. Can you help?

Many thanks.

Best wishes,

Rachel

grenville · 23 May 2022 16:23

Rachel

Please switch on Extra diagnostic messages & run again - perhaps that will tell us which file is causing the problem.

(memory and disk are different things in this context)

Grenville

racheldiamond · 24 May 2022 09:31

Hi,

Switching on extra diagnostic messages gives an error (I think because the diagnostic messages are too long) in coupled task /work/n02/n02/radiam24/cylc-run/u-ck767/log/job/19071101T0000Z/coupled/05/job.err

lib-4211 : UNRECOVERABLE library error
A WRITE operation tried to write a record that was too long.

Encountered during a sequential formatted WRITE to an internal file (character variable)

I can also try re-running with a different diagnostic setting (e.g. normal or operational)? Or is there an easy way to fix this?

Thanks.

Best wishes,

Rachel

grenville · 24 May 2022 10:11

Yes, try operational

racheldiamond · 24 May 2022 12:49

Hi, I did.

The last lines of are /home/n02/n02/radiam24/cylc-run/u-ck767/work/19071101*/coupled/pe_output/*.pe000 are:

TEMP CORRECTION OVER A DAY = -0.40118E-02 K
TEMPERATURE CORRECTION RATE = -0.46432E-07 K/S
FLUX CORRECTION (ATM) = -0.33520E+00 W/M2
IOS_init_md: Wating for buffer 13
Info: Stall getting protocol queue slot at 2244.684 of 0.000
IOS_init_md: Wating for buffer 11
Info: Stall getting protocol queue slot at 2244.691 of 0.000
IOS_init_md: Wating for buffer 3
Info: Stall getting protocol queue slot at 2244.706 of 0.000
IOS_init_md: Wating for buffer 2
Info: Stall getting protocol queue slot at 2244.712 of 0.000
IOS_init_md: Wating for buffer 39
Info: Stall getting protocol queue slot at 2244.733 of 0.000
IOS_init_md: Wating for buffer 3
Info: Stall getting protocol queue slot at 2244.777 of 0.000
IOS_init_md: Wating for buffer 35
Info: Stall getting protocol queue slot at 2244.797 of 0.000

The last lines of /home/n02/n02/radiam24/cylc-run/u-ck767/work/19071101*/coupled/pe_output/*.pe011:

[1] UMPRINTOPENSTREAM: Opening unit 6 on file pe_output/ck767.fort6.pe011
[1]
[1] ???
[1] ???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
[1] ? Error code: 22
[1] ? Error from routine: io:buffin
[1] ? Error message: Error in buffin errorCode= 0.00 len=26112/28160
[1] ? Error from processor: 0
[1] ? Error number: 0
[1] ???

Does this help?

Best wishes,

Rachel

grenville · 25 May 2022 12:49

Hi Rachel

Do you have another suite that uses the same forcing data (ancil files etc) and offline emissions that has run past 1907 12 1?

I’m guessing that there may be a problem with the emission file - the model has stopped at the time it should have read emission data.

The UM is poor at telling us which file it’s having problems with – you may need to put in a few print statements to get it to write out file names.

Grenville

racheldiamond · 25 May 2022 12:56

Hi,

Thanks - u-cl809 uses the same forcing data/ancils/restart files etc and finished running at 19650101. For u-ck767 - I think whichever file was causing the problem isn’t one with data I need, because when I set coupled task 19071101T0000Z to succeeded, the following postproc_atmos task ran fine and archived the files I need successfully. However, the model then failed at 19071201T0000Z coupled task, I think because cylc-run/u-ck767/share/data/History_Data/*xhist still points to CHECKPOINT_DUMP_IM = ‘/work/n02/n02/radiam24/cylc-run/u-ck767/share/data/History_Data/ck767a.da19071121_00’.

The NEMO, CICE and UM restarts for 19071201 are available in the usual places. Is there a way to modify the .xhist file so it points to da19071201_00, so that I can then keep running the model? Or should I just stop the model and restart it from the NEMO, CICE and UM restarts for 19071201.

Many thanks.

Rachel

Topic		Replies	Views
Error in buffin errorCode Unified Model Monsoon2	1	29	15 January 2026
BUFFIN error in createbc Unified Model ARCHER2	4	300	24 October 2022
Coupled task failure Unified Model ARCHER2 , PUMATest	5	264	23 February 2023
Error from routine: portio2a:flush_unit_buffer Unified Model Monsoon2 , ARCHER2	2	220	22 September 2022
WRITE operation to a record that was too long Unified Model	6	389	23 June 2021

BUFFIN: Read Failed

Related topics