BUFFIN: Read Failed

Dear NCAS helpdesk,

My suite u-ck767 keeps failing near the end of the coupled task for 19071101T0000Z with error in cylc-run/u-ck767/log/job/19071101T0000Z/coupled/04/job.err:

BUFFIN: Read Failed: Cannot allocate memory
[1]
[1] ???
[1] ???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
[1] ? Error code: 22
[1] ? Error from routine: io:buffin
[1] ? Error message: Error in buffin errorCode= 0.00 len=26112/28160
[1] ? Error from processor: 0
[1] ? Error number: 0
[1] ???
[1]

I’ve checked SAFE and do have enough memory on the /work/ and /home/ filesystems and my other suites are running fine, so I don’t understand what the problem is. Can you help?

Many thanks.

Best wishes,

Rachel

Rachel

Please switch on Extra diagnostic messages & run again - perhaps that will tell us which file is causing the problem.

(memory and disk are different things in this context)

Grenville

Hi,

Switching on extra diagnostic messages gives an error (I think because the diagnostic messages are too long) in coupled task /work/n02/n02/radiam24/cylc-run/u-ck767/log/job/19071101T0000Z/coupled/05/job.err

lib-4211 : UNRECOVERABLE library error
A WRITE operation tried to write a record that was too long.

Encountered during a sequential formatted WRITE to an internal file (character variable)

I can also try re-running with a different diagnostic setting (e.g. normal or operational)? Or is there an easy way to fix this?

Thanks.

Best wishes,

Rachel

Yes, try operational

Hi, I did.

The last lines of are /home/n02/n02/radiam24/cylc-run/u-ck767/work/19071101*/coupled/pe_output/*.pe000 are:

TEMP CORRECTION OVER A DAY = -0.40118E-02 K
TEMPERATURE CORRECTION RATE = -0.46432E-07 K/S
FLUX CORRECTION (ATM) = -0.33520E+00 W/M2
IOS_init_md: Wating for buffer 13
Info: Stall getting protocol queue slot at 2244.684 of 0.000
IOS_init_md: Wating for buffer 11
Info: Stall getting protocol queue slot at 2244.691 of 0.000
IOS_init_md: Wating for buffer 3
Info: Stall getting protocol queue slot at 2244.706 of 0.000
IOS_init_md: Wating for buffer 2
Info: Stall getting protocol queue slot at 2244.712 of 0.000
IOS_init_md: Wating for buffer 39
Info: Stall getting protocol queue slot at 2244.733 of 0.000
IOS_init_md: Wating for buffer 3
Info: Stall getting protocol queue slot at 2244.777 of 0.000
IOS_init_md: Wating for buffer 35
Info: Stall getting protocol queue slot at 2244.797 of 0.000

The last lines of /home/n02/n02/radiam24/cylc-run/u-ck767/work/19071101*/coupled/pe_output/*.pe011:

[1] UMPRINTOPENSTREAM: Opening unit 6 on file pe_output/ck767.fort6.pe011
[1]
[1] ???
[1] ???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
[1] ? Error code: 22
[1] ? Error from routine: io:buffin
[1] ? Error message: Error in buffin errorCode= 0.00 len=26112/28160
[1] ? Error from processor: 0
[1] ? Error number: 0
[1] ???

Does this help?

Best wishes,

Rachel

Hi Rachel

Do you have another suite that uses the same forcing data (ancil files etc) and offline emissions that has run past 1907 12 1?

I’m guessing that there may be a problem with the emission file - the model has stopped at the time it should have read emission data.

The UM is poor at telling us which file it’s having problems with – you may need to put in a few print statements to get it to write out file names.

Grenville

Hi,

Thanks - u-cl809 uses the same forcing data/ancils/restart files etc and finished running at 19650101. For u-ck767 - I think whichever file was causing the problem isn’t one with data I need, because when I set coupled task 19071101T0000Z to succeeded, the following postproc_atmos task ran fine and archived the files I need successfully. However, the model then failed at 19071201T0000Z coupled task, I think because cylc-run/u-ck767/share/data/History_Data/*xhist still points to CHECKPOINT_DUMP_IM = ‘/work/n02/n02/radiam24/cylc-run/u-ck767/share/data/History_Data/ck767a.da19071121_00’.

The NEMO, CICE and UM restarts for 19071201 are available in the usual places. Is there a way to modify the .xhist file so it points to da19071201_00, so that I can then keep running the model? Or should I just stop the model and restart it from the NEMO, CICE and UM restarts for 19071201.

Many thanks.

Rachel

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.