My suite u-ck767 keeps failing near the end of the coupled task for 19071101T0000Z with error in cylc-run/u-ck767/log/job/19071101T0000Z/coupled/04/job.err:
I’ve checked SAFE and do have enough memory on the /work/ and /home/ filesystems and my other suites are running fine, so I don’t understand what the problem is. Can you help?
Switching on extra diagnostic messages gives an error (I think because the diagnostic messages are too long) in coupled task /work/n02/n02/radiam24/cylc-run/u-ck767/log/job/19071101T0000Z/coupled/05/job.err
lib-4211 : UNRECOVERABLE library error
A WRITE operation tried to write a record that was too long.
Encountered during a sequential formatted WRITE to an internal file (character variable)
I can also try re-running with a different diagnostic setting (e.g. normal or operational)? Or is there an easy way to fix this?
The last lines of are /home/n02/n02/radiam24/cylc-run/u-ck767/work/19071101*/coupled/pe_output/*.pe000 are:
TEMP CORRECTION OVER A DAY = -0.40118E-02 K
TEMPERATURE CORRECTION RATE = -0.46432E-07 K/S
FLUX CORRECTION (ATM) = -0.33520E+00 W/M2
IOS_init_md: Wating for buffer 13
Info: Stall getting protocol queue slot at 2244.684 of 0.000
IOS_init_md: Wating for buffer 11
Info: Stall getting protocol queue slot at 2244.691 of 0.000
IOS_init_md: Wating for buffer 3
Info: Stall getting protocol queue slot at 2244.706 of 0.000
IOS_init_md: Wating for buffer 2
Info: Stall getting protocol queue slot at 2244.712 of 0.000
IOS_init_md: Wating for buffer 39
Info: Stall getting protocol queue slot at 2244.733 of 0.000
IOS_init_md: Wating for buffer 3
Info: Stall getting protocol queue slot at 2244.777 of 0.000
IOS_init_md: Wating for buffer 35
Info: Stall getting protocol queue slot at 2244.797 of 0.000
The last lines of /home/n02/n02/radiam24/cylc-run/u-ck767/work/19071101*/coupled/pe_output/*.pe011:
[1] UMPRINTOPENSTREAM: Opening unit 6 on file pe_output/ck767.fort6.pe011
[1]
[1] ???
[1] ???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
[1] ? Error code: 22
[1] ? Error from routine: io:buffin
[1] ? Error message: Error in buffin errorCode= 0.00 len=26112/28160
[1] ? Error from processor: 0
[1] ? Error number: 0
[1] ???
Thanks - u-cl809 uses the same forcing data/ancils/restart files etc and finished running at 19650101. For u-ck767 - I think whichever file was causing the problem isn’t one with data I need, because when I set coupled task 19071101T0000Z to succeeded, the following postproc_atmos task ran fine and archived the files I need successfully. However, the model then failed at 19071201T0000Z coupled task, I think because cylc-run/u-ck767/share/data/History_Data/*xhist still points to CHECKPOINT_DUMP_IM = ‘/work/n02/n02/radiam24/cylc-run/u-ck767/share/data/History_Data/ck767a.da19071121_00’.
The NEMO, CICE and UM restarts for 19071201 are available in the usual places. Is there a way to modify the .xhist file so it points to da19071201_00, so that I can then keep running the model? Or should I just stop the model and restart it from the NEMO, CICE and UM restarts for 19071201.