Time out error in 'coupled' task for UKESM suite on Archer2

Hi, my suite (u-da600) on Archer2 is failing in the ‘coupled’ task with the error message (job.err) below. The time-out failure has happened a few times now after the coupled task has been running for just under an hour, despite giving the suite longer wallclock times (e.g. 10 hours). Is there another reason my suite might be getting cut off? The same set-up seems to be running fine for short test runs with cycles a few days in length.

???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 99
? Error from routine: IOS_QUERYBUFFER
? Error message: Time out waiting for protocol buffer 343 addressing pe 440
? Error from processor: 0
? Error number: 55
???

Hi CMS, I’m still having trouble with this error, which is happening now for me in various suites since the update to puma2. The pe_output file in work/coupled shows that the suites are running to the end of the first cycle, and then failing at the first time step of cycle 2. This only happens for longer runs though (e.g. 1 month cycles).

Any ideas for things I could check?

Thanks again,

Alistair

pl remind us of your ARCHER2 username

Ah sorry, its ‘aduffeyum’

Alistair

I don’t see any log output for u-da600. The IOS_QUERYBUFFER error is coming from the IO servers - do you have another example that has a full output log?

Grenville

Hi Grenville,

Thanks for this! And sorry for the slow response.

I’ve now repeated the suite set up which generates the error in u-da729, which has the full output log. Again this run failed after about an hour in ‘coupled’ on the first cycle (20350101T0000Z) with the same error message.

Hi Alistair

Does u-da729 have a standard set of STASH output (or a set that has run successfully elsewhere?)

Grenville

Hi grenville,

Yes, at least i think so. The STASH settings haven’t been changed relative to suites that have worked for me before. Ultimately, the STASH settings come from u-be537, the UKESM1.0 ScenarioMIP SSP2-4.5 run, which I have ported over to Archer2.

Alistair

There is no ARCHER2 output for u-be537, but it is set up to not use any IO servers, whereas u-da729 is configured to use 6.

I’d try running with 0 to check.

Grenville

Thanks, that solved it!