U-ct529 failure after activating IOS

Hello CMS,
I got a reference run of the suite u-ct528 and modified it to ru with IOS. Then I used the rose edit GUI to activate the IOS and then clicked “run” to hopefully repeat the run.
It failed and seemed to try twice…

/work/n02/n02/mricha/cylc-run/u-ct529/log.20230206T120914Z/job/20151129T0000Z/TM_4km_RA3_p3_um_fcst_000/02

have you any recommendation on how I can track down the cause of hte failure?

I wonder if I should have used command line and “rose suite-run --new” or is that the default when clicking the UI run button?

many thanks,
Mark

I did some grepping and looked at PE 15, which is an IO server.
/work/n02/n02/mricha/cylc-run/u-ct529/work/20151129T0000Z/TM_4km_RA3_p3_um_fcst_000/pe_output/umnsa.fort6.pe015
is the output.

Perhaps it’s closing the file then trying to read it again? I don’t really know about the IO servers in the UM, but perhaps something associated with the file:
/work/n02/n02/mricha/cylc-run/u-ct529/share/cycle/20151129T0000Z/TM/4km/RA3_p3/um//umnsaa_pverd000
is to blame?

Mark

I think the stash setup is not right – you have output to Usage profile 60_DIAGS that goes to file_id pp0 (looks OK) but also output to usage profile model_variables that is associated with file_id model_variables, which does not exist (same for other usage profiles) - I’d fix those & try again.

Grenville

Hi Grenville,

I will have to ask the original owner to help with that. Is this a risk for the non-IOS version? Is it a fundamental error that will bite them later?

Might they be hijacking other stash codes to get the full amount of info from their simulations?

I’ll have to refresh on working with STASH profiles. This a side effect of inheriting someone else’s rose suite.

Mark

image001.jpg

Actually, I am having second thoughts – does this exact configuration without IO servers run OK?

Grenville

Hello Grenville,

The original suite owner informs me those are for netCDF output. See attached PNG.

Um-> namelist → model input and output → netcdf output streams → model_variables.

So it looks like the STASH usage profile ought to point to that ncdf file. Similar for 3 other streams.

Are the usage profiles not indicating ncdf files?

Mark

image001.jpg

hi grenville,
yes the suite ran and put out decent drook logs too without IOS.
it was when I turn on IOS that the failure occurs.
I set IOS diags to maximum “5”. so those log files have a lot in them but it is all a bit cryptic and I wonder what other flags I can set to give more info.

am i hitting some sort of limit in the IOS. (or did the job just run out of walltime - where can I check that??)

Mark

arghh - my mistake, back to the drawing board.

a bit more investigation. I set the IO server to use only one server so this case needs 121 MPI x2 OMP i.e. 242 cores. the Jinja2 in suite-adds tries to spread out the job (I need to learn jinja now too). I tweaked it to cope with under populated nodes. so ideally I would like 60 atmos tasks per node and use remaining cores for IOS.
in this case it decide to use 3 Nodes with 40 atmos MPI per node and IOS on one of the n odes - it seems to have selected MPI 063 for hte IOS.

it fails. now the error message ids the same that portio is trying to use “unit 11” but it seems to have been closed …
I would up the diagnostics and in the pe_output for 063 I see a lot more IOS information and this seems relevant (filtered with rep)… the IOS_Action_close has occurred for unit 11 but there is a callto StashWritePPData to unit 11.

1322:[0] Info: Queue: Added action IOS_Action_Sync trns_no: 65 to queue, now 56 items
1325:[0] Info: Queue: Added action IOS_Action_StashInitPPLookup trns_no: 66 to queue, now 57 items
1544:[0] Info: Queue: Added action IOS_Action_StashWritePPData trns_no: 67 to queue, now 1 items
1587:[0] Info: Queue: Added action IOS_Action_StashWritePPData trns_no: 68 to queue, now 2 items
1623:[0] Info: Queue: Added action IOS_Action_StashWritePPData trns_no: 69 to queue, now 3 items
1659:[0] Info: Queue: Added action IOS_Action_StashWritePPData trns_no: 70 to queue, now 4 items
1694:[0] Info: Queue: Added action IOS_Action_StashWritePPData trns_no: 71 to queue, now 5 items
1730:[0] Info: Queue: Added action IOS_Action_StashWritePPData trns_no: 72 to queue, now 6 items
1764:[0] Info: Queue: Added action IOS_Action_StashWritePPData trns_no: 73 to queue, now 7 items
1800:[0] Info: Queue: Added action IOS_Action_StashWritePPData trns_no: 74 to queue, now 8 items
1813:[0] Info: Queue: Added action IOS_Action_StashWritePPData trns_no: 75 to queue, now 8 items
1816:[0] Info: Queue: Added action IOS_Action_StashWritePPData trns_no: 76 to queue, now 9 items
1889:[0] Info: Queue: Added action IOS_Action_Process trns_no: 77 to queue, now 10 items
1894:[0] Info: Queue: Added action IOS_Action_StashWritePPData trns_no: 78 to queue, now 11 items
1901:[0] Info: Queue: Added action IOS_Action_StashWritePPLookup trns_no: 79 to queue, now 12 items
1904:[0] Info: Queue: Added action IOS_Action_Close trns_no: 80 to queue, now 13 items
4054:[0] Info: Queue: Added action IOS_Action_Process trns_no: 81 to queue, now 1 items
4061:[0] Info: Queue: Added action IOS_Action_StashWritePPData trns_no: 82 to queue, now 1 items

this is possibly a better illustration where I grep for transaction. note trns_id 80 is to CLOSE unit 11 and trns_id 82 is to write PP data to unit 11… oops not sure how to ebug furhter.

1891:[0] Info: Listener: Received a transaction: IOS_Action_StashWritePPData
1897:[0] Info: Listener: Received a transaction: IOS_Action_StashWritePPLookup
1903:[0] Info: Listener: Received a transaction: IOS_Action_Close
2072:[1] Info: Writer: transaction: 68 is completed in 1.022
2074:[1] Info: Writer: transaction: 69 is for unit 12 :IOS_Action_StashWritePPData
2316:[1] Info: Writer: transaction: 69 is completed in 0.534
2318:[1] Info: Writer: transaction: 70 is for unit 12 :IOS_Action_StashWritePPData
2560:[1] Info: Writer: transaction: 70 is completed in 0.688
2562:[1] Info: Writer: transaction: 71 is for unit 12 :IOS_Action_StashWritePPData
2804:[1] Info: Writer: transaction: 71 is completed in 0.184
2806:[1] Info: Writer: transaction: 72 is for unit 12 :IOS_Action_StashWritePPData
3048:[1] Info: Writer: transaction: 72 is completed in 0.185
3050:[1] Info: Writer: transaction: 73 is for unit 12 :IOS_Action_StashWritePPData
3292:[1] Info: Writer: transaction: 73 is completed in 0.652
3294:[1] Info: Writer: transaction: 74 is for unit 12 :IOS_Action_StashWritePPData
3536:[1] Info: Writer: transaction: 74 is completed in 0.198
3538:[1] Info: Writer: transaction: 75 is for unit 12 :IOS_Action_StashWritePPData
3780:[1] Info: Writer: transaction: 75 is completed in 0.787
3782:[1] Info: Writer: transaction: 76 is for unit 12 :IOS_Action_StashWritePPData
4024:[1] Info: Writer: transaction: 76 is completed in 0.233
4026:[1] Info: Writer: transaction: 77 is for PE 63 :IOS_Action_Process
4028:[1] Info: Writer: transaction: 77 is completed in 0.007
4030:[1] Info: Writer: transaction: 78 is for unit 11 :IOS_Action_StashWritePPData
4038:[1] Info: Writer: transaction: 78 is completed in 0.023
4040:[1] Info: Writer: transaction: 79 is for unit 11 :IOS_Action_StashWritePPLookup
4043:[1] Info: Writer: transaction: 79 is completed in 0.004
4045:[1] Info: Writer: transaction: 80 is for unit 11 :IOS_Action_Close
4050:[1] Info: Writer: transaction: 80 is completed in 0.114
4052:[0] Info: Listener: Received a transaction: IOS_Action_Process
4056:[1] Info: Writer: transaction: 81 is for PE 63 :IOS_Action_Process
4058:[1] Info: Writer: transaction: 81 is completed in 0.000
4060:[0] Info: Listener: Received a transaction: IOS_Action_StashWritePPData
4063:[1] Info: Writer: transaction: 82 is for unit 11 :IOS_Action_StashWritePPData

So the problem I am describing above is down to the fact that this suite has 4 output streams that are NetCDF and my experiment is to turn on parallel IO (the IO server). In the small print for NetCDF it is implied that IOS is incompatible with NetCDF so I am on a hiding to nowhere. I am now converting those streams to Field Files and encounter new problems with the STWORK exceeding reserved header size (i.e. over the default 4096).

So I think it is best to close this ticket as SOLVED do not use NCDF and IOS. and if I cannot solve the new problem I might post a new thread.

Thanks for the assistance,
Mark

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.