Hi Annette,
I’ve made some progress on this, though I don’t think I’m at the optimal configuration yet. I’m finding that I can reduce the time taken for the timesteps with i/o, but that the timesteps with no i/o take longer so that there’s no actual gain. Is this expected?
Also, 2 of my output streams are limited area subdomains which are written at a higher frequency. The packing of these streams was causing an error, so I’m currently writing these streams unpacked - but have you seen anything like this before?
The error is in this file, repeated for 25 of the i/o processors:
in /work/n02/n02/emmah/cylc-run/u-cc339/log/job/20151203T0000Z/tm2_ra2t_um_fcst1/18/job.err. For one processor this is:
“”"
[4146] exceptions: the exception reports the extra information: Integer divide by zero.
[4146] exceptions: whilst in a parallel region, by thread 1
[4146] exceptions: Task had pid=133337 on host nid001191
[4146] exceptions: Program is “/work/n02/n02/emmah/cylc-run/u-cc339/work/20151203T0000Z/tm2_ra2t_um_fcst1/toyatm”
[4146] exceptions: Data address (si_addr): 0x005a20b1; rip: 0x005a20b1
[4146] exceptions: [backtrace]: has 15 elements:
[4146] exceptions: [backtrace]: ( 1) : Address: [0x005a20b1]
[4146] exceptions: [backtrace]: ( 1) : f_shum_wgdos_pack_1d_arg64$f_shum_wgdos_packing_mod_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/extract/shumlib/shum_wgdos_packing/src/f_shum_wgdos_packing.f90 line 411
[4146] exceptions: [backtrace]: ( 2) : Address: [0x00bef504]
[4146] exceptions: [backtrace]: ( 2) : signal_do_backtrace_linux in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/extract/um/src/control/c_code/exceptions/exceptions-platform/exceptions-linux.c line 81
[4146] exceptions: [backtrace]: ( 3) : Address: [0x00beeee7]
[4146] exceptions: [backtrace]: ( 3) : signal_handler in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/extract/um/src/control/c_code/exceptions/exceptions.c line 706
[4146] exceptions: [backtrace]: ( 4) : Address: [0x2af531b8a2d0]
[4146] exceptions: [backtrace]: ( 4) : ?? (* Cannot Locate *)
[4146] exceptions: [backtrace]: ( 5) : Address: [0x005a20b1]
[4146] exceptions: [backtrace]: ( 5) : f_shum_wgdos_pack_1d_arg64$f_shum_wgdos_packing_mod_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/extract/shumlib/shum_wgdos_packing/src/f_shum_wgdos_packing.f90 line 411
[4146] exceptions: [backtrace]: ( 6) : Address: [0x0059f4b4]
[4146] exceptions: [backtrace]: ( 6) : wgdos_compress_field$wgdos_packing_mod_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/preprocess-atmos/src/um/src/control/stash/wgdos_packing.F90 line 97
[4146] exceptions: [backtrace]: ( 7) : Address: [0x0259182f]
[4146] exceptions: [backtrace]: ( 7) : ios_stash_pack_wgdos$ios_stash_wgdos_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/preprocess-atmos/src/um/src/io_services/server/stash/ios_stash_wgdos.F90 line 68
[4146] exceptions: [backtrace]: ( 8) : Address: [0x0258a4d2]
[4146] exceptions: [backtrace]: ( 8) : ios_stash_pack$ios_stash_server_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/preprocess-atmos/src/um/src/io_services/server/stash/ios_stash_server.F90 line 1505
[4146] exceptions: [backtrace]: ( 9) : Address: [0x02585da6]
[4146] exceptions: [backtrace]: ( 9) : ios_stash_server_process$ios_stash_server_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/preprocess-atmos/src/um/src/io_services/server/stash/ios_stash_server.F90 line 804
[4146] exceptions: [backtrace]: ( 10) : Address: [0x025782ea]
[4146] exceptions: [backtrace]: ( 10) : ios_writer$io_server_writer_ in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/preprocess-atmos/src/um/src/io_services/server/io_server_writer.F90 line 161
[4146] exceptions: [backtrace]: ( 11) : Address: [0x02574d4b]
[4146] exceptions: [backtrace]: ( 11) : ios_run$ios_init__cray$mt$p0003 in file /lus/cls01095/work/n02/n02/emmah/cylc-run/u-cb263/share/fcm_make_um/preprocess-atmos/src/um/src/io_services/server/ios_init.F90 line 833
“”"
Looking through the code, my guess is that the stride, which is the “row length, for 2D field held as 1D array” and comes from “SIZE(field)” in “ios_stash_pack_wgdos”, is 0 - perhaps it’s trying to write the field for a part of the domain that’s not in the subdomain?
I’m continuing to run tweaks to the various settings to see if I can speed things up a bit too.
Best wishes,
Emma