Hi,
I have some coupled runs (e.g u-dh43) which run fine on the coupled task but then hit submit-retrying at each postproc stage with error messages like:
ERROR: file not found: /home/d04/jamwe/cylc-run/u-dh430/log/job/20660101T0000Z/postproc_atmos/05/job.err
ERROR: command terminated by signal 1: ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no -n xcs-c env CYLC_VERSION=7.8.14 bash --login -c ‘’“'”‘exec “$0” “$@”’“'”‘’ cylc cat-log ‘–remote-arg=’“'”‘$HOME/cylc-run/u-dh430/log/job/20660101T0000Z/postproc_atmos/05/job.err’“'”‘’ --remote-arg=tail ‘–remote-arg=’“'”‘tail -n +1 -F %(filename)s’“'”‘’ u-dh430
Once I retrigger the tasks, they run fine but the next coupled task won’t go until the previous submission’s timesteps finish.
This seems similar to an old, but persistent issue I have with AMIP run (http://cms.ncas.ac.uk/ticket/3505#comment:4 - not sure if link works anymore) which was solved by changing the host = $(rose host-select xcs-c) → host = localhost in the HPC section of monsoon.rc. However, for coupled runs I can’t see an equivalent change.
Have you seen this before?
Thanks for your help,
James