Hi,
I’ve been running some JULES suites on JASMIN. Over the last couple of days, one by one the suites have started to fail in the main run with the following error:
ORTE has lost communication with a remote daemon.
HNP daemon : [[26927,0],0] on node host379
Remote daemon: [[26927,0],2] on node host384
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
[FAIL] rose-jules-run <<'__STDIN__'
[FAIL]
[FAIL] '__STDIN__' # return-code=205
2022-09-01T10:49:48Z CRITICAL - failed/EXIT
The last suite (u-cq006; the first to fail was u-cq007) failed yesterday for the first time, so none of the suites work anymore. I hadn’t changed any setting apart from files for initial conditions. Could you help me figure out where the issue stems from? Many thanks in advance!
Best,
Markus