ORTE has lost communication with a remote daemon

Hi,

I’ve been running some JULES suites on JASMIN. Over the last couple of days, one by one the suites have started to fail in the main run with the following error:

ORTE has lost communication with a remote daemon.

  HNP daemon   : [[26927,0],0] on node host379
  Remote daemon: [[26927,0],2] on node host384

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
[FAIL] rose-jules-run <<'__STDIN__'
[FAIL] 
[FAIL] '__STDIN__' # return-code=205
2022-09-01T10:49:48Z CRITICAL - failed/EXIT

The last suite (u-cq006; the first to fail was u-cq007) failed yesterday for the first time, so none of the suites work anymore. I hadn’t changed any setting apart from files for initial conditions. Could you help me figure out where the issue stems from? Many thanks in advance!

Best,
Markus

Markus

This looks like a JASMIN problem - please report it to the JASMIN help desk.

Grenville

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.