Hi,
I am five years into a run and the most recent atmos_main task failed with this error:
MPICH ERROR [Rank 0] [job id 1382455.0] [Mon Apr 4 22:01:09 2022] [nid001360] - Abort(2685455) (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIDI_CRAY_post_init(2087)…:
MPIDI_CRAY_ofi_check_nic_symmetry(2705)…:
MPIR_CRAY_Allreduce(557)…:
MPIR_Allreduce_impl(298)…:
MPIR_Allreduce_intra_auto(210)…:
MPIR_Allreduce_intra_recursive_doubling(232):
(unknown)(): Other MPI error
aborting job:
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIDI_CRAY_post_init(2087)…:
I wondered if this was due to my connection to archer and retriggered the failed task again and this ran successfully (and has now moved onto the next timestep). I have now noticed within the job.err there are new warnings as such:
???
??? WARNING ???
? Warning code: -1
? Warning from routine: UKCA_TROPOPAUSE::UKCA_CALC_TROPOPAUSE
? Warning message: Difficulty diagnosing pv tropopause, Reverting to default tropopause pressure
? Warning from processor: 0
? Warning number: 78
???
I am concerned that something is now wrong with the run and what the best thing to do is (restart or if it is possible to go back in the run to before the task that failed? (the ainitial file now appears to have moved past this point)).
Thanks,
Hannah