Atmos_main failure

Hi,

I am five years into a run and the most recent atmos_main task failed with this error:
MPICH ERROR [Rank 0] [job id 1382455.0] [Mon Apr 4 22:01:09 2022] [nid001360] - Abort(2685455) (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIDI_CRAY_post_init(2087)…:
MPIDI_CRAY_ofi_check_nic_symmetry(2705)…:
MPIR_CRAY_Allreduce(557)…:
MPIR_Allreduce_impl(298)…:
MPIR_Allreduce_intra_auto(210)…:
MPIR_Allreduce_intra_recursive_doubling(232):
(unknown)(): Other MPI error

aborting job:
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIDI_CRAY_post_init(2087)…:

I wondered if this was due to my connection to archer and retriggered the failed task again and this ran successfully (and has now moved onto the next timestep). I have now noticed within the job.err there are new warnings as such:

???
??? WARNING ???
? Warning code: -1
? Warning from routine: UKCA_TROPOPAUSE::UKCA_CALC_TROPOPAUSE
? Warning message: Difficulty diagnosing pv tropopause, Reverting to default tropopause pressure
? Warning from processor: 0
? Warning number: 78
???

I am concerned that something is now wrong with the run and what the best thing to do is (restart or if it is possible to go back in the run to before the task that failed? (the ainitial file now appears to have moved past this point)).

Thanks,
Hannah

Someone else may be able to give you a definitive answer on the warning, but I would note that the network on ARCHER2 is not completely robust, and restarting after an MPI failure like that is what I would have done.

Thanks @dcase
Just to update for the help desk - the newer atmos_mains run successfully without the tropopause warning but I am still concerned that restarting has messed up the run. I need to run this for another 10 years, do you have any thoughts on whether it is ok to proceed or if redoing is more sensible?

Thanks,
Hannah