Atmos_main failure

hannahbryant · 5 April 2022 09:49

Hi,

I am five years into a run and the most recent atmos_main task failed with this error:
MPICH ERROR [Rank 0] [job id 1382455.0] [Mon Apr 4 22:01:09 2022] [nid001360] - Abort(2685455) (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIDI_CRAY_post_init(2087)…:
MPIDI_CRAY_ofi_check_nic_symmetry(2705)…:
MPIR_CRAY_Allreduce(557)…:
MPIR_Allreduce_impl(298)…:
MPIR_Allreduce_intra_auto(210)…:
MPIR_Allreduce_intra_recursive_doubling(232):
(unknown)(): Other MPI error

aborting job:
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIDI_CRAY_post_init(2087)…:

I wondered if this was due to my connection to archer and retriggered the failed task again and this ran successfully (and has now moved onto the next timestep). I have now noticed within the job.err there are new warnings as such:

???
??? WARNING ???
? Warning code: -1
? Warning from routine: UKCA_TROPOPAUSE::UKCA_CALC_TROPOPAUSE
? Warning message: Difficulty diagnosing pv tropopause, Reverting to default tropopause pressure
? Warning from processor: 0
? Warning number: 78
???

I am concerned that something is now wrong with the run and what the best thing to do is (restart or if it is possible to go back in the run to before the task that failed? (the ainitial file now appears to have moved past this point)).

Thanks,
Hannah

dcase · 5 April 2022 10:07

Someone else may be able to give you a definitive answer on the warning, but I would note that the network on ARCHER2 is not completely robust, and restarting after an MPI failure like that is what I would have done.

hannahbryant · 7 April 2022 08:36

Thanks @dcase
Just to update for the help desk - the newer atmos_mains run successfully without the tropopause warning but I am still concerned that restarting has messed up the run. I need to run this for another 10 years, do you have any thoughts on whether it is ok to proceed or if redoing is more sensible?

Thanks,
Hannah

Topic		Replies	Views
Restarting a failed UM run Unified Model ARCHER2	2	350	31 August 2021
Error in u-ch427 Unified Model ARCHER2	8	290	4 January 2022
Error at postproc_atmos Unified Model PUMA , ARCHER2	10	261	15 February 2024
"Too many negatives" failure in atmos_main Unified Model Monsoon2	8	242	23 February 2022
Job failing on Archer2 ARCHER2	4	114	8 December 2023

Atmos_main failure

Related topics