Cargando…

Reinit[Formula: see text]: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, re-deplo...

Descripción completa

Detalles Bibliográficos
Autores principales: Georgakoudis, Giorgis, Guo, Luanzheng, Laguna, Ignacio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295366/
http://dx.doi.org/10.1007/978-3-030-50743-5_27
_version_ 1783546637352697856
author Georgakoudis, Giorgis
Guo, Luanzheng
Laguna, Ignacio
author_facet Georgakoudis, Giorgis
Guo, Luanzheng
Laguna, Ignacio
author_sort Georgakoudis, Giorgis
collection PubMed
description Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit[Formula: see text], a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit[Formula: see text] contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit[Formula: see text] recovers much faster than restarting, up to 6[Formula: see text], or ULFM, up to 3[Formula: see text], and that it scales excellently as the number of MPI processes grows.
format Online
Article
Text
id pubmed-7295366
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-72953662020-06-16 Reinit[Formula: see text]: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance Georgakoudis, Giorgis Guo, Luanzheng Laguna, Ignacio High Performance Computing Article Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit[Formula: see text], a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit[Formula: see text] contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit[Formula: see text] recovers much faster than restarting, up to 6[Formula: see text], or ULFM, up to 3[Formula: see text], and that it scales excellently as the number of MPI processes grows. 2020-05-22 /pmc/articles/PMC7295366/ http://dx.doi.org/10.1007/978-3-030-50743-5_27 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Georgakoudis, Giorgis
Guo, Luanzheng
Laguna, Ignacio
Reinit[Formula: see text]: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance
title Reinit[Formula: see text]: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance
title_full Reinit[Formula: see text]: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance
title_fullStr Reinit[Formula: see text]: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance
title_full_unstemmed Reinit[Formula: see text]: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance
title_short Reinit[Formula: see text]: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance
title_sort reinit[formula: see text]: evaluating the performance of global-restart recovery methods for mpi fault tolerance
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295366/
http://dx.doi.org/10.1007/978-3-030-50743-5_27
work_keys_str_mv AT georgakoudisgiorgis reinitformulaseetextevaluatingtheperformanceofglobalrestartrecoverymethodsformpifaulttolerance
AT guoluanzheng reinitformulaseetextevaluatingtheperformanceofglobalrestartrecoverymethodsformpifaulttolerance
AT lagunaignacio reinitformulaseetextevaluatingtheperformanceofglobalrestartrecoverymethodsformpifaulttolerance