Cargando…

Reinit[Formula: see text]: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, re-deplo...

Descripción completa

Detalles Bibliográficos
Autores principales: Georgakoudis, Giorgis, Guo, Luanzheng, Laguna, Ignacio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295366/
http://dx.doi.org/10.1007/978-3-030-50743-5_27