Cargando…
Reinit[Formula: see text]: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance
Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, re-deplo...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295366/ http://dx.doi.org/10.1007/978-3-030-50743-5_27 |