Cargando…

Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off

When gradient descent (GD) is scaled to many parallel workers for large-scale machine learning applications, its per-iteration computation time is limited by straggling workers. Straggling workers can be tolerated by assigning redundant computations and/or coding across data and computations, but in...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ozfatura, Emre, Ulukus, Sennur, Gündüz, Deniz
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7517046/ https://www.ncbi.nlm.nih.gov/pubmed/33286316 http://dx.doi.org/10.3390/e22050544

_version_	1783587139372449792
author	Ozfatura, Emre Ulukus, Sennur Gündüz, Deniz
author_facet	Ozfatura, Emre Ulukus, Sennur Gündüz, Deniz
author_sort	Ozfatura, Emre
collection	PubMed
description	When gradient descent (GD) is scaled to many parallel workers for large-scale machine learning applications, its per-iteration computation time is limited by straggling workers. Straggling workers can be tolerated by assigning redundant computations and/or coding across data and computations, but in most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two drawbacks: over-computation due to inaccurate prediction of the straggling behavior, and under-utilization due to discarding partial computations carried out by stragglers. To overcome these drawbacks, we consider multi-message communication (MMC) by allowing multiple computations to be conveyed from each worker per iteration, and propose novel straggler avoidance techniques for both coded computation and coded communication with MMC. We analyze how the proposed designs can be employed efficiently to seek a balance between the computation and communication latency. Furthermore, we identify the advantages and disadvantages of these designs in different settings through extensive simulations, both model-based and real implementation on Amazon EC2 servers, and demonstrate that proposed schemes with MMC can help improve upon existing straggler avoidance schemes.
format	Online Article Text
id	pubmed-7517046
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-75170462020-11-09 Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off Ozfatura, Emre Ulukus, Sennur Gündüz, Deniz Entropy (Basel) Article When gradient descent (GD) is scaled to many parallel workers for large-scale machine learning applications, its per-iteration computation time is limited by straggling workers. Straggling workers can be tolerated by assigning redundant computations and/or coding across data and computations, but in most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two drawbacks: over-computation due to inaccurate prediction of the straggling behavior, and under-utilization due to discarding partial computations carried out by stragglers. To overcome these drawbacks, we consider multi-message communication (MMC) by allowing multiple computations to be conveyed from each worker per iteration, and propose novel straggler avoidance techniques for both coded computation and coded communication with MMC. We analyze how the proposed designs can be employed efficiently to seek a balance between the computation and communication latency. Furthermore, we identify the advantages and disadvantages of these designs in different settings through extensive simulations, both model-based and real implementation on Amazon EC2 servers, and demonstrate that proposed schemes with MMC can help improve upon existing straggler avoidance schemes. MDPI 2020-05-13 /pmc/articles/PMC7517046/ /pubmed/33286316 http://dx.doi.org/10.3390/e22050544 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Ozfatura, Emre Ulukus, Sennur Gündüz, Deniz Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off
title	Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off
title_full	Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off
title_fullStr	Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off
title_full_unstemmed	Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off
title_short	Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off
title_sort	straggler-aware distributed learning: communication–computation latency trade-off
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7517046/ https://www.ncbi.nlm.nih.gov/pubmed/33286316 http://dx.doi.org/10.3390/e22050544
work_keys_str_mv	AT ozfaturaemre stragglerawaredistributedlearningcommunicationcomputationlatencytradeoff AT ulukussennur stragglerawaredistributedlearningcommunicationcomputationlatencytradeoff AT gunduzdeniz stragglerawaredistributedlearningcommunicationcomputationlatencytradeoff

Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off

Ejemplares similares