Cargando…

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow

To reduce the training time of large-scale Deep Neural Networks (DNNs), Deep Learning (DL) scientists have started to explore parallelization strategies like data-parallelism, model-parallelism, and hybrid-parallelism. While data-parallelism has been extensively studied and developed, several proble...

Descripción completa

Detalles Bibliográficos
Autores principales:	Awan, Ammar Ahmad, Jain, Arpan, Anthony, Quentin, Subramoni, Hari, Panda, Dhabaleswar K.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295349/ http://dx.doi.org/10.1007/978-3-030-50743-5_5

_version_	1783546633824239616
author	Awan, Ammar Ahmad Jain, Arpan Anthony, Quentin Subramoni, Hari Panda, Dhabaleswar K.
author_facet	Awan, Ammar Ahmad Jain, Arpan Anthony, Quentin Subramoni, Hari Panda, Dhabaleswar K.
author_sort	Awan, Ammar Ahmad
collection	PubMed
description	To reduce the training time of large-scale Deep Neural Networks (DNNs), Deep Learning (DL) scientists have started to explore parallelization strategies like data-parallelism, model-parallelism, and hybrid-parallelism. While data-parallelism has been extensively studied and developed, several problems exist in realizing model-parallelism and hybrid-parallelism efficiently. Four major problems we focus on are: 1) defining a notion of a distributed model across processes, 2) implementing forward/back-propagation across process boundaries that requires explicit communication, 3) obtaining parallel speedup on an inherently sequential task, and 4) achieving scalability without losing out on a model’s accuracy. To address these problems, we create HyPar-Flow—a model-size and model-type agnostic, scalable, practical, and user-transparent system for hybrid-parallel training by exploiting MPI, Keras, and TensorFlow. HyPar-Flow provides a single API that can be used to perform data, model, and hybrid parallel training of any Keras model at scale. We create an internal distributed representation of the user-provided Keras model, utilize TF’s Eager execution features for distributed forward/back-propagation across processes, exploit pipelining to improve performance and leverage efficient MPI primitives for scalable communication. Between model partitions, we use send and recv to exchange layer-data/partial-errors while allreduce is used to accumulate/average gradients across model replicas. Beyond the design and implementation of HyPar-Flow, we also provide comprehensive correctness and performance results on three state-of-the-art HPC systems including TACC Frontera (#5 on Top500.org). For ResNet-1001, an ultra-deep model, HyPar-Flow provides: 1) Up to 1.6[Formula: see text] speedup over Horovod-based data-parallel training, 2) 110[Formula: see text] speedup over single-node on 128 Stampede2 nodes, and 3) 481[Formula: see text] speedup over single-node on 512 Frontera nodes.
format	Online Article Text
id	pubmed-7295349
institution	National Center for Biotechnology Information
language	English
publishDate	2020
record_format	MEDLINE/PubMed
spelling	pubmed-72953492020-06-16 HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow Awan, Ammar Ahmad Jain, Arpan Anthony, Quentin Subramoni, Hari Panda, Dhabaleswar K. High Performance Computing Article To reduce the training time of large-scale Deep Neural Networks (DNNs), Deep Learning (DL) scientists have started to explore parallelization strategies like data-parallelism, model-parallelism, and hybrid-parallelism. While data-parallelism has been extensively studied and developed, several problems exist in realizing model-parallelism and hybrid-parallelism efficiently. Four major problems we focus on are: 1) defining a notion of a distributed model across processes, 2) implementing forward/back-propagation across process boundaries that requires explicit communication, 3) obtaining parallel speedup on an inherently sequential task, and 4) achieving scalability without losing out on a model’s accuracy. To address these problems, we create HyPar-Flow—a model-size and model-type agnostic, scalable, practical, and user-transparent system for hybrid-parallel training by exploiting MPI, Keras, and TensorFlow. HyPar-Flow provides a single API that can be used to perform data, model, and hybrid parallel training of any Keras model at scale. We create an internal distributed representation of the user-provided Keras model, utilize TF’s Eager execution features for distributed forward/back-propagation across processes, exploit pipelining to improve performance and leverage efficient MPI primitives for scalable communication. Between model partitions, we use send and recv to exchange layer-data/partial-errors while allreduce is used to accumulate/average gradients across model replicas. Beyond the design and implementation of HyPar-Flow, we also provide comprehensive correctness and performance results on three state-of-the-art HPC systems including TACC Frontera (#5 on Top500.org). For ResNet-1001, an ultra-deep model, HyPar-Flow provides: 1) Up to 1.6[Formula: see text] speedup over Horovod-based data-parallel training, 2) 110[Formula: see text] speedup over single-node on 128 Stampede2 nodes, and 3) 481[Formula: see text] speedup over single-node on 512 Frontera nodes. 2020-05-22 /pmc/articles/PMC7295349/ http://dx.doi.org/10.1007/978-3-030-50743-5_5 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Article Awan, Ammar Ahmad Jain, Arpan Anthony, Quentin Subramoni, Hari Panda, Dhabaleswar K. HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow
title	HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow
title_full	HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow
title_fullStr	HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow
title_full_unstemmed	HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow
title_short	HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow
title_sort	hypar-flow: exploiting mpi and keras for scalable hybrid-parallel dnn training with tensorflow
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295349/ http://dx.doi.org/10.1007/978-3-030-50743-5_5
work_keys_str_mv	AT awanammarahmad hyparflowexploitingmpiandkerasforscalablehybridparalleldnntrainingwithtensorflow AT jainarpan hyparflowexploitingmpiandkerasforscalablehybridparalleldnntrainingwithtensorflow AT anthonyquentin hyparflowexploitingmpiandkerasforscalablehybridparalleldnntrainingwithtensorflow AT subramonihari hyparflowexploitingmpiandkerasforscalablehybridparalleldnntrainingwithtensorflow AT pandadhabaleswark hyparflowexploitingmpiandkerasforscalablehybridparalleldnntrainingwithtensorflow

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow

Ejemplares similares