Cargando…

Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system

Due to the increase in computing power, it is possible to improve the feature extraction and data fitting capabilities of DNN networks by increasing their depth and model complexity. However, the big data and complex models greatly increase the training overhead of DNN, so accelerating their trainin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wei, Jia, Zhang, Xingjun, Ji, Zeyu, Li, Jingbo, Wei, Zheng
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8511035/ https://www.ncbi.nlm.nih.gov/pubmed/34642373 http://dx.doi.org/10.1038/s41598-021-98794-z

_version_	1784582700198264832
author	Wei, Jia Zhang, Xingjun Ji, Zeyu Li, Jingbo Wei, Zheng
author_facet	Wei, Jia Zhang, Xingjun Ji, Zeyu Li, Jingbo Wei, Zheng
author_sort	Wei, Jia
collection	PubMed
description	Due to the increase in computing power, it is possible to improve the feature extraction and data fitting capabilities of DNN networks by increasing their depth and model complexity. However, the big data and complex models greatly increase the training overhead of DNN, so accelerating their training process becomes a key task. The Tianhe-3 peak speed is designed to target E-class, and the huge computing power provides a potential opportunity for DNN training. We implement and extend LeNet, AlexNet, VGG, and ResNet model training for a single MT-2000+ and FT-2000+ compute nodes, as well as extended multi-node clusters, and propose an improved gradient synchronization process for Dynamic Allreduce communication optimization strategy for the gradient synchronization process base on the ARM architecture features of the Tianhe-3 prototype, providing experimental data and theoretical basis for further enhancing and improving the performance of the Tianhe-3 prototype in large-scale distributed training of neural networks.
format	Online Article Text
id	pubmed-8511035
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-85110352021-10-14 Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system Wei, Jia Zhang, Xingjun Ji, Zeyu Li, Jingbo Wei, Zheng Sci Rep Article Due to the increase in computing power, it is possible to improve the feature extraction and data fitting capabilities of DNN networks by increasing their depth and model complexity. However, the big data and complex models greatly increase the training overhead of DNN, so accelerating their training process becomes a key task. The Tianhe-3 peak speed is designed to target E-class, and the huge computing power provides a potential opportunity for DNN training. We implement and extend LeNet, AlexNet, VGG, and ResNet model training for a single MT-2000+ and FT-2000+ compute nodes, as well as extended multi-node clusters, and propose an improved gradient synchronization process for Dynamic Allreduce communication optimization strategy for the gradient synchronization process base on the ARM architecture features of the Tianhe-3 prototype, providing experimental data and theoretical basis for further enhancing and improving the performance of the Tianhe-3 prototype in large-scale distributed training of neural networks. Nature Publishing Group UK 2021-10-12 /pmc/articles/PMC8511035/ /pubmed/34642373 http://dx.doi.org/10.1038/s41598-021-98794-z Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Wei, Jia Zhang, Xingjun Ji, Zeyu Li, Jingbo Wei, Zheng Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
title	Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
title_full	Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
title_fullStr	Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
title_full_unstemmed	Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
title_short	Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
title_sort	deploying and scaling distributed parallel deep neural networks on the tianhe-3 prototype system
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8511035/ https://www.ncbi.nlm.nih.gov/pubmed/34642373 http://dx.doi.org/10.1038/s41598-021-98794-z
work_keys_str_mv	AT weijia deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem AT zhangxingjun deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem AT jizeyu deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem AT lijingbo deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem AT weizheng deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem

Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system

Ejemplares similares