Cargando…

Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system

Due to the increase in computing power, it is possible to improve the feature extraction and data fitting capabilities of DNN networks by increasing their depth and model complexity. However, the big data and complex models greatly increase the training overhead of DNN, so accelerating their trainin...

Descripción completa

Detalles Bibliográficos
Autores principales: Wei, Jia, Zhang, Xingjun, Ji, Zeyu, Li, Jingbo, Wei, Zheng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8511035/
https://www.ncbi.nlm.nih.gov/pubmed/34642373
http://dx.doi.org/10.1038/s41598-021-98794-z
_version_ 1784582700198264832
author Wei, Jia
Zhang, Xingjun
Ji, Zeyu
Li, Jingbo
Wei, Zheng
author_facet Wei, Jia
Zhang, Xingjun
Ji, Zeyu
Li, Jingbo
Wei, Zheng
author_sort Wei, Jia
collection PubMed
description Due to the increase in computing power, it is possible to improve the feature extraction and data fitting capabilities of DNN networks by increasing their depth and model complexity. However, the big data and complex models greatly increase the training overhead of DNN, so accelerating their training process becomes a key task. The Tianhe-3 peak speed is designed to target E-class, and the huge computing power provides a potential opportunity for DNN training. We implement and extend LeNet, AlexNet, VGG, and ResNet model training for a single MT-2000+ and FT-2000+ compute nodes, as well as extended multi-node clusters, and propose an improved gradient synchronization process for Dynamic Allreduce communication optimization strategy for the gradient synchronization process base on the ARM architecture features of the Tianhe-3 prototype, providing experimental data and theoretical basis for further enhancing and improving the performance of the Tianhe-3 prototype in large-scale distributed training of neural networks.
format Online
Article
Text
id pubmed-8511035
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-85110352021-10-14 Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system Wei, Jia Zhang, Xingjun Ji, Zeyu Li, Jingbo Wei, Zheng Sci Rep Article Due to the increase in computing power, it is possible to improve the feature extraction and data fitting capabilities of DNN networks by increasing their depth and model complexity. However, the big data and complex models greatly increase the training overhead of DNN, so accelerating their training process becomes a key task. The Tianhe-3 peak speed is designed to target E-class, and the huge computing power provides a potential opportunity for DNN training. We implement and extend LeNet, AlexNet, VGG, and ResNet model training for a single MT-2000+ and FT-2000+ compute nodes, as well as extended multi-node clusters, and propose an improved gradient synchronization process for Dynamic Allreduce communication optimization strategy for the gradient synchronization process base on the ARM architecture features of the Tianhe-3 prototype, providing experimental data and theoretical basis for further enhancing and improving the performance of the Tianhe-3 prototype in large-scale distributed training of neural networks. Nature Publishing Group UK 2021-10-12 /pmc/articles/PMC8511035/ /pubmed/34642373 http://dx.doi.org/10.1038/s41598-021-98794-z Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Wei, Jia
Zhang, Xingjun
Ji, Zeyu
Li, Jingbo
Wei, Zheng
Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
title Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
title_full Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
title_fullStr Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
title_full_unstemmed Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
title_short Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
title_sort deploying and scaling distributed parallel deep neural networks on the tianhe-3 prototype system
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8511035/
https://www.ncbi.nlm.nih.gov/pubmed/34642373
http://dx.doi.org/10.1038/s41598-021-98794-z
work_keys_str_mv AT weijia deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem
AT zhangxingjun deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem
AT jizeyu deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem
AT lijingbo deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem
AT weizheng deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem