Cargando…
Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
Due to the increase in computing power, it is possible to improve the feature extraction and data fitting capabilities of DNN networks by increasing their depth and model complexity. However, the big data and complex models greatly increase the training overhead of DNN, so accelerating their trainin...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8511035/ https://www.ncbi.nlm.nih.gov/pubmed/34642373 http://dx.doi.org/10.1038/s41598-021-98794-z |
_version_ | 1784582700198264832 |
---|---|
author | Wei, Jia Zhang, Xingjun Ji, Zeyu Li, Jingbo Wei, Zheng |
author_facet | Wei, Jia Zhang, Xingjun Ji, Zeyu Li, Jingbo Wei, Zheng |
author_sort | Wei, Jia |
collection | PubMed |
description | Due to the increase in computing power, it is possible to improve the feature extraction and data fitting capabilities of DNN networks by increasing their depth and model complexity. However, the big data and complex models greatly increase the training overhead of DNN, so accelerating their training process becomes a key task. The Tianhe-3 peak speed is designed to target E-class, and the huge computing power provides a potential opportunity for DNN training. We implement and extend LeNet, AlexNet, VGG, and ResNet model training for a single MT-2000+ and FT-2000+ compute nodes, as well as extended multi-node clusters, and propose an improved gradient synchronization process for Dynamic Allreduce communication optimization strategy for the gradient synchronization process base on the ARM architecture features of the Tianhe-3 prototype, providing experimental data and theoretical basis for further enhancing and improving the performance of the Tianhe-3 prototype in large-scale distributed training of neural networks. |
format | Online Article Text |
id | pubmed-8511035 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-85110352021-10-14 Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system Wei, Jia Zhang, Xingjun Ji, Zeyu Li, Jingbo Wei, Zheng Sci Rep Article Due to the increase in computing power, it is possible to improve the feature extraction and data fitting capabilities of DNN networks by increasing their depth and model complexity. However, the big data and complex models greatly increase the training overhead of DNN, so accelerating their training process becomes a key task. The Tianhe-3 peak speed is designed to target E-class, and the huge computing power provides a potential opportunity for DNN training. We implement and extend LeNet, AlexNet, VGG, and ResNet model training for a single MT-2000+ and FT-2000+ compute nodes, as well as extended multi-node clusters, and propose an improved gradient synchronization process for Dynamic Allreduce communication optimization strategy for the gradient synchronization process base on the ARM architecture features of the Tianhe-3 prototype, providing experimental data and theoretical basis for further enhancing and improving the performance of the Tianhe-3 prototype in large-scale distributed training of neural networks. Nature Publishing Group UK 2021-10-12 /pmc/articles/PMC8511035/ /pubmed/34642373 http://dx.doi.org/10.1038/s41598-021-98794-z Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Wei, Jia Zhang, Xingjun Ji, Zeyu Li, Jingbo Wei, Zheng Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system |
title | Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system |
title_full | Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system |
title_fullStr | Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system |
title_full_unstemmed | Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system |
title_short | Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system |
title_sort | deploying and scaling distributed parallel deep neural networks on the tianhe-3 prototype system |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8511035/ https://www.ncbi.nlm.nih.gov/pubmed/34642373 http://dx.doi.org/10.1038/s41598-021-98794-z |
work_keys_str_mv | AT weijia deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem AT zhangxingjun deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem AT jizeyu deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem AT lijingbo deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem AT weizheng deployingandscalingdistributedparalleldeepneuralnetworksonthetianhe3prototypesystem |