Cargando…

Scalable deep text comprehension for Cancer surveillance on high-performance computing

BACKGROUND: Deep Learning (DL) has advanced the state-of-the-art capabilities in bioinformatics applications which has resulted in trends of increasingly sophisticated and computationally demanding models trained by larger and larger data sets. This vastly increased computational demand challenges t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Qiu, John X., Yoon, Hong-Jun, Srivastava, Kshitij, Watson, Thomas P., Blair Christian, J., Ramanathan, Arvind, Wu, Xiao C., Fearn, Paul A., Tourassi, Georgia D.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302459/ https://www.ncbi.nlm.nih.gov/pubmed/30577743 http://dx.doi.org/10.1186/s12859-018-2511-9

_version_	1783381984195641344
author	Qiu, John X. Yoon, Hong-Jun Srivastava, Kshitij Watson, Thomas P. Blair Christian, J. Ramanathan, Arvind Wu, Xiao C. Fearn, Paul A. Tourassi, Georgia D.
author_facet	Qiu, John X. Yoon, Hong-Jun Srivastava, Kshitij Watson, Thomas P. Blair Christian, J. Ramanathan, Arvind Wu, Xiao C. Fearn, Paul A. Tourassi, Georgia D.
author_sort	Qiu, John X.
collection	PubMed
description	BACKGROUND: Deep Learning (DL) has advanced the state-of-the-art capabilities in bioinformatics applications which has resulted in trends of increasingly sophisticated and computationally demanding models trained by larger and larger data sets. This vastly increased computational demand challenges the feasibility of conducting cutting-edge research. One solution is to distribute the vast computational workload across multiple computing cluster nodes with data parallelism algorithms. In this study, we used a High-Performance Computing environment and implemented the Downpour Stochastic Gradient Descent algorithm for data parallelism to train a Convolutional Neural Network (CNN) for the natural language processing task of information extraction from a massive dataset of cancer pathology reports. We evaluated the scalability improvements using data parallelism training and the Titan supercomputer at Oak Ridge Leadership Computing Facility. To evaluate scalability, we used different numbers of worker nodes and performed a set of experiments comparing the effects of different training batch sizes and optimizer functions. RESULTS: We found that Adadelta would consistently converge at a lower validation loss, though requiring over twice as many training epochs as the fastest converging optimizer, RMSProp. The Adam optimizer consistently achieved a close 2nd place minimum validation loss significantly faster; using a batch size of 16 and 32 allowed the network to converge in only 4.5 training epochs. CONCLUSIONS: We demonstrated that the networked training process is scalable across multiple compute nodes communicating with message passing interface while achieving higher classification accuracy compared to a traditional machine learning algorithm.
format	Online Article Text
id	pubmed-6302459
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-63024592018-12-31 Scalable deep text comprehension for Cancer surveillance on high-performance computing Qiu, John X. Yoon, Hong-Jun Srivastava, Kshitij Watson, Thomas P. Blair Christian, J. Ramanathan, Arvind Wu, Xiao C. Fearn, Paul A. Tourassi, Georgia D. BMC Bioinformatics Research BACKGROUND: Deep Learning (DL) has advanced the state-of-the-art capabilities in bioinformatics applications which has resulted in trends of increasingly sophisticated and computationally demanding models trained by larger and larger data sets. This vastly increased computational demand challenges the feasibility of conducting cutting-edge research. One solution is to distribute the vast computational workload across multiple computing cluster nodes with data parallelism algorithms. In this study, we used a High-Performance Computing environment and implemented the Downpour Stochastic Gradient Descent algorithm for data parallelism to train a Convolutional Neural Network (CNN) for the natural language processing task of information extraction from a massive dataset of cancer pathology reports. We evaluated the scalability improvements using data parallelism training and the Titan supercomputer at Oak Ridge Leadership Computing Facility. To evaluate scalability, we used different numbers of worker nodes and performed a set of experiments comparing the effects of different training batch sizes and optimizer functions. RESULTS: We found that Adadelta would consistently converge at a lower validation loss, though requiring over twice as many training epochs as the fastest converging optimizer, RMSProp. The Adam optimizer consistently achieved a close 2nd place minimum validation loss significantly faster; using a batch size of 16 and 32 allowed the network to converge in only 4.5 training epochs. CONCLUSIONS: We demonstrated that the networked training process is scalable across multiple compute nodes communicating with message passing interface while achieving higher classification accuracy compared to a traditional machine learning algorithm. BioMed Central 2018-12-21 /pmc/articles/PMC6302459/ /pubmed/30577743 http://dx.doi.org/10.1186/s12859-018-2511-9 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Qiu, John X. Yoon, Hong-Jun Srivastava, Kshitij Watson, Thomas P. Blair Christian, J. Ramanathan, Arvind Wu, Xiao C. Fearn, Paul A. Tourassi, Georgia D. Scalable deep text comprehension for Cancer surveillance on high-performance computing
title	Scalable deep text comprehension for Cancer surveillance on high-performance computing
title_full	Scalable deep text comprehension for Cancer surveillance on high-performance computing
title_fullStr	Scalable deep text comprehension for Cancer surveillance on high-performance computing
title_full_unstemmed	Scalable deep text comprehension for Cancer surveillance on high-performance computing
title_short	Scalable deep text comprehension for Cancer surveillance on high-performance computing
title_sort	scalable deep text comprehension for cancer surveillance on high-performance computing
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302459/ https://www.ncbi.nlm.nih.gov/pubmed/30577743 http://dx.doi.org/10.1186/s12859-018-2511-9
work_keys_str_mv	AT qiujohnx scalabledeeptextcomprehensionforcancersurveillanceonhighperformancecomputing AT yoonhongjun scalabledeeptextcomprehensionforcancersurveillanceonhighperformancecomputing AT srivastavakshitij scalabledeeptextcomprehensionforcancersurveillanceonhighperformancecomputing AT watsonthomasp scalabledeeptextcomprehensionforcancersurveillanceonhighperformancecomputing AT blairchristianj scalabledeeptextcomprehensionforcancersurveillanceonhighperformancecomputing AT ramanathanarvind scalabledeeptextcomprehensionforcancersurveillanceonhighperformancecomputing AT wuxiaoc scalabledeeptextcomprehensionforcancersurveillanceonhighperformancecomputing AT fearnpaula scalabledeeptextcomprehensionforcancersurveillanceonhighperformancecomputing AT tourassigeorgiad scalabledeeptextcomprehensionforcancersurveillanceonhighperformancecomputing

Scalable deep text comprehension for Cancer surveillance on high-performance computing

Ejemplares similares