Cargando…

SparkText: Biomedical Text Mining on Big Data Framework

BACKGROUND: Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases an...

Descripción completa

Detalles Bibliográficos
Autores principales: Ye, Zhan, Tafti, Ahmad P., He, Karen Y., Wang, Kai, He, Max M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5042555/
https://www.ncbi.nlm.nih.gov/pubmed/27685652
http://dx.doi.org/10.1371/journal.pone.0162721
_version_ 1782456615002177536
author Ye, Zhan
Tafti, Ahmad P.
He, Karen Y.
Wang, Kai
He, Max M.
author_facet Ye, Zhan
Tafti, Ahmad P.
He, Karen Y.
Wang, Kai
He, Max M.
author_sort Ye, Zhan
collection PubMed
description BACKGROUND: Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. RESULTS: In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. CONCLUSIONS: This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.
format Online
Article
Text
id pubmed-5042555
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-50425552016-10-27 SparkText: Biomedical Text Mining on Big Data Framework Ye, Zhan Tafti, Ahmad P. He, Karen Y. Wang, Kai He, Max M. PLoS One Research Article BACKGROUND: Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. RESULTS: In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. CONCLUSIONS: This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research. Public Library of Science 2016-09-29 /pmc/articles/PMC5042555/ /pubmed/27685652 http://dx.doi.org/10.1371/journal.pone.0162721 Text en © 2016 Ye et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Ye, Zhan
Tafti, Ahmad P.
He, Karen Y.
Wang, Kai
He, Max M.
SparkText: Biomedical Text Mining on Big Data Framework
title SparkText: Biomedical Text Mining on Big Data Framework
title_full SparkText: Biomedical Text Mining on Big Data Framework
title_fullStr SparkText: Biomedical Text Mining on Big Data Framework
title_full_unstemmed SparkText: Biomedical Text Mining on Big Data Framework
title_short SparkText: Biomedical Text Mining on Big Data Framework
title_sort sparktext: biomedical text mining on big data framework
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5042555/
https://www.ncbi.nlm.nih.gov/pubmed/27685652
http://dx.doi.org/10.1371/journal.pone.0162721
work_keys_str_mv AT yezhan sparktextbiomedicaltextminingonbigdataframework
AT taftiahmadp sparktextbiomedicaltextminingonbigdataframework
AT hekareny sparktextbiomedicaltextminingonbigdataframework
AT wangkai sparktextbiomedicaltextminingonbigdataframework
AT hemaxm sparktextbiomedicaltextminingonbigdataframework