Cargando…

A novel procedure on next generation sequencing data analysis using text mining algorithm

BACKGROUND: Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amount...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhao, Weizhong, Chen, James J., Perkins, Roger, Wang, Yuping, Liu, Zhichao, Hong, Huixiao, Tong, Weida, Zou, Wen
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4866036/
https://www.ncbi.nlm.nih.gov/pubmed/27177941
http://dx.doi.org/10.1186/s12859-016-1075-9
_version_ 1782431876825219072
author Zhao, Weizhong
Chen, James J.
Perkins, Roger
Wang, Yuping
Liu, Zhichao
Hong, Huixiao
Tong, Weida
Zou, Wen
author_facet Zhao, Weizhong
Chen, James J.
Perkins, Roger
Wang, Yuping
Liu, Zhichao
Hong, Huixiao
Tong, Weida
Zou, Wen
author_sort Zhao, Weizhong
collection PubMed
description BACKGROUND: Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. METHODS: We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. RESULTS: The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. CONCLUSION: The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1075-9) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4866036
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-48660362016-05-23 A novel procedure on next generation sequencing data analysis using text mining algorithm Zhao, Weizhong Chen, James J. Perkins, Roger Wang, Yuping Liu, Zhichao Hong, Huixiao Tong, Weida Zou, Wen BMC Bioinformatics Research Article BACKGROUND: Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. METHODS: We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. RESULTS: The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. CONCLUSION: The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1075-9) contains supplementary material, which is available to authorized users. BioMed Central 2016-05-13 /pmc/articles/PMC4866036/ /pubmed/27177941 http://dx.doi.org/10.1186/s12859-016-1075-9 Text en © Zhao et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Zhao, Weizhong
Chen, James J.
Perkins, Roger
Wang, Yuping
Liu, Zhichao
Hong, Huixiao
Tong, Weida
Zou, Wen
A novel procedure on next generation sequencing data analysis using text mining algorithm
title A novel procedure on next generation sequencing data analysis using text mining algorithm
title_full A novel procedure on next generation sequencing data analysis using text mining algorithm
title_fullStr A novel procedure on next generation sequencing data analysis using text mining algorithm
title_full_unstemmed A novel procedure on next generation sequencing data analysis using text mining algorithm
title_short A novel procedure on next generation sequencing data analysis using text mining algorithm
title_sort novel procedure on next generation sequencing data analysis using text mining algorithm
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4866036/
https://www.ncbi.nlm.nih.gov/pubmed/27177941
http://dx.doi.org/10.1186/s12859-016-1075-9
work_keys_str_mv AT zhaoweizhong anovelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT chenjamesj anovelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT perkinsroger anovelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT wangyuping anovelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT liuzhichao anovelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT honghuixiao anovelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT tongweida anovelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT zouwen anovelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT zhaoweizhong novelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT chenjamesj novelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT perkinsroger novelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT wangyuping novelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT liuzhichao novelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT honghuixiao novelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT tongweida novelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm
AT zouwen novelprocedureonnextgenerationsequencingdataanalysisusingtextminingalgorithm