Cargando…

Topic modeling for cluster analysis of large biological and medical datasets

BACKGROUND: The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multiv...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhao, Weizhong, Zou, Wen, Chen, James J
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251039/ https://www.ncbi.nlm.nih.gov/pubmed/25350106 http://dx.doi.org/10.1186/1471-2105-15-S11-S11

_version_	1782346992406495232
author	Zhao, Weizhong Zou, Wen Chen, James J
author_facet	Zhao, Weizhong Zou, Wen Chen, James J
author_sort	Zhao, Weizhong
collection	PubMed
description	BACKGROUND: The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. RESULTS: In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. CONCLUSION: Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.
format	Online Article Text
id	pubmed-4251039
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-42510392014-12-02 Topic modeling for cluster analysis of large biological and medical datasets Zhao, Weizhong Zou, Wen Chen, James J BMC Bioinformatics Proceedings BACKGROUND: The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. RESULTS: In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. CONCLUSION: Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets. BioMed Central 2014-10-21 /pmc/articles/PMC4251039/ /pubmed/25350106 http://dx.doi.org/10.1186/1471-2105-15-S11-S11 Text en Copyright © 2014 Zhao et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Proceedings Zhao, Weizhong Zou, Wen Chen, James J Topic modeling for cluster analysis of large biological and medical datasets
title	Topic modeling for cluster analysis of large biological and medical datasets
title_full	Topic modeling for cluster analysis of large biological and medical datasets
title_fullStr	Topic modeling for cluster analysis of large biological and medical datasets
title_full_unstemmed	Topic modeling for cluster analysis of large biological and medical datasets
title_short	Topic modeling for cluster analysis of large biological and medical datasets
title_sort	topic modeling for cluster analysis of large biological and medical datasets
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251039/ https://www.ncbi.nlm.nih.gov/pubmed/25350106 http://dx.doi.org/10.1186/1471-2105-15-S11-S11
work_keys_str_mv	AT zhaoweizhong topicmodelingforclusteranalysisoflargebiologicalandmedicaldatasets AT zouwen topicmodelingforclusteranalysisoflargebiologicalandmedicaldatasets AT chenjamesj topicmodelingforclusteranalysisoflargebiologicalandmedicaldatasets

Topic modeling for cluster analysis of large biological and medical datasets

Ejemplares similares