Cargando…

A Novel Information-Theoretic Approach for Variable Clustering and Predictive Modeling Using Dirichlet Process Mixtures

In the era of big data, there are increasing interests on clustering variables for the minimization of data redundancy and the maximization of variable relevancy. Existing clustering methods, however, depend on nontrivial assumptions about the data structure. Note that nonlinear interdependence amon...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Yun, Yang, Hui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5155267/
https://www.ncbi.nlm.nih.gov/pubmed/27966581
http://dx.doi.org/10.1038/srep38913
_version_ 1782474973207592960
author Chen, Yun
Yang, Hui
author_facet Chen, Yun
Yang, Hui
author_sort Chen, Yun
collection PubMed
description In the era of big data, there are increasing interests on clustering variables for the minimization of data redundancy and the maximization of variable relevancy. Existing clustering methods, however, depend on nontrivial assumptions about the data structure. Note that nonlinear interdependence among variables poses significant challenges on the traditional framework of predictive modeling. In the present work, we reformulate the problem of variable clustering from an information theoretic perspective that does not require the assumption of data structure for the identification of nonlinear interdependence among variables. Specifically, we propose the use of mutual information to characterize and measure nonlinear correlation structures among variables. Further, we develop Dirichlet process (DP) models to cluster variables based on the mutual-information measures among variables. Finally, orthonormalized variables in each cluster are integrated with group elastic-net model to improve the performance of predictive modeling. Both simulation and real-world case studies showed that the proposed methodology not only effectively reveals the nonlinear interdependence structures among variables but also outperforms traditional variable clustering algorithms such as hierarchical clustering.
format Online
Article
Text
id pubmed-5155267
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Nature Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-51552672016-12-28 A Novel Information-Theoretic Approach for Variable Clustering and Predictive Modeling Using Dirichlet Process Mixtures Chen, Yun Yang, Hui Sci Rep Article In the era of big data, there are increasing interests on clustering variables for the minimization of data redundancy and the maximization of variable relevancy. Existing clustering methods, however, depend on nontrivial assumptions about the data structure. Note that nonlinear interdependence among variables poses significant challenges on the traditional framework of predictive modeling. In the present work, we reformulate the problem of variable clustering from an information theoretic perspective that does not require the assumption of data structure for the identification of nonlinear interdependence among variables. Specifically, we propose the use of mutual information to characterize and measure nonlinear correlation structures among variables. Further, we develop Dirichlet process (DP) models to cluster variables based on the mutual-information measures among variables. Finally, orthonormalized variables in each cluster are integrated with group elastic-net model to improve the performance of predictive modeling. Both simulation and real-world case studies showed that the proposed methodology not only effectively reveals the nonlinear interdependence structures among variables but also outperforms traditional variable clustering algorithms such as hierarchical clustering. Nature Publishing Group 2016-12-14 /pmc/articles/PMC5155267/ /pubmed/27966581 http://dx.doi.org/10.1038/srep38913 Text en Copyright © 2016, The Author(s) http://creativecommons.org/licenses/by/4.0/ This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
spellingShingle Article
Chen, Yun
Yang, Hui
A Novel Information-Theoretic Approach for Variable Clustering and Predictive Modeling Using Dirichlet Process Mixtures
title A Novel Information-Theoretic Approach for Variable Clustering and Predictive Modeling Using Dirichlet Process Mixtures
title_full A Novel Information-Theoretic Approach for Variable Clustering and Predictive Modeling Using Dirichlet Process Mixtures
title_fullStr A Novel Information-Theoretic Approach for Variable Clustering and Predictive Modeling Using Dirichlet Process Mixtures
title_full_unstemmed A Novel Information-Theoretic Approach for Variable Clustering and Predictive Modeling Using Dirichlet Process Mixtures
title_short A Novel Information-Theoretic Approach for Variable Clustering and Predictive Modeling Using Dirichlet Process Mixtures
title_sort novel information-theoretic approach for variable clustering and predictive modeling using dirichlet process mixtures
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5155267/
https://www.ncbi.nlm.nih.gov/pubmed/27966581
http://dx.doi.org/10.1038/srep38913
work_keys_str_mv AT chenyun anovelinformationtheoreticapproachforvariableclusteringandpredictivemodelingusingdirichletprocessmixtures
AT yanghui anovelinformationtheoreticapproachforvariableclusteringandpredictivemodelingusingdirichletprocessmixtures
AT chenyun novelinformationtheoreticapproachforvariableclusteringandpredictivemodelingusingdirichletprocessmixtures
AT yanghui novelinformationtheoreticapproachforvariableclusteringandpredictivemodelingusingdirichletprocessmixtures