Cargando…

Learning Predictive Interactions Using Information Gain and Bayesian Network Scoring

BACKGROUND: The problems of correlation and classification are long-standing in the fields of statistics and machine learning, and techniques have been developed to address these problems. We are now in the era of high-dimensional data, which is data that can concern billions of variables. These dat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jiang, Xia, Jao, Jeremy, Neapolitan, Richard
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4666609/ https://www.ncbi.nlm.nih.gov/pubmed/26624895 http://dx.doi.org/10.1371/journal.pone.0143247

_version_	1782403725268090880
author	Jiang, Xia Jao, Jeremy Neapolitan, Richard
author_facet	Jiang, Xia Jao, Jeremy Neapolitan, Richard
author_sort	Jiang, Xia
collection	PubMed
description	BACKGROUND: The problems of correlation and classification are long-standing in the fields of statistics and machine learning, and techniques have been developed to address these problems. We are now in the era of high-dimensional data, which is data that can concern billions of variables. These data present new challenges. In particular, it is difficult to discover predictive variables, when each variable has little marginal effect. An example concerns Genome-wide Association Studies (GWAS) datasets, which involve millions of single nucleotide polymorphism (SNPs), where some of the SNPs interact epistatically to affect disease status. Towards determining these interacting SNPs, researchers developed techniques that addressed this specific problem. However, the problem is more general, and so these techniques are applicable to other problems concerning interactions. A difficulty with many of these techniques is that they do not distinguish whether a learned interaction is actually an interaction or whether it involves several variables with strong marginal effects. METHODOLOGY/FINDINGS: We address this problem using information gain and Bayesian network scoring. First, we identify candidate interactions by determining whether together variables provide more information than they do separately. Then we use Bayesian network scoring to see if a candidate interaction really is a likely model. Our strategy is called MBS-IGain. Using 100 simulated datasets and a real GWAS Alzheimer’s dataset, we investigated the performance of MBS-IGain. CONCLUSIONS/SIGNIFICANCE: When analyzing the simulated datasets, MBS-IGain substantially out-performed nine previous methods at locating interacting predictors, and at identifying interactions exactly. When analyzing the real Alzheimer’s dataset, we obtained new results and results that substantiated previous findings. We conclude that MBS-IGain is highly effective at finding interactions in high-dimensional datasets. This result is significant because we have increasingly abundant high-dimensional data in many domains, and to learn causes and perform prediction/classification using these data, we often must first identify interactions.
format	Online Article Text
id	pubmed-4666609
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-46666092015-12-10 Learning Predictive Interactions Using Information Gain and Bayesian Network Scoring Jiang, Xia Jao, Jeremy Neapolitan, Richard PLoS One Research Article BACKGROUND: The problems of correlation and classification are long-standing in the fields of statistics and machine learning, and techniques have been developed to address these problems. We are now in the era of high-dimensional data, which is data that can concern billions of variables. These data present new challenges. In particular, it is difficult to discover predictive variables, when each variable has little marginal effect. An example concerns Genome-wide Association Studies (GWAS) datasets, which involve millions of single nucleotide polymorphism (SNPs), where some of the SNPs interact epistatically to affect disease status. Towards determining these interacting SNPs, researchers developed techniques that addressed this specific problem. However, the problem is more general, and so these techniques are applicable to other problems concerning interactions. A difficulty with many of these techniques is that they do not distinguish whether a learned interaction is actually an interaction or whether it involves several variables with strong marginal effects. METHODOLOGY/FINDINGS: We address this problem using information gain and Bayesian network scoring. First, we identify candidate interactions by determining whether together variables provide more information than they do separately. Then we use Bayesian network scoring to see if a candidate interaction really is a likely model. Our strategy is called MBS-IGain. Using 100 simulated datasets and a real GWAS Alzheimer’s dataset, we investigated the performance of MBS-IGain. CONCLUSIONS/SIGNIFICANCE: When analyzing the simulated datasets, MBS-IGain substantially out-performed nine previous methods at locating interacting predictors, and at identifying interactions exactly. When analyzing the real Alzheimer’s dataset, we obtained new results and results that substantiated previous findings. We conclude that MBS-IGain is highly effective at finding interactions in high-dimensional datasets. This result is significant because we have increasingly abundant high-dimensional data in many domains, and to learn causes and perform prediction/classification using these data, we often must first identify interactions. Public Library of Science 2015-12-01 /pmc/articles/PMC4666609/ /pubmed/26624895 http://dx.doi.org/10.1371/journal.pone.0143247 Text en © 2015 Jiang et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Jiang, Xia Jao, Jeremy Neapolitan, Richard Learning Predictive Interactions Using Information Gain and Bayesian Network Scoring
title	Learning Predictive Interactions Using Information Gain and Bayesian Network Scoring
title_full	Learning Predictive Interactions Using Information Gain and Bayesian Network Scoring
title_fullStr	Learning Predictive Interactions Using Information Gain and Bayesian Network Scoring
title_full_unstemmed	Learning Predictive Interactions Using Information Gain and Bayesian Network Scoring
title_short	Learning Predictive Interactions Using Information Gain and Bayesian Network Scoring
title_sort	learning predictive interactions using information gain and bayesian network scoring
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4666609/ https://www.ncbi.nlm.nih.gov/pubmed/26624895 http://dx.doi.org/10.1371/journal.pone.0143247
work_keys_str_mv	AT jiangxia learningpredictiveinteractionsusinginformationgainandbayesiannetworkscoring AT jaojeremy learningpredictiveinteractionsusinginformationgainandbayesiannetworkscoring AT neapolitanrichard learningpredictiveinteractionsusinginformationgainandbayesiannetworkscoring

Learning Predictive Interactions Using Information Gain and Bayesian Network Scoring

Ejemplares similares