Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies

The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlati...

Descripción completa

Detalles Bibliográficos
Autores principales: Botta, Vincent, Louppe, Gilles, Geurts, Pierre, Wehenkel, Louis
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3973686/
https://www.ncbi.nlm.nih.gov/pubmed/24695491
http://dx.doi.org/10.1371/journal.pone.0093379
_version_ 1782479359303483392
author Botta, Vincent
Louppe, Gilles
Geurts, Pierre
Wehenkel, Louis
author_facet Botta, Vincent
Louppe, Gilles
Geurts, Pierre
Wehenkel, Louis
author_sort Botta, Vincent
collection PubMed
description The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification. Variable importance results and T-Trees source code are all available at www.montefiore.ulg.ac.be/~botta/ttrees/ and github.com/0asa/TTree-source respectively.
format Online
Article
Text
id pubmed-3973686
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-39736862014-04-04 Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies Botta, Vincent Louppe, Gilles Geurts, Pierre Wehenkel, Louis PLoS One Research Article The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification. Variable importance results and T-Trees source code are all available at www.montefiore.ulg.ac.be/~botta/ttrees/ and github.com/0asa/TTree-source respectively. Public Library of Science 2014-04-02 /pmc/articles/PMC3973686/ /pubmed/24695491 http://dx.doi.org/10.1371/journal.pone.0093379 Text en © 2014 Botta et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Botta, Vincent
Louppe, Gilles
Geurts, Pierre
Wehenkel, Louis
Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
title Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
title_full Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
title_fullStr Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
title_full_unstemmed Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
title_short Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
title_sort exploiting snp correlations within random forest for genome-wide association studies
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3973686/
https://www.ncbi.nlm.nih.gov/pubmed/24695491
http://dx.doi.org/10.1371/journal.pone.0093379
work_keys_str_mv AT bottavincent exploitingsnpcorrelationswithinrandomforestforgenomewideassociationstudies
AT louppegilles exploitingsnpcorrelationswithinrandomforestforgenomewideassociationstudies
AT geurtspierre exploitingsnpcorrelationswithinrandomforestforgenomewideassociationstudies
AT wehenkellouis exploitingsnpcorrelationswithinrandomforestforgenomewideassociationstudies