Cargando…

An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings

BACKGROUND: As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disea...

Descripción completa

Detalles Bibliográficos
Autores principales:	Goldstein, Benjamin A, Hubbard, Alan E, Cutler, Adele, Barcellos, Lisa F
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896336/ https://www.ncbi.nlm.nih.gov/pubmed/20546594 http://dx.doi.org/10.1186/1471-2156-11-49

_version_	1782183336607744000
author	Goldstein, Benjamin A Hubbard, Alan E Cutler, Adele Barcellos, Lisa F
author_facet	Goldstein, Benjamin A Hubbard, Alan E Cutler, Adele Barcellos, Lisa F
author_sort	Goldstein, Benjamin A
collection	PubMed
description	BACKGROUND: As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited. RESULTS: Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, MPHOSPH9, CTNNA3, PHACTR2 and IL7, by RF analysis and warrant further follow-up in independent studies. CONCLUSIONS: This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.
format	Text
id	pubmed-2896336
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28963362010-07-03 An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings Goldstein, Benjamin A Hubbard, Alan E Cutler, Adele Barcellos, Lisa F BMC Genet Research article BACKGROUND: As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited. RESULTS: Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, MPHOSPH9, CTNNA3, PHACTR2 and IL7, by RF analysis and warrant further follow-up in independent studies. CONCLUSIONS: This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease. BioMed Central 2010-06-14 /pmc/articles/PMC2896336/ /pubmed/20546594 http://dx.doi.org/10.1186/1471-2156-11-49 Text en Copyright ©2010 Goldstein et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research article Goldstein, Benjamin A Hubbard, Alan E Cutler, Adele Barcellos, Lisa F An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
title	An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
title_full	An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
title_fullStr	An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
title_full_unstemmed	An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
title_short	An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
title_sort	application of random forests to a genome-wide association dataset: methodological considerations & new findings
topic	Research article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896336/ https://www.ncbi.nlm.nih.gov/pubmed/20546594 http://dx.doi.org/10.1186/1471-2156-11-49
work_keys_str_mv	AT goldsteinbenjamina anapplicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings AT hubbardalane anapplicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings AT cutleradele anapplicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings AT barcelloslisaf anapplicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings AT goldsteinbenjamina applicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings AT hubbardalane applicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings AT cutleradele applicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings AT barcelloslisaf applicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings

An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings

Ejemplares similares