Cargando…

Screening large-scale association study data: exploiting interactions using random forests

BACKGROUND: Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some cri...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lunetta, Kathryn L, Hayward, L Brooke, Segal, Jonathan, Van Eerdewegh, Paul
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2004
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC545646/ https://www.ncbi.nlm.nih.gov/pubmed/15588316 http://dx.doi.org/10.1186/1471-2156-5-32

_version_	1782122207553519616
author	Lunetta, Kathryn L Hayward, L Brooke Segal, Jonathan Van Eerdewegh, Paul
author_facet	Lunetta, Kathryn L Hayward, L Brooke Segal, Jonathan Van Eerdewegh, Paul
author_sort	Lunetta, Kathryn L
collection	PubMed
description	BACKGROUND: Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for futher study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction. RESULTS: Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact. CONCLUSIONS: In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.
format	Text
id	pubmed-545646
institution	National Center for Biotechnology Information
language	English
publishDate	2004
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-5456462005-01-27 Screening large-scale association study data: exploiting interactions using random forests Lunetta, Kathryn L Hayward, L Brooke Segal, Jonathan Van Eerdewegh, Paul BMC Genet Research Article BACKGROUND: Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for futher study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction. RESULTS: Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact. CONCLUSIONS: In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods. BioMed Central 2004-12-10 /pmc/articles/PMC545646/ /pubmed/15588316 http://dx.doi.org/10.1186/1471-2156-5-32 Text en Copyright © 2004 Lunetta et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Lunetta, Kathryn L Hayward, L Brooke Segal, Jonathan Van Eerdewegh, Paul Screening large-scale association study data: exploiting interactions using random forests
title	Screening large-scale association study data: exploiting interactions using random forests
title_full	Screening large-scale association study data: exploiting interactions using random forests
title_fullStr	Screening large-scale association study data: exploiting interactions using random forests
title_full_unstemmed	Screening large-scale association study data: exploiting interactions using random forests
title_short	Screening large-scale association study data: exploiting interactions using random forests
title_sort	screening large-scale association study data: exploiting interactions using random forests
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC545646/ https://www.ncbi.nlm.nih.gov/pubmed/15588316 http://dx.doi.org/10.1186/1471-2156-5-32
work_keys_str_mv	AT lunettakathrynl screeninglargescaleassociationstudydataexploitinginteractionsusingrandomforests AT haywardlbrooke screeninglargescaleassociationstudydataexploitinginteractionsusingrandomforests AT segaljonathan screeninglargescaleassociationstudydataexploitinginteractionsusingrandomforests AT vaneerdeweghpaul screeninglargescaleassociationstudydataexploitinginteractionsusingrandomforests

Screening large-scale association study data: exploiting interactions using random forests

Ejemplares similares