Cargando…

Performance of random forest when SNPs are in linkage disequilibrium

BACKGROUND: Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished v...

Descripción completa

Detalles Bibliográficos
Autores principales:	Meng, Yan A, Yu, Yi, Cupples, L Adrienne, Farrer, Lindsay A, Lunetta, Kathryn L
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2009
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2666661/ https://www.ncbi.nlm.nih.gov/pubmed/19265542 http://dx.doi.org/10.1186/1471-2105-10-78

_version_	1782166062659272704
author	Meng, Yan A Yu, Yi Cupples, L Adrienne Farrer, Lindsay A Lunetta, Kathryn L
author_facet	Meng, Yan A Yu, Yi Cupples, L Adrienne Farrer, Lindsay A Lunetta, Kathryn L
author_sort	Meng, Yan A
collection	PubMed
description	BACKGROUND: Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF. RESULTS: We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype. CONCLUSION: Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.
format	Text
id	pubmed-2666661
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-26666612009-04-08 Performance of random forest when SNPs are in linkage disequilibrium Meng, Yan A Yu, Yi Cupples, L Adrienne Farrer, Lindsay A Lunetta, Kathryn L BMC Bioinformatics Methodology Article BACKGROUND: Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF. RESULTS: We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype. CONCLUSION: Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies. BioMed Central 2009-03-05 /pmc/articles/PMC2666661/ /pubmed/19265542 http://dx.doi.org/10.1186/1471-2105-10-78 Text en Copyright © 2009 Meng et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Meng, Yan A Yu, Yi Cupples, L Adrienne Farrer, Lindsay A Lunetta, Kathryn L Performance of random forest when SNPs are in linkage disequilibrium
title	Performance of random forest when SNPs are in linkage disequilibrium
title_full	Performance of random forest when SNPs are in linkage disequilibrium
title_fullStr	Performance of random forest when SNPs are in linkage disequilibrium
title_full_unstemmed	Performance of random forest when SNPs are in linkage disequilibrium
title_short	Performance of random forest when SNPs are in linkage disequilibrium
title_sort	performance of random forest when snps are in linkage disequilibrium
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2666661/ https://www.ncbi.nlm.nih.gov/pubmed/19265542 http://dx.doi.org/10.1186/1471-2105-10-78
work_keys_str_mv	AT mengyana performanceofrandomforestwhensnpsareinlinkagedisequilibrium AT yuyi performanceofrandomforestwhensnpsareinlinkagedisequilibrium AT cupplesladrienne performanceofrandomforestwhensnpsareinlinkagedisequilibrium AT farrerlindsaya performanceofrandomforestwhensnpsareinlinkagedisequilibrium AT lunettakathrynl performanceofrandomforestwhensnpsareinlinkagedisequilibrium

Performance of random forest when SNPs are in linkage disequilibrium

Ejemplares similares