Cargando…

Conditional Random Fields for Fast, Large-Scale Genome-Wide Association Studies

Understanding the role of genetic variation in human diseases remains an important problem to be solved in genomics. An important component of such variation consist of variations at single sites in DNA, or single nucleotide polymorphisms (SNPs). Typically, the problem of associating particular SNPs...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Jim C., Meek, Christopher, Kadie, Carl, Heckerman, David
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3134455/
https://www.ncbi.nlm.nih.gov/pubmed/21765897
http://dx.doi.org/10.1371/journal.pone.0021591
_version_ 1782207989605728256
author Huang, Jim C.
Meek, Christopher
Kadie, Carl
Heckerman, David
author_facet Huang, Jim C.
Meek, Christopher
Kadie, Carl
Heckerman, David
author_sort Huang, Jim C.
collection PubMed
description Understanding the role of genetic variation in human diseases remains an important problem to be solved in genomics. An important component of such variation consist of variations at single sites in DNA, or single nucleotide polymorphisms (SNPs). Typically, the problem of associating particular SNPs to phenotypes has been confounded by hidden factors such as the presence of population structure, family structure or cryptic relatedness in the sample of individuals being analyzed. Such confounding factors lead to a large number of spurious associations and missed associations. Various statistical methods have been proposed to account for such confounding factors such as linear mixed-effect models (LMMs) or methods that adjust data based on a principal components analysis (PCA), but these methods either suffer from low power or cease to be tractable for larger numbers of individuals in the sample. Here we present a statistical model for conducting genome-wide association studies (GWAS) that accounts for such confounding factors. Our method scales in runtime quadratic in the number of individuals being studied with only a modest loss in statistical power as compared to LMM-based and PCA-based methods when testing on synthetic data that was generated from a generalized LMM. Applying our method to both real and synthetic human genotype/phenotype data, we demonstrate the ability of our model to correct for confounding factors while requiring significantly less runtime relative to LMMs. We have implemented methods for fitting these models, which are available at http://www.microsoft.com/science.
format Online
Article
Text
id pubmed-3134455
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-31344552011-07-15 Conditional Random Fields for Fast, Large-Scale Genome-Wide Association Studies Huang, Jim C. Meek, Christopher Kadie, Carl Heckerman, David PLoS One Research Article Understanding the role of genetic variation in human diseases remains an important problem to be solved in genomics. An important component of such variation consist of variations at single sites in DNA, or single nucleotide polymorphisms (SNPs). Typically, the problem of associating particular SNPs to phenotypes has been confounded by hidden factors such as the presence of population structure, family structure or cryptic relatedness in the sample of individuals being analyzed. Such confounding factors lead to a large number of spurious associations and missed associations. Various statistical methods have been proposed to account for such confounding factors such as linear mixed-effect models (LMMs) or methods that adjust data based on a principal components analysis (PCA), but these methods either suffer from low power or cease to be tractable for larger numbers of individuals in the sample. Here we present a statistical model for conducting genome-wide association studies (GWAS) that accounts for such confounding factors. Our method scales in runtime quadratic in the number of individuals being studied with only a modest loss in statistical power as compared to LMM-based and PCA-based methods when testing on synthetic data that was generated from a generalized LMM. Applying our method to both real and synthetic human genotype/phenotype data, we demonstrate the ability of our model to correct for confounding factors while requiring significantly less runtime relative to LMMs. We have implemented methods for fitting these models, which are available at http://www.microsoft.com/science. Public Library of Science 2011-07-12 /pmc/articles/PMC3134455/ /pubmed/21765897 http://dx.doi.org/10.1371/journal.pone.0021591 Text en Huang et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Huang, Jim C.
Meek, Christopher
Kadie, Carl
Heckerman, David
Conditional Random Fields for Fast, Large-Scale Genome-Wide Association Studies
title Conditional Random Fields for Fast, Large-Scale Genome-Wide Association Studies
title_full Conditional Random Fields for Fast, Large-Scale Genome-Wide Association Studies
title_fullStr Conditional Random Fields for Fast, Large-Scale Genome-Wide Association Studies
title_full_unstemmed Conditional Random Fields for Fast, Large-Scale Genome-Wide Association Studies
title_short Conditional Random Fields for Fast, Large-Scale Genome-Wide Association Studies
title_sort conditional random fields for fast, large-scale genome-wide association studies
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3134455/
https://www.ncbi.nlm.nih.gov/pubmed/21765897
http://dx.doi.org/10.1371/journal.pone.0021591
work_keys_str_mv AT huangjimc conditionalrandomfieldsforfastlargescalegenomewideassociationstudies
AT meekchristopher conditionalrandomfieldsforfastlargescalegenomewideassociationstudies
AT kadiecarl conditionalrandomfieldsforfastlargescalegenomewideassociationstudies
AT heckermandavid conditionalrandomfieldsforfastlargescalegenomewideassociationstudies