Cargando…

A model-based approach to selection of tag SNPs

BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the p...

Descripción completa

Detalles Bibliográficos
Autores principales: Nicolas, Pierre, Sun, Fengzhu, Li, Lei M
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1525207/
https://www.ncbi.nlm.nih.gov/pubmed/16776821
http://dx.doi.org/10.1186/1471-2105-7-303
_version_ 1782128887886512128
author Nicolas, Pierre
Sun, Fengzhu
Li, Lei M
author_facet Nicolas, Pierre
Sun, Fengzhu
Li, Lei M
author_sort Nicolas, Pierre
collection PubMed
description BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. RESULTS: Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. CONCLUSION: Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available.
format Text
id pubmed-1525207
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-15252072006-08-07 A model-based approach to selection of tag SNPs Nicolas, Pierre Sun, Fengzhu Li, Lei M BMC Bioinformatics Methodology Article BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. RESULTS: Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. CONCLUSION: Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available. BioMed Central 2006-06-15 /pmc/articles/PMC1525207/ /pubmed/16776821 http://dx.doi.org/10.1186/1471-2105-7-303 Text en Copyright © 2006 Nicolas et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Nicolas, Pierre
Sun, Fengzhu
Li, Lei M
A model-based approach to selection of tag SNPs
title A model-based approach to selection of tag SNPs
title_full A model-based approach to selection of tag SNPs
title_fullStr A model-based approach to selection of tag SNPs
title_full_unstemmed A model-based approach to selection of tag SNPs
title_short A model-based approach to selection of tag SNPs
title_sort model-based approach to selection of tag snps
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1525207/
https://www.ncbi.nlm.nih.gov/pubmed/16776821
http://dx.doi.org/10.1186/1471-2105-7-303
work_keys_str_mv AT nicolaspierre amodelbasedapproachtoselectionoftagsnps
AT sunfengzhu amodelbasedapproachtoselectionoftagsnps
AT lileim amodelbasedapproachtoselectionoftagsnps
AT nicolaspierre modelbasedapproachtoselectionoftagsnps
AT sunfengzhu modelbasedapproachtoselectionoftagsnps
AT lileim modelbasedapproachtoselectionoftagsnps