Cargando…
A model-based approach to selection of tag SNPs
BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the p...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2006
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1525207/ https://www.ncbi.nlm.nih.gov/pubmed/16776821 http://dx.doi.org/10.1186/1471-2105-7-303 |
_version_ | 1782128887886512128 |
---|---|
author | Nicolas, Pierre Sun, Fengzhu Li, Lei M |
author_facet | Nicolas, Pierre Sun, Fengzhu Li, Lei M |
author_sort | Nicolas, Pierre |
collection | PubMed |
description | BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. RESULTS: Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. CONCLUSION: Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available. |
format | Text |
id | pubmed-1525207 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2006 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-15252072006-08-07 A model-based approach to selection of tag SNPs Nicolas, Pierre Sun, Fengzhu Li, Lei M BMC Bioinformatics Methodology Article BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. RESULTS: Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. CONCLUSION: Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available. BioMed Central 2006-06-15 /pmc/articles/PMC1525207/ /pubmed/16776821 http://dx.doi.org/10.1186/1471-2105-7-303 Text en Copyright © 2006 Nicolas et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Nicolas, Pierre Sun, Fengzhu Li, Lei M A model-based approach to selection of tag SNPs |
title | A model-based approach to selection of tag SNPs |
title_full | A model-based approach to selection of tag SNPs |
title_fullStr | A model-based approach to selection of tag SNPs |
title_full_unstemmed | A model-based approach to selection of tag SNPs |
title_short | A model-based approach to selection of tag SNPs |
title_sort | model-based approach to selection of tag snps |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1525207/ https://www.ncbi.nlm.nih.gov/pubmed/16776821 http://dx.doi.org/10.1186/1471-2105-7-303 |
work_keys_str_mv | AT nicolaspierre amodelbasedapproachtoselectionoftagsnps AT sunfengzhu amodelbasedapproachtoselectionoftagsnps AT lileim amodelbasedapproachtoselectionoftagsnps AT nicolaspierre modelbasedapproachtoselectionoftagsnps AT sunfengzhu modelbasedapproachtoselectionoftagsnps AT lileim modelbasedapproachtoselectionoftagsnps |