Cargando…

An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data

BACKGROUND: Pooling is a cost effective way to collect data for genetic association studies, particularly for rare genetic variants. It is of interest to estimate the haplotype frequencies, which contain more information than single locus statistics. By viewing the pooled genotype data as incomplete...

Descripción completa

Detalles Bibliográficos
Autores principales: Kuk, Anthony YC, Li, Xiang, Xu, Jinfeng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3847674/
https://www.ncbi.nlm.nih.gov/pubmed/24034507
http://dx.doi.org/10.1186/1471-2156-14-82
_version_ 1782293641607249920
author Kuk, Anthony YC
Li, Xiang
Xu, Jinfeng
author_facet Kuk, Anthony YC
Li, Xiang
Xu, Jinfeng
author_sort Kuk, Anthony YC
collection PubMed
description BACKGROUND: Pooling is a cost effective way to collect data for genetic association studies, particularly for rare genetic variants. It is of interest to estimate the haplotype frequencies, which contain more information than single locus statistics. By viewing the pooled genotype data as incomplete data, the expectation-maximization (EM) algorithm is the natural algorithm to use, but it is computationally intensive. A recent proposal to reduce the computational burden is to make use of database information to form a list of frequently occurring haplotypes, and to restrict the haplotypes to come from this list only in implementing the EM algorithm. There is, however, the danger of using an incorrect list, and there may not be enough database information to form a list externally in some applications. RESULTS: We investigate the possibility of creating an internal list from the data at hand. One way to form such a list is to collapse the observed total minor allele frequencies to “zero” or “at least one”, which is shown to have the desirable effect of amplifying the haplotype frequencies. To improve coverage, we propose ways to add and remove haplotypes from the list, and a benchmarking method to determine the frequency threshold for removing haplotypes. Simulation results show that the EM estimates based on a suitably augmented and trimmed collapsed data list (ATCDL) perform satisfactorily. In two scenarios involving 25 and 32 loci respectively, the EM-ATCDL estimates outperform the EM estimates based on other lists as well as the collapsed data maximum likelihood estimates. CONCLUSIONS: The proposed augmented and trimmed CD list is a useful list for the EM algorithm to base upon in estimating the haplotype distributions of rare variants. It can handle more markers and larger pool size than existing methods, and the resulting EM-ATCDL estimates are more efficient than the EM estimates based on other lists.
format Online
Article
Text
id pubmed-3847674
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38476742013-12-05 An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data Kuk, Anthony YC Li, Xiang Xu, Jinfeng BMC Genet Methodology Article BACKGROUND: Pooling is a cost effective way to collect data for genetic association studies, particularly for rare genetic variants. It is of interest to estimate the haplotype frequencies, which contain more information than single locus statistics. By viewing the pooled genotype data as incomplete data, the expectation-maximization (EM) algorithm is the natural algorithm to use, but it is computationally intensive. A recent proposal to reduce the computational burden is to make use of database information to form a list of frequently occurring haplotypes, and to restrict the haplotypes to come from this list only in implementing the EM algorithm. There is, however, the danger of using an incorrect list, and there may not be enough database information to form a list externally in some applications. RESULTS: We investigate the possibility of creating an internal list from the data at hand. One way to form such a list is to collapse the observed total minor allele frequencies to “zero” or “at least one”, which is shown to have the desirable effect of amplifying the haplotype frequencies. To improve coverage, we propose ways to add and remove haplotypes from the list, and a benchmarking method to determine the frequency threshold for removing haplotypes. Simulation results show that the EM estimates based on a suitably augmented and trimmed collapsed data list (ATCDL) perform satisfactorily. In two scenarios involving 25 and 32 loci respectively, the EM-ATCDL estimates outperform the EM estimates based on other lists as well as the collapsed data maximum likelihood estimates. CONCLUSIONS: The proposed augmented and trimmed CD list is a useful list for the EM algorithm to base upon in estimating the haplotype distributions of rare variants. It can handle more markers and larger pool size than existing methods, and the resulting EM-ATCDL estimates are more efficient than the EM estimates based on other lists. BioMed Central 2013-09-13 /pmc/articles/PMC3847674/ /pubmed/24034507 http://dx.doi.org/10.1186/1471-2156-14-82 Text en Copyright © 2013 Kuk et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Kuk, Anthony YC
Li, Xiang
Xu, Jinfeng
An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data
title An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data
title_full An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data
title_fullStr An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data
title_full_unstemmed An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data
title_short An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data
title_sort em algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3847674/
https://www.ncbi.nlm.nih.gov/pubmed/24034507
http://dx.doi.org/10.1186/1471-2156-14-82
work_keys_str_mv AT kukanthonyyc anemalgorithmbasedonaninternallistforestimatinghaplotypedistributionsofrarevariantsfrompooledgenotypedata
AT lixiang anemalgorithmbasedonaninternallistforestimatinghaplotypedistributionsofrarevariantsfrompooledgenotypedata
AT xujinfeng anemalgorithmbasedonaninternallistforestimatinghaplotypedistributionsofrarevariantsfrompooledgenotypedata
AT kukanthonyyc emalgorithmbasedonaninternallistforestimatinghaplotypedistributionsofrarevariantsfrompooledgenotypedata
AT lixiang emalgorithmbasedonaninternallistforestimatinghaplotypedistributionsofrarevariantsfrompooledgenotypedata
AT xujinfeng emalgorithmbasedonaninternallistforestimatinghaplotypedistributionsofrarevariantsfrompooledgenotypedata