Cargando…

Haplotype frequency estimation error analysis in the presence of missing genotype data

BACKGROUND: Increasingly researchers are turning to the use of haplotype analysis as a tool in population studies, the investigation of linkage disequilibrium, and candidate gene analysis. When the phase of the data is unknown, computational methods, in particular those employing the Expectation-Max...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kelly, Enda D, Sievers, Fabian, McManus, Ross
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2004
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC544188/ https://www.ncbi.nlm.nih.gov/pubmed/15574202 http://dx.doi.org/10.1186/1471-2105-5-188

_version_	1782122133098332160
author	Kelly, Enda D Sievers, Fabian McManus, Ross
author_facet	Kelly, Enda D Sievers, Fabian McManus, Ross
author_sort	Kelly, Enda D
collection	PubMed
description	BACKGROUND: Increasingly researchers are turning to the use of haplotype analysis as a tool in population studies, the investigation of linkage disequilibrium, and candidate gene analysis. When the phase of the data is unknown, computational methods, in particular those employing the Expectation-Maximisation (EM) algorithm, are frequently used for estimating the phase and frequency of the underlying haplotypes. These methods have proved very successful, predicting the phase-known frequencies from data for which the phase is unknown with a high degree of accuracy. Recently there has been much speculation as to the effect of unknown, or missing allelic data – a common phenomenon even with modern automated DNA analysis techniques – on the performance of EM-based methods. To this end an EM-based program, modified to accommodate missing data, has been developed, incorporating non-parametric bootstrapping for the calculation of accurate confidence intervals. RESULTS: Here we present the results of the analyses of various data sets in which randomly selected known alleles have been relabelled as missing. Remarkably, we find that the absence of up to 30% of the data in both biallelic and multiallelic data sets with moderate to strong levels of linkage disequilibrium can be tolerated. Additionally, the frequencies of haplotypes which predominate in the complete data analysis remain essentially the same after the addition of the random noise caused by missing data. CONCLUSIONS: These findings have important implications for the area of data gathering. It may be concluded that small levels of drop out in the data do not affect the overall accuracy of haplotype analysis perceptibly, and that, given recent findings on the effect of inaccurate data, ambiguous data points are best treated as unknown.
format	Text
id	pubmed-544188
institution	National Center for Biotechnology Information
language	English
publishDate	2004
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-5441882005-01-13 Haplotype frequency estimation error analysis in the presence of missing genotype data Kelly, Enda D Sievers, Fabian McManus, Ross BMC Bioinformatics Research Article BACKGROUND: Increasingly researchers are turning to the use of haplotype analysis as a tool in population studies, the investigation of linkage disequilibrium, and candidate gene analysis. When the phase of the data is unknown, computational methods, in particular those employing the Expectation-Maximisation (EM) algorithm, are frequently used for estimating the phase and frequency of the underlying haplotypes. These methods have proved very successful, predicting the phase-known frequencies from data for which the phase is unknown with a high degree of accuracy. Recently there has been much speculation as to the effect of unknown, or missing allelic data – a common phenomenon even with modern automated DNA analysis techniques – on the performance of EM-based methods. To this end an EM-based program, modified to accommodate missing data, has been developed, incorporating non-parametric bootstrapping for the calculation of accurate confidence intervals. RESULTS: Here we present the results of the analyses of various data sets in which randomly selected known alleles have been relabelled as missing. Remarkably, we find that the absence of up to 30% of the data in both biallelic and multiallelic data sets with moderate to strong levels of linkage disequilibrium can be tolerated. Additionally, the frequencies of haplotypes which predominate in the complete data analysis remain essentially the same after the addition of the random noise caused by missing data. CONCLUSIONS: These findings have important implications for the area of data gathering. It may be concluded that small levels of drop out in the data do not affect the overall accuracy of haplotype analysis perceptibly, and that, given recent findings on the effect of inaccurate data, ambiguous data points are best treated as unknown. BioMed Central 2004-12-01 /pmc/articles/PMC544188/ /pubmed/15574202 http://dx.doi.org/10.1186/1471-2105-5-188 Text en Copyright © 2004 Kelly et al; licensee BioMed Central Ltd.
spellingShingle	Research Article Kelly, Enda D Sievers, Fabian McManus, Ross Haplotype frequency estimation error analysis in the presence of missing genotype data
title	Haplotype frequency estimation error analysis in the presence of missing genotype data
title_full	Haplotype frequency estimation error analysis in the presence of missing genotype data
title_fullStr	Haplotype frequency estimation error analysis in the presence of missing genotype data
title_full_unstemmed	Haplotype frequency estimation error analysis in the presence of missing genotype data
title_short	Haplotype frequency estimation error analysis in the presence of missing genotype data
title_sort	haplotype frequency estimation error analysis in the presence of missing genotype data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC544188/ https://www.ncbi.nlm.nih.gov/pubmed/15574202 http://dx.doi.org/10.1186/1471-2105-5-188
work_keys_str_mv	AT kellyendad haplotypefrequencyestimationerroranalysisinthepresenceofmissinggenotypedata AT sieversfabian haplotypefrequencyestimationerroranalysisinthepresenceofmissinggenotypedata AT mcmanusross haplotypefrequencyestimationerroranalysisinthepresenceofmissinggenotypedata

Haplotype frequency estimation error analysis in the presence of missing genotype data

Ejemplares similares