Cargando…

Estimating Haplotype Frequency and Coverage of Databases

A variety of forensic, population, and disease studies are based on haploid DNA (e.g. mitochondrial DNA or Y-chromosome data). For any set of genetic markers databases of conventional size will normally contain only a fraction of all haplotypes. For several applications, reliable estimates of haplot...

Descripción completa

Detalles Bibliográficos
Autores principales: Egeland, Thore, Salas, Antonio
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2602601/
https://www.ncbi.nlm.nih.gov/pubmed/19098988
http://dx.doi.org/10.1371/journal.pone.0003988
_version_ 1782162517852684288
author Egeland, Thore
Salas, Antonio
author_facet Egeland, Thore
Salas, Antonio
author_sort Egeland, Thore
collection PubMed
description A variety of forensic, population, and disease studies are based on haploid DNA (e.g. mitochondrial DNA or Y-chromosome data). For any set of genetic markers databases of conventional size will normally contain only a fraction of all haplotypes. For several applications, reliable estimates of haplotype frequencies, the total number of haplotypes and coverage of the database (the probability that the next random haplotype is contained in the database) will be useful. We propose different approaches to the problem based on classical methods as well as new applications of Principal Component Analysis (PCA). We also discuss previous proposals based on saturation curves. Several conclusions can be inferred from simulated and real data. First, classical estimates of the fraction of unseen haplotypes can be seriously biased. Second, there is no obvious way to decide on required sample size based on traditional approaches. Methods based on testing of hypotheses or length of confidence intervals may appear artificial since no single test or parameter stands out as particularly relevant. Rather the coverage may be more relevant since it indicates the percentage of different haplotypes that are contained in a database; if the coverage is low, there is a considerable chance that the next haplotype to be observed does not appear in the database and this indicates that the database needs to be expanded. Finally, freeware and example data sets accompany the methods discussed in this paper: http://folk.uio.no/thoree/nhap/.
format Text
id pubmed-2602601
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-26026012008-12-22 Estimating Haplotype Frequency and Coverage of Databases Egeland, Thore Salas, Antonio PLoS One Research Article A variety of forensic, population, and disease studies are based on haploid DNA (e.g. mitochondrial DNA or Y-chromosome data). For any set of genetic markers databases of conventional size will normally contain only a fraction of all haplotypes. For several applications, reliable estimates of haplotype frequencies, the total number of haplotypes and coverage of the database (the probability that the next random haplotype is contained in the database) will be useful. We propose different approaches to the problem based on classical methods as well as new applications of Principal Component Analysis (PCA). We also discuss previous proposals based on saturation curves. Several conclusions can be inferred from simulated and real data. First, classical estimates of the fraction of unseen haplotypes can be seriously biased. Second, there is no obvious way to decide on required sample size based on traditional approaches. Methods based on testing of hypotheses or length of confidence intervals may appear artificial since no single test or parameter stands out as particularly relevant. Rather the coverage may be more relevant since it indicates the percentage of different haplotypes that are contained in a database; if the coverage is low, there is a considerable chance that the next haplotype to be observed does not appear in the database and this indicates that the database needs to be expanded. Finally, freeware and example data sets accompany the methods discussed in this paper: http://folk.uio.no/thoree/nhap/. Public Library of Science 2008-12-22 /pmc/articles/PMC2602601/ /pubmed/19098988 http://dx.doi.org/10.1371/journal.pone.0003988 Text en Egeland et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Egeland, Thore
Salas, Antonio
Estimating Haplotype Frequency and Coverage of Databases
title Estimating Haplotype Frequency and Coverage of Databases
title_full Estimating Haplotype Frequency and Coverage of Databases
title_fullStr Estimating Haplotype Frequency and Coverage of Databases
title_full_unstemmed Estimating Haplotype Frequency and Coverage of Databases
title_short Estimating Haplotype Frequency and Coverage of Databases
title_sort estimating haplotype frequency and coverage of databases
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2602601/
https://www.ncbi.nlm.nih.gov/pubmed/19098988
http://dx.doi.org/10.1371/journal.pone.0003988
work_keys_str_mv AT egelandthore estimatinghaplotypefrequencyandcoverageofdatabases
AT salasantonio estimatinghaplotypefrequencyandcoverageofdatabases