Cargando…

Estimating the information value of polymorphic sites using pooled sequences

BACKGROUND: High-throughput sequencing is a cost effective method for identifying genetic variation, and it is currently in use on a large scale across the field of biology, including ecology and population genetics. Correctly identifying variable sites and allele frequencies from sequencing data re...

Descripción completa

Detalles Bibliográficos
Autor principal:	Malde, Ketil
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4239578/ https://www.ncbi.nlm.nih.gov/pubmed/25571927 http://dx.doi.org/10.1186/1471-2164-15-S6-S20

_version_	1782345614836629504
author	Malde, Ketil
author_facet	Malde, Ketil
author_sort	Malde, Ketil
collection	PubMed
description	BACKGROUND: High-throughput sequencing is a cost effective method for identifying genetic variation, and it is currently in use on a large scale across the field of biology, including ecology and population genetics. Correctly identifying variable sites and allele frequencies from sequencing data remains challenging, in large part due to artifacts and biases inherent in the sequencing process. Selecting variants that are diagnostic is commonly done using diversity statistics like F(ST), but these measures are not ideal for the task. RESULTS: Here, we develop a method that directly calculates the expected amount of information gained from observing each variant site. We then develop and implement a conservative estimator that takes into account uncertainity introduced by sampling bias and sequencing error. This estimator is applied to simulated and real sequencing data, and we discuss how it performs compared to the commonly used existing methods for identifying diagnostic polymorphisms. CONCLUSION: The expected information content gives an easy to interpret measure for the usefulness of variant sites. The results show that we achieve a clear separation between true variants and noise, allowing us to select candidate sites with a high degree of confidence.
format	Online Article Text
id	pubmed-4239578
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-42395782014-11-25 Estimating the information value of polymorphic sites using pooled sequences Malde, Ketil BMC Genomics Research BACKGROUND: High-throughput sequencing is a cost effective method for identifying genetic variation, and it is currently in use on a large scale across the field of biology, including ecology and population genetics. Correctly identifying variable sites and allele frequencies from sequencing data remains challenging, in large part due to artifacts and biases inherent in the sequencing process. Selecting variants that are diagnostic is commonly done using diversity statistics like F(ST), but these measures are not ideal for the task. RESULTS: Here, we develop a method that directly calculates the expected amount of information gained from observing each variant site. We then develop and implement a conservative estimator that takes into account uncertainity introduced by sampling bias and sequencing error. This estimator is applied to simulated and real sequencing data, and we discuss how it performs compared to the commonly used existing methods for identifying diagnostic polymorphisms. CONCLUSION: The expected information content gives an easy to interpret measure for the usefulness of variant sites. The results show that we achieve a clear separation between true variants and noise, allowing us to select candidate sites with a high degree of confidence. BioMed Central 2014-10-17 /pmc/articles/PMC4239578/ /pubmed/25571927 http://dx.doi.org/10.1186/1471-2164-15-S6-S20 Text en Copyright © 2014 Malde; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Malde, Ketil Estimating the information value of polymorphic sites using pooled sequences
title	Estimating the information value of polymorphic sites using pooled sequences
title_full	Estimating the information value of polymorphic sites using pooled sequences
title_fullStr	Estimating the information value of polymorphic sites using pooled sequences
title_full_unstemmed	Estimating the information value of polymorphic sites using pooled sequences
title_short	Estimating the information value of polymorphic sites using pooled sequences
title_sort	estimating the information value of polymorphic sites using pooled sequences
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4239578/ https://www.ncbi.nlm.nih.gov/pubmed/25571927 http://dx.doi.org/10.1186/1471-2164-15-S6-S20
work_keys_str_mv	AT maldeketil estimatingtheinformationvalueofpolymorphicsitesusingpooledsequences

Estimating the information value of polymorphic sites using pooled sequences

Ejemplares similares