Cargando…

SNPest: a probabilistic graphical model for estimating genotypes

BACKGROUND: As the use of next-generation sequencing technologies is becoming more widespread, the need for robust software to help with the analysis is growing as well. A key challenge when analyzing sequencing data is the prediction of genotypes from the reads, i.e. correct inference of the underl...

Descripción completa

Detalles Bibliográficos
Autores principales: Lindgreen, Stinus, Krogh, Anders, Pedersen, Jakob Skou
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4203901/
https://www.ncbi.nlm.nih.gov/pubmed/25294605
http://dx.doi.org/10.1186/1756-0500-7-698
_version_ 1782340459195006976
author Lindgreen, Stinus
Krogh, Anders
Pedersen, Jakob Skou
author_facet Lindgreen, Stinus
Krogh, Anders
Pedersen, Jakob Skou
author_sort Lindgreen, Stinus
collection PubMed
description BACKGROUND: As the use of next-generation sequencing technologies is becoming more widespread, the need for robust software to help with the analysis is growing as well. A key challenge when analyzing sequencing data is the prediction of genotypes from the reads, i.e. correct inference of the underlying DNA sequences that gave rise to the sequenced fragments. For diploid organisms, the genotyper should be able to predict both alleles in the individual. Variations between the individual and the population can then be analyzed by looking for SNPs (single nucleotide polymorphisms) in order to investigate diseases or phenotypic features. To perform robust and high confidence genotyping and SNP calling, methods are needed that take the technology specific limitations into account and can model different sources of error. As an example, ancient DNA poses special challenges as the data is often shallow and subject to errors induced by post mortem damage. FINDINGS: We present a novel approach to the genotyping problem where a probabilistic framework describing the process from sampling to sequencing is implemented as a graphical model. This makes it possible to model technology specific errors and other sources of variation that can affect the result. The inferred genotype is given a posterior probability to signify the confidence in the result. SNPest has already been used to genotype large scale projects such as the first ancient human genome published in 2010. CONCLUSIONS: We compare the performance of SNPest to a number of other widely used genotypers on both real and simulated data, covering both haploid and diploid genomes. We investigate the effects of read depth, of removing adapters before mapping and genotyping, of using different mapping tools, and of using the correct model in the genotyping process. We show that the performance of SNPest is comparable to existing methods, and we also illustrate cases where SNPest has an advantage over other methods, e.g. when dealing with simulated ancient DNA. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1756-0500-7-698) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4203901
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42039012014-10-22 SNPest: a probabilistic graphical model for estimating genotypes Lindgreen, Stinus Krogh, Anders Pedersen, Jakob Skou BMC Res Notes Technical Note BACKGROUND: As the use of next-generation sequencing technologies is becoming more widespread, the need for robust software to help with the analysis is growing as well. A key challenge when analyzing sequencing data is the prediction of genotypes from the reads, i.e. correct inference of the underlying DNA sequences that gave rise to the sequenced fragments. For diploid organisms, the genotyper should be able to predict both alleles in the individual. Variations between the individual and the population can then be analyzed by looking for SNPs (single nucleotide polymorphisms) in order to investigate diseases or phenotypic features. To perform robust and high confidence genotyping and SNP calling, methods are needed that take the technology specific limitations into account and can model different sources of error. As an example, ancient DNA poses special challenges as the data is often shallow and subject to errors induced by post mortem damage. FINDINGS: We present a novel approach to the genotyping problem where a probabilistic framework describing the process from sampling to sequencing is implemented as a graphical model. This makes it possible to model technology specific errors and other sources of variation that can affect the result. The inferred genotype is given a posterior probability to signify the confidence in the result. SNPest has already been used to genotype large scale projects such as the first ancient human genome published in 2010. CONCLUSIONS: We compare the performance of SNPest to a number of other widely used genotypers on both real and simulated data, covering both haploid and diploid genomes. We investigate the effects of read depth, of removing adapters before mapping and genotyping, of using different mapping tools, and of using the correct model in the genotyping process. We show that the performance of SNPest is comparable to existing methods, and we also illustrate cases where SNPest has an advantage over other methods, e.g. when dealing with simulated ancient DNA. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1756-0500-7-698) contains supplementary material, which is available to authorized users. BioMed Central 2014-10-07 /pmc/articles/PMC4203901/ /pubmed/25294605 http://dx.doi.org/10.1186/1756-0500-7-698 Text en © Lindgreen et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Technical Note
Lindgreen, Stinus
Krogh, Anders
Pedersen, Jakob Skou
SNPest: a probabilistic graphical model for estimating genotypes
title SNPest: a probabilistic graphical model for estimating genotypes
title_full SNPest: a probabilistic graphical model for estimating genotypes
title_fullStr SNPest: a probabilistic graphical model for estimating genotypes
title_full_unstemmed SNPest: a probabilistic graphical model for estimating genotypes
title_short SNPest: a probabilistic graphical model for estimating genotypes
title_sort snpest: a probabilistic graphical model for estimating genotypes
topic Technical Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4203901/
https://www.ncbi.nlm.nih.gov/pubmed/25294605
http://dx.doi.org/10.1186/1756-0500-7-698
work_keys_str_mv AT lindgreenstinus snpestaprobabilisticgraphicalmodelforestimatinggenotypes
AT kroghanders snpestaprobabilisticgraphicalmodelforestimatinggenotypes
AT pedersenjakobskou snpestaprobabilisticgraphicalmodelforestimatinggenotypes