Cargando…

The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process

With the availability of next-generation sequencing (NGS) technology, it is expected that sequence variants may be called on a genomic scale. Here, we demonstrate that a deeper understanding of the distribution of the variant call frequencies at heterozygous loci in NGS data sets is a prerequisite f...

Descripción completa

Detalles Bibliográficos
Autores principales: Heinrich, Verena, Stange, Jens, Dickhaus, Thorsten, Imkeller, Peter, Krüger, Ulrike, Bauer, Sebastian, Mundlos, Stefan, Robinson, Peter N., Hecht, Jochen, Krawitz, Peter M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3315291/
https://www.ncbi.nlm.nih.gov/pubmed/22127862
http://dx.doi.org/10.1093/nar/gkr1073
_version_ 1782228207407202304
author Heinrich, Verena
Stange, Jens
Dickhaus, Thorsten
Imkeller, Peter
Krüger, Ulrike
Bauer, Sebastian
Mundlos, Stefan
Robinson, Peter N.
Hecht, Jochen
Krawitz, Peter M.
author_facet Heinrich, Verena
Stange, Jens
Dickhaus, Thorsten
Imkeller, Peter
Krüger, Ulrike
Bauer, Sebastian
Mundlos, Stefan
Robinson, Peter N.
Hecht, Jochen
Krawitz, Peter M.
author_sort Heinrich, Verena
collection PubMed
description With the availability of next-generation sequencing (NGS) technology, it is expected that sequence variants may be called on a genomic scale. Here, we demonstrate that a deeper understanding of the distribution of the variant call frequencies at heterozygous loci in NGS data sets is a prerequisite for sensitive variant detection. We model the crucial steps in an NGS protocol as a stochastic branching process and derive a mathematical framework for the expected distribution of alleles at heterozygous loci before measurement that is sequencing. We confirm our theoretical results by analyzing technical replicates of human exome data and demonstrate that the variance of allele frequencies at heterozygous loci is higher than expected by a simple binomial distribution. Due to this high variance, mutation callers relying on binomial distributed priors are less sensitive for heterozygous variants that deviate strongly from the expected mean frequency. Our results also indicate that error rates can be reduced to a greater degree by technical replicates than by increasing sequencing depth.
format Online
Article
Text
id pubmed-3315291
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-33152912012-03-30 The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process Heinrich, Verena Stange, Jens Dickhaus, Thorsten Imkeller, Peter Krüger, Ulrike Bauer, Sebastian Mundlos, Stefan Robinson, Peter N. Hecht, Jochen Krawitz, Peter M. Nucleic Acids Res Computational Biology With the availability of next-generation sequencing (NGS) technology, it is expected that sequence variants may be called on a genomic scale. Here, we demonstrate that a deeper understanding of the distribution of the variant call frequencies at heterozygous loci in NGS data sets is a prerequisite for sensitive variant detection. We model the crucial steps in an NGS protocol as a stochastic branching process and derive a mathematical framework for the expected distribution of alleles at heterozygous loci before measurement that is sequencing. We confirm our theoretical results by analyzing technical replicates of human exome data and demonstrate that the variance of allele frequencies at heterozygous loci is higher than expected by a simple binomial distribution. Due to this high variance, mutation callers relying on binomial distributed priors are less sensitive for heterozygous variants that deviate strongly from the expected mean frequency. Our results also indicate that error rates can be reduced to a greater degree by technical replicates than by increasing sequencing depth. Oxford University Press 2012-03 2011-11-29 /pmc/articles/PMC3315291/ /pubmed/22127862 http://dx.doi.org/10.1093/nar/gkr1073 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Computational Biology
Heinrich, Verena
Stange, Jens
Dickhaus, Thorsten
Imkeller, Peter
Krüger, Ulrike
Bauer, Sebastian
Mundlos, Stefan
Robinson, Peter N.
Hecht, Jochen
Krawitz, Peter M.
The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process
title The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process
title_full The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process
title_fullStr The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process
title_full_unstemmed The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process
title_short The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process
title_sort allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process
topic Computational Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3315291/
https://www.ncbi.nlm.nih.gov/pubmed/22127862
http://dx.doi.org/10.1093/nar/gkr1073
work_keys_str_mv AT heinrichverena thealleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT stangejens thealleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT dickhausthorsten thealleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT imkellerpeter thealleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT krugerulrike thealleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT bauersebastian thealleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT mundlosstefan thealleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT robinsonpetern thealleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT hechtjochen thealleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT krawitzpeterm thealleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT heinrichverena alleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT stangejens alleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT dickhausthorsten alleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT imkellerpeter alleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT krugerulrike alleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT bauersebastian alleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT mundlosstefan alleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT robinsonpetern alleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT hechtjochen alleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess
AT krawitzpeterm alleledistributioninnextgenerationsequencingdatasetsisaccuratelydescribedastheresultofastochasticbranchingprocess