Cargando…

SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data

Current high-throughput sequencing technologies can generate sequence data and provide information on the genetic composition of samples at very high coverage. Deep sequencing approaches enable the detection of rare variants in heterogeneous samples, such as viral quasi-species, but also have the un...

Descripción completa

Detalles Bibliográficos
Autores principales: Ferretti, Luca, Tennakoon, Chandana, Silesian, Adrian, Freimanis, Graham, Ribeca, Paolo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6722845/
https://www.ncbi.nlm.nih.gov/pubmed/31349684
http://dx.doi.org/10.3390/genes10080561
_version_ 1783448634367410176
author Ferretti, Luca
Tennakoon, Chandana
Silesian, Adrian
Freimanis, Graham
Ribeca, Paolo
author_facet Ferretti, Luca
Tennakoon, Chandana
Silesian, Adrian
Freimanis, Graham
Ribeca, Paolo
author_sort Ferretti, Luca
collection PubMed
description Current high-throughput sequencing technologies can generate sequence data and provide information on the genetic composition of samples at very high coverage. Deep sequencing approaches enable the detection of rare variants in heterogeneous samples, such as viral quasi-species, but also have the undesired effect of amplifying sequencing errors and artefacts. Distinguishing real variants from such noise is not straightforward. Variant callers that can handle pooled samples can be in trouble at extremely high read depths, while at lower depths sensitivity is often sacrificed to specificity. In this paper, we propose SiNPle (Simplified Inference of Novel Polymorphisms from Large coveragE), a fast and effective software for variant calling. SiNPle is based on a simplified Bayesian approach to compute the posterior probability that a variant is not generated by sequencing errors or PCR artefacts. The Bayesian model takes into consideration individual base qualities as well as their distribution, the baseline error rates during both the sequencing and the PCR stage, the prior distribution of variant frequencies and their strandedness. Our approach leads to an approximate but extremely fast computation of posterior probabilities even for very high coverage data, since the expression for the posterior distribution is a simple analytical formula in terms of summary statistics for the variants appearing at each site in the genome. These statistics can be used to filter out putative SNPs and indels according to the required level of sensitivity. We tested SiNPle on several simulated and real-life viral datasets to show that it is faster and more sensitive than existing methods. The source code for SiNPle is freely available to download and compile, or as a Conda/Bioconda package.
format Online
Article
Text
id pubmed-6722845
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-67228452019-09-10 SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data Ferretti, Luca Tennakoon, Chandana Silesian, Adrian Freimanis, Graham Ribeca, Paolo Genes (Basel) Article Current high-throughput sequencing technologies can generate sequence data and provide information on the genetic composition of samples at very high coverage. Deep sequencing approaches enable the detection of rare variants in heterogeneous samples, such as viral quasi-species, but also have the undesired effect of amplifying sequencing errors and artefacts. Distinguishing real variants from such noise is not straightforward. Variant callers that can handle pooled samples can be in trouble at extremely high read depths, while at lower depths sensitivity is often sacrificed to specificity. In this paper, we propose SiNPle (Simplified Inference of Novel Polymorphisms from Large coveragE), a fast and effective software for variant calling. SiNPle is based on a simplified Bayesian approach to compute the posterior probability that a variant is not generated by sequencing errors or PCR artefacts. The Bayesian model takes into consideration individual base qualities as well as their distribution, the baseline error rates during both the sequencing and the PCR stage, the prior distribution of variant frequencies and their strandedness. Our approach leads to an approximate but extremely fast computation of posterior probabilities even for very high coverage data, since the expression for the posterior distribution is a simple analytical formula in terms of summary statistics for the variants appearing at each site in the genome. These statistics can be used to filter out putative SNPs and indels according to the required level of sensitivity. We tested SiNPle on several simulated and real-life viral datasets to show that it is faster and more sensitive than existing methods. The source code for SiNPle is freely available to download and compile, or as a Conda/Bioconda package. MDPI 2019-07-25 /pmc/articles/PMC6722845/ /pubmed/31349684 http://dx.doi.org/10.3390/genes10080561 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Ferretti, Luca
Tennakoon, Chandana
Silesian, Adrian
Freimanis, Graham
Ribeca, Paolo
SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data
title SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data
title_full SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data
title_fullStr SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data
title_full_unstemmed SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data
title_short SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data
title_sort sinple: fast and sensitive variant calling for deep sequencing data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6722845/
https://www.ncbi.nlm.nih.gov/pubmed/31349684
http://dx.doi.org/10.3390/genes10080561
work_keys_str_mv AT ferrettiluca sinplefastandsensitivevariantcallingfordeepsequencingdata
AT tennakoonchandana sinplefastandsensitivevariantcallingfordeepsequencingdata
AT silesianadrian sinplefastandsensitivevariantcallingfordeepsequencingdata
AT freimanisgraham sinplefastandsensitivevariantcallingfordeepsequencingdata
AT ribecapaolo sinplefastandsensitivevariantcallingfordeepsequencingdata