Cargando…

ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering

BACKGROUND: Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a q...

Descripción completa

Detalles Bibliográficos
Autores principales: Verbist, Bie, Clement, Lieven, Reumers, Joke, Thys, Kim, Vapirev, Alexander, Talloen, Willem, Wetzels, Yves, Meys, Joris, Aerssens, Jeroen, Bijnens, Luc, Thas, Olivier
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4369097/
https://www.ncbi.nlm.nih.gov/pubmed/25887734
http://dx.doi.org/10.1186/s12859-015-0458-7
_version_ 1782362728451538944
author Verbist, Bie
Clement, Lieven
Reumers, Joke
Thys, Kim
Vapirev, Alexander
Talloen, Willem
Wetzels, Yves
Meys, Joris
Aerssens, Jeroen
Bijnens, Luc
Thas, Olivier
author_facet Verbist, Bie
Clement, Lieven
Reumers, Joke
Thys, Kim
Vapirev, Alexander
Talloen, Willem
Wetzels, Yves
Meys, Joris
Aerssens, Jeroen
Bijnens, Luc
Thas, Olivier
author_sort Verbist, Bie
collection PubMed
description BACKGROUND: Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses. RESULTS: Using mixtures of HCV plasmids we show that our method accurately estimates frequencies down to 0.5%. The estimates are unbiased when average coverages of 25,000 are reached. A comparison with the SNP-callers V-Phaser2, ShoRAH, and LoFreq shows that ViVaMBC has a superb sensitivity and specificity for variants with frequencies above 0.4%. Unlike the competitors, ViVaMBC reports a higher number of false-positive findings with frequencies below 0.4% which might partially originate from picking up artificial variants introduced by errors in the sample and library preparation step. CONCLUSIONS: ViVaMBC is the first method to call viral variants directly at the codon level. The strength of the approach lies in modeling the error probabilities based on the quality scores. Although the use of second best base calls appeared very promising in our data exploration phase, their utility was limited. They provided a slight increase in sensitivity, which however does not warrant the additional computational cost of running the offline base caller. Apparently a lot of information is already contained in the quality scores enabling the model based clustering procedure to adjust the majority of the sequencing errors. Overall the sensitivity of ViVaMBC is such that technical constraints like PCR errors start to form the bottleneck for low frequency variant detection. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0458-7) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4369097
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43690972015-03-22 ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering Verbist, Bie Clement, Lieven Reumers, Joke Thys, Kim Vapirev, Alexander Talloen, Willem Wetzels, Yves Meys, Joris Aerssens, Jeroen Bijnens, Luc Thas, Olivier BMC Bioinformatics Methodology Article BACKGROUND: Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses. RESULTS: Using mixtures of HCV plasmids we show that our method accurately estimates frequencies down to 0.5%. The estimates are unbiased when average coverages of 25,000 are reached. A comparison with the SNP-callers V-Phaser2, ShoRAH, and LoFreq shows that ViVaMBC has a superb sensitivity and specificity for variants with frequencies above 0.4%. Unlike the competitors, ViVaMBC reports a higher number of false-positive findings with frequencies below 0.4% which might partially originate from picking up artificial variants introduced by errors in the sample and library preparation step. CONCLUSIONS: ViVaMBC is the first method to call viral variants directly at the codon level. The strength of the approach lies in modeling the error probabilities based on the quality scores. Although the use of second best base calls appeared very promising in our data exploration phase, their utility was limited. They provided a slight increase in sensitivity, which however does not warrant the additional computational cost of running the offline base caller. Apparently a lot of information is already contained in the quality scores enabling the model based clustering procedure to adjust the majority of the sequencing errors. Overall the sensitivity of ViVaMBC is such that technical constraints like PCR errors start to form the bottleneck for low frequency variant detection. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0458-7) contains supplementary material, which is available to authorized users. BioMed Central 2015-02-22 /pmc/articles/PMC4369097/ /pubmed/25887734 http://dx.doi.org/10.1186/s12859-015-0458-7 Text en © Verbist et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle Methodology Article
Verbist, Bie
Clement, Lieven
Reumers, Joke
Thys, Kim
Vapirev, Alexander
Talloen, Willem
Wetzels, Yves
Meys, Joris
Aerssens, Jeroen
Bijnens, Luc
Thas, Olivier
ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering
title ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering
title_full ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering
title_fullStr ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering
title_full_unstemmed ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering
title_short ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering
title_sort vivambc: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4369097/
https://www.ncbi.nlm.nih.gov/pubmed/25887734
http://dx.doi.org/10.1186/s12859-015-0458-7
work_keys_str_mv AT verbistbie vivambcestimatingviralsequencevariationincomplexpopulationsfromilluminadeepsequencingdatausingmodelbasedclustering
AT clementlieven vivambcestimatingviralsequencevariationincomplexpopulationsfromilluminadeepsequencingdatausingmodelbasedclustering
AT reumersjoke vivambcestimatingviralsequencevariationincomplexpopulationsfromilluminadeepsequencingdatausingmodelbasedclustering
AT thyskim vivambcestimatingviralsequencevariationincomplexpopulationsfromilluminadeepsequencingdatausingmodelbasedclustering
AT vapirevalexander vivambcestimatingviralsequencevariationincomplexpopulationsfromilluminadeepsequencingdatausingmodelbasedclustering
AT talloenwillem vivambcestimatingviralsequencevariationincomplexpopulationsfromilluminadeepsequencingdatausingmodelbasedclustering
AT wetzelsyves vivambcestimatingviralsequencevariationincomplexpopulationsfromilluminadeepsequencingdatausingmodelbasedclustering
AT meysjoris vivambcestimatingviralsequencevariationincomplexpopulationsfromilluminadeepsequencingdatausingmodelbasedclustering
AT aerssensjeroen vivambcestimatingviralsequencevariationincomplexpopulationsfromilluminadeepsequencingdatausingmodelbasedclustering
AT bijnensluc vivambcestimatingviralsequencevariationincomplexpopulationsfromilluminadeepsequencingdatausingmodelbasedclustering
AT thasolivier vivambcestimatingviralsequencevariationincomplexpopulationsfromilluminadeepsequencingdatausingmodelbasedclustering