Cargando…

QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles

BACKGROUND: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth (“deep sequencing”), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression...

Descripción completa

Detalles Bibliográficos
Autores principales: Van der Borght, Koen, Thys, Kim, Wetzels, Yves, Clement, Lieven, Verbist, Bie, Reumers, Joke, van Vlijmen, Herman, Aerssens, Jeroen
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4641353/
https://www.ncbi.nlm.nih.gov/pubmed/26554718
http://dx.doi.org/10.1186/s12859-015-0812-9
_version_ 1782400186441531392
author Van der Borght, Koen
Thys, Kim
Wetzels, Yves
Clement, Lieven
Verbist, Bie
Reumers, Joke
van Vlijmen, Herman
Aerssens, Jeroen
author_facet Van der Borght, Koen
Thys, Kim
Wetzels, Yves
Clement, Lieven
Verbist, Bie
Reumers, Joke
van Vlijmen, Herman
Aerssens, Jeroen
author_sort Van der Borght, Koen
collection PubMed
description BACKGROUND: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth (“deep sequencing”), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset. RESULTS: For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNV(D)). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNV(HS)). To also increase specificity, SNVs called were overruled when their frequency was below the 80(th) percentile calculated on the distribution of error frequencies (QQ-SNV(HS-P80)). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNV(D) performed similarly to the existing approaches. QQ-SNV(HS) was more sensitive on all test sets but with more false positives. QQ-SNV(HS-P80) was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5 %, QQ-SNV(HS-P80) revealed a sensitivity of 100 % (vs. 40–60 % for the existing methods) and a specificity of 100 % (vs. 98.0–99.7 % for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5 % were consistently detected by QQ-SNV(HS-P80) from different generations of Illumina sequencers. CONCLUSIONS: We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0812-9) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4641353
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-46413532015-11-12 QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles Van der Borght, Koen Thys, Kim Wetzels, Yves Clement, Lieven Verbist, Bie Reumers, Joke van Vlijmen, Herman Aerssens, Jeroen BMC Bioinformatics Methodology Article BACKGROUND: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth (“deep sequencing”), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset. RESULTS: For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNV(D)). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNV(HS)). To also increase specificity, SNVs called were overruled when their frequency was below the 80(th) percentile calculated on the distribution of error frequencies (QQ-SNV(HS-P80)). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNV(D) performed similarly to the existing approaches. QQ-SNV(HS) was more sensitive on all test sets but with more false positives. QQ-SNV(HS-P80) was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5 %, QQ-SNV(HS-P80) revealed a sensitivity of 100 % (vs. 40–60 % for the existing methods) and a specificity of 100 % (vs. 98.0–99.7 % for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5 % were consistently detected by QQ-SNV(HS-P80) from different generations of Illumina sequencers. CONCLUSIONS: We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0812-9) contains supplementary material, which is available to authorized users. BioMed Central 2015-11-10 /pmc/articles/PMC4641353/ /pubmed/26554718 http://dx.doi.org/10.1186/s12859-015-0812-9 Text en © Van der Borght et al. 2015 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Van der Borght, Koen
Thys, Kim
Wetzels, Yves
Clement, Lieven
Verbist, Bie
Reumers, Joke
van Vlijmen, Herman
Aerssens, Jeroen
QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles
title QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles
title_full QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles
title_fullStr QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles
title_full_unstemmed QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles
title_short QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles
title_sort qq-snv: single nucleotide variant detection at low frequency by comparing the quality quantiles
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4641353/
https://www.ncbi.nlm.nih.gov/pubmed/26554718
http://dx.doi.org/10.1186/s12859-015-0812-9
work_keys_str_mv AT vanderborghtkoen qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles
AT thyskim qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles
AT wetzelsyves qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles
AT clementlieven qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles
AT verbistbie qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles
AT reumersjoke qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles
AT vanvlijmenherman qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles
AT aerssensjeroen qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles