Cargando…
QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles
BACKGROUND: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth (“deep sequencing”), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4641353/ https://www.ncbi.nlm.nih.gov/pubmed/26554718 http://dx.doi.org/10.1186/s12859-015-0812-9 |
_version_ | 1782400186441531392 |
---|---|
author | Van der Borght, Koen Thys, Kim Wetzels, Yves Clement, Lieven Verbist, Bie Reumers, Joke van Vlijmen, Herman Aerssens, Jeroen |
author_facet | Van der Borght, Koen Thys, Kim Wetzels, Yves Clement, Lieven Verbist, Bie Reumers, Joke van Vlijmen, Herman Aerssens, Jeroen |
author_sort | Van der Borght, Koen |
collection | PubMed |
description | BACKGROUND: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth (“deep sequencing”), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset. RESULTS: For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNV(D)). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNV(HS)). To also increase specificity, SNVs called were overruled when their frequency was below the 80(th) percentile calculated on the distribution of error frequencies (QQ-SNV(HS-P80)). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNV(D) performed similarly to the existing approaches. QQ-SNV(HS) was more sensitive on all test sets but with more false positives. QQ-SNV(HS-P80) was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5 %, QQ-SNV(HS-P80) revealed a sensitivity of 100 % (vs. 40–60 % for the existing methods) and a specificity of 100 % (vs. 98.0–99.7 % for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5 % were consistently detected by QQ-SNV(HS-P80) from different generations of Illumina sequencers. CONCLUSIONS: We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0812-9) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4641353 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-46413532015-11-12 QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles Van der Borght, Koen Thys, Kim Wetzels, Yves Clement, Lieven Verbist, Bie Reumers, Joke van Vlijmen, Herman Aerssens, Jeroen BMC Bioinformatics Methodology Article BACKGROUND: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth (“deep sequencing”), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset. RESULTS: For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNV(D)). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNV(HS)). To also increase specificity, SNVs called were overruled when their frequency was below the 80(th) percentile calculated on the distribution of error frequencies (QQ-SNV(HS-P80)). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNV(D) performed similarly to the existing approaches. QQ-SNV(HS) was more sensitive on all test sets but with more false positives. QQ-SNV(HS-P80) was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5 %, QQ-SNV(HS-P80) revealed a sensitivity of 100 % (vs. 40–60 % for the existing methods) and a specificity of 100 % (vs. 98.0–99.7 % for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5 % were consistently detected by QQ-SNV(HS-P80) from different generations of Illumina sequencers. CONCLUSIONS: We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0812-9) contains supplementary material, which is available to authorized users. BioMed Central 2015-11-10 /pmc/articles/PMC4641353/ /pubmed/26554718 http://dx.doi.org/10.1186/s12859-015-0812-9 Text en © Van der Borght et al. 2015 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Article Van der Borght, Koen Thys, Kim Wetzels, Yves Clement, Lieven Verbist, Bie Reumers, Joke van Vlijmen, Herman Aerssens, Jeroen QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles |
title | QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles |
title_full | QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles |
title_fullStr | QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles |
title_full_unstemmed | QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles |
title_short | QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles |
title_sort | qq-snv: single nucleotide variant detection at low frequency by comparing the quality quantiles |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4641353/ https://www.ncbi.nlm.nih.gov/pubmed/26554718 http://dx.doi.org/10.1186/s12859-015-0812-9 |
work_keys_str_mv | AT vanderborghtkoen qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles AT thyskim qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles AT wetzelsyves qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles AT clementlieven qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles AT verbistbie qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles AT reumersjoke qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles AT vanvlijmenherman qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles AT aerssensjeroen qqsnvsinglenucleotidevariantdetectionatlowfrequencybycomparingthequalityquantiles |