Cargando…
Evaluation of variant detection software for pooled next-generation sequence data
BACKGROUND: Despite the tremendous drop in the cost of nucleotide sequencing in recent years, many research projects still utilize sequencing of pools containing multiple samples for the detection of sequence variants as a cost saving measure. Various software tools exist to analyze these pooled seq...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4518579/ https://www.ncbi.nlm.nih.gov/pubmed/26220471 http://dx.doi.org/10.1186/s12859-015-0624-y |
_version_ | 1782383374864744448 |
---|---|
author | Huang, Howard W. Mullikin, James C. Hansen, Nancy F. |
author_facet | Huang, Howard W. Mullikin, James C. Hansen, Nancy F. |
author_sort | Huang, Howard W. |
collection | PubMed |
description | BACKGROUND: Despite the tremendous drop in the cost of nucleotide sequencing in recent years, many research projects still utilize sequencing of pools containing multiple samples for the detection of sequence variants as a cost saving measure. Various software tools exist to analyze these pooled sequence data, yet little has been reported on the relative accuracy and ease of use of these different programs. RESULTS: In this manuscript we evaluate five different variant detection programs—The Genome Analysis Toolkit (GATK), CRISP, LoFreq, VarScan, and SNVer—with regard to their ability to detect variants in synthetically pooled Illumina sequencing data, by creating simulated pooled binary alignment/map (BAM) files using single-sample sequencing data from varying numbers of previously characterized samples at varying depths of coverage per sample. We report the overall runtimes and memory usage of each program, as well as each program’s sensitivity and specificity to detect known true variants. CONCLUSIONS: GATK, CRISP, and LoFreq all gave balanced accuracy of 80 % or greater for datasets with varying per-sample depth of coverage and numbers of samples per pool. VarScan and SNVer generally had balanced accuracy lower than 80 %. CRISP and LoFreq required up to four times less computational time and up to ten times less physical memory than GATK did, and without filtering, gave results with the highest sensitivity. VarScan and SNVer had generally lower false positive rates, but also significantly lower sensitivity than the other three programs. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0624-y) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4518579 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-45185792015-07-30 Evaluation of variant detection software for pooled next-generation sequence data Huang, Howard W. Mullikin, James C. Hansen, Nancy F. BMC Bioinformatics Research Article BACKGROUND: Despite the tremendous drop in the cost of nucleotide sequencing in recent years, many research projects still utilize sequencing of pools containing multiple samples for the detection of sequence variants as a cost saving measure. Various software tools exist to analyze these pooled sequence data, yet little has been reported on the relative accuracy and ease of use of these different programs. RESULTS: In this manuscript we evaluate five different variant detection programs—The Genome Analysis Toolkit (GATK), CRISP, LoFreq, VarScan, and SNVer—with regard to their ability to detect variants in synthetically pooled Illumina sequencing data, by creating simulated pooled binary alignment/map (BAM) files using single-sample sequencing data from varying numbers of previously characterized samples at varying depths of coverage per sample. We report the overall runtimes and memory usage of each program, as well as each program’s sensitivity and specificity to detect known true variants. CONCLUSIONS: GATK, CRISP, and LoFreq all gave balanced accuracy of 80 % or greater for datasets with varying per-sample depth of coverage and numbers of samples per pool. VarScan and SNVer generally had balanced accuracy lower than 80 %. CRISP and LoFreq required up to four times less computational time and up to ten times less physical memory than GATK did, and without filtering, gave results with the highest sensitivity. VarScan and SNVer had generally lower false positive rates, but also significantly lower sensitivity than the other three programs. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0624-y) contains supplementary material, which is available to authorized users. BioMed Central 2015-07-29 /pmc/articles/PMC4518579/ /pubmed/26220471 http://dx.doi.org/10.1186/s12859-015-0624-y Text en © Huang et al. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Huang, Howard W. Mullikin, James C. Hansen, Nancy F. Evaluation of variant detection software for pooled next-generation sequence data |
title | Evaluation of variant detection software for pooled next-generation sequence data |
title_full | Evaluation of variant detection software for pooled next-generation sequence data |
title_fullStr | Evaluation of variant detection software for pooled next-generation sequence data |
title_full_unstemmed | Evaluation of variant detection software for pooled next-generation sequence data |
title_short | Evaluation of variant detection software for pooled next-generation sequence data |
title_sort | evaluation of variant detection software for pooled next-generation sequence data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4518579/ https://www.ncbi.nlm.nih.gov/pubmed/26220471 http://dx.doi.org/10.1186/s12859-015-0624-y |
work_keys_str_mv | AT huanghowardw evaluationofvariantdetectionsoftwareforpoolednextgenerationsequencedata AT evaluationofvariantdetectionsoftwareforpoolednextgenerationsequencedata AT mullikinjamesc evaluationofvariantdetectionsoftwareforpoolednextgenerationsequencedata AT hansennancyf evaluationofvariantdetectionsoftwareforpoolednextgenerationsequencedata |