Cargando…

Evaluation of variant detection software for pooled next-generation sequence data

BACKGROUND: Despite the tremendous drop in the cost of nucleotide sequencing in recent years, many research projects still utilize sequencing of pools containing multiple samples for the detection of sequence variants as a cost saving measure. Various software tools exist to analyze these pooled seq...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Howard W., Mullikin, James C., Hansen, Nancy F.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4518579/
https://www.ncbi.nlm.nih.gov/pubmed/26220471
http://dx.doi.org/10.1186/s12859-015-0624-y
_version_ 1782383374864744448
author Huang, Howard W.
Mullikin, James C.
Hansen, Nancy F.
author_facet Huang, Howard W.
Mullikin, James C.
Hansen, Nancy F.
author_sort Huang, Howard W.
collection PubMed
description BACKGROUND: Despite the tremendous drop in the cost of nucleotide sequencing in recent years, many research projects still utilize sequencing of pools containing multiple samples for the detection of sequence variants as a cost saving measure. Various software tools exist to analyze these pooled sequence data, yet little has been reported on the relative accuracy and ease of use of these different programs. RESULTS: In this manuscript we evaluate five different variant detection programs—The Genome Analysis Toolkit (GATK), CRISP, LoFreq, VarScan, and SNVer—with regard to their ability to detect variants in synthetically pooled Illumina sequencing data, by creating simulated pooled binary alignment/map (BAM) files using single-sample sequencing data from varying numbers of previously characterized samples at varying depths of coverage per sample. We report the overall runtimes and memory usage of each program, as well as each program’s sensitivity and specificity to detect known true variants. CONCLUSIONS: GATK, CRISP, and LoFreq all gave balanced accuracy of 80 % or greater for datasets with varying per-sample depth of coverage and numbers of samples per pool. VarScan and SNVer generally had balanced accuracy lower than 80 %. CRISP and LoFreq required up to four times less computational time and up to ten times less physical memory than GATK did, and without filtering, gave results with the highest sensitivity. VarScan and SNVer had generally lower false positive rates, but also significantly lower sensitivity than the other three programs. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0624-y) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4518579
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-45185792015-07-30 Evaluation of variant detection software for pooled next-generation sequence data Huang, Howard W. Mullikin, James C. Hansen, Nancy F. BMC Bioinformatics Research Article BACKGROUND: Despite the tremendous drop in the cost of nucleotide sequencing in recent years, many research projects still utilize sequencing of pools containing multiple samples for the detection of sequence variants as a cost saving measure. Various software tools exist to analyze these pooled sequence data, yet little has been reported on the relative accuracy and ease of use of these different programs. RESULTS: In this manuscript we evaluate five different variant detection programs—The Genome Analysis Toolkit (GATK), CRISP, LoFreq, VarScan, and SNVer—with regard to their ability to detect variants in synthetically pooled Illumina sequencing data, by creating simulated pooled binary alignment/map (BAM) files using single-sample sequencing data from varying numbers of previously characterized samples at varying depths of coverage per sample. We report the overall runtimes and memory usage of each program, as well as each program’s sensitivity and specificity to detect known true variants. CONCLUSIONS: GATK, CRISP, and LoFreq all gave balanced accuracy of 80 % or greater for datasets with varying per-sample depth of coverage and numbers of samples per pool. VarScan and SNVer generally had balanced accuracy lower than 80 %. CRISP and LoFreq required up to four times less computational time and up to ten times less physical memory than GATK did, and without filtering, gave results with the highest sensitivity. VarScan and SNVer had generally lower false positive rates, but also significantly lower sensitivity than the other three programs. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0624-y) contains supplementary material, which is available to authorized users. BioMed Central 2015-07-29 /pmc/articles/PMC4518579/ /pubmed/26220471 http://dx.doi.org/10.1186/s12859-015-0624-y Text en © Huang et al. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Huang, Howard W.
Mullikin, James C.
Hansen, Nancy F.
Evaluation of variant detection software for pooled next-generation sequence data
title Evaluation of variant detection software for pooled next-generation sequence data
title_full Evaluation of variant detection software for pooled next-generation sequence data
title_fullStr Evaluation of variant detection software for pooled next-generation sequence data
title_full_unstemmed Evaluation of variant detection software for pooled next-generation sequence data
title_short Evaluation of variant detection software for pooled next-generation sequence data
title_sort evaluation of variant detection software for pooled next-generation sequence data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4518579/
https://www.ncbi.nlm.nih.gov/pubmed/26220471
http://dx.doi.org/10.1186/s12859-015-0624-y
work_keys_str_mv AT huanghowardw evaluationofvariantdetectionsoftwareforpoolednextgenerationsequencedata
AT evaluationofvariantdetectionsoftwareforpoolednextgenerationsequencedata
AT mullikinjamesc evaluationofvariantdetectionsoftwareforpoolednextgenerationsequencedata
AT hansennancyf evaluationofvariantdetectionsoftwareforpoolednextgenerationsequencedata