Cargando…

Detailed comparison of two popular variant calling packages for exome and targeted exon studies

The Genome Analysis Toolkit (GATK) is commonly used for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but...

Descripción completa

Detalles Bibliográficos
Autores principales: Warden, Charles D., Adamson, Aaron W., Neuhausen, Susan L., Wu, Xiwei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4184249/
https://www.ncbi.nlm.nih.gov/pubmed/25289185
http://dx.doi.org/10.7717/peerj.600
_version_ 1782337814467182592
author Warden, Charles D.
Adamson, Aaron W.
Neuhausen, Susan L.
Wu, Xiwei
author_facet Warden, Charles D.
Adamson, Aaron W.
Neuhausen, Susan L.
Wu, Xiwei
author_sort Warden, Charles D.
collection PubMed
description The Genome Analysis Toolkit (GATK) is commonly used for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but an equally comprehensive comparison for VarScan not yet been performed. More specifically, we compare (1) the effects of different pre-processing steps prior to variant calling with both GATK and VarScan, (2) VarScan variants called with increasingly conservative parameters, and (3) filtered and unfiltered GATK variant calls (for both the UnifiedGenotyper and the HaplotypeCaller). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. In most cases, pre-processing steps (e.g., indel realignment and quality score base recalibration using GATK) had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. Based upon concordance statistics presented in this study, we recommend GATK users focus on “high-quality” GATK variants by filtering out variants flagged as low-quality. We also found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a reproducible list of variants, with high concordance (>97%) to high-quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84–88% of the high-quality GATK SNPs in the exome datasets. This study also provides limited evidence that VarScan-Cons has a decreased false positive rate among novel variants (relative to high-quality GATK SNPs) and that the GATK HaplotypeCaller has an increased false positive rate for indels (relative to VarScan-Cons and high-quality GATK UnifiedGenotyper indels). More broadly, we believe the metrics used for comparison in this study can be useful in assessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.
format Online
Article
Text
id pubmed-4184249
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-41842492014-10-06 Detailed comparison of two popular variant calling packages for exome and targeted exon studies Warden, Charles D. Adamson, Aaron W. Neuhausen, Susan L. Wu, Xiwei PeerJ Bioinformatics The Genome Analysis Toolkit (GATK) is commonly used for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but an equally comprehensive comparison for VarScan not yet been performed. More specifically, we compare (1) the effects of different pre-processing steps prior to variant calling with both GATK and VarScan, (2) VarScan variants called with increasingly conservative parameters, and (3) filtered and unfiltered GATK variant calls (for both the UnifiedGenotyper and the HaplotypeCaller). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. In most cases, pre-processing steps (e.g., indel realignment and quality score base recalibration using GATK) had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. Based upon concordance statistics presented in this study, we recommend GATK users focus on “high-quality” GATK variants by filtering out variants flagged as low-quality. We also found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a reproducible list of variants, with high concordance (>97%) to high-quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84–88% of the high-quality GATK SNPs in the exome datasets. This study also provides limited evidence that VarScan-Cons has a decreased false positive rate among novel variants (relative to high-quality GATK SNPs) and that the GATK HaplotypeCaller has an increased false positive rate for indels (relative to VarScan-Cons and high-quality GATK UnifiedGenotyper indels). More broadly, we believe the metrics used for comparison in this study can be useful in assessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers. PeerJ Inc. 2014-09-30 /pmc/articles/PMC4184249/ /pubmed/25289185 http://dx.doi.org/10.7717/peerj.600 Text en © 2014 Warden et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Warden, Charles D.
Adamson, Aaron W.
Neuhausen, Susan L.
Wu, Xiwei
Detailed comparison of two popular variant calling packages for exome and targeted exon studies
title Detailed comparison of two popular variant calling packages for exome and targeted exon studies
title_full Detailed comparison of two popular variant calling packages for exome and targeted exon studies
title_fullStr Detailed comparison of two popular variant calling packages for exome and targeted exon studies
title_full_unstemmed Detailed comparison of two popular variant calling packages for exome and targeted exon studies
title_short Detailed comparison of two popular variant calling packages for exome and targeted exon studies
title_sort detailed comparison of two popular variant calling packages for exome and targeted exon studies
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4184249/
https://www.ncbi.nlm.nih.gov/pubmed/25289185
http://dx.doi.org/10.7717/peerj.600
work_keys_str_mv AT wardencharlesd detailedcomparisonoftwopopularvariantcallingpackagesforexomeandtargetedexonstudies
AT adamsonaaronw detailedcomparisonoftwopopularvariantcallingpackagesforexomeandtargetedexonstudies
AT neuhausensusanl detailedcomparisonoftwopopularvariantcallingpackagesforexomeandtargetedexonstudies
AT wuxiwei detailedcomparisonoftwopopularvariantcallingpackagesforexomeandtargetedexonstudies