Cargando…

Comparing the performance of selected variant callers using synthetic data and genome segmentation

BACKGROUND: High-throughput sequencing has rapidly become an essential part of precision cancer medicine. But validating results obtained from analyzing and interpreting genomic data remains a rate-limiting factor. The gold standard, of course, remains manual validation by expert panels, which is no...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bian, Xiaopeng, Zhu, Bin, Wang, Mingyi, Hu, Ying, Chen, Qingrong, Nguyen, Cu, Hicks, Belynda, Meerzaman, Daoud
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6245711/ https://www.ncbi.nlm.nih.gov/pubmed/30453880 http://dx.doi.org/10.1186/s12859-018-2440-7

_version_	1783372291260809216
author	Bian, Xiaopeng Zhu, Bin Wang, Mingyi Hu, Ying Chen, Qingrong Nguyen, Cu Hicks, Belynda Meerzaman, Daoud
author_facet	Bian, Xiaopeng Zhu, Bin Wang, Mingyi Hu, Ying Chen, Qingrong Nguyen, Cu Hicks, Belynda Meerzaman, Daoud
author_sort	Bian, Xiaopeng
collection	PubMed
description	BACKGROUND: High-throughput sequencing has rapidly become an essential part of precision cancer medicine. But validating results obtained from analyzing and interpreting genomic data remains a rate-limiting factor. The gold standard, of course, remains manual validation by expert panels, which is not without its weaknesses, namely high costs in both funding and time as well as the necessarily selective nature of manual validation. But it may be possible to develop more economical, complementary means of validation. In this study we employed four synthetic data sets (variants with known mutations spiked into specific genomic locations) of increasing complexity to assess the sensitivity, specificity, and balanced accuracy of five open-source variant callers: FreeBayes v1.0, VarDict v11.5.1, MuTect v1.1.7, MuTect2, and MuSE v1.0rc. FreeBayes, VarDict, and MuTect were run in bcbio-next gen, and the results were integrated into a single Ensemble call set. The known mutations provided a level of “ground truth” against which we evaluated variant-caller performance. We further facilitated the comparison and evaluation by segmenting the whole genome into 10,000,000 base-pair fragments which yielded 316 segments. RESULTS: Differences among the numbers of true positives were small among the callers, but the numbers of false positives varied much more when the tools were used to analyze sets one through three. Both FreeBayes and VarDict produced strikingly more false positives than did the others, although VarDict, somewhat paradoxically also produced the highest number of true positives. The Ensemble approach yielded results characterized by higher specificity and balanced accuracy and fewer false positives than did any of the five tools used alone. Sensitivity and specificity, however, declined for all five callers as the complexity of the data sets increased, but we did not uncover anything more than limited, weak correlations between caller performance and certain DNA structural features: gene density and guanine-cytosine content. Altogether, MuTect2 performed the best among the callers tested, followed by MuSE and MuTect. CONCLUSIONS: Spiking data sets with specific mutations –single-nucleotide variations (SNVs), single-nucleotide polymorphisms (SNPs), or structural variations (SVs) in this study—at known locations in the genome provides an effective and economical way to compare data analyzed by variant callers with ground truth. The method constitutes a viable alternative to the prolonged, expensive, and noncomprehensive assessment by expert panels. It should be further developed and refined, as should other comparatively “lightweight” methods of assessing accuracy. Given that the scientific community has not yet established gold standards for validating NGS-related technologies such as variant callers, developing multiple alternative means for verifying variant-caller accuracy will eventually lead to the establishment of higher-quality standards than could be achieved by prematurely limiting the range of innovative methods explored by members of the community.
format	Online Article Text
id	pubmed-6245711
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-62457112018-11-26 Comparing the performance of selected variant callers using synthetic data and genome segmentation Bian, Xiaopeng Zhu, Bin Wang, Mingyi Hu, Ying Chen, Qingrong Nguyen, Cu Hicks, Belynda Meerzaman, Daoud BMC Bioinformatics Research Article BACKGROUND: High-throughput sequencing has rapidly become an essential part of precision cancer medicine. But validating results obtained from analyzing and interpreting genomic data remains a rate-limiting factor. The gold standard, of course, remains manual validation by expert panels, which is not without its weaknesses, namely high costs in both funding and time as well as the necessarily selective nature of manual validation. But it may be possible to develop more economical, complementary means of validation. In this study we employed four synthetic data sets (variants with known mutations spiked into specific genomic locations) of increasing complexity to assess the sensitivity, specificity, and balanced accuracy of five open-source variant callers: FreeBayes v1.0, VarDict v11.5.1, MuTect v1.1.7, MuTect2, and MuSE v1.0rc. FreeBayes, VarDict, and MuTect were run in bcbio-next gen, and the results were integrated into a single Ensemble call set. The known mutations provided a level of “ground truth” against which we evaluated variant-caller performance. We further facilitated the comparison and evaluation by segmenting the whole genome into 10,000,000 base-pair fragments which yielded 316 segments. RESULTS: Differences among the numbers of true positives were small among the callers, but the numbers of false positives varied much more when the tools were used to analyze sets one through three. Both FreeBayes and VarDict produced strikingly more false positives than did the others, although VarDict, somewhat paradoxically also produced the highest number of true positives. The Ensemble approach yielded results characterized by higher specificity and balanced accuracy and fewer false positives than did any of the five tools used alone. Sensitivity and specificity, however, declined for all five callers as the complexity of the data sets increased, but we did not uncover anything more than limited, weak correlations between caller performance and certain DNA structural features: gene density and guanine-cytosine content. Altogether, MuTect2 performed the best among the callers tested, followed by MuSE and MuTect. CONCLUSIONS: Spiking data sets with specific mutations –single-nucleotide variations (SNVs), single-nucleotide polymorphisms (SNPs), or structural variations (SVs) in this study—at known locations in the genome provides an effective and economical way to compare data analyzed by variant callers with ground truth. The method constitutes a viable alternative to the prolonged, expensive, and noncomprehensive assessment by expert panels. It should be further developed and refined, as should other comparatively “lightweight” methods of assessing accuracy. Given that the scientific community has not yet established gold standards for validating NGS-related technologies such as variant callers, developing multiple alternative means for verifying variant-caller accuracy will eventually lead to the establishment of higher-quality standards than could be achieved by prematurely limiting the range of innovative methods explored by members of the community. BioMed Central 2018-11-19 /pmc/articles/PMC6245711/ /pubmed/30453880 http://dx.doi.org/10.1186/s12859-018-2440-7 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Bian, Xiaopeng Zhu, Bin Wang, Mingyi Hu, Ying Chen, Qingrong Nguyen, Cu Hicks, Belynda Meerzaman, Daoud Comparing the performance of selected variant callers using synthetic data and genome segmentation
title	Comparing the performance of selected variant callers using synthetic data and genome segmentation
title_full	Comparing the performance of selected variant callers using synthetic data and genome segmentation
title_fullStr	Comparing the performance of selected variant callers using synthetic data and genome segmentation
title_full_unstemmed	Comparing the performance of selected variant callers using synthetic data and genome segmentation
title_short	Comparing the performance of selected variant callers using synthetic data and genome segmentation
title_sort	comparing the performance of selected variant callers using synthetic data and genome segmentation
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6245711/ https://www.ncbi.nlm.nih.gov/pubmed/30453880 http://dx.doi.org/10.1186/s12859-018-2440-7
work_keys_str_mv	AT bianxiaopeng comparingtheperformanceofselectedvariantcallersusingsyntheticdataandgenomesegmentation AT zhubin comparingtheperformanceofselectedvariantcallersusingsyntheticdataandgenomesegmentation AT wangmingyi comparingtheperformanceofselectedvariantcallersusingsyntheticdataandgenomesegmentation AT huying comparingtheperformanceofselectedvariantcallersusingsyntheticdataandgenomesegmentation AT chenqingrong comparingtheperformanceofselectedvariantcallersusingsyntheticdataandgenomesegmentation AT nguyencu comparingtheperformanceofselectedvariantcallersusingsyntheticdataandgenomesegmentation AT hicksbelynda comparingtheperformanceofselectedvariantcallersusingsyntheticdataandgenomesegmentation AT meerzamandaoud comparingtheperformanceofselectedvariantcallersusingsyntheticdataandgenomesegmentation

Comparing the performance of selected variant callers using synthetic data and genome segmentation

Ejemplares similares