Cargando…

Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers

BACKGROUND: Next-generation sequencing of matched tumor and normal biopsy pairs has become a technology of paramount importance for precision cancer treatment. Sequencing costs have dropped tremendously, allowing the sequencing of the whole exome of tumors for just a fraction of the total treatment...

Descripción completa

Detalles Bibliográficos
Autores principales: Hofmann, Ariane L., Behr, Jonas, Singer, Jochen, Kuipers, Jack, Beisel, Christian, Schraml, Peter, Moch, Holger, Beerenwinkel, Niko
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5209852/
https://www.ncbi.nlm.nih.gov/pubmed/28049408
http://dx.doi.org/10.1186/s12859-016-1417-7
_version_ 1782490806349725696
author Hofmann, Ariane L.
Behr, Jonas
Singer, Jochen
Kuipers, Jack
Beisel, Christian
Schraml, Peter
Moch, Holger
Beerenwinkel, Niko
author_facet Hofmann, Ariane L.
Behr, Jonas
Singer, Jochen
Kuipers, Jack
Beisel, Christian
Schraml, Peter
Moch, Holger
Beerenwinkel, Niko
author_sort Hofmann, Ariane L.
collection PubMed
description BACKGROUND: Next-generation sequencing of matched tumor and normal biopsy pairs has become a technology of paramount importance for precision cancer treatment. Sequencing costs have dropped tremendously, allowing the sequencing of the whole exome of tumors for just a fraction of the total treatment costs. However, clinicians and scientists cannot take full advantage of the generated data because the accuracy of analysis pipelines is limited. This particularly concerns the reliable identification of subclonal mutations in a cancer tissue sample with very low frequencies, which may be clinically relevant. RESULTS: Using simulations based on kidney tumor data, we compared the performance of nine state-of-the-art variant callers, namely deepSNV, GATK HaplotypeCaller, GATK UnifiedGenotyper, JointSNVMix2, MuTect, SAMtools, SiNVICT, SomaticSniper, and VarScan2. The comparison was done as a function of variant allele frequencies and coverage. Our analysis revealed that deepSNV and JointSNVMix2 perform very well, especially in the low-frequency range. We attributed false positive and false negative calls of the nine tools to specific error sources and assigned them to processing steps of the pipeline. All of these errors can be expected to occur in real data sets. We found that modifying certain steps of the pipeline or parameters of the tools can lead to substantial improvements in performance. Furthermore, a novel integration strategy that combines the ranks of the variants yielded the best performance. More precisely, the rank-combination of deepSNV, JointSNVMix2, MuTect, SiNVICT and VarScan2 reached a sensitivity of 78% when fixing the precision at 90%, and outperformed all individual tools, where the maximum sensitivity was 71% with the same precision. CONCLUSIONS: The choice of well-performing tools for alignment and variant calling is crucial for the correct interpretation of exome sequencing data obtained from mixed samples, and common pipelines are suboptimal. We were able to relate observed substantial differences in performance to the underlying statistical models of the tools, and to pinpoint the error sources of false positive and false negative calls. These findings might inspire new software developments that improve exome sequencing pipelines and further the field of precision cancer treatment. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1417-7) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5209852
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-52098522017-01-04 Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers Hofmann, Ariane L. Behr, Jonas Singer, Jochen Kuipers, Jack Beisel, Christian Schraml, Peter Moch, Holger Beerenwinkel, Niko BMC Bioinformatics Research Article BACKGROUND: Next-generation sequencing of matched tumor and normal biopsy pairs has become a technology of paramount importance for precision cancer treatment. Sequencing costs have dropped tremendously, allowing the sequencing of the whole exome of tumors for just a fraction of the total treatment costs. However, clinicians and scientists cannot take full advantage of the generated data because the accuracy of analysis pipelines is limited. This particularly concerns the reliable identification of subclonal mutations in a cancer tissue sample with very low frequencies, which may be clinically relevant. RESULTS: Using simulations based on kidney tumor data, we compared the performance of nine state-of-the-art variant callers, namely deepSNV, GATK HaplotypeCaller, GATK UnifiedGenotyper, JointSNVMix2, MuTect, SAMtools, SiNVICT, SomaticSniper, and VarScan2. The comparison was done as a function of variant allele frequencies and coverage. Our analysis revealed that deepSNV and JointSNVMix2 perform very well, especially in the low-frequency range. We attributed false positive and false negative calls of the nine tools to specific error sources and assigned them to processing steps of the pipeline. All of these errors can be expected to occur in real data sets. We found that modifying certain steps of the pipeline or parameters of the tools can lead to substantial improvements in performance. Furthermore, a novel integration strategy that combines the ranks of the variants yielded the best performance. More precisely, the rank-combination of deepSNV, JointSNVMix2, MuTect, SiNVICT and VarScan2 reached a sensitivity of 78% when fixing the precision at 90%, and outperformed all individual tools, where the maximum sensitivity was 71% with the same precision. CONCLUSIONS: The choice of well-performing tools for alignment and variant calling is crucial for the correct interpretation of exome sequencing data obtained from mixed samples, and common pipelines are suboptimal. We were able to relate observed substantial differences in performance to the underlying statistical models of the tools, and to pinpoint the error sources of false positive and false negative calls. These findings might inspire new software developments that improve exome sequencing pipelines and further the field of precision cancer treatment. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1417-7) contains supplementary material, which is available to authorized users. BioMed Central 2017-01-03 /pmc/articles/PMC5209852/ /pubmed/28049408 http://dx.doi.org/10.1186/s12859-016-1417-7 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Hofmann, Ariane L.
Behr, Jonas
Singer, Jochen
Kuipers, Jack
Beisel, Christian
Schraml, Peter
Moch, Holger
Beerenwinkel, Niko
Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers
title Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers
title_full Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers
title_fullStr Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers
title_full_unstemmed Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers
title_short Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers
title_sort detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5209852/
https://www.ncbi.nlm.nih.gov/pubmed/28049408
http://dx.doi.org/10.1186/s12859-016-1417-7
work_keys_str_mv AT hofmannarianel detailedsimulationofcancerexomesequencingdatarevealsdifferencesandcommonlimitationsofvariantcallers
AT behrjonas detailedsimulationofcancerexomesequencingdatarevealsdifferencesandcommonlimitationsofvariantcallers
AT singerjochen detailedsimulationofcancerexomesequencingdatarevealsdifferencesandcommonlimitationsofvariantcallers
AT kuipersjack detailedsimulationofcancerexomesequencingdatarevealsdifferencesandcommonlimitationsofvariantcallers
AT beiselchristian detailedsimulationofcancerexomesequencingdatarevealsdifferencesandcommonlimitationsofvariantcallers
AT schramlpeter detailedsimulationofcancerexomesequencingdatarevealsdifferencesandcommonlimitationsofvariantcallers
AT mochholger detailedsimulationofcancerexomesequencingdatarevealsdifferencesandcommonlimitationsofvariantcallers
AT beerenwinkelniko detailedsimulationofcancerexomesequencingdatarevealsdifferencesandcommonlimitationsofvariantcallers