Cargando…

GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data

BACKGROUND: NGS technology represents a powerful alternative to the standard Sanger sequencing in the context of clinical setting. The proprietary software that are generally used for variant calling often depend on preset parameters that may not fit in a satisfactory manner for different genes. GAT...

Descripción completa

Detalles Bibliográficos
Autores principales: De Summa, Simona, Malerba, Giovanni, Pinto, Rosamaria, Mori, Antonio, Mijatovic, Vladan, Tommasi, Stefania
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374681/
https://www.ncbi.nlm.nih.gov/pubmed/28361668
http://dx.doi.org/10.1186/s12859-017-1537-8
_version_ 1782518941945430016
author De Summa, Simona
Malerba, Giovanni
Pinto, Rosamaria
Mori, Antonio
Mijatovic, Vladan
Tommasi, Stefania
author_facet De Summa, Simona
Malerba, Giovanni
Pinto, Rosamaria
Mori, Antonio
Mijatovic, Vladan
Tommasi, Stefania
author_sort De Summa, Simona
collection PubMed
description BACKGROUND: NGS technology represents a powerful alternative to the standard Sanger sequencing in the context of clinical setting. The proprietary software that are generally used for variant calling often depend on preset parameters that may not fit in a satisfactory manner for different genes. GATK, which is widely used in the academic world, is rich in parameters for variant calling. However the self-adjusting parameter calibration of GATK requires data from a large number of exomes. When these are not available, which is the standard condition of a diagnostic laboratory, the parameters must be set by the operator (hard filtering). The aim of the present paper was to set up a procedure to assess the best parameters to be used in the hard filtering of GATK. This was pursued by using classification trees on true and false variants from simulated sequences of a real dataset data. RESULTS: We simulated two datasets, with different coverages, including all the sequence alterations identified in a real dataset according to their observed frequencies. Simulated sequences were aligned with standard protocols and then regression trees were built up to identify the most reliable parameters and cutoff values to discriminate true and false variant calls. Moreover, we analyzed flanking sequences of region presenting a high rate of false positive calls observing that such sequences present a low complexity make up. CONCLUSIONS: Our results showed that GATK hard filtering parameter values can be tailored through a simulation study based-on the DNA region of interest to ameliorate the accuracy of the variant calling. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1537-8) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5374681
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-53746812017-04-03 GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data De Summa, Simona Malerba, Giovanni Pinto, Rosamaria Mori, Antonio Mijatovic, Vladan Tommasi, Stefania BMC Bioinformatics Research BACKGROUND: NGS technology represents a powerful alternative to the standard Sanger sequencing in the context of clinical setting. The proprietary software that are generally used for variant calling often depend on preset parameters that may not fit in a satisfactory manner for different genes. GATK, which is widely used in the academic world, is rich in parameters for variant calling. However the self-adjusting parameter calibration of GATK requires data from a large number of exomes. When these are not available, which is the standard condition of a diagnostic laboratory, the parameters must be set by the operator (hard filtering). The aim of the present paper was to set up a procedure to assess the best parameters to be used in the hard filtering of GATK. This was pursued by using classification trees on true and false variants from simulated sequences of a real dataset data. RESULTS: We simulated two datasets, with different coverages, including all the sequence alterations identified in a real dataset according to their observed frequencies. Simulated sequences were aligned with standard protocols and then regression trees were built up to identify the most reliable parameters and cutoff values to discriminate true and false variant calls. Moreover, we analyzed flanking sequences of region presenting a high rate of false positive calls observing that such sequences present a low complexity make up. CONCLUSIONS: Our results showed that GATK hard filtering parameter values can be tailored through a simulation study based-on the DNA region of interest to ameliorate the accuracy of the variant calling. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1537-8) contains supplementary material, which is available to authorized users. BioMed Central 2017-03-23 /pmc/articles/PMC5374681/ /pubmed/28361668 http://dx.doi.org/10.1186/s12859-017-1537-8 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
De Summa, Simona
Malerba, Giovanni
Pinto, Rosamaria
Mori, Antonio
Mijatovic, Vladan
Tommasi, Stefania
GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data
title GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data
title_full GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data
title_fullStr GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data
title_full_unstemmed GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data
title_short GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data
title_sort gatk hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374681/
https://www.ncbi.nlm.nih.gov/pubmed/28361668
http://dx.doi.org/10.1186/s12859-017-1537-8
work_keys_str_mv AT desummasimona gatkhardfilteringtunableparameterstoimprovevariantcallingfornextgenerationsequencingtargetedgenepaneldata
AT malerbagiovanni gatkhardfilteringtunableparameterstoimprovevariantcallingfornextgenerationsequencingtargetedgenepaneldata
AT pintorosamaria gatkhardfilteringtunableparameterstoimprovevariantcallingfornextgenerationsequencingtargetedgenepaneldata
AT moriantonio gatkhardfilteringtunableparameterstoimprovevariantcallingfornextgenerationsequencingtargetedgenepaneldata
AT mijatovicvladan gatkhardfilteringtunableparameterstoimprovevariantcallingfornextgenerationsequencingtargetedgenepaneldata
AT tommasistefania gatkhardfilteringtunableparameterstoimprovevariantcallingfornextgenerationsequencingtargetedgenepaneldata