Cargando…

Effective filtering strategies to improve data quality from population-based whole exome sequencing studies

BACKGROUND: Genotypes generated in next generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (...

Descripción completa

Detalles Bibliográficos
Autores principales:	Carson, Andrew R, Smith, Erin N, Matsui, Hiroko, Brækkan, Sigrid K, Jepsen, Kristen, Hansen, John-Bjarne, Frazer, Kelly A
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4098776/ https://www.ncbi.nlm.nih.gov/pubmed/24884706 http://dx.doi.org/10.1186/1471-2105-15-125

_version_	1782326393814646784
author	Carson, Andrew R Smith, Erin N Matsui, Hiroko Brækkan, Sigrid K Jepsen, Kristen Hansen, John-Bjarne Frazer, Kelly A
author_facet	Carson, Andrew R Smith, Erin N Matsui, Hiroko Brækkan, Sigrid K Jepsen, Kristen Hansen, John-Bjarne Frazer, Kelly A
author_sort	Carson, Andrew R
collection	PubMed
description	BACKGROUND: Genotypes generated in next generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (VQSR) tool and thus remain a source of errors in whole exome sequencing (WES) projects that follow GATK’s recommended best practices. Therefore, additional data filtering methods are required to effectively remove these errors before performing association analyses with complex phenotypes. Here we empirically derive thresholds for genotype and variant filters that, when used in conjunction with the VQSR tool, achieve higher data quality than when using VQSR alone. RESULTS: The detailed filtering strategies improve the concordance of sequenced genotypes with array genotypes from 99.33% to 99.77%; improve the percent of discordant genotypes removed from 10.5% to 69.5%; and improve the Ti/Tv ratio from 2.63 to 2.75. We also demonstrate that managing batch effects by separating samples based on different target capture and sequencing chemistry protocols results in a final data set containing 40.9% more high-quality variants. In addition, imputation is an important component of WES studies and is used to estimate common variant genotypes to generate additional markers for association analyses. As such, we demonstrate filtering methods for imputed data that improve genotype concordance from 79.3% to 99.8% while removing 99.5% of discordant genotypes. CONCLUSIONS: The described filtering methods are advantageous for large population-based WES studies designed to identify common and rare variation associated with complex diseases. Compared to data processed through standard practices, these strategies result in substantially higher quality data for common and rare association analyses.
format	Online Article Text
id	pubmed-4098776
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-40987762014-07-16 Effective filtering strategies to improve data quality from population-based whole exome sequencing studies Carson, Andrew R Smith, Erin N Matsui, Hiroko Brækkan, Sigrid K Jepsen, Kristen Hansen, John-Bjarne Frazer, Kelly A BMC Bioinformatics Methodology Article BACKGROUND: Genotypes generated in next generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (VQSR) tool and thus remain a source of errors in whole exome sequencing (WES) projects that follow GATK’s recommended best practices. Therefore, additional data filtering methods are required to effectively remove these errors before performing association analyses with complex phenotypes. Here we empirically derive thresholds for genotype and variant filters that, when used in conjunction with the VQSR tool, achieve higher data quality than when using VQSR alone. RESULTS: The detailed filtering strategies improve the concordance of sequenced genotypes with array genotypes from 99.33% to 99.77%; improve the percent of discordant genotypes removed from 10.5% to 69.5%; and improve the Ti/Tv ratio from 2.63 to 2.75. We also demonstrate that managing batch effects by separating samples based on different target capture and sequencing chemistry protocols results in a final data set containing 40.9% more high-quality variants. In addition, imputation is an important component of WES studies and is used to estimate common variant genotypes to generate additional markers for association analyses. As such, we demonstrate filtering methods for imputed data that improve genotype concordance from 79.3% to 99.8% while removing 99.5% of discordant genotypes. CONCLUSIONS: The described filtering methods are advantageous for large population-based WES studies designed to identify common and rare variation associated with complex diseases. Compared to data processed through standard practices, these strategies result in substantially higher quality data for common and rare association analyses. BioMed Central 2014-05-02 /pmc/articles/PMC4098776/ /pubmed/24884706 http://dx.doi.org/10.1186/1471-2105-15-125 Text en Copyright © 2014 Carson et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Carson, Andrew R Smith, Erin N Matsui, Hiroko Brækkan, Sigrid K Jepsen, Kristen Hansen, John-Bjarne Frazer, Kelly A Effective filtering strategies to improve data quality from population-based whole exome sequencing studies
title	Effective filtering strategies to improve data quality from population-based whole exome sequencing studies
title_full	Effective filtering strategies to improve data quality from population-based whole exome sequencing studies
title_fullStr	Effective filtering strategies to improve data quality from population-based whole exome sequencing studies
title_full_unstemmed	Effective filtering strategies to improve data quality from population-based whole exome sequencing studies
title_short	Effective filtering strategies to improve data quality from population-based whole exome sequencing studies
title_sort	effective filtering strategies to improve data quality from population-based whole exome sequencing studies
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4098776/ https://www.ncbi.nlm.nih.gov/pubmed/24884706 http://dx.doi.org/10.1186/1471-2105-15-125
work_keys_str_mv	AT carsonandrewr effectivefilteringstrategiestoimprovedataqualityfrompopulationbasedwholeexomesequencingstudies AT smitherinn effectivefilteringstrategiestoimprovedataqualityfrompopulationbasedwholeexomesequencingstudies AT matsuihiroko effectivefilteringstrategiestoimprovedataqualityfrompopulationbasedwholeexomesequencingstudies AT brækkansigridk effectivefilteringstrategiestoimprovedataqualityfrompopulationbasedwholeexomesequencingstudies AT jepsenkristen effectivefilteringstrategiestoimprovedataqualityfrompopulationbasedwholeexomesequencingstudies AT hansenjohnbjarne effectivefilteringstrategiestoimprovedataqualityfrompopulationbasedwholeexomesequencingstudies AT frazerkellya effectivefilteringstrategiestoimprovedataqualityfrompopulationbasedwholeexomesequencingstudies

Effective filtering strategies to improve data quality from population-based whole exome sequencing studies

Ejemplares similares