Cargando…
The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors
Next Generation Sequencing studies generate a large quantity of genetic data in a relatively cost and time efficient manner and provide an unprecedented opportunity to identify candidate causative variants that lead to disease phenotypes. A challenge to these studies is the generation of sequencing...
Autores principales: | , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3921572/ https://www.ncbi.nlm.nih.gov/pubmed/24575121 http://dx.doi.org/10.3389/fgene.2014.00016 |
_version_ | 1782303313149034496 |
---|---|
author | Patel, Zubin H. Kottyan, Leah C. Lazaro, Sara Williams, Marc S. Ledbetter, David H. Tromp, hbGerard Rupert, Andrew Kohram, Mojtaba Wagner, Michael Husami, Ammar Qian, Yaping Valencia, C. Alexander Zhang, Kejian Hostetter, Margaret K. Harley, John B. Kaufman, Kenneth M. |
author_facet | Patel, Zubin H. Kottyan, Leah C. Lazaro, Sara Williams, Marc S. Ledbetter, David H. Tromp, hbGerard Rupert, Andrew Kohram, Mojtaba Wagner, Michael Husami, Ammar Qian, Yaping Valencia, C. Alexander Zhang, Kejian Hostetter, Margaret K. Harley, John B. Kaufman, Kenneth M. |
author_sort | Patel, Zubin H. |
collection | PubMed |
description | Next Generation Sequencing studies generate a large quantity of genetic data in a relatively cost and time efficient manner and provide an unprecedented opportunity to identify candidate causative variants that lead to disease phenotypes. A challenge to these studies is the generation of sequencing artifacts by current technologies. To identify and characterize the properties that distinguish false positive variants from true variants, we sequenced a child and both parents (one trio) using DNA isolated from three sources (blood, buccal cells, and saliva). The trio strategy allowed us to identify variants in the proband that could not have been inherited from the parents (Mendelian errors) and would most likely indicate sequencing artifacts. Quality control measurements were examined and three measurements were found to identify the greatest number of Mendelian errors. These included read depth, genotype quality score, and alternate allele ratio. Filtering the variants on these measurements removed ~95% of the Mendelian errors while retaining 80% of the called variants. These filters were applied independently. After filtering, the concordance between identical samples isolated from different sources was 99.99% as compared to 87% before filtering. This high concordance suggests that different sources of DNA can be used in trio studies without affecting the ability to identify causative polymorphisms. To facilitate analysis of next generation sequencing data, we developed the Cincinnati Analytical Suite for Sequencing Informatics (CASSI) to store sequencing files, metadata (eg. relatedness information), file versioning, data filtering, variant annotation, and identify candidate causative polymorphisms that follow either de novo, rare recessive homozygous or compound heterozygous inheritance models. We conclude the data cleaning process improves the signal to noise ratio in terms of variants and facilitates the identification of candidate disease causative polymorphisms. |
format | Online Article Text |
id | pubmed-3921572 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-39215722014-02-26 The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors Patel, Zubin H. Kottyan, Leah C. Lazaro, Sara Williams, Marc S. Ledbetter, David H. Tromp, hbGerard Rupert, Andrew Kohram, Mojtaba Wagner, Michael Husami, Ammar Qian, Yaping Valencia, C. Alexander Zhang, Kejian Hostetter, Margaret K. Harley, John B. Kaufman, Kenneth M. Front Genet Genetics Next Generation Sequencing studies generate a large quantity of genetic data in a relatively cost and time efficient manner and provide an unprecedented opportunity to identify candidate causative variants that lead to disease phenotypes. A challenge to these studies is the generation of sequencing artifacts by current technologies. To identify and characterize the properties that distinguish false positive variants from true variants, we sequenced a child and both parents (one trio) using DNA isolated from three sources (blood, buccal cells, and saliva). The trio strategy allowed us to identify variants in the proband that could not have been inherited from the parents (Mendelian errors) and would most likely indicate sequencing artifacts. Quality control measurements were examined and three measurements were found to identify the greatest number of Mendelian errors. These included read depth, genotype quality score, and alternate allele ratio. Filtering the variants on these measurements removed ~95% of the Mendelian errors while retaining 80% of the called variants. These filters were applied independently. After filtering, the concordance between identical samples isolated from different sources was 99.99% as compared to 87% before filtering. This high concordance suggests that different sources of DNA can be used in trio studies without affecting the ability to identify causative polymorphisms. To facilitate analysis of next generation sequencing data, we developed the Cincinnati Analytical Suite for Sequencing Informatics (CASSI) to store sequencing files, metadata (eg. relatedness information), file versioning, data filtering, variant annotation, and identify candidate causative polymorphisms that follow either de novo, rare recessive homozygous or compound heterozygous inheritance models. We conclude the data cleaning process improves the signal to noise ratio in terms of variants and facilitates the identification of candidate disease causative polymorphisms. Frontiers Media S.A. 2014-02-12 /pmc/articles/PMC3921572/ /pubmed/24575121 http://dx.doi.org/10.3389/fgene.2014.00016 Text en Copyright © 2014 Patel, Kottyan, Lazaro, Williams, Ledbetter, Tromp, Rupert, Kohram, Wagner, Husami, Qian, Valencia, Zhang, Hostetter, Harley and Kaufman. http://creativecommons.org/licenses/by/3.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Genetics Patel, Zubin H. Kottyan, Leah C. Lazaro, Sara Williams, Marc S. Ledbetter, David H. Tromp, hbGerard Rupert, Andrew Kohram, Mojtaba Wagner, Michael Husami, Ammar Qian, Yaping Valencia, C. Alexander Zhang, Kejian Hostetter, Margaret K. Harley, John B. Kaufman, Kenneth M. The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors |
title | The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors |
title_full | The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors |
title_fullStr | The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors |
title_full_unstemmed | The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors |
title_short | The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors |
title_sort | struggle to find reliable results in exome sequencing data: filtering out mendelian errors |
topic | Genetics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3921572/ https://www.ncbi.nlm.nih.gov/pubmed/24575121 http://dx.doi.org/10.3389/fgene.2014.00016 |
work_keys_str_mv | AT patelzubinh thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT kottyanleahc thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT lazarosara thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT williamsmarcs thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT ledbetterdavidh thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT tromphbgerard thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT rupertandrew thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT kohrammojtaba thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT wagnermichael thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT husamiammar thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT qianyaping thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT valenciacalexander thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT zhangkejian thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT hostettermargaretk thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT harleyjohnb thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT kaufmankennethm thestruggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT patelzubinh struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT kottyanleahc struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT lazarosara struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT williamsmarcs struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT ledbetterdavidh struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT tromphbgerard struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT rupertandrew struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT kohrammojtaba struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT wagnermichael struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT husamiammar struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT qianyaping struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT valenciacalexander struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT zhangkejian struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT hostettermargaretk struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT harleyjohnb struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors AT kaufmankennethm struggletofindreliableresultsinexomesequencingdatafilteringoutmendelianerrors |