Cargando…

Reducing INDEL calling errors in whole genome and exome sequencing data

BACKGROUND: INDELs, especially those disrupting protein-coding regions of the genome, have been strongly associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts. METHODS: We characteriz...

Descripción completa

Detalles Bibliográficos
Autores principales: Fang, Han, Wu, Yiyang, Narzisi, Giuseppe, O’Rawe, Jason A, Barrón, Laura T Jimenez, Rosenbaum, Julie, Ronemus, Michael, Iossifov, Ivan, Schatz, Michael C, Lyon, Gholson J
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4240813/
https://www.ncbi.nlm.nih.gov/pubmed/25426171
http://dx.doi.org/10.1186/s13073-014-0089-z
_version_ 1782345774223327232
author Fang, Han
Wu, Yiyang
Narzisi, Giuseppe
O’Rawe, Jason A
Barrón, Laura T Jimenez
Rosenbaum, Julie
Ronemus, Michael
Iossifov, Ivan
Schatz, Michael C
Lyon, Gholson J
author_facet Fang, Han
Wu, Yiyang
Narzisi, Giuseppe
O’Rawe, Jason A
Barrón, Laura T Jimenez
Rosenbaum, Julie
Ronemus, Michael
Iossifov, Ivan
Schatz, Michael C
Lyon, Gholson J
author_sort Fang, Han
collection PubMed
description BACKGROUND: INDELs, especially those disrupting protein-coding regions of the genome, have been strongly associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts. METHODS: We characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate the sources of INDEL errors. We also developed a classification scheme based on the coverage and composition to rank high and low quality INDEL calls. We performed a large-scale validation experiment on 600 loci, and find high-quality INDELs to have a substantially lower error rate than low-quality INDELs (7% vs. 51%). RESULTS: Simulation and experimental data show that assembly based callers are significantly more sensitive and robust for detecting large INDELs (>5 bp) than alignment based callers, consistent with published data. The concordance of INDEL detection between WGS and WES is low (53%), and WGS data uniquely identifies 10.8-fold more high-quality INDELs. The validation rate for WGS-specific INDELs is also much higher than that for WES-specific INDELs (84% vs. 57%), and WES misses many large INDELs. In addition, the concordance for INDEL detection between standard WGS and PCR-free sequencing is 71%, and standard WGS data uniquely identifies 6.3-fold more low-quality INDELs. Furthermore, accurate detection with Scalpel of heterozygous INDELs requires 1.2-fold higher coverage than that for homozygous INDELs. Lastly, homopolymer A/T INDELs are a major source of low-quality INDEL calls, and they are highly enriched in the WES data. CONCLUSIONS: Overall, we show that accuracy of INDEL detection with WGS is much greater than WES even in the targeted region. We calculated that 60X WGS depth of coverage from the HiSeq platform is needed to recover 95% of INDELs detected by Scalpel. While this is higher than current sequencing practice, the deeper coverage may save total project costs because of the greater accuracy and sensitivity. Finally, we investigate sources of INDEL errors (for example, capture deficiency, PCR amplification, homopolymers) with various data that will serve as a guideline to effectively reduce INDEL errors in genome sequencing. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13073-014-0089-z) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4240813
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42408132014-11-25 Reducing INDEL calling errors in whole genome and exome sequencing data Fang, Han Wu, Yiyang Narzisi, Giuseppe O’Rawe, Jason A Barrón, Laura T Jimenez Rosenbaum, Julie Ronemus, Michael Iossifov, Ivan Schatz, Michael C Lyon, Gholson J Genome Med Research BACKGROUND: INDELs, especially those disrupting protein-coding regions of the genome, have been strongly associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts. METHODS: We characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate the sources of INDEL errors. We also developed a classification scheme based on the coverage and composition to rank high and low quality INDEL calls. We performed a large-scale validation experiment on 600 loci, and find high-quality INDELs to have a substantially lower error rate than low-quality INDELs (7% vs. 51%). RESULTS: Simulation and experimental data show that assembly based callers are significantly more sensitive and robust for detecting large INDELs (>5 bp) than alignment based callers, consistent with published data. The concordance of INDEL detection between WGS and WES is low (53%), and WGS data uniquely identifies 10.8-fold more high-quality INDELs. The validation rate for WGS-specific INDELs is also much higher than that for WES-specific INDELs (84% vs. 57%), and WES misses many large INDELs. In addition, the concordance for INDEL detection between standard WGS and PCR-free sequencing is 71%, and standard WGS data uniquely identifies 6.3-fold more low-quality INDELs. Furthermore, accurate detection with Scalpel of heterozygous INDELs requires 1.2-fold higher coverage than that for homozygous INDELs. Lastly, homopolymer A/T INDELs are a major source of low-quality INDEL calls, and they are highly enriched in the WES data. CONCLUSIONS: Overall, we show that accuracy of INDEL detection with WGS is much greater than WES even in the targeted region. We calculated that 60X WGS depth of coverage from the HiSeq platform is needed to recover 95% of INDELs detected by Scalpel. While this is higher than current sequencing practice, the deeper coverage may save total project costs because of the greater accuracy and sensitivity. Finally, we investigate sources of INDEL errors (for example, capture deficiency, PCR amplification, homopolymers) with various data that will serve as a guideline to effectively reduce INDEL errors in genome sequencing. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13073-014-0089-z) contains supplementary material, which is available to authorized users. BioMed Central 2014-10-28 /pmc/articles/PMC4240813/ /pubmed/25426171 http://dx.doi.org/10.1186/s13073-014-0089-z Text en © Fang et al.; licensee BioMed Central Ltd. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Fang, Han
Wu, Yiyang
Narzisi, Giuseppe
O’Rawe, Jason A
Barrón, Laura T Jimenez
Rosenbaum, Julie
Ronemus, Michael
Iossifov, Ivan
Schatz, Michael C
Lyon, Gholson J
Reducing INDEL calling errors in whole genome and exome sequencing data
title Reducing INDEL calling errors in whole genome and exome sequencing data
title_full Reducing INDEL calling errors in whole genome and exome sequencing data
title_fullStr Reducing INDEL calling errors in whole genome and exome sequencing data
title_full_unstemmed Reducing INDEL calling errors in whole genome and exome sequencing data
title_short Reducing INDEL calling errors in whole genome and exome sequencing data
title_sort reducing indel calling errors in whole genome and exome sequencing data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4240813/
https://www.ncbi.nlm.nih.gov/pubmed/25426171
http://dx.doi.org/10.1186/s13073-014-0089-z
work_keys_str_mv AT fanghan reducingindelcallingerrorsinwholegenomeandexomesequencingdata
AT wuyiyang reducingindelcallingerrorsinwholegenomeandexomesequencingdata
AT narzisigiuseppe reducingindelcallingerrorsinwholegenomeandexomesequencingdata
AT orawejasona reducingindelcallingerrorsinwholegenomeandexomesequencingdata
AT barronlauratjimenez reducingindelcallingerrorsinwholegenomeandexomesequencingdata
AT rosenbaumjulie reducingindelcallingerrorsinwholegenomeandexomesequencingdata
AT ronemusmichael reducingindelcallingerrorsinwholegenomeandexomesequencingdata
AT iossifovivan reducingindelcallingerrorsinwholegenomeandexomesequencingdata
AT schatzmichaelc reducingindelcallingerrorsinwholegenomeandexomesequencingdata
AT lyongholsonj reducingindelcallingerrorsinwholegenomeandexomesequencingdata