Cargando…

Analysis of error profiles in deep next-generation sequencing data

BACKGROUND: Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introdu...

Descripción completa

Detalles Bibliográficos
Autores principales: Ma, Xiaotu, Shao, Ying, Tian, Liqing, Flasch, Diane A., Mulder, Heather L., Edmonson, Michael N., Liu, Yu, Chen, Xiang, Newman, Scott, Nakitandwe, Joy, Li, Yongjin, Li, Benshang, Shen, Shuhong, Wang, Zhaoming, Shurtleff, Sheila, Robison, Leslie L., Levy, Shawn, Easton, John, Zhang, Jinghui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6417284/
https://www.ncbi.nlm.nih.gov/pubmed/30867008
http://dx.doi.org/10.1186/s13059-019-1659-6
_version_ 1783403539072024576
author Ma, Xiaotu
Shao, Ying
Tian, Liqing
Flasch, Diane A.
Mulder, Heather L.
Edmonson, Michael N.
Liu, Yu
Chen, Xiang
Newman, Scott
Nakitandwe, Joy
Li, Yongjin
Li, Benshang
Shen, Shuhong
Wang, Zhaoming
Shurtleff, Sheila
Robison, Leslie L.
Levy, Shawn
Easton, John
Zhang, Jinghui
author_facet Ma, Xiaotu
Shao, Ying
Tian, Liqing
Flasch, Diane A.
Mulder, Heather L.
Edmonson, Michael N.
Liu, Yu
Chen, Xiang
Newman, Scott
Nakitandwe, Joy
Li, Yongjin
Li, Benshang
Shen, Shuhong
Wang, Zhaoming
Shurtleff, Sheila
Robison, Leslie L.
Levy, Shawn
Easton, John
Zhang, Jinghui
author_sort Ma, Xiaotu
collection PubMed
description BACKGROUND: Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions. RESULTS: By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10(−5) to 10(−4), which is 10- to 100-fold lower than generally considered achievable (10(−3)) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10(−5) for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10(−4) for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression. CONCLUSIONS: We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13059-019-1659-6) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6417284
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-64172842019-03-25 Analysis of error profiles in deep next-generation sequencing data Ma, Xiaotu Shao, Ying Tian, Liqing Flasch, Diane A. Mulder, Heather L. Edmonson, Michael N. Liu, Yu Chen, Xiang Newman, Scott Nakitandwe, Joy Li, Yongjin Li, Benshang Shen, Shuhong Wang, Zhaoming Shurtleff, Sheila Robison, Leslie L. Levy, Shawn Easton, John Zhang, Jinghui Genome Biol Research BACKGROUND: Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions. RESULTS: By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10(−5) to 10(−4), which is 10- to 100-fold lower than generally considered achievable (10(−3)) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10(−5) for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10(−4) for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression. CONCLUSIONS: We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13059-019-1659-6) contains supplementary material, which is available to authorized users. BioMed Central 2019-03-14 /pmc/articles/PMC6417284/ /pubmed/30867008 http://dx.doi.org/10.1186/s13059-019-1659-6 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Ma, Xiaotu
Shao, Ying
Tian, Liqing
Flasch, Diane A.
Mulder, Heather L.
Edmonson, Michael N.
Liu, Yu
Chen, Xiang
Newman, Scott
Nakitandwe, Joy
Li, Yongjin
Li, Benshang
Shen, Shuhong
Wang, Zhaoming
Shurtleff, Sheila
Robison, Leslie L.
Levy, Shawn
Easton, John
Zhang, Jinghui
Analysis of error profiles in deep next-generation sequencing data
title Analysis of error profiles in deep next-generation sequencing data
title_full Analysis of error profiles in deep next-generation sequencing data
title_fullStr Analysis of error profiles in deep next-generation sequencing data
title_full_unstemmed Analysis of error profiles in deep next-generation sequencing data
title_short Analysis of error profiles in deep next-generation sequencing data
title_sort analysis of error profiles in deep next-generation sequencing data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6417284/
https://www.ncbi.nlm.nih.gov/pubmed/30867008
http://dx.doi.org/10.1186/s13059-019-1659-6
work_keys_str_mv AT maxiaotu analysisoferrorprofilesindeepnextgenerationsequencingdata
AT shaoying analysisoferrorprofilesindeepnextgenerationsequencingdata
AT tianliqing analysisoferrorprofilesindeepnextgenerationsequencingdata
AT flaschdianea analysisoferrorprofilesindeepnextgenerationsequencingdata
AT mulderheatherl analysisoferrorprofilesindeepnextgenerationsequencingdata
AT edmonsonmichaeln analysisoferrorprofilesindeepnextgenerationsequencingdata
AT liuyu analysisoferrorprofilesindeepnextgenerationsequencingdata
AT chenxiang analysisoferrorprofilesindeepnextgenerationsequencingdata
AT newmanscott analysisoferrorprofilesindeepnextgenerationsequencingdata
AT nakitandwejoy analysisoferrorprofilesindeepnextgenerationsequencingdata
AT liyongjin analysisoferrorprofilesindeepnextgenerationsequencingdata
AT libenshang analysisoferrorprofilesindeepnextgenerationsequencingdata
AT shenshuhong analysisoferrorprofilesindeepnextgenerationsequencingdata
AT wangzhaoming analysisoferrorprofilesindeepnextgenerationsequencingdata
AT shurtleffsheila analysisoferrorprofilesindeepnextgenerationsequencingdata
AT robisonlesliel analysisoferrorprofilesindeepnextgenerationsequencingdata
AT levyshawn analysisoferrorprofilesindeepnextgenerationsequencingdata
AT eastonjohn analysisoferrorprofilesindeepnextgenerationsequencingdata
AT zhangjinghui analysisoferrorprofilesindeepnextgenerationsequencingdata