Cargando…

Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data

BACKGROUND: Removing duplicates might be considered as a well-resolved problem in next-generation sequencing (NGS) data processing domain. However, as NGS technology gains more recognition in clinical application, researchers start to pay more attention to its sequencing errors, and prefer to remove...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Shifu, Zhou, Yanqing, Chen, Yaru, Huang, Tanxiao, Liao, Wenting, Xu, Yun, Li, Zhicheng, Gu, Jia
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6933617/
https://www.ncbi.nlm.nih.gov/pubmed/31881822
http://dx.doi.org/10.1186/s12859-019-3280-9
_version_ 1783483242183131136
author Chen, Shifu
Zhou, Yanqing
Chen, Yaru
Huang, Tanxiao
Liao, Wenting
Xu, Yun
Li, Zhicheng
Gu, Jia
author_facet Chen, Shifu
Zhou, Yanqing
Chen, Yaru
Huang, Tanxiao
Liao, Wenting
Xu, Yun
Li, Zhicheng
Gu, Jia
author_sort Chen, Shifu
collection PubMed
description BACKGROUND: Removing duplicates might be considered as a well-resolved problem in next-generation sequencing (NGS) data processing domain. However, as NGS technology gains more recognition in clinical application, researchers start to pay more attention to its sequencing errors, and prefer to remove these errors while performing deduplication operations. Recently, a new technology called unique molecular identifier (UMI) has been developed to better identify sequencing reads derived from different DNA fragments. Most existing duplicate removing tools cannot handle the UMI-integrated data. Some modern tools can work with UMIs, but are usually slow and use too much memory. Furthermore, existing tools rarely report rich statistical results, which are very important for quality control and downstream analysis. These unmet requirements drove us to develop an ultra-fast, simple, little-weighted but powerful tool for duplicate removing and sequence error suppressing, with features of handling UMIs and reporting informative results. RESULTS: This paper presents an efficient tool gencore for duplicate removing and sequence error suppressing of NGS data. This tool clusters the mapped sequencing reads and merges reads in each cluster to generate one single consensus read. While the consensus read is generated, the random errors introduced by library construction and sequencing can be removed. This error-suppressing feature makes gencore very suitable for the application of detecting ultra-low frequency mutations from deep sequencing data. When unique molecular identifier (UMI) technology is applied, gencore can use them to identify the reads derived from same original DNA fragment. Gencore reports statistical results in both HTML and JSON formats. The HTML format report contains many interactive figures plotting statistical coverage and duplication information. The JSON format report contains all the statistical results, and is interpretable for downstream programs. CONCLUSIONS: Comparing to the conventional tools like Picard and SAMtools, gencore greatly reduces the output data’s mapping mismatches, which are mostly caused by errors. Comparing to some new tools like UMI-Reducer and UMI-tools, gencore runs much faster, uses less memory, generates better consensus reads and provides simpler interfaces. To our best knowledge, gencore is the only duplicate removing tool that generates both informative HTML and JSON reports. This tool is available at: https://github.com/OpenGene/gencore
format Online
Article
Text
id pubmed-6933617
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-69336172019-12-30 Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data Chen, Shifu Zhou, Yanqing Chen, Yaru Huang, Tanxiao Liao, Wenting Xu, Yun Li, Zhicheng Gu, Jia BMC Bioinformatics Software BACKGROUND: Removing duplicates might be considered as a well-resolved problem in next-generation sequencing (NGS) data processing domain. However, as NGS technology gains more recognition in clinical application, researchers start to pay more attention to its sequencing errors, and prefer to remove these errors while performing deduplication operations. Recently, a new technology called unique molecular identifier (UMI) has been developed to better identify sequencing reads derived from different DNA fragments. Most existing duplicate removing tools cannot handle the UMI-integrated data. Some modern tools can work with UMIs, but are usually slow and use too much memory. Furthermore, existing tools rarely report rich statistical results, which are very important for quality control and downstream analysis. These unmet requirements drove us to develop an ultra-fast, simple, little-weighted but powerful tool for duplicate removing and sequence error suppressing, with features of handling UMIs and reporting informative results. RESULTS: This paper presents an efficient tool gencore for duplicate removing and sequence error suppressing of NGS data. This tool clusters the mapped sequencing reads and merges reads in each cluster to generate one single consensus read. While the consensus read is generated, the random errors introduced by library construction and sequencing can be removed. This error-suppressing feature makes gencore very suitable for the application of detecting ultra-low frequency mutations from deep sequencing data. When unique molecular identifier (UMI) technology is applied, gencore can use them to identify the reads derived from same original DNA fragment. Gencore reports statistical results in both HTML and JSON formats. The HTML format report contains many interactive figures plotting statistical coverage and duplication information. The JSON format report contains all the statistical results, and is interpretable for downstream programs. CONCLUSIONS: Comparing to the conventional tools like Picard and SAMtools, gencore greatly reduces the output data’s mapping mismatches, which are mostly caused by errors. Comparing to some new tools like UMI-Reducer and UMI-tools, gencore runs much faster, uses less memory, generates better consensus reads and provides simpler interfaces. To our best knowledge, gencore is the only duplicate removing tool that generates both informative HTML and JSON reports. This tool is available at: https://github.com/OpenGene/gencore BioMed Central 2019-12-27 /pmc/articles/PMC6933617/ /pubmed/31881822 http://dx.doi.org/10.1186/s12859-019-3280-9 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Chen, Shifu
Zhou, Yanqing
Chen, Yaru
Huang, Tanxiao
Liao, Wenting
Xu, Yun
Li, Zhicheng
Gu, Jia
Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data
title Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data
title_full Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data
title_fullStr Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data
title_full_unstemmed Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data
title_short Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data
title_sort gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of ngs data
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6933617/
https://www.ncbi.nlm.nih.gov/pubmed/31881822
http://dx.doi.org/10.1186/s12859-019-3280-9
work_keys_str_mv AT chenshifu gencoreanefficienttooltogenerateconsensusreadsforerrorsuppressingandduplicateremovingofngsdata
AT zhouyanqing gencoreanefficienttooltogenerateconsensusreadsforerrorsuppressingandduplicateremovingofngsdata
AT chenyaru gencoreanefficienttooltogenerateconsensusreadsforerrorsuppressingandduplicateremovingofngsdata
AT huangtanxiao gencoreanefficienttooltogenerateconsensusreadsforerrorsuppressingandduplicateremovingofngsdata
AT liaowenting gencoreanefficienttooltogenerateconsensusreadsforerrorsuppressingandduplicateremovingofngsdata
AT xuyun gencoreanefficienttooltogenerateconsensusreadsforerrorsuppressingandduplicateremovingofngsdata
AT lizhicheng gencoreanefficienttooltogenerateconsensusreadsforerrorsuppressingandduplicateremovingofngsdata
AT gujia gencoreanefficienttooltogenerateconsensusreadsforerrorsuppressingandduplicateremovingofngsdata