Cargando…

Illumina error correction near highly repetitive DNA regions improves de novo genome assembly

BACKGROUND: Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found...

Descripción completa

Detalles Bibliográficos
Autores principales: Heydari, Mahdi, Miclotte, Giles, Van de Peer, Yves, Fostier, Jan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6545690/
https://www.ncbi.nlm.nih.gov/pubmed/31159722
http://dx.doi.org/10.1186/s12859-019-2906-2
_version_ 1783423427437133824
author Heydari, Mahdi
Miclotte, Giles
Van de Peer, Yves
Fostier, Jan
author_facet Heydari, Mahdi
Miclotte, Giles
Van de Peer, Yves
Fostier, Jan
author_sort Heydari, Mahdi
collection PubMed
description BACKGROUND: Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly. RESULTS: We propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster. CONCLUSIONS: BrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2906-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6545690
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-65456902019-06-06 Illumina error correction near highly repetitive DNA regions improves de novo genome assembly Heydari, Mahdi Miclotte, Giles Van de Peer, Yves Fostier, Jan BMC Bioinformatics Research Article BACKGROUND: Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly. RESULTS: We propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster. CONCLUSIONS: BrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2906-2) contains supplementary material, which is available to authorized users. BioMed Central 2019-06-03 /pmc/articles/PMC6545690/ /pubmed/31159722 http://dx.doi.org/10.1186/s12859-019-2906-2 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Heydari, Mahdi
Miclotte, Giles
Van de Peer, Yves
Fostier, Jan
Illumina error correction near highly repetitive DNA regions improves de novo genome assembly
title Illumina error correction near highly repetitive DNA regions improves de novo genome assembly
title_full Illumina error correction near highly repetitive DNA regions improves de novo genome assembly
title_fullStr Illumina error correction near highly repetitive DNA regions improves de novo genome assembly
title_full_unstemmed Illumina error correction near highly repetitive DNA regions improves de novo genome assembly
title_short Illumina error correction near highly repetitive DNA regions improves de novo genome assembly
title_sort illumina error correction near highly repetitive dna regions improves de novo genome assembly
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6545690/
https://www.ncbi.nlm.nih.gov/pubmed/31159722
http://dx.doi.org/10.1186/s12859-019-2906-2
work_keys_str_mv AT heydarimahdi illuminaerrorcorrectionnearhighlyrepetitivednaregionsimprovesdenovogenomeassembly
AT miclottegiles illuminaerrorcorrectionnearhighlyrepetitivednaregionsimprovesdenovogenomeassembly
AT vandepeeryves illuminaerrorcorrectionnearhighlyrepetitivednaregionsimprovesdenovogenomeassembly
AT fostierjan illuminaerrorcorrectionnearhighlyrepetitivednaregionsimprovesdenovogenomeassembly