Cargando…

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

BACKGROUND: Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0...

Descripción completa

Detalles Bibliográficos
Autores principales:	Das, Arghya Kusum, Goswami, Sayan, Lee, Kisung, Park, Seung-Jong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6923905/ https://www.ncbi.nlm.nih.gov/pubmed/31856721 http://dx.doi.org/10.1186/s12864-019-6286-9

_version_	1783481618246139904
author	Das, Arghya Kusum Goswami, Sayan Lee, Kisung Park, Seung-Jong
author_facet	Das, Arghya Kusum Goswami, Sayan Lee, Kisung Park, Seung-Jong
author_sort	Das, Arghya Kusum
collection	PubMed
description	BACKGROUND: Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. METHODS: In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. RESULTS: ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. CONCLUSION: ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.
format	Online Article Text
id	pubmed-6923905
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-69239052019-12-30 A hybrid and scalable error correction algorithm for indel and substitution errors of long reads Das, Arghya Kusum Goswami, Sayan Lee, Kisung Park, Seung-Jong BMC Genomics Research BACKGROUND: Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. METHODS: In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. RESULTS: ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. CONCLUSION: ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads. BioMed Central 2019-12-20 /pmc/articles/PMC6923905/ /pubmed/31856721 http://dx.doi.org/10.1186/s12864-019-6286-9 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Das, Arghya Kusum Goswami, Sayan Lee, Kisung Park, Seung-Jong A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
title	A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
title_full	A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
title_fullStr	A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
title_full_unstemmed	A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
title_short	A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
title_sort	hybrid and scalable error correction algorithm for indel and substitution errors of long reads
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6923905/ https://www.ncbi.nlm.nih.gov/pubmed/31856721 http://dx.doi.org/10.1186/s12864-019-6286-9
work_keys_str_mv	AT dasarghyakusum ahybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT goswamisayan ahybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT leekisung ahybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT parkseungjong ahybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT dasarghyakusum hybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT goswamisayan hybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT leekisung hybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT parkseungjong hybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

Ejemplares similares