Cargando…

HALC: High throughput algorithm for long read error correction

BACKGROUND: The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, bu...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bao, Ergude, Lan, Lingxiao
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5382505/ https://www.ncbi.nlm.nih.gov/pubmed/28381259 http://dx.doi.org/10.1186/s12859-017-1610-3

_version_	1782520114091917312
author	Bao, Ergude Lan, Lingxiao
author_facet	Bao, Ergude Lan, Lingxiao
author_sort	Bao, Ergude
collection	PubMed
description	BACKGROUND: The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis. RESULTS: Here, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region’s repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads’ alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms. CONCLUSIONS: The HALC software can be downloaded for free from this site: https://github.com/lanl001/halc. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1610-3) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5382505
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-53825052017-04-10 HALC: High throughput algorithm for long read error correction Bao, Ergude Lan, Lingxiao BMC Bioinformatics Software BACKGROUND: The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis. RESULTS: Here, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region’s repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads’ alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms. CONCLUSIONS: The HALC software can be downloaded for free from this site: https://github.com/lanl001/halc. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1610-3) contains supplementary material, which is available to authorized users. BioMed Central 2017-04-05 /pmc/articles/PMC5382505/ /pubmed/28381259 http://dx.doi.org/10.1186/s12859-017-1610-3 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Software Bao, Ergude Lan, Lingxiao HALC: High throughput algorithm for long read error correction
title	HALC: High throughput algorithm for long read error correction
title_full	HALC: High throughput algorithm for long read error correction
title_fullStr	HALC: High throughput algorithm for long read error correction
title_full_unstemmed	HALC: High throughput algorithm for long read error correction
title_short	HALC: High throughput algorithm for long read error correction
title_sort	halc: high throughput algorithm for long read error correction
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5382505/ https://www.ncbi.nlm.nih.gov/pubmed/28381259 http://dx.doi.org/10.1186/s12859-017-1610-3
work_keys_str_mv	AT baoergude halchighthroughputalgorithmforlongreaderrorcorrection AT lanlingxiao halchighthroughputalgorithmforlongreaderrorcorrection

HALC: High throughput algorithm for long read error correction

Ejemplares similares