Cargando…

EC: an efficient error correction algorithm for short reads

BACKGROUND: In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trim...

Descripción completa

Detalles Bibliográficos
Autores principales: Saha, Subrata, Rajasekaran, Sanguthevar
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4674864/
https://www.ncbi.nlm.nih.gov/pubmed/26678663
http://dx.doi.org/10.1186/1471-2105-16-S17-S2
_version_ 1782404962703114240
author Saha, Subrata
Rajasekaran, Sanguthevar
author_facet Saha, Subrata
Rajasekaran, Sanguthevar
author_sort Saha, Subrata
collection PubMed
description BACKGROUND: In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will be greatly reduced if the reads are first corrected. We have developed a novel error correction algorithm called EC and compared it with four other state-of-the-art algorithms using both real and simulated sequencing reads. RESULTS: We have done extensive and rigorous experiments that reveal that EC is indeed an effective, scalable, and efficient error correction tool. Real reads that we have employed in our performance evaluation are Illumina-generated short reads of various lengths. Six experimental datasets we have utilized are taken from sequence and read archive (SRA) at NCBI. The simulated reads are obtained by picking substrings from random positions of reference genomes. To introduce errors, some of the bases of the simulated reads are changed to other bases with some probabilities. CONCLUSIONS: Error correction is a vital problem in biology especially for NGS data. In this paper we present a novel algorithm, called Error Corrector (EC), for correcting substitution errors in biological sequencing reads. We plan to investigate the possibility of employing the techniques introduced in this research paper to handle insertion and deletion errors also. SOFTWARE AVAILABILITY: The implementation is freely available for non-commercial purposes. It can be downloaded from: http://engr.uconn.edu/~rajasek/EC.zip.
format Online
Article
Text
id pubmed-4674864
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-46748642015-12-15 EC: an efficient error correction algorithm for short reads Saha, Subrata Rajasekaran, Sanguthevar BMC Bioinformatics Research BACKGROUND: In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will be greatly reduced if the reads are first corrected. We have developed a novel error correction algorithm called EC and compared it with four other state-of-the-art algorithms using both real and simulated sequencing reads. RESULTS: We have done extensive and rigorous experiments that reveal that EC is indeed an effective, scalable, and efficient error correction tool. Real reads that we have employed in our performance evaluation are Illumina-generated short reads of various lengths. Six experimental datasets we have utilized are taken from sequence and read archive (SRA) at NCBI. The simulated reads are obtained by picking substrings from random positions of reference genomes. To introduce errors, some of the bases of the simulated reads are changed to other bases with some probabilities. CONCLUSIONS: Error correction is a vital problem in biology especially for NGS data. In this paper we present a novel algorithm, called Error Corrector (EC), for correcting substitution errors in biological sequencing reads. We plan to investigate the possibility of employing the techniques introduced in this research paper to handle insertion and deletion errors also. SOFTWARE AVAILABILITY: The implementation is freely available for non-commercial purposes. It can be downloaded from: http://engr.uconn.edu/~rajasek/EC.zip. BioMed Central 2015-12-07 /pmc/articles/PMC4674864/ /pubmed/26678663 http://dx.doi.org/10.1186/1471-2105-16-S17-S2 Text en Copyright © 2015 Saha and Rajasekaran http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Saha, Subrata
Rajasekaran, Sanguthevar
EC: an efficient error correction algorithm for short reads
title EC: an efficient error correction algorithm for short reads
title_full EC: an efficient error correction algorithm for short reads
title_fullStr EC: an efficient error correction algorithm for short reads
title_full_unstemmed EC: an efficient error correction algorithm for short reads
title_short EC: an efficient error correction algorithm for short reads
title_sort ec: an efficient error correction algorithm for short reads
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4674864/
https://www.ncbi.nlm.nih.gov/pubmed/26678663
http://dx.doi.org/10.1186/1471-2105-16-S17-S2
work_keys_str_mv AT sahasubrata ecanefficienterrorcorrectionalgorithmforshortreads
AT rajasekaransanguthevar ecanefficienterrorcorrectionalgorithmforshortreads