Cargando…
SparkEC: speeding up alignment-based DNA error correction tools
BACKGROUND: In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Althou...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9639292/ https://www.ncbi.nlm.nih.gov/pubmed/36344928 http://dx.doi.org/10.1186/s12859-022-05013-1 |
_version_ | 1784825604337565696 |
---|---|
author | Expósito, Roberto R. Martínez-Sánchez, Marco Touriño, Juan |
author_facet | Expósito, Roberto R. Martínez-Sánchez, Marco Touriño, Juan |
author_sort | Expósito, Roberto R. |
collection | PubMed |
description | BACKGROUND: In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Although error correction can be performed as a preprocessing step to overcome this issue, it usually requires long computational times to analyze those large datasets generated nowadays through NGS. Therefore, new software capable of scaling out on a cluster of nodes with high performance is of great importance. RESULTS: In this paper, we present SparkEC, a parallel tool capable of fixing those errors produced during the sequencing process. For this purpose, the algorithms proposed by the CloudEC tool, which is already proved to perform accurate corrections, have been analyzed and optimized to improve their performance by relying on the Apache Spark framework together with the introduction of other enhancements such as the usage of memory-efficient data structures and the avoidance of any input preprocessing. The experimental results have shown significant improvements in the computational times of SparkEC when compared to CloudEC for all the representative datasets and scenarios under evaluation, providing an average and maximum speedups of 4.9[Formula: see text] and 11.9[Formula: see text] , respectively, over its counterpart. CONCLUSION: As error correction can take excessive computational time, SparkEC provides a scalable solution for correcting large datasets. Due to its distributed implementation, SparkEC speed can increase with respect to the number of nodes in a cluster. Furthermore, the software is freely available under GPLv3 license and is compatible with different operating systems (Linux, Windows and macOS). SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-05013-1. |
format | Online Article Text |
id | pubmed-9639292 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-96392922022-11-08 SparkEC: speeding up alignment-based DNA error correction tools Expósito, Roberto R. Martínez-Sánchez, Marco Touriño, Juan BMC Bioinformatics Software BACKGROUND: In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Although error correction can be performed as a preprocessing step to overcome this issue, it usually requires long computational times to analyze those large datasets generated nowadays through NGS. Therefore, new software capable of scaling out on a cluster of nodes with high performance is of great importance. RESULTS: In this paper, we present SparkEC, a parallel tool capable of fixing those errors produced during the sequencing process. For this purpose, the algorithms proposed by the CloudEC tool, which is already proved to perform accurate corrections, have been analyzed and optimized to improve their performance by relying on the Apache Spark framework together with the introduction of other enhancements such as the usage of memory-efficient data structures and the avoidance of any input preprocessing. The experimental results have shown significant improvements in the computational times of SparkEC when compared to CloudEC for all the representative datasets and scenarios under evaluation, providing an average and maximum speedups of 4.9[Formula: see text] and 11.9[Formula: see text] , respectively, over its counterpart. CONCLUSION: As error correction can take excessive computational time, SparkEC provides a scalable solution for correcting large datasets. Due to its distributed implementation, SparkEC speed can increase with respect to the number of nodes in a cluster. Furthermore, the software is freely available under GPLv3 license and is compatible with different operating systems (Linux, Windows and macOS). SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-05013-1. BioMed Central 2022-11-07 /pmc/articles/PMC9639292/ /pubmed/36344928 http://dx.doi.org/10.1186/s12859-022-05013-1 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Software Expósito, Roberto R. Martínez-Sánchez, Marco Touriño, Juan SparkEC: speeding up alignment-based DNA error correction tools |
title | SparkEC: speeding up alignment-based DNA error correction tools |
title_full | SparkEC: speeding up alignment-based DNA error correction tools |
title_fullStr | SparkEC: speeding up alignment-based DNA error correction tools |
title_full_unstemmed | SparkEC: speeding up alignment-based DNA error correction tools |
title_short | SparkEC: speeding up alignment-based DNA error correction tools |
title_sort | sparkec: speeding up alignment-based dna error correction tools |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9639292/ https://www.ncbi.nlm.nih.gov/pubmed/36344928 http://dx.doi.org/10.1186/s12859-022-05013-1 |
work_keys_str_mv | AT expositorobertor sparkecspeedingupalignmentbaseddnaerrorcorrectiontools AT martinezsanchezmarco sparkecspeedingupalignmentbaseddnaerrorcorrectiontools AT tourinojuan sparkecspeedingupalignmentbaseddnaerrorcorrectiontools |