Cargando…

Big Data in metagenomics: Apache Spark vs MPI

The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clu...

Descripción completa

Detalles Bibliográficos
Autores principales: Abuín, José M., Lopes, Nuno, Ferreira, Luís, Pena, Tomás F., Schmidt, Bertil
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7537910/
https://www.ncbi.nlm.nih.gov/pubmed/33022000
http://dx.doi.org/10.1371/journal.pone.0239741
_version_ 1783590760298315776
author Abuín, José M.
Lopes, Nuno
Ferreira, Luís
Pena, Tomás F.
Schmidt, Bertil
author_facet Abuín, José M.
Lopes, Nuno
Ferreira, Luís
Pena, Tomás F.
Schmidt, Bertil
author_sort Abuín, José M.
collection PubMed
description The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.
format Online
Article
Text
id pubmed-7537910
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-75379102020-10-19 Big Data in metagenomics: Apache Spark vs MPI Abuín, José M. Lopes, Nuno Ferreira, Luís Pena, Tomás F. Schmidt, Bertil PLoS One Research Article The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation. Public Library of Science 2020-10-06 /pmc/articles/PMC7537910/ /pubmed/33022000 http://dx.doi.org/10.1371/journal.pone.0239741 Text en © 2020 Abuín et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Abuín, José M.
Lopes, Nuno
Ferreira, Luís
Pena, Tomás F.
Schmidt, Bertil
Big Data in metagenomics: Apache Spark vs MPI
title Big Data in metagenomics: Apache Spark vs MPI
title_full Big Data in metagenomics: Apache Spark vs MPI
title_fullStr Big Data in metagenomics: Apache Spark vs MPI
title_full_unstemmed Big Data in metagenomics: Apache Spark vs MPI
title_short Big Data in metagenomics: Apache Spark vs MPI
title_sort big data in metagenomics: apache spark vs mpi
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7537910/
https://www.ncbi.nlm.nih.gov/pubmed/33022000
http://dx.doi.org/10.1371/journal.pone.0239741
work_keys_str_mv AT abuinjosem bigdatainmetagenomicsapachesparkvsmpi
AT lopesnuno bigdatainmetagenomicsapachesparkvsmpi
AT ferreiraluis bigdatainmetagenomicsapachesparkvsmpi
AT penatomasf bigdatainmetagenomicsapachesparkvsmpi
AT schmidtbertil bigdatainmetagenomicsapachesparkvsmpi