Cargando…
Big Data in metagenomics: Apache Spark vs MPI
The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clu...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7537910/ https://www.ncbi.nlm.nih.gov/pubmed/33022000 http://dx.doi.org/10.1371/journal.pone.0239741 |
_version_ | 1783590760298315776 |
---|---|
author | Abuín, José M. Lopes, Nuno Ferreira, Luís Pena, Tomás F. Schmidt, Bertil |
author_facet | Abuín, José M. Lopes, Nuno Ferreira, Luís Pena, Tomás F. Schmidt, Bertil |
author_sort | Abuín, José M. |
collection | PubMed |
description | The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation. |
format | Online Article Text |
id | pubmed-7537910 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-75379102020-10-19 Big Data in metagenomics: Apache Spark vs MPI Abuín, José M. Lopes, Nuno Ferreira, Luís Pena, Tomás F. Schmidt, Bertil PLoS One Research Article The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation. Public Library of Science 2020-10-06 /pmc/articles/PMC7537910/ /pubmed/33022000 http://dx.doi.org/10.1371/journal.pone.0239741 Text en © 2020 Abuín et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Abuín, José M. Lopes, Nuno Ferreira, Luís Pena, Tomás F. Schmidt, Bertil Big Data in metagenomics: Apache Spark vs MPI |
title | Big Data in metagenomics: Apache Spark vs MPI |
title_full | Big Data in metagenomics: Apache Spark vs MPI |
title_fullStr | Big Data in metagenomics: Apache Spark vs MPI |
title_full_unstemmed | Big Data in metagenomics: Apache Spark vs MPI |
title_short | Big Data in metagenomics: Apache Spark vs MPI |
title_sort | big data in metagenomics: apache spark vs mpi |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7537910/ https://www.ncbi.nlm.nih.gov/pubmed/33022000 http://dx.doi.org/10.1371/journal.pone.0239741 |
work_keys_str_mv | AT abuinjosem bigdatainmetagenomicsapachesparkvsmpi AT lopesnuno bigdatainmetagenomicsapachesparkvsmpi AT ferreiraluis bigdatainmetagenomicsapachesparkvsmpi AT penatomasf bigdatainmetagenomicsapachesparkvsmpi AT schmidtbertil bigdatainmetagenomicsapachesparkvsmpi |