Cargando…

Big Data in metagenomics: Apache Spark vs MPI

The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clu...

Descripción completa

Detalles Bibliográficos
Autores principales:	Abuín, José M., Lopes, Nuno, Ferreira, Luís, Pena, Tomás F., Schmidt, Bertil
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2020
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7537910/ https://www.ncbi.nlm.nih.gov/pubmed/33022000 http://dx.doi.org/10.1371/journal.pone.0239741

_version_	1783590760298315776
author	Abuín, José M. Lopes, Nuno Ferreira, Luís Pena, Tomás F. Schmidt, Bertil
author_facet	Abuín, José M. Lopes, Nuno Ferreira, Luís Pena, Tomás F. Schmidt, Bertil
author_sort	Abuín, José M.
collection	PubMed
description	The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.
format	Online Article Text
id	pubmed-7537910
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-75379102020-10-19 Big Data in metagenomics: Apache Spark vs MPI Abuín, José M. Lopes, Nuno Ferreira, Luís Pena, Tomás F. Schmidt, Bertil PLoS One Research Article The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation. Public Library of Science 2020-10-06 /pmc/articles/PMC7537910/ /pubmed/33022000 http://dx.doi.org/10.1371/journal.pone.0239741 Text en © 2020 Abuín et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Abuín, José M. Lopes, Nuno Ferreira, Luís Pena, Tomás F. Schmidt, Bertil Big Data in metagenomics: Apache Spark vs MPI
title	Big Data in metagenomics: Apache Spark vs MPI
title_full	Big Data in metagenomics: Apache Spark vs MPI
title_fullStr	Big Data in metagenomics: Apache Spark vs MPI
title_full_unstemmed	Big Data in metagenomics: Apache Spark vs MPI
title_short	Big Data in metagenomics: Apache Spark vs MPI
title_sort	big data in metagenomics: apache spark vs mpi
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7537910/ https://www.ncbi.nlm.nih.gov/pubmed/33022000 http://dx.doi.org/10.1371/journal.pone.0239741
work_keys_str_mv	AT abuinjosem bigdatainmetagenomicsapachesparkvsmpi AT lopesnuno bigdatainmetagenomicsapachesparkvsmpi AT ferreiraluis bigdatainmetagenomicsapachesparkvsmpi AT penatomasf bigdatainmetagenomicsapachesparkvsmpi AT schmidtbertil bigdatainmetagenomicsapachesparkvsmpi

Big Data in metagenomics: Apache Spark vs MPI

Ejemplares similares