Cargando…

SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework

Due to the rapid decrease in the cost of NGS (Next Generation Sequencing), interest has increased in using data generated from NGS to diagnose genetic diseases. However, the data generated by NGS technology is usually in the order of hundreds of gigabytes per experiment, thus requiring efficient and...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mushtaq, Hamid, Ahmed, Nauman, Al-Ars, Zaid
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6894754/ https://www.ncbi.nlm.nih.gov/pubmed/31805063 http://dx.doi.org/10.1371/journal.pone.0224784

_version_	1783476444006973440
author	Mushtaq, Hamid Ahmed, Nauman Al-Ars, Zaid
author_facet	Mushtaq, Hamid Ahmed, Nauman Al-Ars, Zaid
author_sort	Mushtaq, Hamid
collection	PubMed
description	Due to the rapid decrease in the cost of NGS (Next Generation Sequencing), interest has increased in using data generated from NGS to diagnose genetic diseases. However, the data generated by NGS technology is usually in the order of hundreds of gigabytes per experiment, thus requiring efficient and scalable programs to perform data analysis quickly. This paper presents SparkGA2, a memory efficient, production quality framework for high performance DNA analysis in the cloud, which can scale according to the available computational resources by increasing the number of nodes. Our framework uses Apache Spark’s ability to cache data in the memory to speed up processing, while also allowing the user to run the framework on systems with lower amounts of memory at the cost of slightly less performance. To manage the memory footprint, we implement an on-the-fly compression method of intermediate data and reduce memory requirements by up to 3x. Our framework also uses a streaming approach to gradually stream input data as processing is taking place. This makes our framework faster than other state of the art approaches while at the same time allowing users to adapt it to run on clusters with lower memory. As compared to the state of the art, SparkGA2 is up to 22% faster on a large big data cluster of 67 nodes and up to 9% faster on a smaller cluster of 6 nodes. Including the streaming solution, where data pre-processing is considered, SparkGA2 is 51% faster on a 6 node cluster. The source code of SparkGA2 is publicly available at https://github.com/HamidMushtaq/SparkGA2.
format	Online Article Text
id	pubmed-6894754
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-68947542019-12-14 SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework Mushtaq, Hamid Ahmed, Nauman Al-Ars, Zaid PLoS One Research Article Due to the rapid decrease in the cost of NGS (Next Generation Sequencing), interest has increased in using data generated from NGS to diagnose genetic diseases. However, the data generated by NGS technology is usually in the order of hundreds of gigabytes per experiment, thus requiring efficient and scalable programs to perform data analysis quickly. This paper presents SparkGA2, a memory efficient, production quality framework for high performance DNA analysis in the cloud, which can scale according to the available computational resources by increasing the number of nodes. Our framework uses Apache Spark’s ability to cache data in the memory to speed up processing, while also allowing the user to run the framework on systems with lower amounts of memory at the cost of slightly less performance. To manage the memory footprint, we implement an on-the-fly compression method of intermediate data and reduce memory requirements by up to 3x. Our framework also uses a streaming approach to gradually stream input data as processing is taking place. This makes our framework faster than other state of the art approaches while at the same time allowing users to adapt it to run on clusters with lower memory. As compared to the state of the art, SparkGA2 is up to 22% faster on a large big data cluster of 67 nodes and up to 9% faster on a smaller cluster of 6 nodes. Including the streaming solution, where data pre-processing is considered, SparkGA2 is 51% faster on a 6 node cluster. The source code of SparkGA2 is publicly available at https://github.com/HamidMushtaq/SparkGA2. Public Library of Science 2019-12-05 /pmc/articles/PMC6894754/ /pubmed/31805063 http://dx.doi.org/10.1371/journal.pone.0224784 Text en © 2019 Mushtaq et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Mushtaq, Hamid Ahmed, Nauman Al-Ars, Zaid SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework
title	SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework
title_full	SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework
title_fullStr	SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework
title_full_unstemmed	SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework
title_short	SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework
title_sort	sparkga2: production-quality memory-efficient apache spark based genome analysis framework
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6894754/ https://www.ncbi.nlm.nih.gov/pubmed/31805063 http://dx.doi.org/10.1371/journal.pone.0224784
work_keys_str_mv	AT mushtaqhamid sparkga2productionqualitymemoryefficientapachesparkbasedgenomeanalysisframework AT ahmednauman sparkga2productionqualitymemoryefficientapachesparkbasedgenomeanalysisframework AT alarszaid sparkga2productionqualitymemoryefficientapachesparkbasedgenomeanalysisframework

SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework

Ejemplares similares