Cargando…

SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark

The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Al-Ars, Zaid, Wang, Saiyi, Mushtaq, Hamid
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7016739/ https://www.ncbi.nlm.nih.gov/pubmed/31947774 http://dx.doi.org/10.3390/genes11010053

_version_	1783497044439072768
author	Al-Ars, Zaid Wang, Saiyi Mushtaq, Hamid
author_facet	Al-Ars, Zaid Wang, Saiyi Mushtaq, Hamid
author_sort	Al-Ars, Zaid
collection	PubMed
description	The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results.
format	Online Article Text
id	pubmed-7016739
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-70167392020-02-28 SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark Al-Ars, Zaid Wang, Saiyi Mushtaq, Hamid Genes (Basel) Article The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results. MDPI 2020-01-03 /pmc/articles/PMC7016739/ /pubmed/31947774 http://dx.doi.org/10.3390/genes11010053 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Al-Ars, Zaid Wang, Saiyi Mushtaq, Hamid SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark
title	SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark
title_full	SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark
title_fullStr	SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark
title_full_unstemmed	SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark
title_short	SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark
title_sort	sparkra: enabling big data scalability for the gatk rna-seq pipeline with apache spark
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7016739/ https://www.ncbi.nlm.nih.gov/pubmed/31947774 http://dx.doi.org/10.3390/genes11010053
work_keys_str_mv	AT alarszaid sparkraenablingbigdatascalabilityforthegatkrnaseqpipelinewithapachespark AT wangsaiyi sparkraenablingbigdatascalabilityforthegatkrnaseqpipelinewithapachespark AT mushtaqhamid sparkraenablingbigdatascalabilityforthegatkrnaseqpipelinewithapachespark

SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark

Ejemplares similares