Cargando…

Optimising parallel R correlation matrix calculations on gene expression data using MapReduce

BACKGROUND: High-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Shicai, Pandis, Ioannis, Johnson, David, Emam, Ibrahim, Guitton, Florian, Oehmichen, Axel, Guo, Yike
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4246436/ https://www.ncbi.nlm.nih.gov/pubmed/25371114 http://dx.doi.org/10.1186/s12859-014-0351-9

_version_	1782346513118134272
author	Wang, Shicai Pandis, Ioannis Johnson, David Emam, Ibrahim Guitton, Florian Oehmichen, Axel Guo, Yike
author_facet	Wang, Shicai Pandis, Ioannis Johnson, David Emam, Ibrahim Guitton, Florian Oehmichen, Axel Guo, Yike
author_sort	Wang, Shicai
collection	PubMed
description	BACKGROUND: High-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot meet the requirement of large-scale molecular data due to poor performance of the correlation matrix calculation. With high-throughput sequencing technologies promising to produce even larger datasets per subject, we expect the performance of the state-of-the-art statistical algorithms to be further impacted unless efforts towards optimisation are carried out. MapReduce is a widely used high performance parallel framework that can solve the problem. RESULTS: In this paper, we evaluate the current parallel modes for correlation calculation methods and introduce an efficient data distribution and parallel calculation algorithm based on MapReduce to optimise the correlation calculation. We studied the performance of our algorithm using two gene expression benchmarks. In the micro-benchmark, our implementation using MapReduce, based on the R package RHIPE, demonstrates a 3.26-5.83 fold increase compared to the default Snowfall and 1.56-1.64 fold increase compared to the basic RHIPE in the Euclidean, Pearson and Spearman correlations. Though vanilla R and the optimised Snowfall outperforms our optimised RHIPE in the micro-benchmark, they do not scale well with the macro-benchmark. In the macro-benchmark the optimised RHIPE performs 2.03-16.56 times faster than vanilla R. Benefiting from the 3.30-5.13 times faster data preparation, the optimised RHIPE performs 1.22-1.71 times faster than the optimised Snowfall. Both the optimised RHIPE and the optimised Snowfall successfully performs the Kendall correlation with TCGA dataset within 7 hours. Both of them conduct more than 30 times faster than the estimated vanilla R. CONCLUSIONS: The performance evaluation found that the new MapReduce algorithm and its implementation in RHIPE outperforms vanilla R and the conventional parallel algorithms implemented in R Snowfall. We propose that MapReduce framework holds great promise for large molecular data analysis, in particular for high-dimensional genomic data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new algorithm as a basis for optimising high-throughput molecular data correlation calculation for Big Data.
format	Online Article Text
id	pubmed-4246436
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-42464362014-12-02 Optimising parallel R correlation matrix calculations on gene expression data using MapReduce Wang, Shicai Pandis, Ioannis Johnson, David Emam, Ibrahim Guitton, Florian Oehmichen, Axel Guo, Yike BMC Bioinformatics Research Article BACKGROUND: High-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot meet the requirement of large-scale molecular data due to poor performance of the correlation matrix calculation. With high-throughput sequencing technologies promising to produce even larger datasets per subject, we expect the performance of the state-of-the-art statistical algorithms to be further impacted unless efforts towards optimisation are carried out. MapReduce is a widely used high performance parallel framework that can solve the problem. RESULTS: In this paper, we evaluate the current parallel modes for correlation calculation methods and introduce an efficient data distribution and parallel calculation algorithm based on MapReduce to optimise the correlation calculation. We studied the performance of our algorithm using two gene expression benchmarks. In the micro-benchmark, our implementation using MapReduce, based on the R package RHIPE, demonstrates a 3.26-5.83 fold increase compared to the default Snowfall and 1.56-1.64 fold increase compared to the basic RHIPE in the Euclidean, Pearson and Spearman correlations. Though vanilla R and the optimised Snowfall outperforms our optimised RHIPE in the micro-benchmark, they do not scale well with the macro-benchmark. In the macro-benchmark the optimised RHIPE performs 2.03-16.56 times faster than vanilla R. Benefiting from the 3.30-5.13 times faster data preparation, the optimised RHIPE performs 1.22-1.71 times faster than the optimised Snowfall. Both the optimised RHIPE and the optimised Snowfall successfully performs the Kendall correlation with TCGA dataset within 7 hours. Both of them conduct more than 30 times faster than the estimated vanilla R. CONCLUSIONS: The performance evaluation found that the new MapReduce algorithm and its implementation in RHIPE outperforms vanilla R and the conventional parallel algorithms implemented in R Snowfall. We propose that MapReduce framework holds great promise for large molecular data analysis, in particular for high-dimensional genomic data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new algorithm as a basis for optimising high-throughput molecular data correlation calculation for Big Data. BioMed Central 2014-11-05 /pmc/articles/PMC4246436/ /pubmed/25371114 http://dx.doi.org/10.1186/s12859-014-0351-9 Text en © Wang et al.; licensee BioMed Central Ltd. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Wang, Shicai Pandis, Ioannis Johnson, David Emam, Ibrahim Guitton, Florian Oehmichen, Axel Guo, Yike Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
title	Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
title_full	Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
title_fullStr	Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
title_full_unstemmed	Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
title_short	Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
title_sort	optimising parallel r correlation matrix calculations on gene expression data using mapreduce
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4246436/ https://www.ncbi.nlm.nih.gov/pubmed/25371114 http://dx.doi.org/10.1186/s12859-014-0351-9
work_keys_str_mv	AT wangshicai optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce AT pandisioannis optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce AT johnsondavid optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce AT emamibrahim optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce AT guittonflorian optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce AT oehmichenaxel optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce AT guoyike optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce

Optimising parallel R correlation matrix calculations on gene expression data using MapReduce

Ejemplares similares