Cargando…

Optimising parallel R correlation matrix calculations on gene expression data using MapReduce

BACKGROUND: High-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Shicai, Pandis, Ioannis, Johnson, David, Emam, Ibrahim, Guitton, Florian, Oehmichen, Axel, Guo, Yike
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4246436/
https://www.ncbi.nlm.nih.gov/pubmed/25371114
http://dx.doi.org/10.1186/s12859-014-0351-9
_version_ 1782346513118134272
author Wang, Shicai
Pandis, Ioannis
Johnson, David
Emam, Ibrahim
Guitton, Florian
Oehmichen, Axel
Guo, Yike
author_facet Wang, Shicai
Pandis, Ioannis
Johnson, David
Emam, Ibrahim
Guitton, Florian
Oehmichen, Axel
Guo, Yike
author_sort Wang, Shicai
collection PubMed
description BACKGROUND: High-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot meet the requirement of large-scale molecular data due to poor performance of the correlation matrix calculation. With high-throughput sequencing technologies promising to produce even larger datasets per subject, we expect the performance of the state-of-the-art statistical algorithms to be further impacted unless efforts towards optimisation are carried out. MapReduce is a widely used high performance parallel framework that can solve the problem. RESULTS: In this paper, we evaluate the current parallel modes for correlation calculation methods and introduce an efficient data distribution and parallel calculation algorithm based on MapReduce to optimise the correlation calculation. We studied the performance of our algorithm using two gene expression benchmarks. In the micro-benchmark, our implementation using MapReduce, based on the R package RHIPE, demonstrates a 3.26-5.83 fold increase compared to the default Snowfall and 1.56-1.64 fold increase compared to the basic RHIPE in the Euclidean, Pearson and Spearman correlations. Though vanilla R and the optimised Snowfall outperforms our optimised RHIPE in the micro-benchmark, they do not scale well with the macro-benchmark. In the macro-benchmark the optimised RHIPE performs 2.03-16.56 times faster than vanilla R. Benefiting from the 3.30-5.13 times faster data preparation, the optimised RHIPE performs 1.22-1.71 times faster than the optimised Snowfall. Both the optimised RHIPE and the optimised Snowfall successfully performs the Kendall correlation with TCGA dataset within 7 hours. Both of them conduct more than 30 times faster than the estimated vanilla R. CONCLUSIONS: The performance evaluation found that the new MapReduce algorithm and its implementation in RHIPE outperforms vanilla R and the conventional parallel algorithms implemented in R Snowfall. We propose that MapReduce framework holds great promise for large molecular data analysis, in particular for high-dimensional genomic data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new algorithm as a basis for optimising high-throughput molecular data correlation calculation for Big Data.
format Online
Article
Text
id pubmed-4246436
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42464362014-12-02 Optimising parallel R correlation matrix calculations on gene expression data using MapReduce Wang, Shicai Pandis, Ioannis Johnson, David Emam, Ibrahim Guitton, Florian Oehmichen, Axel Guo, Yike BMC Bioinformatics Research Article BACKGROUND: High-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot meet the requirement of large-scale molecular data due to poor performance of the correlation matrix calculation. With high-throughput sequencing technologies promising to produce even larger datasets per subject, we expect the performance of the state-of-the-art statistical algorithms to be further impacted unless efforts towards optimisation are carried out. MapReduce is a widely used high performance parallel framework that can solve the problem. RESULTS: In this paper, we evaluate the current parallel modes for correlation calculation methods and introduce an efficient data distribution and parallel calculation algorithm based on MapReduce to optimise the correlation calculation. We studied the performance of our algorithm using two gene expression benchmarks. In the micro-benchmark, our implementation using MapReduce, based on the R package RHIPE, demonstrates a 3.26-5.83 fold increase compared to the default Snowfall and 1.56-1.64 fold increase compared to the basic RHIPE in the Euclidean, Pearson and Spearman correlations. Though vanilla R and the optimised Snowfall outperforms our optimised RHIPE in the micro-benchmark, they do not scale well with the macro-benchmark. In the macro-benchmark the optimised RHIPE performs 2.03-16.56 times faster than vanilla R. Benefiting from the 3.30-5.13 times faster data preparation, the optimised RHIPE performs 1.22-1.71 times faster than the optimised Snowfall. Both the optimised RHIPE and the optimised Snowfall successfully performs the Kendall correlation with TCGA dataset within 7 hours. Both of them conduct more than 30 times faster than the estimated vanilla R. CONCLUSIONS: The performance evaluation found that the new MapReduce algorithm and its implementation in RHIPE outperforms vanilla R and the conventional parallel algorithms implemented in R Snowfall. We propose that MapReduce framework holds great promise for large molecular data analysis, in particular for high-dimensional genomic data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new algorithm as a basis for optimising high-throughput molecular data correlation calculation for Big Data. BioMed Central 2014-11-05 /pmc/articles/PMC4246436/ /pubmed/25371114 http://dx.doi.org/10.1186/s12859-014-0351-9 Text en © Wang et al.; licensee BioMed Central Ltd. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Wang, Shicai
Pandis, Ioannis
Johnson, David
Emam, Ibrahim
Guitton, Florian
Oehmichen, Axel
Guo, Yike
Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
title Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
title_full Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
title_fullStr Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
title_full_unstemmed Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
title_short Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
title_sort optimising parallel r correlation matrix calculations on gene expression data using mapreduce
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4246436/
https://www.ncbi.nlm.nih.gov/pubmed/25371114
http://dx.doi.org/10.1186/s12859-014-0351-9
work_keys_str_mv AT wangshicai optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce
AT pandisioannis optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce
AT johnsondavid optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce
AT emamibrahim optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce
AT guittonflorian optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce
AT oehmichenaxel optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce
AT guoyike optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce