Cargando…
Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
BACKGROUND: High-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4246436/ https://www.ncbi.nlm.nih.gov/pubmed/25371114 http://dx.doi.org/10.1186/s12859-014-0351-9 |
_version_ | 1782346513118134272 |
---|---|
author | Wang, Shicai Pandis, Ioannis Johnson, David Emam, Ibrahim Guitton, Florian Oehmichen, Axel Guo, Yike |
author_facet | Wang, Shicai Pandis, Ioannis Johnson, David Emam, Ibrahim Guitton, Florian Oehmichen, Axel Guo, Yike |
author_sort | Wang, Shicai |
collection | PubMed |
description | BACKGROUND: High-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot meet the requirement of large-scale molecular data due to poor performance of the correlation matrix calculation. With high-throughput sequencing technologies promising to produce even larger datasets per subject, we expect the performance of the state-of-the-art statistical algorithms to be further impacted unless efforts towards optimisation are carried out. MapReduce is a widely used high performance parallel framework that can solve the problem. RESULTS: In this paper, we evaluate the current parallel modes for correlation calculation methods and introduce an efficient data distribution and parallel calculation algorithm based on MapReduce to optimise the correlation calculation. We studied the performance of our algorithm using two gene expression benchmarks. In the micro-benchmark, our implementation using MapReduce, based on the R package RHIPE, demonstrates a 3.26-5.83 fold increase compared to the default Snowfall and 1.56-1.64 fold increase compared to the basic RHIPE in the Euclidean, Pearson and Spearman correlations. Though vanilla R and the optimised Snowfall outperforms our optimised RHIPE in the micro-benchmark, they do not scale well with the macro-benchmark. In the macro-benchmark the optimised RHIPE performs 2.03-16.56 times faster than vanilla R. Benefiting from the 3.30-5.13 times faster data preparation, the optimised RHIPE performs 1.22-1.71 times faster than the optimised Snowfall. Both the optimised RHIPE and the optimised Snowfall successfully performs the Kendall correlation with TCGA dataset within 7 hours. Both of them conduct more than 30 times faster than the estimated vanilla R. CONCLUSIONS: The performance evaluation found that the new MapReduce algorithm and its implementation in RHIPE outperforms vanilla R and the conventional parallel algorithms implemented in R Snowfall. We propose that MapReduce framework holds great promise for large molecular data analysis, in particular for high-dimensional genomic data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new algorithm as a basis for optimising high-throughput molecular data correlation calculation for Big Data. |
format | Online Article Text |
id | pubmed-4246436 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-42464362014-12-02 Optimising parallel R correlation matrix calculations on gene expression data using MapReduce Wang, Shicai Pandis, Ioannis Johnson, David Emam, Ibrahim Guitton, Florian Oehmichen, Axel Guo, Yike BMC Bioinformatics Research Article BACKGROUND: High-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot meet the requirement of large-scale molecular data due to poor performance of the correlation matrix calculation. With high-throughput sequencing technologies promising to produce even larger datasets per subject, we expect the performance of the state-of-the-art statistical algorithms to be further impacted unless efforts towards optimisation are carried out. MapReduce is a widely used high performance parallel framework that can solve the problem. RESULTS: In this paper, we evaluate the current parallel modes for correlation calculation methods and introduce an efficient data distribution and parallel calculation algorithm based on MapReduce to optimise the correlation calculation. We studied the performance of our algorithm using two gene expression benchmarks. In the micro-benchmark, our implementation using MapReduce, based on the R package RHIPE, demonstrates a 3.26-5.83 fold increase compared to the default Snowfall and 1.56-1.64 fold increase compared to the basic RHIPE in the Euclidean, Pearson and Spearman correlations. Though vanilla R and the optimised Snowfall outperforms our optimised RHIPE in the micro-benchmark, they do not scale well with the macro-benchmark. In the macro-benchmark the optimised RHIPE performs 2.03-16.56 times faster than vanilla R. Benefiting from the 3.30-5.13 times faster data preparation, the optimised RHIPE performs 1.22-1.71 times faster than the optimised Snowfall. Both the optimised RHIPE and the optimised Snowfall successfully performs the Kendall correlation with TCGA dataset within 7 hours. Both of them conduct more than 30 times faster than the estimated vanilla R. CONCLUSIONS: The performance evaluation found that the new MapReduce algorithm and its implementation in RHIPE outperforms vanilla R and the conventional parallel algorithms implemented in R Snowfall. We propose that MapReduce framework holds great promise for large molecular data analysis, in particular for high-dimensional genomic data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new algorithm as a basis for optimising high-throughput molecular data correlation calculation for Big Data. BioMed Central 2014-11-05 /pmc/articles/PMC4246436/ /pubmed/25371114 http://dx.doi.org/10.1186/s12859-014-0351-9 Text en © Wang et al.; licensee BioMed Central Ltd. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Wang, Shicai Pandis, Ioannis Johnson, David Emam, Ibrahim Guitton, Florian Oehmichen, Axel Guo, Yike Optimising parallel R correlation matrix calculations on gene expression data using MapReduce |
title | Optimising parallel R correlation matrix calculations on gene expression data using MapReduce |
title_full | Optimising parallel R correlation matrix calculations on gene expression data using MapReduce |
title_fullStr | Optimising parallel R correlation matrix calculations on gene expression data using MapReduce |
title_full_unstemmed | Optimising parallel R correlation matrix calculations on gene expression data using MapReduce |
title_short | Optimising parallel R correlation matrix calculations on gene expression data using MapReduce |
title_sort | optimising parallel r correlation matrix calculations on gene expression data using mapreduce |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4246436/ https://www.ncbi.nlm.nih.gov/pubmed/25371114 http://dx.doi.org/10.1186/s12859-014-0351-9 |
work_keys_str_mv | AT wangshicai optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce AT pandisioannis optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce AT johnsondavid optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce AT emamibrahim optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce AT guittonflorian optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce AT oehmichenaxel optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce AT guoyike optimisingparallelrcorrelationmatrixcalculationsongeneexpressiondatausingmapreduce |