Cargando…

paraGSEA: a scalable approach for large-scale gene expression profiling

More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the...

Descripción completa

Detalles Bibliográficos
Autores principales: Peng, Shaoliang, Yang, Shunyun, Bo, Xiaochen, Li, Fei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5737394/
https://www.ncbi.nlm.nih.gov/pubmed/28973463
http://dx.doi.org/10.1093/nar/gkx679
_version_ 1783287511417618432
author Peng, Shaoliang
Yang, Shunyun
Bo, Xiaochen
Li, Fei
author_facet Peng, Shaoliang
Yang, Shunyun
Bo, Xiaochen
Li, Fei
author_sort Peng, Shaoliang
collection PubMed
description More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA.
format Online
Article
Text
id pubmed-5737394
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-57373942018-01-08 paraGSEA: a scalable approach for large-scale gene expression profiling Peng, Shaoliang Yang, Shunyun Bo, Xiaochen Li, Fei Nucleic Acids Res Methods Online More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA. Oxford University Press 2017-09-29 2017-07-31 /pmc/articles/PMC5737394/ /pubmed/28973463 http://dx.doi.org/10.1093/nar/gkx679 Text en © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Methods Online
Peng, Shaoliang
Yang, Shunyun
Bo, Xiaochen
Li, Fei
paraGSEA: a scalable approach for large-scale gene expression profiling
title paraGSEA: a scalable approach for large-scale gene expression profiling
title_full paraGSEA: a scalable approach for large-scale gene expression profiling
title_fullStr paraGSEA: a scalable approach for large-scale gene expression profiling
title_full_unstemmed paraGSEA: a scalable approach for large-scale gene expression profiling
title_short paraGSEA: a scalable approach for large-scale gene expression profiling
title_sort paragsea: a scalable approach for large-scale gene expression profiling
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5737394/
https://www.ncbi.nlm.nih.gov/pubmed/28973463
http://dx.doi.org/10.1093/nar/gkx679
work_keys_str_mv AT pengshaoliang paragseaascalableapproachforlargescalegeneexpressionprofiling
AT yangshunyun paragseaascalableapproachforlargescalegeneexpressionprofiling
AT boxiaochen paragseaascalableapproachforlargescalegeneexpressionprofiling
AT lifei paragseaascalableapproachforlargescalegeneexpressionprofiling