Cargando…

RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

BACKGROUND: Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, altho...

Descripción completa

Detalles Bibliográficos
Autores principales: Pallotta, Simone, Cascianelli, Silvia, Masseroli, Marco
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8991469/
https://www.ncbi.nlm.nih.gov/pubmed/35392801
http://dx.doi.org/10.1186/s12859-022-04648-4
_version_ 1784683574464610304
author Pallotta, Simone
Cascianelli, Silvia
Masseroli, Marco
author_facet Pallotta, Simone
Cascianelli, Silvia
Masseroli, Marco
author_sort Pallotta, Simone
collection PubMed
description BACKGROUND: Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures. RESULTS: We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions. CONCLUSIONS: RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04648-4.
format Online
Article
Text
id pubmed-8991469
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-89914692022-04-09 RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor Pallotta, Simone Cascianelli, Silvia Masseroli, Marco BMC Bioinformatics Software BACKGROUND: Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures. RESULTS: We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions. CONCLUSIONS: RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04648-4. BioMed Central 2022-04-07 /pmc/articles/PMC8991469/ /pubmed/35392801 http://dx.doi.org/10.1186/s12859-022-04648-4 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Pallotta, Simone
Cascianelli, Silvia
Masseroli, Marco
RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor
title RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor
title_full RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor
title_fullStr RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor
title_full_unstemmed RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor
title_short RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor
title_sort rgmql: scalable and interoperable computing of heterogeneous omics big data and metadata in r/bioconductor
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8991469/
https://www.ncbi.nlm.nih.gov/pubmed/35392801
http://dx.doi.org/10.1186/s12859-022-04648-4
work_keys_str_mv AT pallottasimone rgmqlscalableandinteroperablecomputingofheterogeneousomicsbigdataandmetadatainrbioconductor
AT cascianellisilvia rgmqlscalableandinteroperablecomputingofheterogeneousomicsbigdataandmetadatainrbioconductor
AT masserolimarco rgmqlscalableandinteroperablecomputingofheterogeneousomicsbigdataandmetadatainrbioconductor