Cargando…

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

BACKGROUND: With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously pro...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nanni, Luca, Pinoli, Pietro, Canakoglu, Arif, Ceri, Stefano
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6842186/ https://www.ncbi.nlm.nih.gov/pubmed/31703553 http://dx.doi.org/10.1186/s12859-019-3159-9

_version_	1783467999931400192
author	Nanni, Luca Pinoli, Pietro Canakoglu, Arif Ceri, Stefano
author_facet	Nanni, Luca Pinoli, Pietro Canakoglu, Arif Ceri, Stefano
author_sort	Nanni, Luca
collection	PubMed
description	BACKGROUND: With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. RESULTS: We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. CONCLUSIONS: PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.
format	Online Article Text
id	pubmed-6842186
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-68421862019-11-14 PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets Nanni, Luca Pinoli, Pietro Canakoglu, Arif Ceri, Stefano BMC Bioinformatics Software BACKGROUND: With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. RESULTS: We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. CONCLUSIONS: PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability. BioMed Central 2019-11-08 /pmc/articles/PMC6842186/ /pubmed/31703553 http://dx.doi.org/10.1186/s12859-019-3159-9 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Software Nanni, Luca Pinoli, Pietro Canakoglu, Arif Ceri, Stefano PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets
title	PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets
title_full	PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets
title_fullStr	PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets
title_full_unstemmed	PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets
title_short	PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets
title_sort	pygmql: scalable data extraction and analysis for heterogeneous genomic datasets
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6842186/ https://www.ncbi.nlm.nih.gov/pubmed/31703553 http://dx.doi.org/10.1186/s12859-019-3159-9
work_keys_str_mv	AT nanniluca pygmqlscalabledataextractionandanalysisforheterogeneousgenomicdatasets AT pinolipietro pygmqlscalabledataextractionandanalysisforheterogeneousgenomicdatasets AT canakogluarif pygmqlscalabledataextractionandanalysisforheterogeneousgenomicdatasets AT ceristefano pygmqlscalabledataextractionandanalysisforheterogeneousgenomicdatasets

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Ejemplares similares