Cargando…

Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment

Calculation of structural features of proteins, nucleic acids, and nucleic acid-protein complexes on the basis of their geometries and studying various interactions within these macromolecules, for which high-resolution structures are stored in Protein Data Bank (PDB), require parsing and extraction...

Descripción completa

Detalles Bibliográficos
Autores principales: Mrozek, Dariusz, Dąbek, Tomasz, Małysiak-Mrozek, Bożena
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6337464/
https://www.ncbi.nlm.nih.gov/pubmed/30621295
http://dx.doi.org/10.3390/molecules24010179
_version_ 1783388260754522112
author Mrozek, Dariusz
Dąbek, Tomasz
Małysiak-Mrozek, Bożena
author_facet Mrozek, Dariusz
Dąbek, Tomasz
Małysiak-Mrozek, Bożena
author_sort Mrozek, Dariusz
collection PubMed
description Calculation of structural features of proteins, nucleic acids, and nucleic acid-protein complexes on the basis of their geometries and studying various interactions within these macromolecules, for which high-resolution structures are stored in Protein Data Bank (PDB), require parsing and extraction of suitable data stored in text files. To perform these operations on large scale in the face of the growing amount of macromolecular data in public repositories, we propose to perform them in the distributed environment of Azure Data Lake and scale the calculations on the Cloud. In this paper, we present dedicated data extractors for PDB files that can be used in various types of calculations performed over protein and nucleic acids structures in the Azure Data Lake. Results of our tests show that the Cloud storage space occupied by the macromolecular data can be successfully reduced by using compression of PDB files without significant loss of data processing efficiency. Moreover, our experiments show that the performed calculations can be significantly accelerated when using large sequential files for storing macromolecular data and by parallelizing the calculations and data extractions that precede them. Finally, the paper shows how all the calculations can be performed in a declarative way in U-SQL scripts for Data Lake Analytics.
format Online
Article
Text
id pubmed-6337464
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-63374642019-01-25 Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment Mrozek, Dariusz Dąbek, Tomasz Małysiak-Mrozek, Bożena Molecules Article Calculation of structural features of proteins, nucleic acids, and nucleic acid-protein complexes on the basis of their geometries and studying various interactions within these macromolecules, for which high-resolution structures are stored in Protein Data Bank (PDB), require parsing and extraction of suitable data stored in text files. To perform these operations on large scale in the face of the growing amount of macromolecular data in public repositories, we propose to perform them in the distributed environment of Azure Data Lake and scale the calculations on the Cloud. In this paper, we present dedicated data extractors for PDB files that can be used in various types of calculations performed over protein and nucleic acids structures in the Azure Data Lake. Results of our tests show that the Cloud storage space occupied by the macromolecular data can be successfully reduced by using compression of PDB files without significant loss of data processing efficiency. Moreover, our experiments show that the performed calculations can be significantly accelerated when using large sequential files for storing macromolecular data and by parallelizing the calculations and data extractions that precede them. Finally, the paper shows how all the calculations can be performed in a declarative way in U-SQL scripts for Data Lake Analytics. MDPI 2019-01-05 /pmc/articles/PMC6337464/ /pubmed/30621295 http://dx.doi.org/10.3390/molecules24010179 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Mrozek, Dariusz
Dąbek, Tomasz
Małysiak-Mrozek, Bożena
Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment
title Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment
title_full Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment
title_fullStr Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment
title_full_unstemmed Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment
title_short Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment
title_sort scalable extraction of big macromolecular data in azure data lake environment
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6337464/
https://www.ncbi.nlm.nih.gov/pubmed/30621295
http://dx.doi.org/10.3390/molecules24010179
work_keys_str_mv AT mrozekdariusz scalableextractionofbigmacromoleculardatainazuredatalakeenvironment
AT dabektomasz scalableextractionofbigmacromoleculardatainazuredatalakeenvironment
AT małysiakmrozekbozena scalableextractionofbigmacromoleculardatainazuredatalakeenvironment