Cargando…
Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment
Calculation of structural features of proteins, nucleic acids, and nucleic acid-protein complexes on the basis of their geometries and studying various interactions within these macromolecules, for which high-resolution structures are stored in Protein Data Bank (PDB), require parsing and extraction...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6337464/ https://www.ncbi.nlm.nih.gov/pubmed/30621295 http://dx.doi.org/10.3390/molecules24010179 |
_version_ | 1783388260754522112 |
---|---|
author | Mrozek, Dariusz Dąbek, Tomasz Małysiak-Mrozek, Bożena |
author_facet | Mrozek, Dariusz Dąbek, Tomasz Małysiak-Mrozek, Bożena |
author_sort | Mrozek, Dariusz |
collection | PubMed |
description | Calculation of structural features of proteins, nucleic acids, and nucleic acid-protein complexes on the basis of their geometries and studying various interactions within these macromolecules, for which high-resolution structures are stored in Protein Data Bank (PDB), require parsing and extraction of suitable data stored in text files. To perform these operations on large scale in the face of the growing amount of macromolecular data in public repositories, we propose to perform them in the distributed environment of Azure Data Lake and scale the calculations on the Cloud. In this paper, we present dedicated data extractors for PDB files that can be used in various types of calculations performed over protein and nucleic acids structures in the Azure Data Lake. Results of our tests show that the Cloud storage space occupied by the macromolecular data can be successfully reduced by using compression of PDB files without significant loss of data processing efficiency. Moreover, our experiments show that the performed calculations can be significantly accelerated when using large sequential files for storing macromolecular data and by parallelizing the calculations and data extractions that precede them. Finally, the paper shows how all the calculations can be performed in a declarative way in U-SQL scripts for Data Lake Analytics. |
format | Online Article Text |
id | pubmed-6337464 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-63374642019-01-25 Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment Mrozek, Dariusz Dąbek, Tomasz Małysiak-Mrozek, Bożena Molecules Article Calculation of structural features of proteins, nucleic acids, and nucleic acid-protein complexes on the basis of their geometries and studying various interactions within these macromolecules, for which high-resolution structures are stored in Protein Data Bank (PDB), require parsing and extraction of suitable data stored in text files. To perform these operations on large scale in the face of the growing amount of macromolecular data in public repositories, we propose to perform them in the distributed environment of Azure Data Lake and scale the calculations on the Cloud. In this paper, we present dedicated data extractors for PDB files that can be used in various types of calculations performed over protein and nucleic acids structures in the Azure Data Lake. Results of our tests show that the Cloud storage space occupied by the macromolecular data can be successfully reduced by using compression of PDB files without significant loss of data processing efficiency. Moreover, our experiments show that the performed calculations can be significantly accelerated when using large sequential files for storing macromolecular data and by parallelizing the calculations and data extractions that precede them. Finally, the paper shows how all the calculations can be performed in a declarative way in U-SQL scripts for Data Lake Analytics. MDPI 2019-01-05 /pmc/articles/PMC6337464/ /pubmed/30621295 http://dx.doi.org/10.3390/molecules24010179 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Mrozek, Dariusz Dąbek, Tomasz Małysiak-Mrozek, Bożena Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment |
title | Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment |
title_full | Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment |
title_fullStr | Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment |
title_full_unstemmed | Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment |
title_short | Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment |
title_sort | scalable extraction of big macromolecular data in azure data lake environment |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6337464/ https://www.ncbi.nlm.nih.gov/pubmed/30621295 http://dx.doi.org/10.3390/molecules24010179 |
work_keys_str_mv | AT mrozekdariusz scalableextractionofbigmacromoleculardatainazuredatalakeenvironment AT dabektomasz scalableextractionofbigmacromoleculardatainazuredatalakeenvironment AT małysiakmrozekbozena scalableextractionofbigmacromoleculardatainazuredatalakeenvironment |