Cargando…

Toward data lakes as central building blocks for data management and analysis

Data lakes are a fundamental building block for many industrial data analysis solutions and becoming increasingly popular in research. Often associated with big data use cases, data lakes are, for example, used as central data management systems of research institutions or as the core entity of mach...

Descripción completa

Detalles Bibliográficos
Autores principales: Wieder, Philipp, Nolte, Hendrik
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9442782/
https://www.ncbi.nlm.nih.gov/pubmed/36072823
http://dx.doi.org/10.3389/fdata.2022.945720
_version_ 1784782898276073472
author Wieder, Philipp
Nolte, Hendrik
author_facet Wieder, Philipp
Nolte, Hendrik
author_sort Wieder, Philipp
collection PubMed
description Data lakes are a fundamental building block for many industrial data analysis solutions and becoming increasingly popular in research. Often associated with big data use cases, data lakes are, for example, used as central data management systems of research institutions or as the core entity of machine learning pipelines. The basic underlying idea of retaining data in its native format within a data lake facilitates a large range of use cases and improves data reusability, especially when compared to the schema-on-write approach applied in data warehouses, where data is transformed prior to the actual storage to fit a predefined schema. Storing such massive amounts of raw data, however, has its very own challenges, spanning from the general data modeling, and indexing for concise querying to the integration of suitable and scalable compute capabilities. In this contribution, influential papers of the last decade have been selected to provide a comprehensive overview of developments and obtained results. The papers are analyzed with regard to the applicability of their input to data lakes that serve as central data management systems of research institutions. To achieve this, contributions to data lake architectures, metadata models, data provenance, workflow support, and FAIR principles are investigated. Last, but not least, these capabilities are mapped onto the requirements of two common research personae to identify open challenges. With that, potential research topics are determined, which have to be tackled toward the applicability of data lakes as central building blocks for research data management.
format Online
Article
Text
id pubmed-9442782
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-94427822022-09-06 Toward data lakes as central building blocks for data management and analysis Wieder, Philipp Nolte, Hendrik Front Big Data Big Data Data lakes are a fundamental building block for many industrial data analysis solutions and becoming increasingly popular in research. Often associated with big data use cases, data lakes are, for example, used as central data management systems of research institutions or as the core entity of machine learning pipelines. The basic underlying idea of retaining data in its native format within a data lake facilitates a large range of use cases and improves data reusability, especially when compared to the schema-on-write approach applied in data warehouses, where data is transformed prior to the actual storage to fit a predefined schema. Storing such massive amounts of raw data, however, has its very own challenges, spanning from the general data modeling, and indexing for concise querying to the integration of suitable and scalable compute capabilities. In this contribution, influential papers of the last decade have been selected to provide a comprehensive overview of developments and obtained results. The papers are analyzed with regard to the applicability of their input to data lakes that serve as central data management systems of research institutions. To achieve this, contributions to data lake architectures, metadata models, data provenance, workflow support, and FAIR principles are investigated. Last, but not least, these capabilities are mapped onto the requirements of two common research personae to identify open challenges. With that, potential research topics are determined, which have to be tackled toward the applicability of data lakes as central building blocks for research data management. Frontiers Media S.A. 2022-08-19 /pmc/articles/PMC9442782/ /pubmed/36072823 http://dx.doi.org/10.3389/fdata.2022.945720 Text en Copyright © 2022 Wieder and Nolte. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Big Data
Wieder, Philipp
Nolte, Hendrik
Toward data lakes as central building blocks for data management and analysis
title Toward data lakes as central building blocks for data management and analysis
title_full Toward data lakes as central building blocks for data management and analysis
title_fullStr Toward data lakes as central building blocks for data management and analysis
title_full_unstemmed Toward data lakes as central building blocks for data management and analysis
title_short Toward data lakes as central building blocks for data management and analysis
title_sort toward data lakes as central building blocks for data management and analysis
topic Big Data
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9442782/
https://www.ncbi.nlm.nih.gov/pubmed/36072823
http://dx.doi.org/10.3389/fdata.2022.945720
work_keys_str_mv AT wiederphilipp towarddatalakesascentralbuildingblocksfordatamanagementandanalysis
AT noltehendrik towarddatalakesascentralbuildingblocksfordatamanagementandanalysis