Cargando…

On the Logical Design of a Prototypical Data Lake System for Biological Resources

Biological resources are multifarious encompassing organisms, genetic materials, populations, or any other biotic components of ecosystems, and fine-grained data management and processing of these diverse types of resources proposes a tremendous challenge for both researchers and practitioners. Befo...

Descripción completa

Detalles Bibliográficos
Autores principales: Che, Haoyang, Duan, Yucong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7552915/
https://www.ncbi.nlm.nih.gov/pubmed/33117777
http://dx.doi.org/10.3389/fbioe.2020.553904
_version_ 1783593499797487616
author Che, Haoyang
Duan, Yucong
author_facet Che, Haoyang
Duan, Yucong
author_sort Che, Haoyang
collection PubMed
description Biological resources are multifarious encompassing organisms, genetic materials, populations, or any other biotic components of ecosystems, and fine-grained data management and processing of these diverse types of resources proposes a tremendous challenge for both researchers and practitioners. Before the conceptualization of data lakes, former big data management platforms in the research fields of computational biology and biomedicine could not deal with many practical data management tasks very well. As an effective complement to those previous systems, data lakes were devised to store voluminous, varied, and diversely structured or unstructured data in their native formats, for the sake of various analyses like reporting, modeling, data exploration, knowledge discovery, data visualization, advanced analysis, and machine learning. Due to their intrinsic traits, data lakes are thought to be ideal technologies for processing of hybrid biological resources in the format of text, image, audio, video, and structured tabular data. This paper proposes a method for constructing a practical data lake system for processing multimodal biological data using a prototype system named ProtoDLS, especially from the explainability point of view, which is indispensable to the rigor, transparency, persuasiveness, and trustworthiness of the applications in the field. ProtoDLS adopts a horizontal pipeline to ensure the intra-component explainability factors from data acquisition to data presentation, and a vertical pipeline to ensure the inner-component explainability factors including mathematics, algorithm, execution time, memory consumption, network latency, security, and sampling size. The dual mechanism can ensure the explainability guarantees on the entirety of the data lake system. ProtoDLS proves that a single point of explainability cannot thoroughly expound the cause and effect of the matter from an overall perspective, and adopting a systematic, dynamic, and multisided way of thinking and a system-oriented analysis method is critical when designing a data processing system for biological resources.
format Online
Article
Text
id pubmed-7552915
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-75529152020-10-27 On the Logical Design of a Prototypical Data Lake System for Biological Resources Che, Haoyang Duan, Yucong Front Bioeng Biotechnol Bioengineering and Biotechnology Biological resources are multifarious encompassing organisms, genetic materials, populations, or any other biotic components of ecosystems, and fine-grained data management and processing of these diverse types of resources proposes a tremendous challenge for both researchers and practitioners. Before the conceptualization of data lakes, former big data management platforms in the research fields of computational biology and biomedicine could not deal with many practical data management tasks very well. As an effective complement to those previous systems, data lakes were devised to store voluminous, varied, and diversely structured or unstructured data in their native formats, for the sake of various analyses like reporting, modeling, data exploration, knowledge discovery, data visualization, advanced analysis, and machine learning. Due to their intrinsic traits, data lakes are thought to be ideal technologies for processing of hybrid biological resources in the format of text, image, audio, video, and structured tabular data. This paper proposes a method for constructing a practical data lake system for processing multimodal biological data using a prototype system named ProtoDLS, especially from the explainability point of view, which is indispensable to the rigor, transparency, persuasiveness, and trustworthiness of the applications in the field. ProtoDLS adopts a horizontal pipeline to ensure the intra-component explainability factors from data acquisition to data presentation, and a vertical pipeline to ensure the inner-component explainability factors including mathematics, algorithm, execution time, memory consumption, network latency, security, and sampling size. The dual mechanism can ensure the explainability guarantees on the entirety of the data lake system. ProtoDLS proves that a single point of explainability cannot thoroughly expound the cause and effect of the matter from an overall perspective, and adopting a systematic, dynamic, and multisided way of thinking and a system-oriented analysis method is critical when designing a data processing system for biological resources. Frontiers Media S.A. 2020-09-29 /pmc/articles/PMC7552915/ /pubmed/33117777 http://dx.doi.org/10.3389/fbioe.2020.553904 Text en Copyright © 2020 Che and Duan. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Bioengineering and Biotechnology
Che, Haoyang
Duan, Yucong
On the Logical Design of a Prototypical Data Lake System for Biological Resources
title On the Logical Design of a Prototypical Data Lake System for Biological Resources
title_full On the Logical Design of a Prototypical Data Lake System for Biological Resources
title_fullStr On the Logical Design of a Prototypical Data Lake System for Biological Resources
title_full_unstemmed On the Logical Design of a Prototypical Data Lake System for Biological Resources
title_short On the Logical Design of a Prototypical Data Lake System for Biological Resources
title_sort on the logical design of a prototypical data lake system for biological resources
topic Bioengineering and Biotechnology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7552915/
https://www.ncbi.nlm.nih.gov/pubmed/33117777
http://dx.doi.org/10.3389/fbioe.2020.553904
work_keys_str_mv AT chehaoyang onthelogicaldesignofaprototypicaldatalakesystemforbiologicalresources
AT duanyucong onthelogicaldesignofaprototypicaldatalakesystemforbiologicalresources