Cargando…
On the Logical Design of a Prototypical Data Lake System for Biological Resources
Biological resources are multifarious encompassing organisms, genetic materials, populations, or any other biotic components of ecosystems, and fine-grained data management and processing of these diverse types of resources proposes a tremendous challenge for both researchers and practitioners. Befo...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7552915/ https://www.ncbi.nlm.nih.gov/pubmed/33117777 http://dx.doi.org/10.3389/fbioe.2020.553904 |
_version_ | 1783593499797487616 |
---|---|
author | Che, Haoyang Duan, Yucong |
author_facet | Che, Haoyang Duan, Yucong |
author_sort | Che, Haoyang |
collection | PubMed |
description | Biological resources are multifarious encompassing organisms, genetic materials, populations, or any other biotic components of ecosystems, and fine-grained data management and processing of these diverse types of resources proposes a tremendous challenge for both researchers and practitioners. Before the conceptualization of data lakes, former big data management platforms in the research fields of computational biology and biomedicine could not deal with many practical data management tasks very well. As an effective complement to those previous systems, data lakes were devised to store voluminous, varied, and diversely structured or unstructured data in their native formats, for the sake of various analyses like reporting, modeling, data exploration, knowledge discovery, data visualization, advanced analysis, and machine learning. Due to their intrinsic traits, data lakes are thought to be ideal technologies for processing of hybrid biological resources in the format of text, image, audio, video, and structured tabular data. This paper proposes a method for constructing a practical data lake system for processing multimodal biological data using a prototype system named ProtoDLS, especially from the explainability point of view, which is indispensable to the rigor, transparency, persuasiveness, and trustworthiness of the applications in the field. ProtoDLS adopts a horizontal pipeline to ensure the intra-component explainability factors from data acquisition to data presentation, and a vertical pipeline to ensure the inner-component explainability factors including mathematics, algorithm, execution time, memory consumption, network latency, security, and sampling size. The dual mechanism can ensure the explainability guarantees on the entirety of the data lake system. ProtoDLS proves that a single point of explainability cannot thoroughly expound the cause and effect of the matter from an overall perspective, and adopting a systematic, dynamic, and multisided way of thinking and a system-oriented analysis method is critical when designing a data processing system for biological resources. |
format | Online Article Text |
id | pubmed-7552915 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-75529152020-10-27 On the Logical Design of a Prototypical Data Lake System for Biological Resources Che, Haoyang Duan, Yucong Front Bioeng Biotechnol Bioengineering and Biotechnology Biological resources are multifarious encompassing organisms, genetic materials, populations, or any other biotic components of ecosystems, and fine-grained data management and processing of these diverse types of resources proposes a tremendous challenge for both researchers and practitioners. Before the conceptualization of data lakes, former big data management platforms in the research fields of computational biology and biomedicine could not deal with many practical data management tasks very well. As an effective complement to those previous systems, data lakes were devised to store voluminous, varied, and diversely structured or unstructured data in their native formats, for the sake of various analyses like reporting, modeling, data exploration, knowledge discovery, data visualization, advanced analysis, and machine learning. Due to their intrinsic traits, data lakes are thought to be ideal technologies for processing of hybrid biological resources in the format of text, image, audio, video, and structured tabular data. This paper proposes a method for constructing a practical data lake system for processing multimodal biological data using a prototype system named ProtoDLS, especially from the explainability point of view, which is indispensable to the rigor, transparency, persuasiveness, and trustworthiness of the applications in the field. ProtoDLS adopts a horizontal pipeline to ensure the intra-component explainability factors from data acquisition to data presentation, and a vertical pipeline to ensure the inner-component explainability factors including mathematics, algorithm, execution time, memory consumption, network latency, security, and sampling size. The dual mechanism can ensure the explainability guarantees on the entirety of the data lake system. ProtoDLS proves that a single point of explainability cannot thoroughly expound the cause and effect of the matter from an overall perspective, and adopting a systematic, dynamic, and multisided way of thinking and a system-oriented analysis method is critical when designing a data processing system for biological resources. Frontiers Media S.A. 2020-09-29 /pmc/articles/PMC7552915/ /pubmed/33117777 http://dx.doi.org/10.3389/fbioe.2020.553904 Text en Copyright © 2020 Che and Duan. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Bioengineering and Biotechnology Che, Haoyang Duan, Yucong On the Logical Design of a Prototypical Data Lake System for Biological Resources |
title | On the Logical Design of a Prototypical Data Lake System for Biological Resources |
title_full | On the Logical Design of a Prototypical Data Lake System for Biological Resources |
title_fullStr | On the Logical Design of a Prototypical Data Lake System for Biological Resources |
title_full_unstemmed | On the Logical Design of a Prototypical Data Lake System for Biological Resources |
title_short | On the Logical Design of a Prototypical Data Lake System for Biological Resources |
title_sort | on the logical design of a prototypical data lake system for biological resources |
topic | Bioengineering and Biotechnology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7552915/ https://www.ncbi.nlm.nih.gov/pubmed/33117777 http://dx.doi.org/10.3389/fbioe.2020.553904 |
work_keys_str_mv | AT chehaoyang onthelogicaldesignofaprototypicaldatalakesystemforbiologicalresources AT duanyucong onthelogicaldesignofaprototypicaldatalakesystemforbiologicalresources |