Cargando…
Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit
BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered “d...
Autores principales: | , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7976223/ https://www.ncbi.nlm.nih.gov/pubmed/33739401 http://dx.doi.org/10.1093/gigascience/giab018 |
_version_ | 1783667012638081024 |
---|---|
author | Blamey, Ben Toor, Salman Dahlö, Martin Wieslander, Håkan Harrison, Philip J Sintorn, Ida-Maria Sabirsh, Alan Wählby, Carolina Spjuth, Ola Hellander, Andreas |
author_facet | Blamey, Ben Toor, Salman Dahlö, Martin Wieslander, Håkan Harrison, Philip J Sintorn, Ida-Maria Sabirsh, Alan Wählby, Carolina Spjuth, Ola Hellander, Andreas |
author_sort | Blamey, Ben |
collection | PubMed |
description | BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered “data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources. FINDINGS: In our pipeline model, an “interestingness function” assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a “policy” guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope. CONCLUSIONS: Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems – and is intended for use with a range of technologies in different deployment scenarios. |
format | Online Article Text |
id | pubmed-7976223 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-79762232021-03-23 Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit Blamey, Ben Toor, Salman Dahlö, Martin Wieslander, Håkan Harrison, Philip J Sintorn, Ida-Maria Sabirsh, Alan Wählby, Carolina Spjuth, Ola Hellander, Andreas Gigascience Technical Note BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered “data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources. FINDINGS: In our pipeline model, an “interestingness function” assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a “policy” guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope. CONCLUSIONS: Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems – and is intended for use with a range of technologies in different deployment scenarios. Oxford University Press 2021-03-19 /pmc/articles/PMC7976223/ /pubmed/33739401 http://dx.doi.org/10.1093/gigascience/giab018 Text en © The Author(s) 2021. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Technical Note Blamey, Ben Toor, Salman Dahlö, Martin Wieslander, Håkan Harrison, Philip J Sintorn, Ida-Maria Sabirsh, Alan Wählby, Carolina Spjuth, Ola Hellander, Andreas Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit |
title | Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit |
title_full | Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit |
title_fullStr | Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit |
title_full_unstemmed | Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit |
title_short | Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit |
title_sort | rapid development of cloud-native intelligent data pipelines for scientific data streams using the haste toolkit |
topic | Technical Note |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7976223/ https://www.ncbi.nlm.nih.gov/pubmed/33739401 http://dx.doi.org/10.1093/gigascience/giab018 |
work_keys_str_mv | AT blameyben rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT toorsalman rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT dahlomartin rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT wieslanderhakan rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT harrisonphilipj rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT sintornidamaria rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT sabirshalan rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT wahlbycarolina rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT spjuthola rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT hellanderandreas rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit |