Cargando…

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit

BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered “d...

Descripción completa

Detalles Bibliográficos
Autores principales:	Blamey, Ben, Toor, Salman, Dahlö, Martin, Wieslander, Håkan, Harrison, Philip J, Sintorn, Ida-Maria, Sabirsh, Alan, Wählby, Carolina, Spjuth, Ola, Hellander, Andreas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2021
Materias:	Technical Note
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7976223/ https://www.ncbi.nlm.nih.gov/pubmed/33739401 http://dx.doi.org/10.1093/gigascience/giab018

_version_	1783667012638081024
author	Blamey, Ben Toor, Salman Dahlö, Martin Wieslander, Håkan Harrison, Philip J Sintorn, Ida-Maria Sabirsh, Alan Wählby, Carolina Spjuth, Ola Hellander, Andreas
author_facet	Blamey, Ben Toor, Salman Dahlö, Martin Wieslander, Håkan Harrison, Philip J Sintorn, Ida-Maria Sabirsh, Alan Wählby, Carolina Spjuth, Ola Hellander, Andreas
author_sort	Blamey, Ben
collection	PubMed
description	BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered “data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources. FINDINGS: In our pipeline model, an “interestingness function” assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a “policy” guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope. CONCLUSIONS: Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems – and is intended for use with a range of technologies in different deployment scenarios.
format	Online Article Text
id	pubmed-7976223
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-79762232021-03-23 Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit Blamey, Ben Toor, Salman Dahlö, Martin Wieslander, Håkan Harrison, Philip J Sintorn, Ida-Maria Sabirsh, Alan Wählby, Carolina Spjuth, Ola Hellander, Andreas Gigascience Technical Note BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered “data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources. FINDINGS: In our pipeline model, an “interestingness function” assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a “policy” guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope. CONCLUSIONS: Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems – and is intended for use with a range of technologies in different deployment scenarios. Oxford University Press 2021-03-19 /pmc/articles/PMC7976223/ /pubmed/33739401 http://dx.doi.org/10.1093/gigascience/giab018 Text en © The Author(s) 2021. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Technical Note Blamey, Ben Toor, Salman Dahlö, Martin Wieslander, Håkan Harrison, Philip J Sintorn, Ida-Maria Sabirsh, Alan Wählby, Carolina Spjuth, Ola Hellander, Andreas Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit
title	Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit
title_full	Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit
title_fullStr	Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit
title_full_unstemmed	Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit
title_short	Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit
title_sort	rapid development of cloud-native intelligent data pipelines for scientific data streams using the haste toolkit
topic	Technical Note
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7976223/ https://www.ncbi.nlm.nih.gov/pubmed/33739401 http://dx.doi.org/10.1093/gigascience/giab018
work_keys_str_mv	AT blameyben rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT toorsalman rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT dahlomartin rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT wieslanderhakan rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT harrisonphilipj rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT sintornidamaria rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT sabirshalan rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT wahlbycarolina rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT spjuthola rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit AT hellanderandreas rapiddevelopmentofcloudnativeintelligentdatapipelinesforscientificdatastreamsusingthehastetoolkit

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit

Ejemplares similares