Cargando…

Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science

A common task in scientific computing is the data reduction. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for further analysis. Typically, these workflows use distributed storage a...

Descripción completa

Detalles Bibliográficos
Autores principales: Wegner, Tobias, Lassnig, Mario, Ueberholz, Peer, Zeitnitz, Christian
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9805534/
https://www.ncbi.nlm.nih.gov/pubmed/36620727
http://dx.doi.org/10.1007/s41781-021-00076-w
_version_ 1784862347381178368
author Wegner, Tobias
Lassnig, Mario
Ueberholz, Peer
Zeitnitz, Christian
author_facet Wegner, Tobias
Lassnig, Mario
Ueberholz, Peer
Zeitnitz, Christian
author_sort Wegner, Tobias
collection PubMed
description A common task in scientific computing is the data reduction. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for further analysis. Typically, these workflows use distributed storage and computing resources. A straightforward setup of storage media would be low-cost tape storage and higher-cost disk storage. The large, infrequently accessed input data are stored on tape storage. The smaller, frequently accessed derived data is stored on disk storage. In a best-case scenario, the large input data is only accessed very infrequently and in a well-planned pattern. However, practice shows that often the data has to be processed continuously and unpredictably. This can significantly reduce tape storage performance. A common approach to counter this is storing copies of the large input data on disk storage. This contribution evaluates an approach that uses cloud storage resources to serve as a flexible cache or buffer, depending on the computational workflow. The proposed model is explored for the case of continuously processed data. For the evaluation, a simulation tool was developed, which can be used to analyse models related to storage and network resources. We show that using commercial cloud storage can reduce on-premises disk storage requirements, while maintaining an equal throughput of jobs. Moreover, the key metrics of the model are discussed, and an approach is described, which uses the simulation to assist with the decision process of using commercial cloud storage. The goal is to investigate approaches and propose new evaluation methods to overcome future data challenges.
format Online
Article
Text
id pubmed-9805534
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-98055342023-01-04 Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science Wegner, Tobias Lassnig, Mario Ueberholz, Peer Zeitnitz, Christian Comput Softw Big Sci Original Article A common task in scientific computing is the data reduction. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for further analysis. Typically, these workflows use distributed storage and computing resources. A straightforward setup of storage media would be low-cost tape storage and higher-cost disk storage. The large, infrequently accessed input data are stored on tape storage. The smaller, frequently accessed derived data is stored on disk storage. In a best-case scenario, the large input data is only accessed very infrequently and in a well-planned pattern. However, practice shows that often the data has to be processed continuously and unpredictably. This can significantly reduce tape storage performance. A common approach to counter this is storing copies of the large input data on disk storage. This contribution evaluates an approach that uses cloud storage resources to serve as a flexible cache or buffer, depending on the computational workflow. The proposed model is explored for the case of continuously processed data. For the evaluation, a simulation tool was developed, which can be used to analyse models related to storage and network resources. We show that using commercial cloud storage can reduce on-premises disk storage requirements, while maintaining an equal throughput of jobs. Moreover, the key metrics of the model are discussed, and an approach is described, which uses the simulation to assist with the decision process of using commercial cloud storage. The goal is to investigate approaches and propose new evaluation methods to overcome future data challenges. Springer International Publishing 2021-12-22 2022 /pmc/articles/PMC9805534/ /pubmed/36620727 http://dx.doi.org/10.1007/s41781-021-00076-w Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Original Article
Wegner, Tobias
Lassnig, Mario
Ueberholz, Peer
Zeitnitz, Christian
Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science
title Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science
title_full Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science
title_fullStr Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science
title_full_unstemmed Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science
title_short Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science
title_sort simulation and evaluation of cloud storage caching for data intensive science
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9805534/
https://www.ncbi.nlm.nih.gov/pubmed/36620727
http://dx.doi.org/10.1007/s41781-021-00076-w
work_keys_str_mv AT wegnertobias simulationandevaluationofcloudstoragecachingfordataintensivescience
AT lassnigmario simulationandevaluationofcloudstoragecachingfordataintensivescience
AT ueberholzpeer simulationandevaluationofcloudstoragecachingfordataintensivescience
AT zeitnitzchristian simulationandevaluationofcloudstoragecachingfordataintensivescience