Cargando…

Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science

A common task in scientific computing is the data reduction. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for further analysis. Typically, these workflows use distributed storage a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wegner, Tobias, Lassnig, Mario, Ueberholz, Peer, Zeitnitz, Christian
Lenguaje:	eng
Publicado:	2022
Materias:	cs.DC Computing and Computers
Acceso en línea:	https://dx.doi.org/10.1007/s41781-021-00076-w http://cds.cern.ch/record/2812679

_version_	1780973355679088640
author	Wegner, Tobias Lassnig, Mario Ueberholz, Peer Zeitnitz, Christian
author_facet	Wegner, Tobias Lassnig, Mario Ueberholz, Peer Zeitnitz, Christian
author_sort	Wegner, Tobias
collection	CERN
description	A common task in scientific computing is the data reduction. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for further analysis. Typically, these workflows use distributed storage and computing resources. A straightforward setup of storage media would be low-cost tape storage and higher-cost disk storage. The large, infrequently accessed input data are stored on tape storage. The smaller, frequently accessed derived data is stored on disk storage. In a best-case scenario, the large input data is only accessed very infrequently and in a well-planned pattern. However, practice shows that often the data has to be processed continuously and unpredictably. This can significantly reduce tape storage performance. A common approach to counter this is storing copies of the large input data on disk storage. This contribution evaluates an approach that uses cloud storage resources to serve as a flexible cache or buffer, depending on the computational workflow. The proposed model is explored for the case of continuously processed data. For the evaluation, a simulation tool was developed, which can be used to analyse models related to storage and network resources. We show that using commercial cloud storage can reduce on-premises disk storage requirements, while maintaining an equal throughput of jobs. Moreover, the key metrics of the model are discussed, and an approach is described, which uses the simulation to assist with the decision process of using commercial cloud storage. The goal is to investigate approaches and propose new evaluation methods to overcome future data challenges.
id	cern-2812679
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2022
record_format	invenio
spelling	cern-28126792023-02-03T12:00:13Zdoi:10.1007/s41781-021-00076-whttp://cds.cern.ch/record/2812679engWegner, TobiasLassnig, MarioUeberholz, PeerZeitnitz, ChristianSimulation and Evaluation of Cloud Storage Caching for Data Intensive Sciencecs.DCComputing and ComputersA common task in scientific computing is the data reduction. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for further analysis. Typically, these workflows use distributed storage and computing resources. A straightforward setup of storage media would be low-cost tape storage and higher-cost disk storage. The large, infrequently accessed input data are stored on tape storage. The smaller, frequently accessed derived data is stored on disk storage. In a best-case scenario, the large input data is only accessed very infrequently and in a well-planned pattern. However, practice shows that often the data has to be processed continuously and unpredictably. This can significantly reduce tape storage performance. A common approach to counter this is storing copies of the large input data on disk storage. This contribution evaluates an approach that uses cloud storage resources to serve as a flexible cache or buffer, depending on the computational workflow. The proposed model is explored for the case of continuously processed data. For the evaluation, a simulation tool was developed, which can be used to analyse models related to storage and network resources. We show that using commercial cloud storage can reduce on-premises disk storage requirements, while maintaining an equal throughput of jobs. Moreover, the key metrics of the model are discussed, and an approach is described, which uses the simulation to assist with the decision process of using commercial cloud storage. The goal is to investigate approaches and propose new evaluation methods to overcome future data challenges.A common task in scientific computing is the derivation of data. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for further analysis tasks. Typically, those workflows use distributed storage and computing resources. A straightforward configuration of storage media would be low cost tape storage and higher cost disk storage. The large, infrequently accessed input data is stored on tape storage. The smaller, frequently accessed derived data is stored on disk storage. In a best case scenario, the large input data is only accessed very infrequently and in a well planned pattern. However, practice shows that often the data has to be processed continuously and unpredictably. This can significantly reduce tape storage performance. A common approach to counter this is storing copies of the large input data on disk storage. This contribution evaluates an approach that uses cloud storage resources to serve as a flexible cache or buffer depending on the computational workflow. The proposed model is elaborated for the case of continuously processed data. For the evaluation, a simulation was developed, which can be used to evaluate models related to storage and network resources. We show that using commercial cloud storage can reduce the on-premises disk storage requirements, while maintaining an equal throughput of jobs. Moreover, the key metrics of the model are discussed and an approach is described that uses the simulation to assist with the decision process of using commercial cloud storage. The goal is to investigate approaches and propose new evaluation methods to overcome the future data challenges.arXiv:2105.03201oai:cds.cern.ch:28126792022
spellingShingle	cs.DC Computing and Computers Wegner, Tobias Lassnig, Mario Ueberholz, Peer Zeitnitz, Christian Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science
title	Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science
title_full	Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science
title_fullStr	Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science
title_full_unstemmed	Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science
title_short	Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science
title_sort	simulation and evaluation of cloud storage caching for data intensive science
topic	cs.DC Computing and Computers
url	https://dx.doi.org/10.1007/s41781-021-00076-w http://cds.cern.ch/record/2812679
work_keys_str_mv	AT wegnertobias simulationandevaluationofcloudstoragecachingfordataintensivescience AT lassnigmario simulationandevaluationofcloudstoragecachingfordataintensivescience AT ueberholzpeer simulationandevaluationofcloudstoragecachingfordataintensivescience AT zeitnitzchristian simulationandevaluationofcloudstoragecachingfordataintensivescience

Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science

Ejemplares similares