Cargando…

Caching for dataset-based workloads with heterogeneous file sizes

Caching can effectively reduce the cost of serving content and improve the user experience. In this paper, we explore the benefits of caching for existing scientific workloads, taking the Worldwide LHC (Large Hadron Collider) Computing Grid as an example. It is a globally distributed system that sto...

Descripción completa

Detalles Bibliográficos
Autores principales: Chuchuk, Olga, Neglia, Giovanni, Schulz, Markus, Duellmann, Dirk
Lenguaje:eng
Publicado: 2022
Materias:
Acceso en línea:https://dx.doi.org/10.22323/1.415.0009
http://cds.cern.ch/record/2861084
_version_ 1780977796538957824
author Chuchuk, Olga
Neglia, Giovanni
Schulz, Markus
Duellmann, Dirk
author_facet Chuchuk, Olga
Neglia, Giovanni
Schulz, Markus
Duellmann, Dirk
author_sort Chuchuk, Olga
collection CERN
description Caching can effectively reduce the cost of serving content and improve the user experience. In this paper, we explore the benefits of caching for existing scientific workloads, taking the Worldwide LHC (Large Hadron Collider) Computing Grid as an example. It is a globally distributed system that stores and processes multiple hundred petabytes of data and serves the needs of thousands of scientists around the globe. Scientific computation differs from other applications like video streaming as file sizes vary from a few bytes to terabytes and logical links between the files affect user access patterns. These factors profoundly influence caches' performance and, therefore, should be carefully analyzed to select which caching policy to deploy or to design new ones. In this work, we study how the hierarchical organization of the LHC physics data into files and groups of files called datasets affects the request patterns. We then propose new caching policies that exploit dataset-specific knowledge and compare them with file-based ones. Moreover, we show that limited connectivity between the computing and storage sites leads to the delayed hits phenomenon and estimate the consequent reduction in the potential benefits of caching.
id cern-2861084
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2022
record_format invenio
spelling cern-28610842023-06-07T18:56:34Zdoi:10.22323/1.415.0009http://cds.cern.ch/record/2861084engChuchuk, OlgaNeglia, GiovanniSchulz, MarkusDuellmann, DirkCaching for dataset-based workloads with heterogeneous file sizesComputing and ComputersCaching can effectively reduce the cost of serving content and improve the user experience. In this paper, we explore the benefits of caching for existing scientific workloads, taking the Worldwide LHC (Large Hadron Collider) Computing Grid as an example. It is a globally distributed system that stores and processes multiple hundred petabytes of data and serves the needs of thousands of scientists around the globe. Scientific computation differs from other applications like video streaming as file sizes vary from a few bytes to terabytes and logical links between the files affect user access patterns. These factors profoundly influence caches' performance and, therefore, should be carefully analyzed to select which caching policy to deploy or to design new ones. In this work, we study how the hierarchical organization of the LHC physics data into files and groups of files called datasets affects the request patterns. We then propose new caching policies that exploit dataset-specific knowledge and compare them with file-based ones. Moreover, we show that limited connectivity between the computing and storage sites leads to the delayed hits phenomenon and estimate the consequent reduction in the potential benefits of caching.oai:cds.cern.ch:28610842022
spellingShingle Computing and Computers
Chuchuk, Olga
Neglia, Giovanni
Schulz, Markus
Duellmann, Dirk
Caching for dataset-based workloads with heterogeneous file sizes
title Caching for dataset-based workloads with heterogeneous file sizes
title_full Caching for dataset-based workloads with heterogeneous file sizes
title_fullStr Caching for dataset-based workloads with heterogeneous file sizes
title_full_unstemmed Caching for dataset-based workloads with heterogeneous file sizes
title_short Caching for dataset-based workloads with heterogeneous file sizes
title_sort caching for dataset-based workloads with heterogeneous file sizes
topic Computing and Computers
url https://dx.doi.org/10.22323/1.415.0009
http://cds.cern.ch/record/2861084
work_keys_str_mv AT chuchukolga cachingfordatasetbasedworkloadswithheterogeneousfilesizes
AT negliagiovanni cachingfordatasetbasedworkloadswithheterogeneousfilesizes
AT schulzmarkus cachingfordatasetbasedworkloadswithheterogeneousfilesizes
AT duellmanndirk cachingfordatasetbasedworkloadswithheterogeneousfilesizes