Cargando…
A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
Efficient distribution of physics data over ATLAS grid sites is one of the most important tasks for user data processing. ATLAS' initial static data distribution model over-replicated some unpopular data and under-replicated popular data, creating heavy disk space loads while underutilizing som...
Autores principales: | , , , |
---|---|
Lenguaje: | eng |
Publicado: |
2012
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/1448160 |
_version_ | 1780924844320227328 |
---|---|
author | Titov, M Klimentov, A Záruba, G De, K |
author_facet | Titov, M Klimentov, A Záruba, G De, K |
author_sort | Titov, M |
collection | CERN |
description | Efficient distribution of physics data over ATLAS grid sites is one of the most important tasks for user data processing. ATLAS' initial static data distribution model over-replicated some unpopular data and under-replicated popular data, creating heavy disk space loads while underutilizing some processing resources due to low data availability. Thus, a new data distribution mechanism was implemented, PD2P (PanDA Dynamic Data Placement) within the production and distributed analysis system PanDA that dynamically reacts to user data needs, basing dataset distribution principally on user demand. Data deletion is also demand driven, reducing replica counts for unpopular data. This dynamic model has led to substantial improvements in efficient utilization of storage and processing resources. Based on this experience, in this work we seek to further improve data placement policy by investigating in detail how data popularity is calculated. For this it is necessary to precisely define what data popularity means, what types of data popularity exist, how it can be measured, and most importantly, how the history of the data can help to predict the popularity of derived data. We introduce locality of the popularity: a dataset may be only of local interest to a subset of clouds/sites or may have a wide (global) interest. We also extend the idea of the “data temperature scale” model and a popularity measure. Using the ATLAS data replication history, we devise data distribution algorithms based on popularity measures and past history. Based on this work we will describe how to explicitly identify why and how datasets become popular and how such information can be used to predict future popularity. |
id | cern-1448160 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2012 |
record_format | invenio |
spelling | cern-14481602019-09-30T06:29:59Zhttp://cds.cern.ch/record/1448160engTitov, MKlimentov, AZáruba, GDe, KA Probabilistic Analysis of Data Popularity in ATLAS Data CachingDetectors and Experimental TechniquesEfficient distribution of physics data over ATLAS grid sites is one of the most important tasks for user data processing. ATLAS' initial static data distribution model over-replicated some unpopular data and under-replicated popular data, creating heavy disk space loads while underutilizing some processing resources due to low data availability. Thus, a new data distribution mechanism was implemented, PD2P (PanDA Dynamic Data Placement) within the production and distributed analysis system PanDA that dynamically reacts to user data needs, basing dataset distribution principally on user demand. Data deletion is also demand driven, reducing replica counts for unpopular data. This dynamic model has led to substantial improvements in efficient utilization of storage and processing resources. Based on this experience, in this work we seek to further improve data placement policy by investigating in detail how data popularity is calculated. For this it is necessary to precisely define what data popularity means, what types of data popularity exist, how it can be measured, and most importantly, how the history of the data can help to predict the popularity of derived data. We introduce locality of the popularity: a dataset may be only of local interest to a subset of clouds/sites or may have a wide (global) interest. We also extend the idea of the “data temperature scale” model and a popularity measure. Using the ATLAS data replication history, we devise data distribution algorithms based on popularity measures and past history. Based on this work we will describe how to explicitly identify why and how datasets become popular and how such information can be used to predict future popularity.ATL-SOFT-SLIDE-2012-195oai:cds.cern.ch:14481602012-05-12 |
spellingShingle | Detectors and Experimental Techniques Titov, M Klimentov, A Záruba, G De, K A Probabilistic Analysis of Data Popularity in ATLAS Data Caching |
title | A Probabilistic Analysis of Data Popularity in ATLAS Data Caching |
title_full | A Probabilistic Analysis of Data Popularity in ATLAS Data Caching |
title_fullStr | A Probabilistic Analysis of Data Popularity in ATLAS Data Caching |
title_full_unstemmed | A Probabilistic Analysis of Data Popularity in ATLAS Data Caching |
title_short | A Probabilistic Analysis of Data Popularity in ATLAS Data Caching |
title_sort | probabilistic analysis of data popularity in atlas data caching |
topic | Detectors and Experimental Techniques |
url | http://cds.cern.ch/record/1448160 |
work_keys_str_mv | AT titovm aprobabilisticanalysisofdatapopularityinatlasdatacaching AT klimentova aprobabilisticanalysisofdatapopularityinatlasdatacaching AT zarubag aprobabilisticanalysisofdatapopularityinatlasdatacaching AT dek aprobabilisticanalysisofdatapopularityinatlasdatacaching AT titovm probabilisticanalysisofdatapopularityinatlasdatacaching AT klimentova probabilisticanalysisofdatapopularityinatlasdatacaching AT zarubag probabilisticanalysisofdatapopularityinatlasdatacaching AT dek probabilisticanalysisofdatapopularityinatlasdatacaching |