Cargando…

A Probabilistic Analysis of Data Popularity in ATLAS Data Caching

Efficient distribution of physics data over ATLAS grid sites is one of the most important tasks for user data processing. ATLAS' initial static data distribution model over-replicated some unpopular data and under-replicated popular data, creating heavy disk space loads while underutilizing som...

Descripción completa

Detalles Bibliográficos
Autores principales: Titov, M, Klimentov, A, Záruba, G, De, K
Lenguaje:eng
Publicado: 2012
Materias:
Acceso en línea:http://cds.cern.ch/record/1448160
_version_ 1780924844320227328
author Titov, M
Klimentov, A
Záruba, G
De, K
author_facet Titov, M
Klimentov, A
Záruba, G
De, K
author_sort Titov, M
collection CERN
description Efficient distribution of physics data over ATLAS grid sites is one of the most important tasks for user data processing. ATLAS' initial static data distribution model over-replicated some unpopular data and under-replicated popular data, creating heavy disk space loads while underutilizing some processing resources due to low data availability. Thus, a new data distribution mechanism was implemented, PD2P (PanDA Dynamic Data Placement) within the production and distributed analysis system PanDA that dynamically reacts to user data needs, basing dataset distribution principally on user demand. Data deletion is also demand driven, reducing replica counts for unpopular data. This dynamic model has led to substantial improvements in efficient utilization of storage and processing resources. Based on this experience, in this work we seek to further improve data placement policy by investigating in detail how data popularity is calculated. For this it is necessary to precisely define what data popularity means, what types of data popularity exist, how it can be measured, and most importantly, how the history of the data can help to predict the popularity of derived data. We introduce locality of the popularity: a dataset may be only of local interest to a subset of clouds/sites or may have a wide (global) interest. We also extend the idea of the “data temperature scale” model and a popularity measure. Using the ATLAS data replication history, we devise data distribution algorithms based on popularity measures and past history. Based on this work we will describe how to explicitly identify why and how datasets become popular and how such information can be used to predict future popularity.
id cern-1448160
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2012
record_format invenio
spelling cern-14481602019-09-30T06:29:59Zhttp://cds.cern.ch/record/1448160engTitov, MKlimentov, AZáruba, GDe, KA Probabilistic Analysis of Data Popularity in ATLAS Data CachingDetectors and Experimental TechniquesEfficient distribution of physics data over ATLAS grid sites is one of the most important tasks for user data processing. ATLAS' initial static data distribution model over-replicated some unpopular data and under-replicated popular data, creating heavy disk space loads while underutilizing some processing resources due to low data availability. Thus, a new data distribution mechanism was implemented, PD2P (PanDA Dynamic Data Placement) within the production and distributed analysis system PanDA that dynamically reacts to user data needs, basing dataset distribution principally on user demand. Data deletion is also demand driven, reducing replica counts for unpopular data. This dynamic model has led to substantial improvements in efficient utilization of storage and processing resources. Based on this experience, in this work we seek to further improve data placement policy by investigating in detail how data popularity is calculated. For this it is necessary to precisely define what data popularity means, what types of data popularity exist, how it can be measured, and most importantly, how the history of the data can help to predict the popularity of derived data. We introduce locality of the popularity: a dataset may be only of local interest to a subset of clouds/sites or may have a wide (global) interest. We also extend the idea of the “data temperature scale” model and a popularity measure. Using the ATLAS data replication history, we devise data distribution algorithms based on popularity measures and past history. Based on this work we will describe how to explicitly identify why and how datasets become popular and how such information can be used to predict future popularity.ATL-SOFT-SLIDE-2012-195oai:cds.cern.ch:14481602012-05-12
spellingShingle Detectors and Experimental Techniques
Titov, M
Klimentov, A
Záruba, G
De, K
A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
title A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
title_full A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
title_fullStr A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
title_full_unstemmed A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
title_short A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
title_sort probabilistic analysis of data popularity in atlas data caching
topic Detectors and Experimental Techniques
url http://cds.cern.ch/record/1448160
work_keys_str_mv AT titovm aprobabilisticanalysisofdatapopularityinatlasdatacaching
AT klimentova aprobabilisticanalysisofdatapopularityinatlasdatacaching
AT zarubag aprobabilisticanalysisofdatapopularityinatlasdatacaching
AT dek aprobabilisticanalysisofdatapopularityinatlasdatacaching
AT titovm probabilisticanalysisofdatapopularityinatlasdatacaching
AT klimentova probabilisticanalysisofdatapopularityinatlasdatacaching
AT zarubag probabilisticanalysisofdatapopularityinatlasdatacaching
AT dek probabilisticanalysisofdatapopularityinatlasdatacaching