Cargando…

A Probabilistic Analysis of Data Popularity in ATLAS Data Caching

Efficient distribution of physics data over ATLAS grid sites is one of the most important tasks for user data processing. ATLAS' initial static data distribution model over-replicated some unpopular data and under-replicated popular data, creating heavy disk space loads while underutilizing som...

Descripción completa

Detalles Bibliográficos
Autores principales:	Titov, M, Klimentov, A, Záruba, G, De, K
Lenguaje:	eng
Publicado:	2012
Materias:	Detectors and Experimental Techniques
Acceso en línea:	http://cds.cern.ch/record/1448160

_version_	1780924844320227328
author	Titov, M Klimentov, A Záruba, G De, K
author_facet	Titov, M Klimentov, A Záruba, G De, K
author_sort	Titov, M
collection	CERN
description	Efficient distribution of physics data over ATLAS grid sites is one of the most important tasks for user data processing. ATLAS' initial static data distribution model over-replicated some unpopular data and under-replicated popular data, creating heavy disk space loads while underutilizing some processing resources due to low data availability. Thus, a new data distribution mechanism was implemented, PD2P (PanDA Dynamic Data Placement) within the production and distributed analysis system PanDA that dynamically reacts to user data needs, basing dataset distribution principally on user demand. Data deletion is also demand driven, reducing replica counts for unpopular data. This dynamic model has led to substantial improvements in efficient utilization of storage and processing resources. Based on this experience, in this work we seek to further improve data placement policy by investigating in detail how data popularity is calculated. For this it is necessary to precisely define what data popularity means, what types of data popularity exist, how it can be measured, and most importantly, how the history of the data can help to predict the popularity of derived data. We introduce locality of the popularity: a dataset may be only of local interest to a subset of clouds/sites or may have a wide (global) interest. We also extend the idea of the “data temperature scale” model and a popularity measure. Using the ATLAS data replication history, we devise data distribution algorithms based on popularity measures and past history. Based on this work we will describe how to explicitly identify why and how datasets become popular and how such information can be used to predict future popularity.
id	cern-1448160
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2012
record_format	invenio
spelling	cern-14481602019-09-30T06:29:59Zhttp://cds.cern.ch/record/1448160engTitov, MKlimentov, AZáruba, GDe, KA Probabilistic Analysis of Data Popularity in ATLAS Data CachingDetectors and Experimental TechniquesEfficient distribution of physics data over ATLAS grid sites is one of the most important tasks for user data processing. ATLAS' initial static data distribution model over-replicated some unpopular data and under-replicated popular data, creating heavy disk space loads while underutilizing some processing resources due to low data availability. Thus, a new data distribution mechanism was implemented, PD2P (PanDA Dynamic Data Placement) within the production and distributed analysis system PanDA that dynamically reacts to user data needs, basing dataset distribution principally on user demand. Data deletion is also demand driven, reducing replica counts for unpopular data. This dynamic model has led to substantial improvements in efficient utilization of storage and processing resources. Based on this experience, in this work we seek to further improve data placement policy by investigating in detail how data popularity is calculated. For this it is necessary to precisely define what data popularity means, what types of data popularity exist, how it can be measured, and most importantly, how the history of the data can help to predict the popularity of derived data. We introduce locality of the popularity: a dataset may be only of local interest to a subset of clouds/sites or may have a wide (global) interest. We also extend the idea of the “data temperature scale” model and a popularity measure. Using the ATLAS data replication history, we devise data distribution algorithms based on popularity measures and past history. Based on this work we will describe how to explicitly identify why and how datasets become popular and how such information can be used to predict future popularity.ATL-SOFT-SLIDE-2012-195oai:cds.cern.ch:14481602012-05-12
spellingShingle	Detectors and Experimental Techniques Titov, M Klimentov, A Záruba, G De, K A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
title	A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
title_full	A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
title_fullStr	A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
title_full_unstemmed	A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
title_short	A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
title_sort	probabilistic analysis of data popularity in atlas data caching
topic	Detectors and Experimental Techniques
url	http://cds.cern.ch/record/1448160
work_keys_str_mv	AT titovm aprobabilisticanalysisofdatapopularityinatlasdatacaching AT klimentova aprobabilisticanalysisofdatapopularityinatlasdatacaching AT zarubag aprobabilisticanalysisofdatapopularityinatlasdatacaching AT dek aprobabilisticanalysisofdatapopularityinatlasdatacaching AT titovm probabilisticanalysisofdatapopularityinatlasdatacaching AT klimentova probabilisticanalysisofdatapopularityinatlasdatacaching AT zarubag probabilisticanalysisofdatapopularityinatlasdatacaching AT dek probabilisticanalysisofdatapopularityinatlasdatacaching

A Probabilistic Analysis of Data Popularity in ATLAS Data Caching

Ejemplares similares