Cargando…

Issues in petabyte data indexing, retrieval and analysis

We propose several methods for speeding up the processing of particle physics data on clusters of PCs. We present a new way of indexing and retrieving data in a high dimensional space by making use of two levels of catalogues enabling an efficient data preselection. We propose several scheduling pol...

Descripción completa

Detalles Bibliográficos
Autor principal: Ponce, Sebastien
Lenguaje:eng
Publicado: 2011
Materias:
Acceso en línea:http://cds.cern.ch/record/1375834
_version_ 1780922957103628288
author Ponce, Sebastien
author_facet Ponce, Sebastien
author_sort Ponce, Sebastien
collection CERN
description We propose several methods for speeding up the processing of particle physics data on clusters of PCs. We present a new way of indexing and retrieving data in a high dimensional space by making use of two levels of catalogues enabling an efficient data preselection. We propose several scheduling policies for parallelizing data intensive particle physics applications on clusters of PCs. We show that making use of intra-job parallelization, caching data on the cluster node disks and reordering incoming jobs improves drastically the performances of a simple batch oriented scheduling policy. In addition, we propose the concept of delayed scheduling and adaptive delayed scheduling, where the deliberate inclusion of a delay improves the disk cache access rate and enables a better utilisation of the cluster. We build theoretical models for the different scheduling policies and propose a detailed comparison between the theoretical models and the results of the cluster simulations. We study the improvements obtained by pipelining data I/O operations and data processing operations, both in respect to tertiary storage I/O and to disk I/O. Pipelining improves the performances by approximately 30%. Using the parallelization framework developed EPFL, we describe a possible implementation of the proposed access policies, within the context of the LHCb experiment at CERN. A first prototype is implemented and the proposed scheduling policies can be easily plugged into it.
id cern-1375834
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2011
record_format invenio
spelling cern-13758342019-09-30T06:29:59Zhttp://cds.cern.ch/record/1375834engPonce, SebastienIssues in petabyte data indexing, retrieval and analysisComputing and ComputersWe propose several methods for speeding up the processing of particle physics data on clusters of PCs. We present a new way of indexing and retrieving data in a high dimensional space by making use of two levels of catalogues enabling an efficient data preselection. We propose several scheduling policies for parallelizing data intensive particle physics applications on clusters of PCs. We show that making use of intra-job parallelization, caching data on the cluster node disks and reordering incoming jobs improves drastically the performances of a simple batch oriented scheduling policy. In addition, we propose the concept of delayed scheduling and adaptive delayed scheduling, where the deliberate inclusion of a delay improves the disk cache access rate and enables a better utilisation of the cluster. We build theoretical models for the different scheduling policies and propose a detailed comparison between the theoretical models and the results of the cluster simulations. We study the improvements obtained by pipelining data I/O operations and data processing operations, both in respect to tertiary storage I/O and to disk I/O. Pipelining improves the performances by approximately 30%. Using the parallelization framework developed EPFL, we describe a possible implementation of the proposed access policies, within the context of the LHCb experiment at CERN. A first prototype is implemented and the proposed scheduling policies can be easily plugged into it.CERN-THESIS-2006-097oai:cds.cern.ch:13758342011-08-18T12:29:43Z
spellingShingle Computing and Computers
Ponce, Sebastien
Issues in petabyte data indexing, retrieval and analysis
title Issues in petabyte data indexing, retrieval and analysis
title_full Issues in petabyte data indexing, retrieval and analysis
title_fullStr Issues in petabyte data indexing, retrieval and analysis
title_full_unstemmed Issues in petabyte data indexing, retrieval and analysis
title_short Issues in petabyte data indexing, retrieval and analysis
title_sort issues in petabyte data indexing, retrieval and analysis
topic Computing and Computers
url http://cds.cern.ch/record/1375834
work_keys_str_mv AT poncesebastien issuesinpetabytedataindexingretrievalandanalysis