Cargando…

Issues in petabyte data indexing, retrieval and analysis

We propose several methods for speeding up the processing of particle physics data on clusters of PCs. We present a new way of indexing and retrieving data in a high dimensional space by making use of two levels of catalogues enabling an efficient data preselection. We propose several scheduling pol...

Descripción completa

Detalles Bibliográficos
Autor principal:	Ponce, Sebastien
Lenguaje:	eng
Publicado:	2011
Materias:	Computing and Computers
Acceso en línea:	http://cds.cern.ch/record/1375834

_version_	1780922957103628288
author	Ponce, Sebastien
author_facet	Ponce, Sebastien
author_sort	Ponce, Sebastien
collection	CERN
description	We propose several methods for speeding up the processing of particle physics data on clusters of PCs. We present a new way of indexing and retrieving data in a high dimensional space by making use of two levels of catalogues enabling an efficient data preselection. We propose several scheduling policies for parallelizing data intensive particle physics applications on clusters of PCs. We show that making use of intra-job parallelization, caching data on the cluster node disks and reordering incoming jobs improves drastically the performances of a simple batch oriented scheduling policy. In addition, we propose the concept of delayed scheduling and adaptive delayed scheduling, where the deliberate inclusion of a delay improves the disk cache access rate and enables a better utilisation of the cluster. We build theoretical models for the different scheduling policies and propose a detailed comparison between the theoretical models and the results of the cluster simulations. We study the improvements obtained by pipelining data I/O operations and data processing operations, both in respect to tertiary storage I/O and to disk I/O. Pipelining improves the performances by approximately 30%. Using the parallelization framework developed EPFL, we describe a possible implementation of the proposed access policies, within the context of the LHCb experiment at CERN. A first prototype is implemented and the proposed scheduling policies can be easily plugged into it.
id	cern-1375834
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2011
record_format	invenio
spelling	cern-13758342019-09-30T06:29:59Zhttp://cds.cern.ch/record/1375834engPonce, SebastienIssues in petabyte data indexing, retrieval and analysisComputing and ComputersWe propose several methods for speeding up the processing of particle physics data on clusters of PCs. We present a new way of indexing and retrieving data in a high dimensional space by making use of two levels of catalogues enabling an efficient data preselection. We propose several scheduling policies for parallelizing data intensive particle physics applications on clusters of PCs. We show that making use of intra-job parallelization, caching data on the cluster node disks and reordering incoming jobs improves drastically the performances of a simple batch oriented scheduling policy. In addition, we propose the concept of delayed scheduling and adaptive delayed scheduling, where the deliberate inclusion of a delay improves the disk cache access rate and enables a better utilisation of the cluster. We build theoretical models for the different scheduling policies and propose a detailed comparison between the theoretical models and the results of the cluster simulations. We study the improvements obtained by pipelining data I/O operations and data processing operations, both in respect to tertiary storage I/O and to disk I/O. Pipelining improves the performances by approximately 30%. Using the parallelization framework developed EPFL, we describe a possible implementation of the proposed access policies, within the context of the LHCb experiment at CERN. A first prototype is implemented and the proposed scheduling policies can be easily plugged into it.CERN-THESIS-2006-097oai:cds.cern.ch:13758342011-08-18T12:29:43Z
spellingShingle	Computing and Computers Ponce, Sebastien Issues in petabyte data indexing, retrieval and analysis
title	Issues in petabyte data indexing, retrieval and analysis
title_full	Issues in petabyte data indexing, retrieval and analysis
title_fullStr	Issues in petabyte data indexing, retrieval and analysis
title_full_unstemmed	Issues in petabyte data indexing, retrieval and analysis
title_short	Issues in petabyte data indexing, retrieval and analysis
title_sort	issues in petabyte data indexing, retrieval and analysis
topic	Computing and Computers
url	http://cds.cern.ch/record/1375834
work_keys_str_mv	AT poncesebastien issuesinpetabytedataindexingretrievalandanalysis

Issues in petabyte data indexing, retrieval and analysis

Ejemplares similares