Cargando…

Issues in petabyte data indexing, retrieval and analysis

We propose several methods for speeding up the processing of particle physics data on clusters of PCs. We present a new way of indexing and retrieving data in a high dimensional space by making use of two levels of catalogues enabling an efficient data preselection. We propose several scheduling pol...

Descripción completa

Detalles Bibliográficos
Autor principal:	Ponce, Sebastien
Lenguaje:	eng
Publicado:	2011
Materias:	Computing and Computers
Acceso en línea:	http://cds.cern.ch/record/1375834

Descripción
Sumario:	We propose several methods for speeding up the processing of particle physics data on clusters of PCs. We present a new way of indexing and retrieving data in a high dimensional space by making use of two levels of catalogues enabling an efficient data preselection. We propose several scheduling policies for parallelizing data intensive particle physics applications on clusters of PCs. We show that making use of intra-job parallelization, caching data on the cluster node disks and reordering incoming jobs improves drastically the performances of a simple batch oriented scheduling policy. In addition, we propose the concept of delayed scheduling and adaptive delayed scheduling, where the deliberate inclusion of a delay improves the disk cache access rate and enables a better utilisation of the cluster. We build theoretical models for the different scheduling policies and propose a detailed comparison between the theoretical models and the results of the cluster simulations. We study the improvements obtained by pipelining data I/O operations and data processing operations, both in respect to tertiary storage I/O and to disk I/O. Pipelining improves the performances by approximately 30%. Using the parallelization framework developed EPFL, we describe a possible implementation of the proposed access policies, within the context of the LHCb experiment at CERN. A first prototype is implemented and the proposed scheduling policies can be easily plugged into it.

Issues in petabyte data indexing, retrieval and analysis

Ejemplares similares