Cargando…

Organising scientific data by dataflow optimisation on the petascale

Scientific applications on the Grid are in most cases heavily data-dependent. Therefore, improving scheduling decisions based on the co-allocation of data and jobs becomes a primary issue. Hence, it is crucial to analyse the behaviour of existing data management systems in order to provide accurate...

Descripción completa

Detalles Bibliográficos
Autor principal: Lassnig, Mario
Lenguaje:eng
Publicado: 2008
Materias:
Acceso en línea:http://cds.cern.ch/record/1123358
_version_ 1780914640090300416
author Lassnig, Mario
author_facet Lassnig, Mario
author_sort Lassnig, Mario
collection CERN
description Scientific applications on the Grid are in most cases heavily data-dependent. Therefore, improving scheduling decisions based on the co-allocation of data and jobs becomes a primary issue. Hence, it is crucial to analyse the behaviour of existing data management systems in order to provide accurate information for decision-making middlewares in a scalable way. We show current research issues in understanding the behaviour of data management systems on the petascale to improve Grid performance. We analyse the Distributed Data Management system Don Quijote 2 (DQ2) of the High- Energy Physics experiment ATLAS at CERN. ATLAS presents unprecedented data transfer and data storage requirements on the petascale and DQ2 was built to fulfill these requirements. DQ2 is built upon the EGEE infrastructure, while seamlessly enabling interoperability with the American OSG and the Scandinavian NorduGrid infrastructures. Thus it serves as a relevant production-quality system to analyse aspects of dataflow behaviour in the petascale. Controlled data transfers are analysed using the central DQ2 bookkeeping service and an external monitoring dashboard, provided by ARDA. However monitoring dynamic data transfers of jobs and enduser data transfers cannot happen centrally because there is no single point of reference. Therefore we provide opportunistic clients tools for all scientists to access, query and modify data. Those tools report the needed usage information in a non-intrusive, scalable way. We characterise three areas for improvement of dataflow. First, controlled data transfers issued by experiment operators or Gridsite operators. This is constant data export from the experiment to distributed computing facilities, mostly defined by experiment computing models. Second, dynamic data transfers issued by jobs on a Gridsite. Those production jobs may need to access data that is only available on remote sites. Third, uncontrolled data transfers issued by end-users; scientists fetching data for direct analysis. We argue that on the petascale complete replication of files is not a suitable option anymore as there is too much data and that erratic and unpredictable data movements are the norm. Furthermore it is important to value the relevance of certain data with respect to time to find useful data on the Grid. Our model derives those usage patterns implicitly. Therefore global data movement and usage patterns on data must be taken into account when doing job/data co-allocation. The objective of reasonable organisation of scientific data on the Grid is not a new one. Already, many approaches especially in file replication show good improvements. We argue though that once we approach petascale, low-level file reorganisation is not sufficient anymore and a global view of Grid dataflow must be taken into account. We provide a preliminary model and its accompanying tools to understand erratic and unpredictable dataflows and show their usefulness in the production EGEE Grid.
id cern-1123358
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2008
record_format invenio
spelling cern-11233582019-09-30T06:29:59Zhttp://cds.cern.ch/record/1123358engLassnig, MarioOrganising scientific data by dataflow optimisation on the petascaleComputing and ComputersScientific applications on the Grid are in most cases heavily data-dependent. Therefore, improving scheduling decisions based on the co-allocation of data and jobs becomes a primary issue. Hence, it is crucial to analyse the behaviour of existing data management systems in order to provide accurate information for decision-making middlewares in a scalable way. We show current research issues in understanding the behaviour of data management systems on the petascale to improve Grid performance. We analyse the Distributed Data Management system Don Quijote 2 (DQ2) of the High- Energy Physics experiment ATLAS at CERN. ATLAS presents unprecedented data transfer and data storage requirements on the petascale and DQ2 was built to fulfill these requirements. DQ2 is built upon the EGEE infrastructure, while seamlessly enabling interoperability with the American OSG and the Scandinavian NorduGrid infrastructures. Thus it serves as a relevant production-quality system to analyse aspects of dataflow behaviour in the petascale. Controlled data transfers are analysed using the central DQ2 bookkeeping service and an external monitoring dashboard, provided by ARDA. However monitoring dynamic data transfers of jobs and enduser data transfers cannot happen centrally because there is no single point of reference. Therefore we provide opportunistic clients tools for all scientists to access, query and modify data. Those tools report the needed usage information in a non-intrusive, scalable way. We characterise three areas for improvement of dataflow. First, controlled data transfers issued by experiment operators or Gridsite operators. This is constant data export from the experiment to distributed computing facilities, mostly defined by experiment computing models. Second, dynamic data transfers issued by jobs on a Gridsite. Those production jobs may need to access data that is only available on remote sites. Third, uncontrolled data transfers issued by end-users; scientists fetching data for direct analysis. We argue that on the petascale complete replication of files is not a suitable option anymore as there is too much data and that erratic and unpredictable data movements are the norm. Furthermore it is important to value the relevance of certain data with respect to time to find useful data on the Grid. Our model derives those usage patterns implicitly. Therefore global data movement and usage patterns on data must be taken into account when doing job/data co-allocation. The objective of reasonable organisation of scientific data on the Grid is not a new one. Already, many approaches especially in file replication show good improvements. We argue though that once we approach petascale, low-level file reorganisation is not sufficient anymore and a global view of Grid dataflow must be taken into account. We provide a preliminary model and its accompanying tools to understand erratic and unpredictable dataflows and show their usefulness in the production EGEE Grid.oai:cds.cern.ch:11233582008
spellingShingle Computing and Computers
Lassnig, Mario
Organising scientific data by dataflow optimisation on the petascale
title Organising scientific data by dataflow optimisation on the petascale
title_full Organising scientific data by dataflow optimisation on the petascale
title_fullStr Organising scientific data by dataflow optimisation on the petascale
title_full_unstemmed Organising scientific data by dataflow optimisation on the petascale
title_short Organising scientific data by dataflow optimisation on the petascale
title_sort organising scientific data by dataflow optimisation on the petascale
topic Computing and Computers
url http://cds.cern.ch/record/1123358
work_keys_str_mv AT lassnigmario organisingscientificdatabydataflowoptimisationonthepetascale