Cargando…

Using AWS Athena analytics to monitor pilot job health on WLCG compute sites

ATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. This model isolates the resource from the workload management system (WMS) and helps to avoid running jobs on faulty resources. A minor side-effect of this isolation is that the faulty resources are ne...

Descripción completa

Detalles Bibliográficos
Autores principales: Love, Peter, Hartland, Thomas George
Lenguaje:eng
Publicado: 2018
Materias:
Acceso en línea:http://cds.cern.ch/record/2649941
_version_ 1780960777430106112
author Love, Peter
Hartland, Thomas George
author_facet Love, Peter
Hartland, Thomas George
author_sort Love, Peter
collection CERN
description ATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. This model isolates the resource from the workload management system (WMS) and helps to avoid running jobs on faulty resources. A minor side-effect of this isolation is that the faulty resources are neglected and not brought back into production because the problems are not visible to the WMS. In this paper we describe a method to analyse logs from the ADC resource provisioning system (AutoPyFactory) and provide monitoring views which target poorly performing resources and help diagnose the issues in good time. Central to this analysis is the use of Amazon Web Services (AWS) to provide an inexpensive and stable analytics platform. In particular, we use the AWS Athena service as an SQL query interface for logging data stored in the AWS S3 service. We describe details of the data handling pipeline and services involved leading to a summary of key metrics suitable for ADC operations.
id cern-2649941
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2018
record_format invenio
spelling cern-26499412019-09-30T06:29:59Zhttp://cds.cern.ch/record/2649941engLove, PeterHartland, Thomas GeorgeUsing AWS Athena analytics to monitor pilot job health on WLCG compute sitesParticle Physics - ExperimentATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. This model isolates the resource from the workload management system (WMS) and helps to avoid running jobs on faulty resources. A minor side-effect of this isolation is that the faulty resources are neglected and not brought back into production because the problems are not visible to the WMS. In this paper we describe a method to analyse logs from the ADC resource provisioning system (AutoPyFactory) and provide monitoring views which target poorly performing resources and help diagnose the issues in good time. Central to this analysis is the use of Amazon Web Services (AWS) to provide an inexpensive and stable analytics platform. In particular, we use the AWS Athena service as an SQL query interface for logging data stored in the AWS S3 service. We describe details of the data handling pipeline and services involved leading to a summary of key metrics suitable for ADC operations.ATL-SOFT-PROC-2018-059oai:cds.cern.ch:26499412018-12-05
spellingShingle Particle Physics - Experiment
Love, Peter
Hartland, Thomas George
Using AWS Athena analytics to monitor pilot job health on WLCG compute sites
title Using AWS Athena analytics to monitor pilot job health on WLCG compute sites
title_full Using AWS Athena analytics to monitor pilot job health on WLCG compute sites
title_fullStr Using AWS Athena analytics to monitor pilot job health on WLCG compute sites
title_full_unstemmed Using AWS Athena analytics to monitor pilot job health on WLCG compute sites
title_short Using AWS Athena analytics to monitor pilot job health on WLCG compute sites
title_sort using aws athena analytics to monitor pilot job health on wlcg compute sites
topic Particle Physics - Experiment
url http://cds.cern.ch/record/2649941
work_keys_str_mv AT lovepeter usingawsathenaanalyticstomonitorpilotjobhealthonwlcgcomputesites
AT hartlandthomasgeorge usingawsathenaanalyticstomonitorpilotjobhealthonwlcgcomputesites