Cargando…
Using AWS Athena analytics to monitor pilot job health on WLCG compute sites
ATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. This model isolates the resource from the workload management system (WMS) and helps to avoid running jobs on faulty resources. A minor side-effect of this isolation is that the faulty resources are ne...
Autores principales: | , |
---|---|
Lenguaje: | eng |
Publicado: |
2018
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2649941 |
_version_ | 1780960777430106112 |
---|---|
author | Love, Peter Hartland, Thomas George |
author_facet | Love, Peter Hartland, Thomas George |
author_sort | Love, Peter |
collection | CERN |
description | ATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. This model isolates the resource from the workload management system (WMS) and helps to avoid running jobs on faulty resources. A minor side-effect of this isolation is that the faulty resources are neglected and not brought back into production because the problems are not visible to the WMS. In this paper we describe a method to analyse logs from the ADC resource provisioning system (AutoPyFactory) and provide monitoring views which target poorly performing resources and help diagnose the issues in good time. Central to this analysis is the use of Amazon Web Services (AWS) to provide an inexpensive and stable analytics platform. In particular, we use the AWS Athena service as an SQL query interface for logging data stored in the AWS S3 service. We describe details of the data handling pipeline and services involved leading to a summary of key metrics suitable for ADC operations. |
id | cern-2649941 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2018 |
record_format | invenio |
spelling | cern-26499412019-09-30T06:29:59Zhttp://cds.cern.ch/record/2649941engLove, PeterHartland, Thomas GeorgeUsing AWS Athena analytics to monitor pilot job health on WLCG compute sitesParticle Physics - ExperimentATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. This model isolates the resource from the workload management system (WMS) and helps to avoid running jobs on faulty resources. A minor side-effect of this isolation is that the faulty resources are neglected and not brought back into production because the problems are not visible to the WMS. In this paper we describe a method to analyse logs from the ADC resource provisioning system (AutoPyFactory) and provide monitoring views which target poorly performing resources and help diagnose the issues in good time. Central to this analysis is the use of Amazon Web Services (AWS) to provide an inexpensive and stable analytics platform. In particular, we use the AWS Athena service as an SQL query interface for logging data stored in the AWS S3 service. We describe details of the data handling pipeline and services involved leading to a summary of key metrics suitable for ADC operations.ATL-SOFT-PROC-2018-059oai:cds.cern.ch:26499412018-12-05 |
spellingShingle | Particle Physics - Experiment Love, Peter Hartland, Thomas George Using AWS Athena analytics to monitor pilot job health on WLCG compute sites |
title | Using AWS Athena analytics to monitor pilot job health on WLCG compute sites |
title_full | Using AWS Athena analytics to monitor pilot job health on WLCG compute sites |
title_fullStr | Using AWS Athena analytics to monitor pilot job health on WLCG compute sites |
title_full_unstemmed | Using AWS Athena analytics to monitor pilot job health on WLCG compute sites |
title_short | Using AWS Athena analytics to monitor pilot job health on WLCG compute sites |
title_sort | using aws athena analytics to monitor pilot job health on wlcg compute sites |
topic | Particle Physics - Experiment |
url | http://cds.cern.ch/record/2649941 |
work_keys_str_mv | AT lovepeter usingawsathenaanalyticstomonitorpilotjobhealthonwlcgcomputesites AT hartlandthomasgeorge usingawsathenaanalyticstomonitorpilotjobhealthonwlcgcomputesites |