Cargando…

Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System

Every scientific workflow involves an organizational part which purpose is to plan an analysis process thoroughly according to defined schedule, thus to keep work progress efficient. Having such information as an estimation of the processing time or possibility of system outage (abnormal behaviour)...

Descripción completa

Detalles Bibliográficos
Autores principales: Titov, Mikhail, Gubin, Maksim, Klimentov, Alexei, Barreiro Megino, Fernando Harald, Golubkov, Dmitry, Borodin, Mikhail, Maeno, Tadashi, Padolski, Siarhei, Korchuganova, Tatiana, Grigoryeva, Maria
Lenguaje:eng
Publicado: 2017
Materias:
Acceso en línea:http://cds.cern.ch/record/2279947
_version_ 1780955493653544960
author Titov, Mikhail
Gubin, Maksim
Klimentov, Alexei
Barreiro Megino, Fernando Harald
Golubkov, Dmitry
Borodin, Mikhail
Maeno, Tadashi
Padolski, Siarhei
Korchuganova, Tatiana
Grigoryeva, Maria
author_facet Titov, Mikhail
Gubin, Maksim
Klimentov, Alexei
Barreiro Megino, Fernando Harald
Golubkov, Dmitry
Borodin, Mikhail
Maeno, Tadashi
Padolski, Siarhei
Korchuganova, Tatiana
Grigoryeva, Maria
author_sort Titov, Mikhail
collection CERN
description Every scientific workflow involves an organizational part which purpose is to plan an analysis process thoroughly according to defined schedule, thus to keep work progress efficient. Having such information as an estimation of the processing time or possibility of system outage (abnormal behaviour) will improve the planning process, provide an assistance to monitor system performance and predict its next state. The ATLAS Production System is an automated scheduling system that is responsible for central production of Monte-Carlo data, highly specialized production for physics groups, as well as data pre-processing and analysis using such facilities as grid infrastructures, clouds and supercomputers. With its next generation (ProdSys2) the processing rate is around 2M tasks per year that is more than 365M jobs per year. ProdSys2 evolves to accommodate a growing number of users and new requirements from the ATLAS Collaboration, physics groups and individual users. ATLAS Distributed Computing in its current state is the aggregation of large and heterogenous facilities, running on the WLCG, academic and commercial clouds, and supercomputers. This cyber-infrastructure presents computing conditions in which contention for resources among high-priority data analysis happens routinely, that might lead to significant workload and data handling interruptions. The lack of the possibility to monitor and predict the behaviour of the analysis process (its duration) and system's state itself caused to focus on design of the built-in situational awareness analytic tools. Proposed suite of tools aims to estimate completion time (so called "Time To Complete", TTC) for every (production) task (i.e., prediction of the task duration), completion time for a chain of tasks, and to predict the failure state of the system (e.g., based on "abnormal" task processing times). Its implementation is based on Machine Learning methods and techniques, and besides the historical information about finished tasks it uses ProdSys2 job execution information and resources usage state (real-time parameters and metrics to adjust predicted values according to the state of the computing environment). The WLCG ML R&D project started in 2016. Within the project the first implementation of the TTC Estimator (for production tasks) was developed, and its visualization was integrated into the ProdSys Monitor.
id cern-2279947
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2017
record_format invenio
spelling cern-22799472019-09-30T06:29:59Zhttp://cds.cern.ch/record/2279947engTitov, MikhailGubin, MaksimKlimentov, AlexeiBarreiro Megino, Fernando HaraldGolubkov, DmitryBorodin, MikhailMaeno, TadashiPadolski, SiarheiKorchuganova, TatianaGrigoryeva, MariaPredictive analytics tools to adjust and monitor performance metrics for the ATLAS Production SystemParticle Physics - ExperimentEvery scientific workflow involves an organizational part which purpose is to plan an analysis process thoroughly according to defined schedule, thus to keep work progress efficient. Having such information as an estimation of the processing time or possibility of system outage (abnormal behaviour) will improve the planning process, provide an assistance to monitor system performance and predict its next state. The ATLAS Production System is an automated scheduling system that is responsible for central production of Monte-Carlo data, highly specialized production for physics groups, as well as data pre-processing and analysis using such facilities as grid infrastructures, clouds and supercomputers. With its next generation (ProdSys2) the processing rate is around 2M tasks per year that is more than 365M jobs per year. ProdSys2 evolves to accommodate a growing number of users and new requirements from the ATLAS Collaboration, physics groups and individual users. ATLAS Distributed Computing in its current state is the aggregation of large and heterogenous facilities, running on the WLCG, academic and commercial clouds, and supercomputers. This cyber-infrastructure presents computing conditions in which contention for resources among high-priority data analysis happens routinely, that might lead to significant workload and data handling interruptions. The lack of the possibility to monitor and predict the behaviour of the analysis process (its duration) and system's state itself caused to focus on design of the built-in situational awareness analytic tools. Proposed suite of tools aims to estimate completion time (so called "Time To Complete", TTC) for every (production) task (i.e., prediction of the task duration), completion time for a chain of tasks, and to predict the failure state of the system (e.g., based on "abnormal" task processing times). Its implementation is based on Machine Learning methods and techniques, and besides the historical information about finished tasks it uses ProdSys2 job execution information and resources usage state (real-time parameters and metrics to adjust predicted values according to the state of the computing environment). The WLCG ML R&D project started in 2016. Within the project the first implementation of the TTC Estimator (for production tasks) was developed, and its visualization was integrated into the ProdSys Monitor.ATL-SOFT-SLIDE-2017-667oai:cds.cern.ch:22799472017-08-16
spellingShingle Particle Physics - Experiment
Titov, Mikhail
Gubin, Maksim
Klimentov, Alexei
Barreiro Megino, Fernando Harald
Golubkov, Dmitry
Borodin, Mikhail
Maeno, Tadashi
Padolski, Siarhei
Korchuganova, Tatiana
Grigoryeva, Maria
Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System
title Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System
title_full Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System
title_fullStr Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System
title_full_unstemmed Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System
title_short Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System
title_sort predictive analytics tools to adjust and monitor performance metrics for the atlas production system
topic Particle Physics - Experiment
url http://cds.cern.ch/record/2279947
work_keys_str_mv AT titovmikhail predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem
AT gubinmaksim predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem
AT klimentovalexei predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem
AT barreiromeginofernandoharald predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem
AT golubkovdmitry predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem
AT borodinmikhail predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem
AT maenotadashi predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem
AT padolskisiarhei predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem
AT korchuganovatatiana predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem
AT grigoryevamaria predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem