Cargando…

Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System

Every scientific workflow involves an organizational part which purpose is to plan an analysis process thoroughly according to defined schedule, thus to keep work progress efficient. Having such information as an estimation of the processing time or possibility of system outage (abnormal behaviour)...

Descripción completa

Detalles Bibliográficos
Autores principales:	Titov, Mikhail, Gubin, Maksim, Klimentov, Alexei, Barreiro Megino, Fernando Harald, Golubkov, Dmitry, Borodin, Mikhail, Maeno, Tadashi, Padolski, Siarhei, Korchuganova, Tatiana, Grigoryeva, Maria
Lenguaje:	eng
Publicado:	2017
Materias:	Particle Physics - Experiment
Acceso en línea:	http://cds.cern.ch/record/2279947

_version_	1780955493653544960
author	Titov, Mikhail Gubin, Maksim Klimentov, Alexei Barreiro Megino, Fernando Harald Golubkov, Dmitry Borodin, Mikhail Maeno, Tadashi Padolski, Siarhei Korchuganova, Tatiana Grigoryeva, Maria
author_facet	Titov, Mikhail Gubin, Maksim Klimentov, Alexei Barreiro Megino, Fernando Harald Golubkov, Dmitry Borodin, Mikhail Maeno, Tadashi Padolski, Siarhei Korchuganova, Tatiana Grigoryeva, Maria
author_sort	Titov, Mikhail
collection	CERN
description	Every scientific workflow involves an organizational part which purpose is to plan an analysis process thoroughly according to defined schedule, thus to keep work progress efficient. Having such information as an estimation of the processing time or possibility of system outage (abnormal behaviour) will improve the planning process, provide an assistance to monitor system performance and predict its next state. The ATLAS Production System is an automated scheduling system that is responsible for central production of Monte-Carlo data, highly specialized production for physics groups, as well as data pre-processing and analysis using such facilities as grid infrastructures, clouds and supercomputers. With its next generation (ProdSys2) the processing rate is around 2M tasks per year that is more than 365M jobs per year. ProdSys2 evolves to accommodate a growing number of users and new requirements from the ATLAS Collaboration, physics groups and individual users. ATLAS Distributed Computing in its current state is the aggregation of large and heterogenous facilities, running on the WLCG, academic and commercial clouds, and supercomputers. This cyber-infrastructure presents computing conditions in which contention for resources among high-priority data analysis happens routinely, that might lead to significant workload and data handling interruptions. The lack of the possibility to monitor and predict the behaviour of the analysis process (its duration) and system's state itself caused to focus on design of the built-in situational awareness analytic tools. Proposed suite of tools aims to estimate completion time (so called "Time To Complete", TTC) for every (production) task (i.e., prediction of the task duration), completion time for a chain of tasks, and to predict the failure state of the system (e.g., based on "abnormal" task processing times). Its implementation is based on Machine Learning methods and techniques, and besides the historical information about finished tasks it uses ProdSys2 job execution information and resources usage state (real-time parameters and metrics to adjust predicted values according to the state of the computing environment). The WLCG ML R&D project started in 2016. Within the project the first implementation of the TTC Estimator (for production tasks) was developed, and its visualization was integrated into the ProdSys Monitor.
id	cern-2279947
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2017
record_format	invenio
spelling	cern-22799472019-09-30T06:29:59Zhttp://cds.cern.ch/record/2279947engTitov, MikhailGubin, MaksimKlimentov, AlexeiBarreiro Megino, Fernando HaraldGolubkov, DmitryBorodin, MikhailMaeno, TadashiPadolski, SiarheiKorchuganova, TatianaGrigoryeva, MariaPredictive analytics tools to adjust and monitor performance metrics for the ATLAS Production SystemParticle Physics - ExperimentEvery scientific workflow involves an organizational part which purpose is to plan an analysis process thoroughly according to defined schedule, thus to keep work progress efficient. Having such information as an estimation of the processing time or possibility of system outage (abnormal behaviour) will improve the planning process, provide an assistance to monitor system performance and predict its next state. The ATLAS Production System is an automated scheduling system that is responsible for central production of Monte-Carlo data, highly specialized production for physics groups, as well as data pre-processing and analysis using such facilities as grid infrastructures, clouds and supercomputers. With its next generation (ProdSys2) the processing rate is around 2M tasks per year that is more than 365M jobs per year. ProdSys2 evolves to accommodate a growing number of users and new requirements from the ATLAS Collaboration, physics groups and individual users. ATLAS Distributed Computing in its current state is the aggregation of large and heterogenous facilities, running on the WLCG, academic and commercial clouds, and supercomputers. This cyber-infrastructure presents computing conditions in which contention for resources among high-priority data analysis happens routinely, that might lead to significant workload and data handling interruptions. The lack of the possibility to monitor and predict the behaviour of the analysis process (its duration) and system's state itself caused to focus on design of the built-in situational awareness analytic tools. Proposed suite of tools aims to estimate completion time (so called "Time To Complete", TTC) for every (production) task (i.e., prediction of the task duration), completion time for a chain of tasks, and to predict the failure state of the system (e.g., based on "abnormal" task processing times). Its implementation is based on Machine Learning methods and techniques, and besides the historical information about finished tasks it uses ProdSys2 job execution information and resources usage state (real-time parameters and metrics to adjust predicted values according to the state of the computing environment). The WLCG ML R&D project started in 2016. Within the project the first implementation of the TTC Estimator (for production tasks) was developed, and its visualization was integrated into the ProdSys Monitor.ATL-SOFT-SLIDE-2017-667oai:cds.cern.ch:22799472017-08-16
spellingShingle	Particle Physics - Experiment Titov, Mikhail Gubin, Maksim Klimentov, Alexei Barreiro Megino, Fernando Harald Golubkov, Dmitry Borodin, Mikhail Maeno, Tadashi Padolski, Siarhei Korchuganova, Tatiana Grigoryeva, Maria Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System
title	Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System
title_full	Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System
title_fullStr	Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System
title_full_unstemmed	Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System
title_short	Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System
title_sort	predictive analytics tools to adjust and monitor performance metrics for the atlas production system
topic	Particle Physics - Experiment
url	http://cds.cern.ch/record/2279947
work_keys_str_mv	AT titovmikhail predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem AT gubinmaksim predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem AT klimentovalexei predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem AT barreiromeginofernandoharald predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem AT golubkovdmitry predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem AT borodinmikhail predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem AT maenotadashi predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem AT padolskisiarhei predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem AT korchuganovatatiana predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem AT grigoryevamaria predictiveanalyticstoolstoadjustandmonitorperformancemetricsfortheatlasproductionsystem

Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System

Ejemplares similares