Cargando…

Performance studies of CMS workflows using Big Data technologies

At the Large Hadron Collider (LHC), more than 30 petabytes of data are produced from particle collisions every year of data taking. The data processing requires large volumes of simulated events through Monte Carlo techniques. Furthermore, physics analysis implies daily access to derived data format...

Descripción completa

Detalles Bibliográficos
Autor principal:	Ambroz, Luca
Lenguaje:	eng
Publicado:	2017
Materias:	Particle Physics - Experiment
Acceso en línea:	http://cds.cern.ch/record/2263131

_version_	1780954235664334848
author	Ambroz, Luca
author_facet	Ambroz, Luca
author_sort	Ambroz, Luca
collection	CERN
description	At the Large Hadron Collider (LHC), more than 30 petabytes of data are produced from particle collisions every year of data taking. The data processing requires large volumes of simulated events through Monte Carlo techniques. Furthermore, physics analysis implies daily access to derived data formats by hundreds of users. The Worldwide LHC Computing Grid (WLCG) - an international collaboration involving personnel and computing centers worldwide - is successfully coping with these challenges, enabling the LHC physics program. With the continuation of LHC data taking and the approval of ambitious projects such as the High-Luminosity LHC, such challenges will reach the edge of current computing capacity and performance. One of the keys to success in the next decades - also under severe financial resource constraints - is to optimize the efficiency in exploiting the computing resources. This thesis focuses on performance studies of CMS workflows, namely centrallyscheduled production activities and unpredictable distributed analysis. The work aims at developing and evaluating tools to improve the understanding of the monitoring data in both production and analysis. For this reason, the work comprises two parts. Firstly, on the distributed analysis side, the development of tools to quickly analyze the logs of previous Grid job submissions can enable a user to tune the next round of submissions and better exploit the computing resources. Secondly, concerning the monitoring of both analysis and production jobs, commercial Big Data technologies can be used to obtain more efficient and flexible monitoring systems. One aspect of such improvement is the possibility to avoid major aggregations at an early stage and to collect much finer granularity monitoring data which can be further processed at a later stage, just upon request. In this thesis, a work towards both directions is presented. Firstly, a lightweight tool to perform rapid studies on distributed analysis performances is presented as a way to enable physics users to smoothly match the job submission tasks to changing conditions of the overall environment. Secondly, a set of performance studies on the CMS Workflow Management and Data Management sectors are performed exploiting a CMS Metrics Service prototype, based on ElasticSearch/Jupyter Notebook/Kibana technologies, that contains high granularity information on CMS production and analysis jobs, exploiting the HTCondor ClassAdds. Chapter 1 provides an overview of the Standard Model. Chapter 2 discusses the LHC accelerator complex and experiments with main focus on CMS. Chapter 3 introduces Computing in High Energy Physics and describes the CMS Computing Model. Chapter 4 presents the development of an original tool for evaluating the performances of local analysis jobs. Chapter 5 describes how data from the CMS Metrics Service can be analyzed to provide insights on the CMS global activities.
id	oai-inspirehep.net-1520805
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2017
record_format	invenio
spelling	oai-inspirehep.net-15208052019-09-30T06:29:59Zhttp://cds.cern.ch/record/2263131engAmbroz, LucaPerformance studies of CMS workflows using Big Data technologiesParticle Physics - ExperimentAt the Large Hadron Collider (LHC), more than 30 petabytes of data are produced from particle collisions every year of data taking. The data processing requires large volumes of simulated events through Monte Carlo techniques. Furthermore, physics analysis implies daily access to derived data formats by hundreds of users. The Worldwide LHC Computing Grid (WLCG) - an international collaboration involving personnel and computing centers worldwide - is successfully coping with these challenges, enabling the LHC physics program. With the continuation of LHC data taking and the approval of ambitious projects such as the High-Luminosity LHC, such challenges will reach the edge of current computing capacity and performance. One of the keys to success in the next decades - also under severe financial resource constraints - is to optimize the efficiency in exploiting the computing resources. This thesis focuses on performance studies of CMS workflows, namely centrallyscheduled production activities and unpredictable distributed analysis. The work aims at developing and evaluating tools to improve the understanding of the monitoring data in both production and analysis. For this reason, the work comprises two parts. Firstly, on the distributed analysis side, the development of tools to quickly analyze the logs of previous Grid job submissions can enable a user to tune the next round of submissions and better exploit the computing resources. Secondly, concerning the monitoring of both analysis and production jobs, commercial Big Data technologies can be used to obtain more efficient and flexible monitoring systems. One aspect of such improvement is the possibility to avoid major aggregations at an early stage and to collect much finer granularity monitoring data which can be further processed at a later stage, just upon request. In this thesis, a work towards both directions is presented. Firstly, a lightweight tool to perform rapid studies on distributed analysis performances is presented as a way to enable physics users to smoothly match the job submission tasks to changing conditions of the overall environment. Secondly, a set of performance studies on the CMS Workflow Management and Data Management sectors are performed exploiting a CMS Metrics Service prototype, based on ElasticSearch/Jupyter Notebook/Kibana technologies, that contains high granularity information on CMS production and analysis jobs, exploiting the HTCondor ClassAdds. Chapter 1 provides an overview of the Standard Model. Chapter 2 discusses the LHC accelerator complex and experiments with main focus on CMS. Chapter 3 introduces Computing in High Energy Physics and describes the CMS Computing Model. Chapter 4 presents the development of an original tool for evaluating the performances of local analysis jobs. Chapter 5 describes how data from the CMS Metrics Service can be analyzed to provide insights on the CMS global activities.CERN-THESIS-2016-295oai:inspirehep.net:15208052017-05-10T04:54:21Z
spellingShingle	Particle Physics - Experiment Ambroz, Luca Performance studies of CMS workflows using Big Data technologies
title	Performance studies of CMS workflows using Big Data technologies
title_full	Performance studies of CMS workflows using Big Data technologies
title_fullStr	Performance studies of CMS workflows using Big Data technologies
title_full_unstemmed	Performance studies of CMS workflows using Big Data technologies
title_short	Performance studies of CMS workflows using Big Data technologies
title_sort	performance studies of cms workflows using big data technologies
topic	Particle Physics - Experiment
url	http://cds.cern.ch/record/2263131
work_keys_str_mv	AT ambrozluca performancestudiesofcmsworkflowsusingbigdatatechnologies

Performance studies of CMS workflows using Big Data technologies

Ejemplares similares