Cargando…

Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management

The increasingly growing scale of modern computing infrastructures solicits more ingenious and automatic solutions to their management. Our work focuses on file transfer failures within the Worldwide Large Hadron Collider Computing Grid and proposes a pipeline to support distributed data management...

Descripción completa

Detalles Bibliográficos
Autores principales:	Clissa, Luca, Lassnig, Mario, Rinaldi, Lorenzo
Lenguaje:	eng
Publicado:	2022
Materias:	Computing and Computers
Acceso en línea:	https://dx.doi.org/10.1007/s41781-022-00089-z http://cds.cern.ch/record/2839280

_version_	1780975959936073728
author	Clissa, Luca Lassnig, Mario Rinaldi, Lorenzo
author_facet	Clissa, Luca Lassnig, Mario Rinaldi, Lorenzo
author_sort	Clissa, Luca
collection	CERN
description	The increasingly growing scale of modern computing infrastructures solicits more ingenious and automatic solutions to their management. Our work focuses on file transfer failures within the Worldwide Large Hadron Collider Computing Grid and proposes a pipeline to support distributed data management operations by suggesting potential issues to investigate. Specifically, we adopt an unsupervised learning approach leveraging Natural Language Processing and Machine Learning tools to automatically parse error messages and group similar failures. The results are presented in the form of a summary table containing the most common textual patterns and time evolution charts. This approach has two main advantages. First, the joint elaboration of the error string and the transfer’s source/destination enables more informative and compact troubleshooting, as opposed to inspecting each site and checking unique messages separately. As a by-product, this also reduces the number of errors to check by some orders of magnitude (from unique error strings to unique categories or patterns). Second, the time evolution plots allow operators to immediately filter out secondary issues (e.g. transient or in resolution) and focus on the most serious problems first (e.g. escalating failures). As a preliminary assessment, we compare our results with the Global Grid User Support ticketing system, showing that most of our suggestions are indeed real issues (direct association), while being able to cover 89% of reported incidents (inverse relationship).
id	cern-2839280
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2022
record_format	invenio
spelling	cern-28392802022-11-02T20:53:24Zdoi:10.1007/s41781-022-00089-zhttp://cds.cern.ch/record/2839280engClissa, LucaLassnig, MarioRinaldi, LorenzoAnalyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data ManagementComputing and ComputersThe increasingly growing scale of modern computing infrastructures solicits more ingenious and automatic solutions to their management. Our work focuses on file transfer failures within the Worldwide Large Hadron Collider Computing Grid and proposes a pipeline to support distributed data management operations by suggesting potential issues to investigate. Specifically, we adopt an unsupervised learning approach leveraging Natural Language Processing and Machine Learning tools to automatically parse error messages and group similar failures. The results are presented in the form of a summary table containing the most common textual patterns and time evolution charts. This approach has two main advantages. First, the joint elaboration of the error string and the transfer’s source/destination enables more informative and compact troubleshooting, as opposed to inspecting each site and checking unique messages separately. As a by-product, this also reduces the number of errors to check by some orders of magnitude (from unique error strings to unique categories or patterns). Second, the time evolution plots allow operators to immediately filter out secondary issues (e.g. transient or in resolution) and focus on the most serious problems first (e.g. escalating failures). As a preliminary assessment, we compare our results with the Global Grid User Support ticketing system, showing that most of our suggestions are indeed real issues (direct association), while being able to cover 89% of reported incidents (inverse relationship).oai:cds.cern.ch:28392802022
spellingShingle	Computing and Computers Clissa, Luca Lassnig, Mario Rinaldi, Lorenzo Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management
title	Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management
title_full	Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management
title_fullStr	Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management
title_full_unstemmed	Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management
title_short	Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management
title_sort	analyzing wlcg file transfer errors through machine learning: an automatic pipeline to support post-mortem distributed data management
topic	Computing and Computers
url	https://dx.doi.org/10.1007/s41781-022-00089-z http://cds.cern.ch/record/2839280
work_keys_str_mv	AT clissaluca analyzingwlcgfiletransfererrorsthroughmachinelearninganautomaticpipelinetosupportpostmortemdistributeddatamanagement AT lassnigmario analyzingwlcgfiletransfererrorsthroughmachinelearninganautomaticpipelinetosupportpostmortemdistributeddatamanagement AT rinaldilorenzo analyzingwlcgfiletransfererrorsthroughmachinelearninganautomaticpipelinetosupportpostmortemdistributeddatamanagement

Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management

Ejemplares similares