Cargando…

Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management

The increasingly growing scale of modern computing infrastructures solicits more ingenious and automatic solutions to their management. Our work focuses on file transfer failures within the Worldwide Large Hadron Collider Computing Grid and proposes a pipeline to support distributed data management...

Descripción completa

Detalles Bibliográficos
Autores principales: Clissa, Luca, Lassnig, Mario, Rinaldi, Lorenzo
Lenguaje:eng
Publicado: 2022
Materias:
Acceso en línea:https://dx.doi.org/10.1007/s41781-022-00089-z
http://cds.cern.ch/record/2839280
_version_ 1780975959936073728
author Clissa, Luca
Lassnig, Mario
Rinaldi, Lorenzo
author_facet Clissa, Luca
Lassnig, Mario
Rinaldi, Lorenzo
author_sort Clissa, Luca
collection CERN
description The increasingly growing scale of modern computing infrastructures solicits more ingenious and automatic solutions to their management. Our work focuses on file transfer failures within the Worldwide Large Hadron Collider Computing Grid and proposes a pipeline to support distributed data management operations by suggesting potential issues to investigate. Specifically, we adopt an unsupervised learning approach leveraging Natural Language Processing and Machine Learning tools to automatically parse error messages and group similar failures. The results are presented in the form of a summary table containing the most common textual patterns and time evolution charts. This approach has two main advantages. First, the joint elaboration of the error string and the transfer’s source/destination enables more informative and compact troubleshooting, as opposed to inspecting each site and checking unique messages separately. As a by-product, this also reduces the number of errors to check by some orders of magnitude (from unique error strings to unique categories or patterns). Second, the time evolution plots allow operators to immediately filter out secondary issues (e.g. transient or in resolution) and focus on the most serious problems first (e.g. escalating failures). As a preliminary assessment, we compare our results with the Global Grid User Support ticketing system, showing that most of our suggestions are indeed real issues (direct association), while being able to cover 89% of reported incidents (inverse relationship).
id cern-2839280
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2022
record_format invenio
spelling cern-28392802022-11-02T20:53:24Zdoi:10.1007/s41781-022-00089-zhttp://cds.cern.ch/record/2839280engClissa, LucaLassnig, MarioRinaldi, LorenzoAnalyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data ManagementComputing and ComputersThe increasingly growing scale of modern computing infrastructures solicits more ingenious and automatic solutions to their management. Our work focuses on file transfer failures within the Worldwide Large Hadron Collider Computing Grid and proposes a pipeline to support distributed data management operations by suggesting potential issues to investigate. Specifically, we adopt an unsupervised learning approach leveraging Natural Language Processing and Machine Learning tools to automatically parse error messages and group similar failures. The results are presented in the form of a summary table containing the most common textual patterns and time evolution charts. This approach has two main advantages. First, the joint elaboration of the error string and the transfer’s source/destination enables more informative and compact troubleshooting, as opposed to inspecting each site and checking unique messages separately. As a by-product, this also reduces the number of errors to check by some orders of magnitude (from unique error strings to unique categories or patterns). Second, the time evolution plots allow operators to immediately filter out secondary issues (e.g. transient or in resolution) and focus on the most serious problems first (e.g. escalating failures). As a preliminary assessment, we compare our results with the Global Grid User Support ticketing system, showing that most of our suggestions are indeed real issues (direct association), while being able to cover 89% of reported incidents (inverse relationship).oai:cds.cern.ch:28392802022
spellingShingle Computing and Computers
Clissa, Luca
Lassnig, Mario
Rinaldi, Lorenzo
Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management
title Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management
title_full Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management
title_fullStr Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management
title_full_unstemmed Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management
title_short Analyzing WLCG File Transfer Errors Through Machine Learning: An Automatic Pipeline to Support Post-mortem Distributed Data Management
title_sort analyzing wlcg file transfer errors through machine learning: an automatic pipeline to support post-mortem distributed data management
topic Computing and Computers
url https://dx.doi.org/10.1007/s41781-022-00089-z
http://cds.cern.ch/record/2839280
work_keys_str_mv AT clissaluca analyzingwlcgfiletransfererrorsthroughmachinelearninganautomaticpipelinetosupportpostmortemdistributeddatamanagement
AT lassnigmario analyzingwlcgfiletransfererrorsthroughmachinelearninganautomaticpipelinetosupportpostmortemdistributeddatamanagement
AT rinaldilorenzo analyzingwlcgfiletransfererrorsthroughmachinelearninganautomaticpipelinetosupportpostmortemdistributeddatamanagement