Cargando…

Exploring the self-service model to visualize the results of the ATLAS Machine Learning analysis jobs in BigPanDA with Openshift OKD3

A large scientific computing infrastructure must provide sufficient versatility to host any kind of experiment that can lead to innovative ideas and great discoveries. The ATLAS experiment provides wide access possibilities to execute intelligent and complex algorithms and to analyze and interpret t...

Descripción completa

Detalles Bibliográficos
Autores principales: Stan, Ioan-Mihail, Padolski, Siarhei, Lee, Christopher Jon
Lenguaje:eng
Publicado: 2021
Materias:
Acceso en línea:http://cds.cern.ch/record/2773425
Descripción
Sumario:A large scientific computing infrastructure must provide sufficient versatility to host any kind of experiment that can lead to innovative ideas and great discoveries. The ATLAS experiment provides wide access possibilities to execute intelligent and complex algorithms and to analyze and interpret the massive amount of data produced in the Large Hadron Collider at CERN. The PanDA Production ANd Distributed Analysis system is an interface between the ATLAS Distributed Computing infrastructure and tenants (eg:scientific groups, physicists ) and it works as a workload management system. The BigPanDa monitoring system is a sub-component of the PanDA and its main role is to monitor the entire life cycle of a job or task running in the ATLAS Distributed Computing infrastructure. Because many scientific experiments are now conducted by Machine Learning algorithms, the BigPanDA community wants to expand the platform’s capabilities and fill the gap between Machine Learning data processing and data visualization. In this regard, BigPanDA takes on the challenge of experiencing the cloud-native paradigm and delegates the data presentation component to MLFlow instances deployed on Openshift OKD. Thus, BigPanDA will interact with Openshift OKD native API and instruct the orchestrator on how to locate and display the results of the Machine Learning analysis by using MLFlow microservices and Kubernetes/Openshift objects. In addition, the proposed solution architecture introduces various DevOps-specific patterns, including continuous integration for the MLFlow middleware containers images and continuous deployment with rolling upgrades for the existing running instances. Machine Learning data visualization services will operate on demand and remain up and available for a limited time, thus optimizing overall resource consumption.