Cargando…

Training and Serving ML workloads with Kubeflow at CERN

Machine Learning (ML) has been growing in popularity in multiple areas and groups at CERN, covering fast simulation, tracking, anomaly detection, among many others. We describe a new service available at CERN, based on Kubeflow and managing the full ML lifecycle: data preparation and interactive ana...

Descripción completa

Detalles Bibliográficos
Autores principales: Golubovic, Dejan, Rocha, Ricardo
Lenguaje:eng
Publicado: 2021
Materias:
Acceso en línea:https://dx.doi.org/10.1051/epjconf/202125102067
http://cds.cern.ch/record/2780362
_version_ 1780971865141936128
author Golubovic, Dejan
Rocha, Ricardo
author_facet Golubovic, Dejan
Rocha, Ricardo
author_sort Golubovic, Dejan
collection CERN
description Machine Learning (ML) has been growing in popularity in multiple areas and groups at CERN, covering fast simulation, tracking, anomaly detection, among many others. We describe a new service available at CERN, based on Kubeflow and managing the full ML lifecycle: data preparation and interactive analysis, large scale distributed model training and model serving. We cover specific features available for hyper-parameter tuning and model metadata management, as well as infrastructure details to integrate accelerators and external resources. We also present results and a cost evaluation from scaling out a popular ML use case using public cloud resources, achieving close to linear scaling when using a large number of GPUs.
id cern-2780362
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2021
record_format invenio
spelling cern-27803622021-09-07T19:17:04Zdoi:10.1051/epjconf/202125102067http://cds.cern.ch/record/2780362engGolubovic, DejanRocha, RicardoTraining and Serving ML workloads with Kubeflow at CERNComputing and ComputersMachine Learning (ML) has been growing in popularity in multiple areas and groups at CERN, covering fast simulation, tracking, anomaly detection, among many others. We describe a new service available at CERN, based on Kubeflow and managing the full ML lifecycle: data preparation and interactive analysis, large scale distributed model training and model serving. We cover specific features available for hyper-parameter tuning and model metadata management, as well as infrastructure details to integrate accelerators and external resources. We also present results and a cost evaluation from scaling out a popular ML use case using public cloud resources, achieving close to linear scaling when using a large number of GPUs.oai:cds.cern.ch:27803622021
spellingShingle Computing and Computers
Golubovic, Dejan
Rocha, Ricardo
Training and Serving ML workloads with Kubeflow at CERN
title Training and Serving ML workloads with Kubeflow at CERN
title_full Training and Serving ML workloads with Kubeflow at CERN
title_fullStr Training and Serving ML workloads with Kubeflow at CERN
title_full_unstemmed Training and Serving ML workloads with Kubeflow at CERN
title_short Training and Serving ML workloads with Kubeflow at CERN
title_sort training and serving ml workloads with kubeflow at cern
topic Computing and Computers
url https://dx.doi.org/10.1051/epjconf/202125102067
http://cds.cern.ch/record/2780362
work_keys_str_mv AT golubovicdejan trainingandservingmlworkloadswithkubeflowatcern
AT rocharicardo trainingandservingmlworkloadswithkubeflowatcern