Cargando…

Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems

In this paper, we present a novel methodology for predicting job resources (memory and time) for submitted jobs on HPC systems. Our methodology based on historical jobs data (saccount data) provided from the Slurm workload manager using supervised machine learning. This Machine Learning (ML) predict...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tanash, Mohammed, Andresen, Daniel, Yang, Huichen, Hsu, William
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8974354/ https://www.ncbi.nlm.nih.gov/pubmed/35373221 http://dx.doi.org/10.1145/3437359.3465574

_version_	1784680230082838528
author	Tanash, Mohammed Andresen, Daniel Yang, Huichen Hsu, William
author_facet	Tanash, Mohammed Andresen, Daniel Yang, Huichen Hsu, William
author_sort	Tanash, Mohammed
collection	PubMed
description	In this paper, we present a novel methodology for predicting job resources (memory and time) for submitted jobs on HPC systems. Our methodology based on historical jobs data (saccount data) provided from the Slurm workload manager using supervised machine learning. This Machine Learning (ML) prediction model is effective and useful for both HPC administrators and HPC users. Moreover, our ML model increases the efficiency and utilization for HPC systems, thus reduce power consumption as well. Our model involves using Several supervised machine learning discriminative models from the scikit-learn machine learning library and LightGBM applied on historical data from Slurm. Our model helps HPC users to determine the required amount of resources for their submitted jobs and make it easier for them to use HPC resources efficiently. This work provides the second step towards implementing our general open source tool towards HPC service providers. For this work, our Machine learning model has been implemented and tested using two HPC providers, an XSEDE service provider (University of Colorado-Boulder (RMACC Summit) and Kansas State University (Beocat)). We used more than two hundred thousand jobs: one-hundred thousand jobs from SUMMIT and one-hundred thousand jobs from Beocat, to model and assess our ML model performance. In particular we measured the improvement of running time, turnaround time, average waiting time for the submitted jobs; and measured utilization of the HPC clusters. Our model achieved up to 86% accuracy in predicting the amount of time and the amount of memory for both SUMMIT and Beocat HPC resources. Our results show that our model helps dramatically reduce computational average waiting time (from 380 to 4 hours in RMACC Summit and from 662 hours to 28 hours in Beocat); reduced turnaround time (from 403 to 6 hours in RMACC Summit and from 673 hours to 35 hours in Beocat); and acheived up to 100% utilization for both HPC resources.
format	Online Article Text
id	pubmed-8974354
institution	National Center for Biotechnology Information
language	English
publishDate	2021
record_format	MEDLINE/PubMed
spelling	pubmed-89743542022-04-01 Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems Tanash, Mohammed Andresen, Daniel Yang, Huichen Hsu, William Pract Exp Adv Res Comput (2021) Article In this paper, we present a novel methodology for predicting job resources (memory and time) for submitted jobs on HPC systems. Our methodology based on historical jobs data (saccount data) provided from the Slurm workload manager using supervised machine learning. This Machine Learning (ML) prediction model is effective and useful for both HPC administrators and HPC users. Moreover, our ML model increases the efficiency and utilization for HPC systems, thus reduce power consumption as well. Our model involves using Several supervised machine learning discriminative models from the scikit-learn machine learning library and LightGBM applied on historical data from Slurm. Our model helps HPC users to determine the required amount of resources for their submitted jobs and make it easier for them to use HPC resources efficiently. This work provides the second step towards implementing our general open source tool towards HPC service providers. For this work, our Machine learning model has been implemented and tested using two HPC providers, an XSEDE service provider (University of Colorado-Boulder (RMACC Summit) and Kansas State University (Beocat)). We used more than two hundred thousand jobs: one-hundred thousand jobs from SUMMIT and one-hundred thousand jobs from Beocat, to model and assess our ML model performance. In particular we measured the improvement of running time, turnaround time, average waiting time for the submitted jobs; and measured utilization of the HPC clusters. Our model achieved up to 86% accuracy in predicting the amount of time and the amount of memory for both SUMMIT and Beocat HPC resources. Our results show that our model helps dramatically reduce computational average waiting time (from 380 to 4 hours in RMACC Summit and from 662 hours to 28 hours in Beocat); reduced turnaround time (from 403 to 6 hours in RMACC Summit and from 673 hours to 35 hours in Beocat); and acheived up to 100% utilization for both HPC resources. 2021-07 2021-07-17 /pmc/articles/PMC8974354/ /pubmed/35373221 http://dx.doi.org/10.1145/3437359.3465574 Text en https://creativecommons.org/licenses/by-nc/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.
spellingShingle	Article Tanash, Mohammed Andresen, Daniel Yang, Huichen Hsu, William Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems
title	Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems
title_full	Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems
title_fullStr	Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems
title_full_unstemmed	Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems
title_short	Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems
title_sort	ensemble prediction of job resources to improve system performance for slurm-based hpc systems
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8974354/ https://www.ncbi.nlm.nih.gov/pubmed/35373221 http://dx.doi.org/10.1145/3437359.3465574
work_keys_str_mv	AT tanashmohammed ensemblepredictionofjobresourcestoimprovesystemperformanceforslurmbasedhpcsystems AT andresendaniel ensemblepredictionofjobresourcestoimprovesystemperformanceforslurmbasedhpcsystems AT yanghuichen ensemblepredictionofjobresourcestoimprovesystemperformanceforslurmbasedhpcsystems AT hsuwilliam ensemblepredictionofjobresourcestoimprovesystemperformanceforslurmbasedhpcsystems

Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems

Ejemplares similares