Cargando…

Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces

The pandemic of coronavirus has dramatically disrupted the retail industry, as many stores are forced to close and people across the world are shelter-in-place with online shopping as the inevitable choice. To meet the rapidly increasing demand for e-commerce, more data centers are expected to provi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Everman, Brad, Rajendran, Narmadha, Li, Xiaomin, Zong, Ziliang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier Inc. 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9760204/ https://www.ncbi.nlm.nih.gov/pubmed/37522151 http://dx.doi.org/10.1016/j.suscom.2021.100528

_version_	1784852409063833600
author	Everman, Brad Rajendran, Narmadha Li, Xiaomin Zong, Ziliang
author_facet	Everman, Brad Rajendran, Narmadha Li, Xiaomin Zong, Ziliang
author_sort	Everman, Brad
collection	PubMed
description	The pandemic of coronavirus has dramatically disrupted the retail industry, as many stores are forced to close and people across the world are shelter-in-place with online shopping as the inevitable choice. To meet the rapidly increasing demand for e-commerce, more data centers are expected to provide new or significantly improve existing cloud services that can better support hybrid workloads (e.g. online purchase jobs and batch jobs that support ranking or recommendation systems). Successful cloud systems need to efficiently handle and quickly respond to huge volume of traffic with such hybrid workloads. Meanwhile, it is critical to reduce the total cost of ownership (TCO) for profitability. Improving system utilization is one of the effective techniques to achieve the twin goals of high performance and low TCO. This paper conducts a comprehensive analysis on the 2017 and 2018 cluster traces released by Alibaba, which provides a case study about Alibaba's best practices in improving the performance and cost efficiency of its large-scale cloud systems by consolidating time-sensitive online service jobs with time-insensitive batch jobs. Our investigation indicates that the over-subscription (causing resource waste and low utilization) and under-subscription (causing performance degradation) problems co-exist in the current Alibaba system. We develop a simulator that allows us to evaluate possible solutions to address this problem and their impact on the performance, energy consumption, and TCO. Our experiments show that the estimated TCO can be reduced by $600,000 for the 2018 trace running on over 4,000 machines without compromising performance. The TCO can decrease by nearly $68 million if similar strategy is extrapolated to Alibaba's 432,000 web facing servers.
format	Online Article Text
id	pubmed-9760204
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Elsevier Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-97602042022-12-19 Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces Everman, Brad Rajendran, Narmadha Li, Xiaomin Zong, Ziliang Sustainable Computing: Informatics and Systems Article The pandemic of coronavirus has dramatically disrupted the retail industry, as many stores are forced to close and people across the world are shelter-in-place with online shopping as the inevitable choice. To meet the rapidly increasing demand for e-commerce, more data centers are expected to provide new or significantly improve existing cloud services that can better support hybrid workloads (e.g. online purchase jobs and batch jobs that support ranking or recommendation systems). Successful cloud systems need to efficiently handle and quickly respond to huge volume of traffic with such hybrid workloads. Meanwhile, it is critical to reduce the total cost of ownership (TCO) for profitability. Improving system utilization is one of the effective techniques to achieve the twin goals of high performance and low TCO. This paper conducts a comprehensive analysis on the 2017 and 2018 cluster traces released by Alibaba, which provides a case study about Alibaba's best practices in improving the performance and cost efficiency of its large-scale cloud systems by consolidating time-sensitive online service jobs with time-insensitive batch jobs. Our investigation indicates that the over-subscription (causing resource waste and low utilization) and under-subscription (causing performance degradation) problems co-exist in the current Alibaba system. We develop a simulator that allows us to evaluate possible solutions to address this problem and their impact on the performance, energy consumption, and TCO. Our experiments show that the estimated TCO can be reduced by $600,000 for the 2018 trace running on over 4,000 machines without compromising performance. The TCO can decrease by nearly $68 million if similar strategy is extrapolated to Alibaba's 432,000 web facing servers. Elsevier Inc. 2021-06 2021-03-03 /pmc/articles/PMC9760204/ /pubmed/37522151 http://dx.doi.org/10.1016/j.suscom.2021.100528 Text en © 2021 Elsevier Inc. All rights reserved. Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company's public news and information website. Elsevier hereby grants permission to make all its COVID-19-related research that is available on the COVID-19 resource centre - including this research content - immediately available in PubMed Central and other publicly funded repositories, such as the WHO COVID database with rights for unrestricted research re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for free by Elsevier for as long as the COVID-19 resource centre remains active.
spellingShingle	Article Everman, Brad Rajendran, Narmadha Li, Xiaomin Zong, Ziliang Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces
title	Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces
title_full	Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces
title_fullStr	Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces
title_full_unstemmed	Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces
title_short	Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces
title_sort	improving the cost efficiency of large-scale cloud systems running hybrid workloads - a case study of alibaba cluster traces
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9760204/ https://www.ncbi.nlm.nih.gov/pubmed/37522151 http://dx.doi.org/10.1016/j.suscom.2021.100528
work_keys_str_mv	AT evermanbrad improvingthecostefficiencyoflargescalecloudsystemsrunninghybridworkloadsacasestudyofalibabaclustertraces AT rajendrannarmadha improvingthecostefficiencyoflargescalecloudsystemsrunninghybridworkloadsacasestudyofalibabaclustertraces AT lixiaomin improvingthecostefficiencyoflargescalecloudsystemsrunninghybridworkloadsacasestudyofalibabaclustertraces AT zongziliang improvingthecostefficiencyoflargescalecloudsystemsrunninghybridworkloadsacasestudyofalibabaclustertraces

Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces

Ejemplares similares