Cargando…

HaRD: a heterogeneity-aware replica deletion for HDFS

The Hadoop distributed file system (HDFS) is responsible for storing very large data-sets reliably on clusters of commodity machines. The HDFS takes advantage of replication to serve data requested by clients with high throughput. Data replication is a trade-off between better data availability and...

Descripción completa

Detalles Bibliográficos
Autores principales: Ciritoglu, Hilmi Egemen, Murphy, John, Thorpe, Christina
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6803594/
https://www.ncbi.nlm.nih.gov/pubmed/31700766
http://dx.doi.org/10.1186/s40537-019-0256-6
_version_ 1783460974009778176
author Ciritoglu, Hilmi Egemen
Murphy, John
Thorpe, Christina
author_facet Ciritoglu, Hilmi Egemen
Murphy, John
Thorpe, Christina
author_sort Ciritoglu, Hilmi Egemen
collection PubMed
description The Hadoop distributed file system (HDFS) is responsible for storing very large data-sets reliably on clusters of commodity machines. The HDFS takes advantage of replication to serve data requested by clients with high throughput. Data replication is a trade-off between better data availability and higher disk usage. Recent studies propose different data replication management frameworks that alter the replication factor of files dynamically in response to the popularity of the data, keeping more replicas for in-demand data to enhance the overall performance of the system. When data gets less popular, these schemes reduce the replication factor, which changes the data distribution and leads to unbalanced data distribution. Such an unbalanced data distribution causes hot spots, low data locality and excessive network usage in the cluster. In this work, we first confirm that reducing the replication factor causes unbalanced data distribution when using Hadoop’s default replica deletion scheme. Then, we show that even keeping a balanced data distribution using WBRD (data-distribution-aware replica deletion scheme) that we proposed in previous work performs sub-optimally on heterogeneous clusters. In order to overcome this issue, we propose a heterogeneity-aware replica deletion scheme (HaRD). HaRD considers the nodes’ processing capabilities when deleting replicas; hence it stores more replicas on the more powerful nodes. We implemented HaRD on top of HDFS and conducted a performance evaluation on a 23-node dedicated heterogeneous cluster. Our results show that HaRD reduced execution time by up to 60%, and 17% when compared to Hadoop and WBRD, respectively.
format Online
Article
Text
id pubmed-6803594
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-68035942019-11-05 HaRD: a heterogeneity-aware replica deletion for HDFS Ciritoglu, Hilmi Egemen Murphy, John Thorpe, Christina J Big Data Research The Hadoop distributed file system (HDFS) is responsible for storing very large data-sets reliably on clusters of commodity machines. The HDFS takes advantage of replication to serve data requested by clients with high throughput. Data replication is a trade-off between better data availability and higher disk usage. Recent studies propose different data replication management frameworks that alter the replication factor of files dynamically in response to the popularity of the data, keeping more replicas for in-demand data to enhance the overall performance of the system. When data gets less popular, these schemes reduce the replication factor, which changes the data distribution and leads to unbalanced data distribution. Such an unbalanced data distribution causes hot spots, low data locality and excessive network usage in the cluster. In this work, we first confirm that reducing the replication factor causes unbalanced data distribution when using Hadoop’s default replica deletion scheme. Then, we show that even keeping a balanced data distribution using WBRD (data-distribution-aware replica deletion scheme) that we proposed in previous work performs sub-optimally on heterogeneous clusters. In order to overcome this issue, we propose a heterogeneity-aware replica deletion scheme (HaRD). HaRD considers the nodes’ processing capabilities when deleting replicas; hence it stores more replicas on the more powerful nodes. We implemented HaRD on top of HDFS and conducted a performance evaluation on a 23-node dedicated heterogeneous cluster. Our results show that HaRD reduced execution time by up to 60%, and 17% when compared to Hadoop and WBRD, respectively. Springer International Publishing 2019-10-21 2019 /pmc/articles/PMC6803594/ /pubmed/31700766 http://dx.doi.org/10.1186/s40537-019-0256-6 Text en © The Author(s) 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle Research
Ciritoglu, Hilmi Egemen
Murphy, John
Thorpe, Christina
HaRD: a heterogeneity-aware replica deletion for HDFS
title HaRD: a heterogeneity-aware replica deletion for HDFS
title_full HaRD: a heterogeneity-aware replica deletion for HDFS
title_fullStr HaRD: a heterogeneity-aware replica deletion for HDFS
title_full_unstemmed HaRD: a heterogeneity-aware replica deletion for HDFS
title_short HaRD: a heterogeneity-aware replica deletion for HDFS
title_sort hard: a heterogeneity-aware replica deletion for hdfs
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6803594/
https://www.ncbi.nlm.nih.gov/pubmed/31700766
http://dx.doi.org/10.1186/s40537-019-0256-6
work_keys_str_mv AT ciritogluhilmiegemen hardaheterogeneityawarereplicadeletionforhdfs
AT murphyjohn hardaheterogeneityawarereplicadeletionforhdfs
AT thorpechristina hardaheterogeneityawarereplicadeletionforhdfs