Cargando…

Fault-Detection Managers: More May Not Be the Merrier

A fault management system contains managers that detect faults as well as initiate recovery actions. Such management systems often come in an architecture that is not only a distributed one but also decoupled from the applications. Although an arrangement like this promotes scalability, it unfortuna...

Descripción completa

Detalles Bibliográficos
Autores principales: Zamani, Ghazal, Das, Olivia
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Netherlands 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7896543/
https://www.ncbi.nlm.nih.gov/pubmed/33642963
http://dx.doi.org/10.1007/s10723-021-09546-2
_version_ 1783653565162586112
author Zamani, Ghazal
Das, Olivia
author_facet Zamani, Ghazal
Das, Olivia
author_sort Zamani, Ghazal
collection PubMed
description A fault management system contains managers that detect faults as well as initiate recovery actions. Such management systems often come in an architecture that is not only a distributed one but also decoupled from the applications. Although an arrangement like this promotes scalability, it unfortunately makes the recovery of applications dependent on the fault management system itself. This work introduces two novel equations to meet the performance objectives of applications. To this end, we first create an equation that estimates the maximum number of jobs to be handled by an application instance for meeting a given performance objective. This formula is then used by admission control mechanism to restrict the number of jobs (targeted for operational application instances) to be allowed to enter the system. Next, we create a second equation that computes the response time distribution of an application. Thereafter, we develop a simulation model that predicts the impact of the failure of four sample fault management architectures on application’s performance. Exploiting our equations, we compare the architectures in terms of three distinct ways of handling affected jobs when application instances fail—allow job loss; retry jobs resulting in overload; employ admission control to mitigate the overload. Our simulation results show that boosting the number of managers may not always be beneficial; rather, it could possibly be the interconnection topology (i.e. the layout of interconnects linking the architectural components) of the management architecture, together with the model parameter values that may sometimes have a bigger role to play in the application’s performance.
format Online
Article
Text
id pubmed-7896543
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Springer Netherlands
record_format MEDLINE/PubMed
spelling pubmed-78965432021-02-22 Fault-Detection Managers: More May Not Be the Merrier Zamani, Ghazal Das, Olivia J Grid Comput Article A fault management system contains managers that detect faults as well as initiate recovery actions. Such management systems often come in an architecture that is not only a distributed one but also decoupled from the applications. Although an arrangement like this promotes scalability, it unfortunately makes the recovery of applications dependent on the fault management system itself. This work introduces two novel equations to meet the performance objectives of applications. To this end, we first create an equation that estimates the maximum number of jobs to be handled by an application instance for meeting a given performance objective. This formula is then used by admission control mechanism to restrict the number of jobs (targeted for operational application instances) to be allowed to enter the system. Next, we create a second equation that computes the response time distribution of an application. Thereafter, we develop a simulation model that predicts the impact of the failure of four sample fault management architectures on application’s performance. Exploiting our equations, we compare the architectures in terms of three distinct ways of handling affected jobs when application instances fail—allow job loss; retry jobs resulting in overload; employ admission control to mitigate the overload. Our simulation results show that boosting the number of managers may not always be beneficial; rather, it could possibly be the interconnection topology (i.e. the layout of interconnects linking the architectural components) of the management architecture, together with the model parameter values that may sometimes have a bigger role to play in the application’s performance. Springer Netherlands 2021-02-20 2021 /pmc/articles/PMC7896543/ /pubmed/33642963 http://dx.doi.org/10.1007/s10723-021-09546-2 Text en © The Author(s), under exclusive licence to Springer Nature B.V. part of Springer Nature 2021 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Zamani, Ghazal
Das, Olivia
Fault-Detection Managers: More May Not Be the Merrier
title Fault-Detection Managers: More May Not Be the Merrier
title_full Fault-Detection Managers: More May Not Be the Merrier
title_fullStr Fault-Detection Managers: More May Not Be the Merrier
title_full_unstemmed Fault-Detection Managers: More May Not Be the Merrier
title_short Fault-Detection Managers: More May Not Be the Merrier
title_sort fault-detection managers: more may not be the merrier
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7896543/
https://www.ncbi.nlm.nih.gov/pubmed/33642963
http://dx.doi.org/10.1007/s10723-021-09546-2
work_keys_str_mv AT zamanighazal faultdetectionmanagersmoremaynotbethemerrier
AT dasolivia faultdetectionmanagersmoremaynotbethemerrier