Cargando…

Failure analysis for distributed computing environments

Distributed computing systems cover a broad range of computing infrastructures, which are heterogeneous, inter-connected and architected around stack-based deployments. Failure occurrences within such tightly-coupled systems while are expected, do not easily lend to predictive modeling due to the co...

Descripción completa

Detalles Bibliográficos
Autores principales: Datskova, Olga, Grigoras, Costin, Shi, Weidong
Lenguaje:eng
Publicado: 2017
Materias:
Acceso en línea:https://dx.doi.org/10.1145/3147234.3148134
http://cds.cern.ch/record/2318247
_version_ 1780958449183490048
author Datskova, Olga
Grigoras, Costin
Shi, Weidong
author_facet Datskova, Olga
Grigoras, Costin
Shi, Weidong
author_sort Datskova, Olga
collection CERN
description Distributed computing systems cover a broad range of computing infrastructures, which are heterogeneous, inter-connected and architected around stack-based deployments. Failure occurrences within such tightly-coupled systems while are expected, do not easily lend to predictive modeling due to the complex interactions between interconnected service layers. This work examines service level instabilities, occurring within data centers, participating in (HEP) scientific research. We present a stability measure based on which a failure event selection process is deployed to detect periods of instability within individual data centers. Experts recognize that understanding conditions for failure is crucial when designing recovery procedures. For distributed computing systems risk and failure analysis facilitates implementation of measures for service availability, subsystem recovery and network redundancy.
id oai-inspirehep.net-1670544
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2017
record_format invenio
spelling oai-inspirehep.net-16705442019-09-30T06:29:59Zdoi:10.1145/3147234.3148134http://cds.cern.ch/record/2318247engDatskova, OlgaGrigoras, CostinShi, WeidongFailure analysis for distributed computing environmentsComputing and ComputersDistributed computing systems cover a broad range of computing infrastructures, which are heterogeneous, inter-connected and architected around stack-based deployments. Failure occurrences within such tightly-coupled systems while are expected, do not easily lend to predictive modeling due to the complex interactions between interconnected service layers. This work examines service level instabilities, occurring within data centers, participating in (HEP) scientific research. We present a stability measure based on which a failure event selection process is deployed to detect periods of instability within individual data centers. Experts recognize that understanding conditions for failure is crucial when designing recovery procedures. For distributed computing systems risk and failure analysis facilitates implementation of measures for service availability, subsystem recovery and network redundancy.oai:inspirehep.net:16705442017
spellingShingle Computing and Computers
Datskova, Olga
Grigoras, Costin
Shi, Weidong
Failure analysis for distributed computing environments
title Failure analysis for distributed computing environments
title_full Failure analysis for distributed computing environments
title_fullStr Failure analysis for distributed computing environments
title_full_unstemmed Failure analysis for distributed computing environments
title_short Failure analysis for distributed computing environments
title_sort failure analysis for distributed computing environments
topic Computing and Computers
url https://dx.doi.org/10.1145/3147234.3148134
http://cds.cern.ch/record/2318247
work_keys_str_mv AT datskovaolga failureanalysisfordistributedcomputingenvironments
AT grigorascostin failureanalysisfordistributedcomputingenvironments
AT shiweidong failureanalysisfordistributedcomputingenvironments