Cargando…
Failure analysis for distributed computing environments
Distributed computing systems cover a broad range of computing infrastructures, which are heterogeneous, inter-connected and architected around stack-based deployments. Failure occurrences within such tightly-coupled systems while are expected, do not easily lend to predictive modeling due to the co...
Autores principales: | , , |
---|---|
Lenguaje: | eng |
Publicado: |
2017
|
Materias: | |
Acceso en línea: | https://dx.doi.org/10.1145/3147234.3148134 http://cds.cern.ch/record/2318247 |
_version_ | 1780958449183490048 |
---|---|
author | Datskova, Olga Grigoras, Costin Shi, Weidong |
author_facet | Datskova, Olga Grigoras, Costin Shi, Weidong |
author_sort | Datskova, Olga |
collection | CERN |
description | Distributed computing systems cover a broad range of computing infrastructures, which are heterogeneous, inter-connected and architected around stack-based deployments. Failure occurrences within such tightly-coupled systems while are expected, do not easily lend to predictive modeling due to the complex interactions between interconnected service layers. This work examines service level instabilities, occurring within data centers, participating in (HEP) scientific research. We present a stability measure based on which a failure event selection process is deployed to detect periods of instability within individual data centers. Experts recognize that understanding conditions for failure is crucial when designing recovery procedures. For distributed computing systems risk and failure analysis facilitates implementation of measures for service availability, subsystem recovery and network redundancy. |
id | oai-inspirehep.net-1670544 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2017 |
record_format | invenio |
spelling | oai-inspirehep.net-16705442019-09-30T06:29:59Zdoi:10.1145/3147234.3148134http://cds.cern.ch/record/2318247engDatskova, OlgaGrigoras, CostinShi, WeidongFailure analysis for distributed computing environmentsComputing and ComputersDistributed computing systems cover a broad range of computing infrastructures, which are heterogeneous, inter-connected and architected around stack-based deployments. Failure occurrences within such tightly-coupled systems while are expected, do not easily lend to predictive modeling due to the complex interactions between interconnected service layers. This work examines service level instabilities, occurring within data centers, participating in (HEP) scientific research. We present a stability measure based on which a failure event selection process is deployed to detect periods of instability within individual data centers. Experts recognize that understanding conditions for failure is crucial when designing recovery procedures. For distributed computing systems risk and failure analysis facilitates implementation of measures for service availability, subsystem recovery and network redundancy.oai:inspirehep.net:16705442017 |
spellingShingle | Computing and Computers Datskova, Olga Grigoras, Costin Shi, Weidong Failure analysis for distributed computing environments |
title | Failure analysis for distributed computing environments |
title_full | Failure analysis for distributed computing environments |
title_fullStr | Failure analysis for distributed computing environments |
title_full_unstemmed | Failure analysis for distributed computing environments |
title_short | Failure analysis for distributed computing environments |
title_sort | failure analysis for distributed computing environments |
topic | Computing and Computers |
url | https://dx.doi.org/10.1145/3147234.3148134 http://cds.cern.ch/record/2318247 |
work_keys_str_mv | AT datskovaolga failureanalysisfordistributedcomputingenvironments AT grigorascostin failureanalysisfordistributedcomputingenvironments AT shiweidong failureanalysisfordistributedcomputingenvironments |