Cargando…

Failure analysis for distributed computing environments

Distributed computing systems cover a broad range of computing infrastructures, which are heterogeneous, inter-connected and architected around stack-based deployments. Failure occurrences within such tightly-coupled systems while are expected, do not easily lend to predictive modeling due to the co...

Descripción completa

Detalles Bibliográficos
Autores principales: Datskova, Olga, Grigoras, Costin, Shi, Weidong
Lenguaje:eng
Publicado: 2017
Materias:
Acceso en línea:https://dx.doi.org/10.1145/3147234.3148134
http://cds.cern.ch/record/2318247
Descripción
Sumario:Distributed computing systems cover a broad range of computing infrastructures, which are heterogeneous, inter-connected and architected around stack-based deployments. Failure occurrences within such tightly-coupled systems while are expected, do not easily lend to predictive modeling due to the complex interactions between interconnected service layers. This work examines service level instabilities, occurring within data centers, participating in (HEP) scientific research. We present a stability measure based on which a failure event selection process is deployed to detect periods of instability within individual data centers. Experts recognize that understanding conditions for failure is crucial when designing recovery procedures. For distributed computing systems risk and failure analysis facilitates implementation of measures for service availability, subsystem recovery and network redundancy.