Cargando…
Failure analysis for distributed computing environments
Distributed computing systems cover a broad range of computing infrastructures, which are heterogeneous, inter-connected and architected around stack-based deployments. Failure occurrences within such tightly-coupled systems while are expected, do not easily lend to predictive modeling due to the co...
Autores principales: | , , |
---|---|
Lenguaje: | eng |
Publicado: |
2017
|
Materias: | |
Acceso en línea: | https://dx.doi.org/10.1145/3147234.3148134 http://cds.cern.ch/record/2318247 |
Sumario: | Distributed computing systems cover a broad range of computing infrastructures, which are heterogeneous, inter-connected and architected around stack-based deployments. Failure occurrences within such tightly-coupled systems while are expected, do not easily lend to predictive modeling due to the complex interactions between interconnected service layers. This work examines service level instabilities, occurring within data centers, participating in (HEP) scientific research. We present a stability measure based on which a failure event selection process is deployed to detect periods of instability within individual data centers. Experts recognize that understanding conditions for failure is crucial when designing recovery procedures. For distributed computing systems risk and failure analysis facilitates implementation of measures for service availability, subsystem recovery and network redundancy. |
---|