Cargando…

When One Line Took Thousands of Websites Offline

This talk describes an incident where an innocuous change in a configuration management system caused a highly-visible unavailability of thousands of websites, which was followed by an intense recovery procedure. The talk covers the part of the infrastructure that prevented more widespread damage, t...

Descripción completa

Detalles Bibliográficos
Autores principales: Henschel, Jack, Borges Aurindo Barros, Francisco
Lenguaje:eng
Publicado: 2023
Materias:
Acceso en línea:http://cds.cern.ch/record/2875365
Descripción
Sumario:This talk describes an incident where an innocuous change in a configuration management system caused a highly-visible unavailability of thousands of websites, which was followed by an intense recovery procedure. The talk covers the part of the infrastructure that prevented more widespread damage, the lessons learned (in terms of infrastructure design and operational procedures) as well as improvements significant improvements that have been implemented since then. All of this happened on Kubernetes infrastructure, therefore the talk will dive into the topics of Kubernetes operators, automation, manual intervention, configuration management and backups.