Cargando…
When One Line Took Thousands of Websites Offline
This talk describes an incident where an innocuous change in a configuration management system caused a highly-visible unavailability of thousands of websites, which was followed by an intense recovery procedure. The talk covers the part of the infrastructure that prevented more widespread damage, t...
Autores principales: | , |
---|---|
Lenguaje: | eng |
Publicado: |
2023
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2875365 |
Sumario: | This talk describes an incident where an innocuous change in a configuration management system caused a highly-visible unavailability of thousands of websites, which was followed by an intense recovery procedure. The talk covers the part of the infrastructure that prevented more widespread damage, the lessons learned (in terms of infrastructure design and operational procedures) as well as improvements significant improvements that have been implemented since then. All of this happened on Kubernetes infrastructure, therefore the talk will dive into the topics of Kubernetes operators, automation, manual intervention, configuration management and backups. |
---|