Cargando…

When One Line Took Thousands of Websites Offline

This talk describes an incident where an innocuous change in a configuration management system caused a highly-visible unavailability of thousands of websites, which was followed by an intense recovery procedure. The talk covers the part of the infrastructure that prevented more widespread damage, t...

Descripción completa

Detalles Bibliográficos
Autores principales: Henschel, Jack, Borges Aurindo Barros, Francisco
Lenguaje:eng
Publicado: 2023
Materias:
Acceso en línea:http://cds.cern.ch/record/2875365
_version_ 1780978894318338048
author Henschel, Jack
Borges Aurindo Barros, Francisco
author_facet Henschel, Jack
Borges Aurindo Barros, Francisco
author_sort Henschel, Jack
collection CERN
description This talk describes an incident where an innocuous change in a configuration management system caused a highly-visible unavailability of thousands of websites, which was followed by an intense recovery procedure. The talk covers the part of the infrastructure that prevented more widespread damage, the lessons learned (in terms of infrastructure design and operational procedures) as well as improvements significant improvements that have been implemented since then. All of this happened on Kubernetes infrastructure, therefore the talk will dive into the topics of Kubernetes operators, automation, manual intervention, configuration management and backups.
id cern-2875365
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2023
record_format invenio
spelling cern-28753652023-10-11T21:48:18Zhttp://cds.cern.ch/record/2875365engHenschel, JackBorges Aurindo Barros, FranciscoWhen One Line Took Thousands of Websites OfflineSREcon EMEA 2023TalkThis talk describes an incident where an innocuous change in a configuration management system caused a highly-visible unavailability of thousands of websites, which was followed by an intense recovery procedure. The talk covers the part of the infrastructure that prevented more widespread damage, the lessons learned (in terms of infrastructure design and operational procedures) as well as improvements significant improvements that have been implemented since then. All of this happened on Kubernetes infrastructure, therefore the talk will dive into the topics of Kubernetes operators, automation, manual intervention, configuration management and backups.IT-TALK-2012-008oai:cds.cern.ch:28753652023
spellingShingle Talk
Henschel, Jack
Borges Aurindo Barros, Francisco
When One Line Took Thousands of Websites Offline
title When One Line Took Thousands of Websites Offline
title_full When One Line Took Thousands of Websites Offline
title_fullStr When One Line Took Thousands of Websites Offline
title_full_unstemmed When One Line Took Thousands of Websites Offline
title_short When One Line Took Thousands of Websites Offline
title_sort when one line took thousands of websites offline
topic Talk
url http://cds.cern.ch/record/2875365
work_keys_str_mv AT henscheljack whenonelinetookthousandsofwebsitesoffline
AT borgesaurindobarrosfrancisco whenonelinetookthousandsofwebsitesoffline
AT henscheljack sreconemea2023
AT borgesaurindobarrosfrancisco sreconemea2023