Cargando…
When One Line Took Thousands of Websites Offline
This talk describes an incident where an innocuous change in a configuration management system caused a highly-visible unavailability of thousands of websites, which was followed by an intense recovery procedure. The talk covers the part of the infrastructure that prevented more widespread damage, t...
Autores principales: | , |
---|---|
Lenguaje: | eng |
Publicado: |
2023
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2875365 |
_version_ | 1780978894318338048 |
---|---|
author | Henschel, Jack Borges Aurindo Barros, Francisco |
author_facet | Henschel, Jack Borges Aurindo Barros, Francisco |
author_sort | Henschel, Jack |
collection | CERN |
description | This talk describes an incident where an innocuous change in a configuration management system caused a highly-visible unavailability of thousands of websites, which was followed by an intense recovery procedure. The talk covers the part of the infrastructure that prevented more widespread damage, the lessons learned (in terms of infrastructure design and operational procedures) as well as improvements significant improvements that have been implemented since then. All of this happened on Kubernetes infrastructure, therefore the talk will dive into the topics of Kubernetes operators, automation, manual intervention, configuration management and backups. |
id | cern-2875365 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2023 |
record_format | invenio |
spelling | cern-28753652023-10-11T21:48:18Zhttp://cds.cern.ch/record/2875365engHenschel, JackBorges Aurindo Barros, FranciscoWhen One Line Took Thousands of Websites OfflineSREcon EMEA 2023TalkThis talk describes an incident where an innocuous change in a configuration management system caused a highly-visible unavailability of thousands of websites, which was followed by an intense recovery procedure. The talk covers the part of the infrastructure that prevented more widespread damage, the lessons learned (in terms of infrastructure design and operational procedures) as well as improvements significant improvements that have been implemented since then. All of this happened on Kubernetes infrastructure, therefore the talk will dive into the topics of Kubernetes operators, automation, manual intervention, configuration management and backups.IT-TALK-2012-008oai:cds.cern.ch:28753652023 |
spellingShingle | Talk Henschel, Jack Borges Aurindo Barros, Francisco When One Line Took Thousands of Websites Offline |
title | When One Line Took Thousands of Websites Offline |
title_full | When One Line Took Thousands of Websites Offline |
title_fullStr | When One Line Took Thousands of Websites Offline |
title_full_unstemmed | When One Line Took Thousands of Websites Offline |
title_short | When One Line Took Thousands of Websites Offline |
title_sort | when one line took thousands of websites offline |
topic | Talk |
url | http://cds.cern.ch/record/2875365 |
work_keys_str_mv | AT henscheljack whenonelinetookthousandsofwebsitesoffline AT borgesaurindobarrosfrancisco whenonelinetookthousandsofwebsitesoffline AT henscheljack sreconemea2023 AT borgesaurindobarrosfrancisco sreconemea2023 |