Cargando…

SRE fundamentals in EOS

<!--HTML-->The EOS system is an advanced distributed storage system that deals with many extreme uses-cases (massive data injection from the LHC, latency-critical online home directories and massive throughput accesses from batch farms). EOS implements many site reliability engineering best p...

Descripción completa

Detalles Bibliográficos
Autor principal: Gonzalez Labrador, Hugo
Lenguaje:eng
Publicado: 2021
Materias:
Acceso en línea:http://cds.cern.ch/record/2754016
Descripción
Sumario:<!--HTML-->The EOS system is an advanced distributed storage system that deals with many extreme uses-cases (massive data injection from the LHC, latency-critical online home directories and massive throughput accesses from batch farms). EOS implements many site reliability engineering best practices to support these uses cases at scale and also to support the work done by the operations team maintaining the production clusters. In this presentation we explain some of the functionalities implemented in the core of EOS (logging, retry mechanism, QoS) that allows a smooth operation of the service while accommodating the diverse use-cases cited above.