Cargando…

Phronesis, a diagnosis and recovery tool for system administrators

The administration of a large computer infrastructure is a great challenge in many aspects and requires experts in various domains to be successful. One criterion to which the users of a data center are directly exposed is the availability of the infrastructure. A high availability comes at the cost...

Descripción completa

Detalles Bibliográficos
Autor principal: Haen, Christophe
Lenguaje:eng
Publicado: 2014
Materias:
Acceso en línea:http://cds.cern.ch/record/1954861
Descripción
Sumario:The administration of a large computer infrastructure is a great challenge in many aspects and requires experts in various domains to be successful. One criterion to which the users of a data center are directly exposed is the availability of the infrastructure. A high availability comes at the cost of constant and performant monitoring solutions as well as experts ready to diagnose and solve the problems. It is unfortunately not always possible to have an expert team constantly on site. This work presents a tool which is meant to support system administrators in their tasks by diagnosing problems, offering recovery solutions, and acting as a history and knowledge database. We will first detail what large data centers are composed of and what are the various competences that are required in order to successfully administrate them. This will lead us to consider the problems that are traditionally encountered by the administrators. Those problems are at the source of this project, and we will define our goals from those issues. Finally we will detail the environment in which this work took place, that is the LHCb experiment at CERN. The second chapter contains the state of the art, which lists the methods that try, one way or the other, to answer problems similar to ours. We will see what these methods are, how they are applied, what the pros and cons are, and how they fit in our context. The third chapter is our proposition to address the issues and reach the goals we set in the first chapter. This chapter gives all the approaches and methods we adopted: some are a direct use or an adaptation of what was mentioned in the state of the art, while some others are innovative techniques. Each proposition is justified and put in relation with our goals. The fourth chapter is a technical description on how our ideas of the third chapter were put in practice. It details the software packaging, the various tools that were developed as well as some explanations on how they are used. The fifth chapter shows what results were obtained with your solution. The first part of it is an analysis of various simulations that aim at showing the efficiency of the methods described in Chapter 3. The second part is a description of how the software is applied in the LHCb Online environment, as well as a feedback on its usage. The last chapter offers perspectives to improve the software, extend its functionalities and address some of the issues that are detailed in the feedback of Chapter 5.