Cargando…

Data preservation and reproducibility at the LHCb experiment at CERN

This dissertation presents the first study of data preservation and research reproducibility in data science at the Large Hadron Collider at CERN. In particular, provenance capture of the experimental data and the reproducibility of physics analyses at the LHCb experiment were studied. First, the pr...

Descripción completa

Detalles Bibliográficos
Autor principal: Trisovic, Ana
Lenguaje:eng
Publicado: 2018
Materias:
Acceso en línea:http://cds.cern.ch/record/2640461
_version_ 1780960153741295616
author Trisovic, Ana
author_facet Trisovic, Ana
author_sort Trisovic, Ana
collection CERN
description This dissertation presents the first study of data preservation and research reproducibility in data science at the Large Hadron Collider at CERN. In particular, provenance capture of the experimental data and the reproducibility of physics analyses at the LHCb experiment were studied. First, the preservation of the software and hardware dependencies of the LHCb experimental data and simulations was investigated. It was found that the links between the data processing information and the datasets themselves were obscure. In order to document these dependencies, a graph database was designed and implemented. The nodes in the graph represent the data with their processing information, software and computational environment, whilst the edges represent their dependence on the other nodes. The database provides a central place to preserve information that was previously scattered across the LHCb computing infrastructure. Using the developed database, a methodology to recreate the LHCb computational environment and to execute the data processing on the cloud was implemented with the use of virtual containers. It was found that the produced physics events were identical to the official LHCb data, meaning that the system can aid in data preservation. Furthermore, the developed method can be used for outreach purposes, providing a streamlined way for a person external to CERN to process and analyse the LHCb data. Following this, the reproducibility of data analyses was studied. A data provenance tracking service was implemented within the LHCb software framework GAUDI. The service allows analysts to capture their data processing configurations that can be used to reproduce a dataset within the dataset itself. Furthermore, to assess the current status of the reproducibility of LHCb physics analyses, the major parts of an analysis were reproduced by following methods described in publicly and internally available documentation. This study allowed the identification of barriers to reproducibility and specific points where documentation is lacking. With this knowledge, one can specifically target areas that need improvement and encourage practices that would improve reproducibility in the future. Finally, contributions were made to the CERN Analysis Preservation portal, which is a general knowledge preservation framework developed at CERN to be used across all the LHC experiments. In particular, the functionality to preserve source code from GIT repositories and DOCKER images in one central location was implemented.
id cern-2640461
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2018
record_format invenio
spelling cern-26404612019-09-30T06:29:59Zhttp://cds.cern.ch/record/2640461engTrisovic, AnaData preservation and reproducibility at the LHCb experiment at CERNComputing and ComputersThis dissertation presents the first study of data preservation and research reproducibility in data science at the Large Hadron Collider at CERN. In particular, provenance capture of the experimental data and the reproducibility of physics analyses at the LHCb experiment were studied. First, the preservation of the software and hardware dependencies of the LHCb experimental data and simulations was investigated. It was found that the links between the data processing information and the datasets themselves were obscure. In order to document these dependencies, a graph database was designed and implemented. The nodes in the graph represent the data with their processing information, software and computational environment, whilst the edges represent their dependence on the other nodes. The database provides a central place to preserve information that was previously scattered across the LHCb computing infrastructure. Using the developed database, a methodology to recreate the LHCb computational environment and to execute the data processing on the cloud was implemented with the use of virtual containers. It was found that the produced physics events were identical to the official LHCb data, meaning that the system can aid in data preservation. Furthermore, the developed method can be used for outreach purposes, providing a streamlined way for a person external to CERN to process and analyse the LHCb data. Following this, the reproducibility of data analyses was studied. A data provenance tracking service was implemented within the LHCb software framework GAUDI. The service allows analysts to capture their data processing configurations that can be used to reproduce a dataset within the dataset itself. Furthermore, to assess the current status of the reproducibility of LHCb physics analyses, the major parts of an analysis were reproduced by following methods described in publicly and internally available documentation. This study allowed the identification of barriers to reproducibility and specific points where documentation is lacking. With this knowledge, one can specifically target areas that need improvement and encourage practices that would improve reproducibility in the future. Finally, contributions were made to the CERN Analysis Preservation portal, which is a general knowledge preservation framework developed at CERN to be used across all the LHC experiments. In particular, the functionality to preserve source code from GIT repositories and DOCKER images in one central location was implemented.CERN-THESIS-2018-174oai:cds.cern.ch:26404612018-09-26T18:56:43Z
spellingShingle Computing and Computers
Trisovic, Ana
Data preservation and reproducibility at the LHCb experiment at CERN
title Data preservation and reproducibility at the LHCb experiment at CERN
title_full Data preservation and reproducibility at the LHCb experiment at CERN
title_fullStr Data preservation and reproducibility at the LHCb experiment at CERN
title_full_unstemmed Data preservation and reproducibility at the LHCb experiment at CERN
title_short Data preservation and reproducibility at the LHCb experiment at CERN
title_sort data preservation and reproducibility at the lhcb experiment at cern
topic Computing and Computers
url http://cds.cern.ch/record/2640461
work_keys_str_mv AT trisovicana datapreservationandreproducibilityatthelhcbexperimentatcern