Cargando…
Blurring High Energy Physics Data Analysis Techniques and Data Science Approaches
Scientific research has always been intertwined to a certain degree with Computing. Even more so over the last few years, during which the needs for resources in terms of storage and processing power have increased exponentially. This holds true for many different joint collaborations in fields such...
Autor principal: | |
---|---|
Lenguaje: | eng |
Publicado: |
2019
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2693575 |
Sumario: | Scientific research has always been intertwined to a certain degree with Computing. Even more so over the last few years, during which the needs for resources in terms of storage and processing power have increased exponentially. This holds true for many different joint collaborations in fields such as biology, medicine, earth sciences, physics and astrophysics, among which CERN definitely represents a notable example. Being the largest centre for research in the High Energy Physics (HEP) field, it has always kept pushing for new discoveries in its executive program. The collaborative efforts of thousands of scientists worldwide have led to important results, most notably in recent years the discovery of the Higgs boson, officially announced in 2012 by the researchers at CMS and ATLAS, the two main experiments taking data at the LHC collider. This strenuous work demands the most advanced technological instruments to recreate the physics events and at the same time hardware and software that keep up with the computing needs. But while HEP has been historically at the forefront in developing solutions to cope up with these requirements, in the recent years other fields and industries have experienced steady advances, helped by an unprecedented abundance of data. The research field born to exploit data, namely Data Science, has brought to the table new computing techniques that may well fit the needs of HEP. In this thesis, a programming model commonly used in Data Science, namely MapReduce, will be exploited to work with the most prominent software for HEP analysis, ROOT. The first will be used in the implementation available under the Apache Spark framework to allow for distributing computations over a remote cluster, while the latter will provide the interface to common HEP data formats and analysis models through one of its latest additions, namely RDataFrame. PyRDF, a purposely developed package, will glue all the components together and will be used to showcase how this new model can affect the workflow of a physics analysis. |
---|