Cargando…
Mining Predictive Models for Big Data Placement
The Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) deploys its data collections, simulationand analysis activities on a distributed computing infrastructure involvingmore than 70 sites worldwide. Over the last few years, the historicalusage data produ...
Autor principal: | |
---|---|
Lenguaje: | eng |
Publicado: |
2018
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2647981 |
Sumario: | The Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) deploys its data collections, simulationand analysis activities on a distributed computing infrastructure involvingmore than 70 sites worldwide. Over the last few years, the historicalusage data produced by this large infrastructure has been recorded onBig Data clusters featuring more than 5 Petabytes of raw storage withdifferent open-source user-level tools available for analytical purposes. Theclusters offer a broad variety of computing and storage logs that represent avaluable, yet scarcely investigated, source of information for system tuningand capacity planning. Amongst all, the problem of understanding andpredicting dataset popularity is of primary interest for CERN. In fact, itssolution can enable effective policies for the placement of mostly accesseddatasets, thus resulting in remarkably shorter job latencies and increasedsystem throughput and resource usage.In this thesis, three key requirements for Petabyte-size dataset popularitymodels in a worldwide computing system such as the Worldwide LHCComputing Grid (WLCG) are investigated. Namely, the need of an efficientHadoop data vault for an effective mining platform capable of collectingcomputing logs from different monitoring services and organizing theminto periodic snapshots; the need of a scalable pipeline of machine learningtools for training, on these snapshots, predictive models able to forecastwhich datasets will become popular over time, thus discovering patternsand correlations useful to enhance the overall efficiency of the distributedinfrastructure; the need of a novel caching policy based on the datasetpopularity predictions than can outperform the current dataset replacementimplementation.The main contributions of this thesis include the following results:1. we propose and implement a scalable machine learning pipeline, builton top of the CMS Hadoop data store, to predict the popularity of |
---|