Cargando…

Hadoop Tutorial - Efficient data ingestion

<p>The Hadoop ecosystem is the leading opensource platform for distributed storage and processing of "big data". The Hadoop platform is available at CERN as a central service provided by the IT department.</p> <p><strong>Real-time data inge...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lanza Garcia, Daniel, Baranowski, Zbigniew
Lenguaje:	eng
Publicado:	2016
Materias:	Workshops
Acceso en línea:	http://cds.cern.ch/record/2200734

Descripción
Sumario:	<!--HTML--><p>The Hadoop ecosystem is the leading opensource platform for distributed storage and processing of "big data". The Hadoop platform is available at CERN as a central service provided by the IT department.</p> <p><strong>Real-time data ingestion to Hadoop</strong> ecosystem due to the system specificity is non-trivial process and requires some efforts (which is often underestimated) in order to make it efficient (low latency, optimize data placement, footprint on the cluster).</p> <p>In this tutorial attendees will learn about:</p> <ul> <li>The important aspects of storing the data in Hadoop Distributed File System (<strong>HDFS</strong>). </li> <li>Data <strong>ingestion techniques</strong> and engines that are capable of shipping data to Hadoop in an efficient way.</li> <li>Setting up a full <strong>data ingestion</strong> flow into a Hadoop Distributed Files System from various sources (streaming, log files, databases) using the best practices and components available around the ecosystem (including <strong>Sqoop, Kite, Flume, Kafka</strong>).</li> </ul>

Hadoop Tutorial - Efficient data ingestion

Ejemplares similares