Cargando…

Hadoop Tutorial - Efficient data ingestion

<!--HTML--><p>The Hadoop ecosystem is the leading opensource platform for distributed storage and processing of &quot;big data&quot;. The Hadoop platform is available at CERN as a central service provided by the IT department.</p> <p><strong>Real-time data inge...

Descripción completa

Detalles Bibliográficos
Autores principales: Lanza Garcia, Daniel, Baranowski, Zbigniew
Lenguaje:eng
Publicado: 2016
Materias:
Acceso en línea:http://cds.cern.ch/record/2200734
_version_ 1780951291029094400
author Lanza Garcia, Daniel
Baranowski, Zbigniew
author_facet Lanza Garcia, Daniel
Baranowski, Zbigniew
author_sort Lanza Garcia, Daniel
collection CERN
description <!--HTML--><p>The Hadoop ecosystem is the leading opensource platform for distributed storage and processing of &quot;big data&quot;. The Hadoop platform is available at CERN as a central service provided by the IT department.</p> <p><strong>Real-time data ingestion to Hadoop</strong> ecosystem due to the system&nbsp;specificity is non-trivial process and requires some efforts (which is often underestimated)&nbsp;in order to make it efficient (low&nbsp;latency, optimize data placement, footprint on the cluster).</p> <p>In this tutorial attendees will learn about:</p> <ul> <li>The important&nbsp;aspects of&nbsp;storing the data in Hadoop Distributed File System (<strong>HDFS</strong>).&nbsp;</li> <li>Data <strong>ingestion techniques</strong>&nbsp;and engines that are capable of shipping data to Hadoop in an efficient way.</li> <li>Setting up a full <strong>data ingestion</strong> flow into a Hadoop Distributed Files System from various sources (streaming, log files, databases)&nbsp;using the best practices and components available around the ecosystem (including <strong>Sqoop, Kite, Flume, Kafka</strong>).</li> </ul>
id cern-2200734
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2016
record_format invenio
spelling cern-22007342022-11-02T22:18:48Zhttp://cds.cern.ch/record/2200734engLanza Garcia, DanielBaranowski, ZbigniewHadoop Tutorial - Efficient data ingestionHadoop Tutorial - Efficient data ingestionWorkshops<!--HTML--><p>The Hadoop ecosystem is the leading opensource platform for distributed storage and processing of &quot;big data&quot;. The Hadoop platform is available at CERN as a central service provided by the IT department.</p> <p><strong>Real-time data ingestion to Hadoop</strong> ecosystem due to the system&nbsp;specificity is non-trivial process and requires some efforts (which is often underestimated)&nbsp;in order to make it efficient (low&nbsp;latency, optimize data placement, footprint on the cluster).</p> <p>In this tutorial attendees will learn about:</p> <ul> <li>The important&nbsp;aspects of&nbsp;storing the data in Hadoop Distributed File System (<strong>HDFS</strong>).&nbsp;</li> <li>Data <strong>ingestion techniques</strong>&nbsp;and engines that are capable of shipping data to Hadoop in an efficient way.</li> <li>Setting up a full <strong>data ingestion</strong> flow into a Hadoop Distributed Files System from various sources (streaming, log files, databases)&nbsp;using the best practices and components available around the ecosystem (including <strong>Sqoop, Kite, Flume, Kafka</strong>).</li> </ul>oai:cds.cern.ch:22007342016
spellingShingle Workshops
Lanza Garcia, Daniel
Baranowski, Zbigniew
Hadoop Tutorial - Efficient data ingestion
title Hadoop Tutorial - Efficient data ingestion
title_full Hadoop Tutorial - Efficient data ingestion
title_fullStr Hadoop Tutorial - Efficient data ingestion
title_full_unstemmed Hadoop Tutorial - Efficient data ingestion
title_short Hadoop Tutorial - Efficient data ingestion
title_sort hadoop tutorial - efficient data ingestion
topic Workshops
url http://cds.cern.ch/record/2200734
work_keys_str_mv AT lanzagarciadaniel hadooptutorialefficientdataingestion
AT baranowskizbigniew hadooptutorialefficientdataingestion