Cargando…
Hadoop Tutorial - Efficient data ingestion
<!--HTML--><p>The Hadoop ecosystem is the leading opensource platform for distributed storage and processing of "big data". The Hadoop platform is available at CERN as a central service provided by the IT department.</p> <p><strong>Real-time data inge...
Autores principales: | , |
---|---|
Lenguaje: | eng |
Publicado: |
2016
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2200734 |
_version_ | 1780951291029094400 |
---|---|
author | Lanza Garcia, Daniel Baranowski, Zbigniew |
author_facet | Lanza Garcia, Daniel Baranowski, Zbigniew |
author_sort | Lanza Garcia, Daniel |
collection | CERN |
description | <!--HTML--><p>The Hadoop ecosystem is the leading opensource platform for distributed storage and processing of "big data". The Hadoop platform is available at CERN as a central service provided by the IT department.</p>
<p><strong>Real-time data ingestion to Hadoop</strong> ecosystem due to the system specificity is non-trivial process and requires some efforts (which is often underestimated) in order to make it efficient (low latency, optimize data placement, footprint on the cluster).</p>
<p>In this tutorial attendees will learn about:</p>
<ul>
<li>The important aspects of storing the data in Hadoop Distributed File System (<strong>HDFS</strong>). </li>
<li>Data <strong>ingestion techniques</strong> and engines that are capable of shipping data to Hadoop in an efficient way.</li>
<li>Setting up a full <strong>data ingestion</strong> flow into a Hadoop Distributed Files System from various sources (streaming, log files, databases) using the best practices and components available around the ecosystem (including <strong>Sqoop, Kite, Flume, Kafka</strong>).</li>
</ul> |
id | cern-2200734 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2016 |
record_format | invenio |
spelling | cern-22007342022-11-02T22:18:48Zhttp://cds.cern.ch/record/2200734engLanza Garcia, DanielBaranowski, ZbigniewHadoop Tutorial - Efficient data ingestionHadoop Tutorial - Efficient data ingestionWorkshops<!--HTML--><p>The Hadoop ecosystem is the leading opensource platform for distributed storage and processing of "big data". The Hadoop platform is available at CERN as a central service provided by the IT department.</p> <p><strong>Real-time data ingestion to Hadoop</strong> ecosystem due to the system specificity is non-trivial process and requires some efforts (which is often underestimated) in order to make it efficient (low latency, optimize data placement, footprint on the cluster).</p> <p>In this tutorial attendees will learn about:</p> <ul> <li>The important aspects of storing the data in Hadoop Distributed File System (<strong>HDFS</strong>). </li> <li>Data <strong>ingestion techniques</strong> and engines that are capable of shipping data to Hadoop in an efficient way.</li> <li>Setting up a full <strong>data ingestion</strong> flow into a Hadoop Distributed Files System from various sources (streaming, log files, databases) using the best practices and components available around the ecosystem (including <strong>Sqoop, Kite, Flume, Kafka</strong>).</li> </ul>oai:cds.cern.ch:22007342016 |
spellingShingle | Workshops Lanza Garcia, Daniel Baranowski, Zbigniew Hadoop Tutorial - Efficient data ingestion |
title | Hadoop Tutorial - Efficient data ingestion |
title_full | Hadoop Tutorial - Efficient data ingestion |
title_fullStr | Hadoop Tutorial - Efficient data ingestion |
title_full_unstemmed | Hadoop Tutorial - Efficient data ingestion |
title_short | Hadoop Tutorial - Efficient data ingestion |
title_sort | hadoop tutorial - efficient data ingestion |
topic | Workshops |
url | http://cds.cern.ch/record/2200734 |
work_keys_str_mv | AT lanzagarciadaniel hadooptutorialefficientdataingestion AT baranowskizbigniew hadooptutorialefficientdataingestion |