Cargando…

Spark - a modern approach for distributed analytics

<p>The <strong>Hadoop</strong> ecosystem is the leading opensource platform for distributed storing and processing big data. It is a very popular system for implementing data warehouses and data lakes. <strong>Spark </strong>has also emerged to be one o...

Descripción completa

Detalles Bibliográficos
Autores principales:	Surdy, Kacper, Kothuri, Prasanth
Lenguaje:	eng
Publicado:	2016
Materias:	Workshops
Acceso en línea:	http://cds.cern.ch/record/2214510

Descripción
Sumario:	<!--HTML--><p>The <strong>Hadoop</strong> ecosystem is the leading opensource platform for distributed storing and processing big data. It is a very popular system for implementing data warehouses and data lakes. <strong>Spark </strong>has also emerged to be one of the leading engines for data analytics. The Hadoop platform is available at CERN as a central service provided by the IT department.</p> <p>By attending the session, a participant will acquire knowledge of the essential <strong>concepts </strong>need to benefit from the<strong> parallel data processing </strong>offered by Spark<strong> </strong>framework. The session is structured around practical <strong>examples </strong>and tutorials.</p> <p>Main topics:</p> <ul> <li><strong>Architecture </strong>overview - work distribution, concepts of a worker and a driver</li> <li>Computing concepts of <strong>transformations </strong>and <strong>actions</strong></li> <li>Data processing APIs - <strong>RDD, DataFrame, </strong>and <strong>SparkSQL</strong></li> </ul>

Cannot write session to /tmp/vufind_sessions/sess_v0b9kdnmmikbhp00q4klc3fi10