Cargando…

Apache Spark usage and deployment models for scientific computing

This talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics, CERN accelerator logging system and infrastructure monitoring. The Hadoop Service has started to expand its user base for researchers who want to perform analysis wit...

Descripción completa

Detalles Bibliográficos
Autores principales: Castro, Diogo, Kothuri, Prasanth, Mrowczynski, Piotr, Piparo, Danilo, Tejedor, Enric
Lenguaje:eng
Publicado: 2019
Materias:
Acceso en línea:https://dx.doi.org/10.1051/epjconf/201921407020
http://cds.cern.ch/record/2699480
_version_ 1780964500873150464
author Castro, Diogo
Kothuri, Prasanth
Mrowczynski, Piotr
Piparo, Danilo
Tejedor, Enric
author_facet Castro, Diogo
Kothuri, Prasanth
Mrowczynski, Piotr
Piparo, Danilo
Tejedor, Enric
author_sort Castro, Diogo
collection CERN
description This talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics, CERN accelerator logging system and infrastructure monitoring. The Hadoop Service has started to expand its user base for researchers who want to perform analysis with big data technologies. Among many frameworks, Apache Spark is currently getting the most traction from various user communities and new ways to deploy Spark such as Apache Mesos or Spark on Kubernetes have started to evolve rapidly. Meanwhile, notebook web applications such as Jupyter offer the ability to perform interactive data analytics and visualizations without the need to install additional software. CERN already provides a web platform, called SWAN (Service for Web-based ANalysis), where users can write and run their analyses in the form of notebooks, seamlessly accessing the data and software they need. The first part of the presentation talks about several recent integrations and optimizations to the Apache Spark computing platform to enable HEP data processing and CERN accelerator logging system analytics. The optimizations and integrations, include, but not limited to, access of kerberized resources, xrootd connector enabling remote access to EOS storage and integration with SWAN for interactive data analysis, thus forming a truly Unified Analytics Platform. The second part of the talk touches upon the evolution of the Apache Spark data analytics platform, particularly sharing the recent work done to run Spark on Kubernetes on the virtualized and container-based infrastructure in Openstack. This deployment model allows for elastic scaling of data analytics workloads enabling efficient, on-demand utilization of resources in private or public clouds.
id oai-inspirehep.net-1761584
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2019
record_format invenio
spelling oai-inspirehep.net-17615842022-08-10T12:24:58Zdoi:10.1051/epjconf/201921407020http://cds.cern.ch/record/2699480engCastro, DiogoKothuri, PrasanthMrowczynski, PiotrPiparo, DaniloTejedor, EnricApache Spark usage and deployment models for scientific computingComputing and ComputersThis talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics, CERN accelerator logging system and infrastructure monitoring. The Hadoop Service has started to expand its user base for researchers who want to perform analysis with big data technologies. Among many frameworks, Apache Spark is currently getting the most traction from various user communities and new ways to deploy Spark such as Apache Mesos or Spark on Kubernetes have started to evolve rapidly. Meanwhile, notebook web applications such as Jupyter offer the ability to perform interactive data analytics and visualizations without the need to install additional software. CERN already provides a web platform, called SWAN (Service for Web-based ANalysis), where users can write and run their analyses in the form of notebooks, seamlessly accessing the data and software they need. The first part of the presentation talks about several recent integrations and optimizations to the Apache Spark computing platform to enable HEP data processing and CERN accelerator logging system analytics. The optimizations and integrations, include, but not limited to, access of kerberized resources, xrootd connector enabling remote access to EOS storage and integration with SWAN for interactive data analysis, thus forming a truly Unified Analytics Platform. The second part of the talk touches upon the evolution of the Apache Spark data analytics platform, particularly sharing the recent work done to run Spark on Kubernetes on the virtualized and container-based infrastructure in Openstack. This deployment model allows for elastic scaling of data analytics workloads enabling efficient, on-demand utilization of resources in private or public clouds.oai:inspirehep.net:17615842019
spellingShingle Computing and Computers
Castro, Diogo
Kothuri, Prasanth
Mrowczynski, Piotr
Piparo, Danilo
Tejedor, Enric
Apache Spark usage and deployment models for scientific computing
title Apache Spark usage and deployment models for scientific computing
title_full Apache Spark usage and deployment models for scientific computing
title_fullStr Apache Spark usage and deployment models for scientific computing
title_full_unstemmed Apache Spark usage and deployment models for scientific computing
title_short Apache Spark usage and deployment models for scientific computing
title_sort apache spark usage and deployment models for scientific computing
topic Computing and Computers
url https://dx.doi.org/10.1051/epjconf/201921407020
http://cds.cern.ch/record/2699480
work_keys_str_mv AT castrodiogo apachesparkusageanddeploymentmodelsforscientificcomputing
AT kothuriprasanth apachesparkusageanddeploymentmodelsforscientificcomputing
AT mrowczynskipiotr apachesparkusageanddeploymentmodelsforscientificcomputing
AT piparodanilo apachesparkusageanddeploymentmodelsforscientificcomputing
AT tejedorenric apachesparkusageanddeploymentmodelsforscientificcomputing