Cargando…

Titian: Data Provenance Support in Spark

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and perfor...

Descripción completa

Detalles Bibliográficos
Autores principales: Interlandi, Matteo, Shah, Kshitij, Tetali, Sai Deep, Gulzar, Muhammad Ali, Yoo, Seunghyun, Kim, Miryung, Millstein, Todd, Condie, Tyson
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4697929/
https://www.ncbi.nlm.nih.gov/pubmed/26726305
_version_ 1782408000938442752
author Interlandi, Matteo
Shah, Kshitij
Tetali, Sai Deep
Gulzar, Muhammad Ali
Yoo, Seunghyun
Kim, Miryung
Millstein, Todd
Condie, Tyson
author_facet Interlandi, Matteo
Shah, Kshitij
Tetali, Sai Deep
Gulzar, Muhammad Ali
Yoo, Seunghyun
Kim, Miryung
Millstein, Todd
Condie, Tyson
author_sort Interlandi, Matteo
collection PubMed
description Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance—tracking data through transformations—in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds—orders-of-magnitude faster than alternative solutions—while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.
format Online
Article
Text
id pubmed-4697929
institution National Center for Biotechnology Information
language English
publishDate 2015
record_format MEDLINE/PubMed
spelling pubmed-46979292016-01-01 Titian: Data Provenance Support in Spark Interlandi, Matteo Shah, Kshitij Tetali, Sai Deep Gulzar, Muhammad Ali Yoo, Seunghyun Kim, Miryung Millstein, Todd Condie, Tyson Proceedings VLDB Endowment Article Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance—tracking data through transformations—in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds—orders-of-magnitude faster than alternative solutions—while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time. 2015-11 /pmc/articles/PMC4697929/ /pubmed/26726305 Text en This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org.
spellingShingle Article
Interlandi, Matteo
Shah, Kshitij
Tetali, Sai Deep
Gulzar, Muhammad Ali
Yoo, Seunghyun
Kim, Miryung
Millstein, Todd
Condie, Tyson
Titian: Data Provenance Support in Spark
title Titian: Data Provenance Support in Spark
title_full Titian: Data Provenance Support in Spark
title_fullStr Titian: Data Provenance Support in Spark
title_full_unstemmed Titian: Data Provenance Support in Spark
title_short Titian: Data Provenance Support in Spark
title_sort titian: data provenance support in spark
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4697929/
https://www.ncbi.nlm.nih.gov/pubmed/26726305
work_keys_str_mv AT interlandimatteo titiandataprovenancesupportinspark
AT shahkshitij titiandataprovenancesupportinspark
AT tetalisaideep titiandataprovenancesupportinspark
AT gulzarmuhammadali titiandataprovenancesupportinspark
AT yooseunghyun titiandataprovenancesupportinspark
AT kimmiryung titiandataprovenancesupportinspark
AT millsteintodd titiandataprovenancesupportinspark
AT condietyson titiandataprovenancesupportinspark