Cargando…
Titian: Data Provenance Support in Spark
Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and perfor...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4697929/ https://www.ncbi.nlm.nih.gov/pubmed/26726305 |
_version_ | 1782408000938442752 |
---|---|
author | Interlandi, Matteo Shah, Kshitij Tetali, Sai Deep Gulzar, Muhammad Ali Yoo, Seunghyun Kim, Miryung Millstein, Todd Condie, Tyson |
author_facet | Interlandi, Matteo Shah, Kshitij Tetali, Sai Deep Gulzar, Muhammad Ali Yoo, Seunghyun Kim, Miryung Millstein, Todd Condie, Tyson |
author_sort | Interlandi, Matteo |
collection | PubMed |
description | Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance—tracking data through transformations—in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds—orders-of-magnitude faster than alternative solutions—while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time. |
format | Online Article Text |
id | pubmed-4697929 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
record_format | MEDLINE/PubMed |
spelling | pubmed-46979292016-01-01 Titian: Data Provenance Support in Spark Interlandi, Matteo Shah, Kshitij Tetali, Sai Deep Gulzar, Muhammad Ali Yoo, Seunghyun Kim, Miryung Millstein, Todd Condie, Tyson Proceedings VLDB Endowment Article Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance—tracking data through transformations—in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds—orders-of-magnitude faster than alternative solutions—while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time. 2015-11 /pmc/articles/PMC4697929/ /pubmed/26726305 Text en This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. |
spellingShingle | Article Interlandi, Matteo Shah, Kshitij Tetali, Sai Deep Gulzar, Muhammad Ali Yoo, Seunghyun Kim, Miryung Millstein, Todd Condie, Tyson Titian: Data Provenance Support in Spark |
title | Titian: Data Provenance Support in Spark |
title_full | Titian: Data Provenance Support in Spark |
title_fullStr | Titian: Data Provenance Support in Spark |
title_full_unstemmed | Titian: Data Provenance Support in Spark |
title_short | Titian: Data Provenance Support in Spark |
title_sort | titian: data provenance support in spark |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4697929/ https://www.ncbi.nlm.nih.gov/pubmed/26726305 |
work_keys_str_mv | AT interlandimatteo titiandataprovenancesupportinspark AT shahkshitij titiandataprovenancesupportinspark AT tetalisaideep titiandataprovenancesupportinspark AT gulzarmuhammadali titiandataprovenancesupportinspark AT yooseunghyun titiandataprovenancesupportinspark AT kimmiryung titiandataprovenancesupportinspark AT millsteintodd titiandataprovenancesupportinspark AT condietyson titiandataprovenancesupportinspark |