Panda: A System for Provenance and Data

01:08:20

0 views

Published December 18, 2012

About this talk

Google Tech Talk October 26, 2012 (more info below) Presented by Jennifer Widom, Stanford University. ABSTRACT The goal of the Panda (Provenance and Data) project has been to develop a general-purpose system for modeling, capturing, storing, exploiting, and querying data provenance in a wide range of applications. Abstractly, provenance (also referred to as lineage) describes where data came from and how it has been processed over time. In Panda we consider "data-oriented workflows" whose nodes are arbitrary queries and transformations, challenging us to integrate data-based and process-based provenance, to handle a spectrum from well-understood to opaque transformations, and to develop compositional formalisms and algorithms suitable for arbitrary workflows. On the system side, we strive to enable efficient provenance operations while keeping the capture overhead low. In this talk, we lay the foundations for data-oriented workflows, then discuss how provenance is defined and captured in this environment. We describe the basic provenance-enabled operations of backward tracing, forward tracing, forward propagation, and refresh, and explain how we support these operations in three settings: provenance as general predicates, provenance as attribute mappings, and provenance in workflows composed exclusively of Map and Reduce functions. We briefly describe the prototype Panda system, and we discuss possible follow-on work: extensions to the provenance model and operations; optimizations for provenance capture, storage, and tracing; and an ad-hoc declarative query language for provenance together with data.