tailieunhanh - Putting Lipstick on Pig: Enabling Database-style Workflow Provenance

Protein family databases obtain sequences fromone of the large protein sequence databases,most commonly SWISS- PROT with TrEMBL (Bairoch and Apweiler, 2000) but also PIR (Barker et al., 2000). They then apply an algorithm, either manual or automatic, to group the sequences into families. Each family is represented in one or more ways to facilitate both inspection by humans and comparison by computer programs. The most common representation is a multiple alignment of the family’s sequences, either with insertion and deletion (gap) characters or without. Sometimes the multiple alignment is summarized as a pattern or consensus sequence. For comparison of a user’s query sequence with the protein family database, the multiple alignment is commonly converted. | Putting Lipstick on Pig Enabling Database-style Workflow Provenance Yael Amsterdamer2 Susan B. Davidson1 Daniel Deutch3 Tova Milo2 Julia Stoyanovich1 Val Tannen1 University of Pennsylvania USA 2Tel Aviv University Israel 3Ben Gurion University Israel susan jstoy val @ yaelamst milo @ deutchd@ ABSTRACT Workflow provenance typically assumes that each module is a black-box so that each output depends on all inputs coarse-grained dependencies . Furthermore it does not model the internal state of a module which can change between repeated executions. In practice however an output may depend on only a small subset of the inputs finegrained dependencies as well as on the internal state of the module. We present a novel provenance framework that marries database-style and workflow-style provenance by using Pig Latin to expose the functionality of modules thus capturing internal state and fine-grained dependencies. A critical ingredient in our solution is the use of a novel form of provenance graph that models module invocations and yields a compact representation of fine-grained workflow provenance. It also enables a number of novel graph transformation operations allowing to choose the desired level of granularity in provenance querying ZoomIn and ZoomOut and supporting what-if workflow analytic queries. We implemented our approach in the Lipstick system and developed a benchmark in support of a systematic performance evaluation. Our results demonstrate the feasibility of tracking and querying fine-grained workflow provenance. 1. INTRODUCTION Data-intensive application domains such as science and electronic commerce are increasingly using workflow systems to design and manage the analysis of large datasets and to track the provenance of intermediate and final data products. Provenance is extremely important for verifiability and repeatability of results as well as for debugging and trouble-shooting workflows 10 11 . The standard .

TÀI LIỆU LIÊN QUAN