Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.UTEP

Start of topic | Skip to actions

Provenance Challenge: UTEP Trust Lab team

Participating Team

Team and Project Details

  • Short team name: UTEP
  • Participant names: Paulo Pinheiro da Silva, Nicholas Del Rio, Leonardo Salayandia
  • Project URL: http://trust.utep.edu
  • Project Overview:
  • Relevant Publications:

Workflow Representation

In our approach we use abstract workflows instead of executable workflows. By using the WDO-It! tool (http://trust.utep.edu/wdo/downloads), we start by first creating an ontology of the concepts that will be used in the creation of the workflow. The ontology is referred to as a Workflow-Driven Ontology, and it mainly consists of two hierarchies of concepts (or classes): Data and Method. Data concepts are those that represent some parameter, dataset, or user input in the workflow. Data concepts are illustrated as directed edges in the workflow graph. Method concepts are those that represent functionality that takes Data as input, and transforms it into some other Data output. Method concepts are illustrated as rectangles in the workflow graph. The purpose of creating abstract workflows instead of executable workflows is to emphasize understandability of the process being represented by the workflow. Hence, it is encouraged to use Data and Method concept names that are meaningful to the workflow creator. For example, in reference to the first workflow below, the workflow author thought that the Method concept name "CheckManifestFile" captured the intended meaning of the sequence of actions of "IsCSVReadyFileExist" and "ReadCSVReadyFile" from the specification of the PC3 workflow.

Once the ontological concepts have been identified and captured in the ontology, the abstract workflow is constructed with the WDO-It! tool by creating "instances" of the ontology concepts, and connecting Data and Method concepts accordingly to specify the intended workflow behavior. In addition, the abstract workflows created with the WDO-It! tool ground Data concepts to Sources (and Sinks), which are concepts that are reused from the provenance component of the Proof-Markup Language (PML-P). With respect to PML, Sources and Sinks are equivalent and we only refer to them as Sources. Sources represent the entitites where data is coming from (or where the data is eventually going to). For example, a Source can be a Database, a Document, or a Human user. These are represented as ovals in the workflow graph.

Finally, different levels of abstraction are also supported. The first workflow represents the most abstract workflow representation of the PC3 workflow. The second workflow, on the other hand, represents a lower level of abstraction of the "PopulateDB" method shown in the first workflow.

  • First workflow: Abstract workflow representation of the PC3 workflow
    First workflow: Abstract workflow representation of the PC3 workflow

  • Second worklfow: More detailed abstract workflow of the PopulateDB? method shown in the first workflow
    More detailed abstract workflow of the <span class=PopulateDB? method shown in the first workflow" width="573" height="513" />

Logging Provenance

One benefit of authoring abstract workflows using WDO-It! is the ability to generate “wrappers” and “data annotators,” which are modules designed to capture and encode provenance associated with an abstract workflow, during runtime and post-runtime respectively. The main distinction between the two logging methods has to do with when the provenance is logged, which has ultimately has implications on how it is logged. Certain properties of the workflow will dictate when one method should be used over the other, for example when intermediate artifacts are not persisted during execution of the workflow, a wrapper approach must be used to capture these intermediate artifacts before they are lost, as is the case when running the PC3 workflow using the Java version. In this case, the intermediate results only exist as Java objects that get removed from memory at the end of execution, thus a wrapper approach is necessary to capture these objects during runtime before they are destroyed. This implies however that the workflow be instrumented to invoke wrapper modules thus requiring alterations to an otherwise tried and tested workflow.

If a workflow does not delete intermediate results, then the non-invasive “data annotation” method can be used. This module can “piece together” provenance by chaining the intermediate results based on their “wasDerivedFrom” relationship. When running the batch version of the PC3 workflow, provenance could be captured with a data annotator because the batch files do not cleanup the intermediate XML files that get dumped.

It is important to note that most of the information needed to generate a fully functional wrapper or data annotator is contained in the abstract workflow. All the relationships between data, methods, and PML sources in a particular workflow are captured in WDO-It! and this knowledge is leveraged to help generate a wrapper or data annotator that needs very minor tweaks to get to work.

For this challenge we opted to use the batch version of the PC3 workflow and employed a wrapper approach for logging provenance, even though we could have used a data annotator. Provenance for this workflow was encoded in the Proof Markup Language (PML), the default encoding language of both the wrappers and data annotators. Our PML based provenance dump for the PC3 workflow can be found here. The start nodeset of the PML provenance graph can be found here.

Executing Wrappers

The wrappers generated from WDO-It! are not fully functional and need to be enhanced before they can be executed.

Visualizing Provenance

Probe-It! is a browser suited to graphically rendering Proof Markup Language based provenance associated with results derived from both inference engines and workflows. You can open Probe-It! already showing the PC3 PML provenance by clicking here.

Probe-It! consists of three primary views to accommodate the different kinds of provenance information: result view, global justification view, and local information view, which refer to final and intermediate data, descriptions of the generation process as a whole, and information about a specific step in the process respectively. Below is a partial PML trace of the PC3 workflow as visualized in Probe-It! The orange boxes on the top left and top right correspond to the workflow inputs, the XML file encoding the CSVRootPath? and the JobID? respectively. The arcs represent the "usedBy" relationship in OPM. However this is not an OPM graph and in PML terms the arcs actually represent the "hasAntecedents" relation.

* Probe-It! screen shot visualizing PC3 PML:
Probe-It! screen shot visualizing PC3 PML

Open Provenance Model Output

Query Results

Suggested Workflow Variants

Suggested Queries

Suggestions for Modification of the Open Provenance Model

Conclusions


to top

I Attachment sort Action Size Date Who Comment
PopulateDBWorkflow.JPG manage 38.7 K 04 Jun 2009 - 15:30 PauloPinheirodaSilva Subworkflow about the process of populating the database
ThirdPCWorkflow.JPG manage 19.7 K 04 Jun 2009 - 15:31 PauloPinheirodaSilva Abstract workflow for the process of the third provenance challenge
probeit.png manage 181.6 K 04 Jun 2009 - 18:58 PauloPinheirodaSilva Probe-It! screen shot visualizing PC3 PML

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback