Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.ES3_2

Start of topic | Skip to actions

ES3

Participating Team

Differences from First Challenge

ES3 lineage trace schema

Data Model

The data model for ES3 contains only 4 types of objects: 1) files, 2) data transformations, 3) links 4) workflows.

File objects in ES3 represent files on disk that are read from or written to during the execution of the workflow. File objects may be data files that are manipulated directly by the workflow or may be files that are read and written by the executables used by the workflow, including operating system libraries, directories and temporary files.

File information may be filtered before being sent to ES3 using a configuration file, so that files that are not of interest to the investigator are ignored, such as system libraries or temporary files that the workflow uses but are not of interest to the investigator.

Data transformation objects are executable scripts or programs that are run during the execution of the workflow.

Link objects represent the connections between ES3 objects, for example between file objects and transformation objects. A link has a single direction so each link defined a 'source' object and a 'destination' object. Links must be organized as a Directed Acyclic Graph, such that no links point backward in the graph to create a loop.

A workflow object is the container in which all file, transformation and link objects belong. The workflow object represents all objects that are used during an instance of scientific processing that begins when recording for a Unix process begins and ends when that process exits.

Workflows can be connected to each other implicitly via that one workflow writes and another workflow reads. No explicit connection is created between workflows.

A workflow may contain another workflow thereby creating a nested structure.

Provenance Data for Workflow Parts

Model Integration Results

We imported provenance data from the PASS system and VisTrails.

Translation Details

We wrote a translator to read a foreign provenance data file and translate it to ES3 objects which could then be sent to ES3.

ES3ingest.py is the translator script that was created for the translation step. Here is the command syntax and examples for ES3ingest.py:

    Usage: ES3ingest.py -t foreign file type -e execution log file filename

         example:
    
             ES3ingest.py -t PASS challenge-D-mod.xml
    
             ES3ingest.py -t VisTrails -e pc_log.xml pc_part3a.xml" 

Translating PASS Provenance Data

Provenance data from the PASS system was used for the first portion of the challenge workflow.

The data model used by PASS is very similiar to the one used in ES3. The translation process involved converting PASS 'PROC' objects into ES3 'transformation' objects and PASS 'FILE' objects into ES3 'file' objects.

runCmd is a shell script that runs the translator program for the PASS data.

lineageTrace-part1.graphml is the XML returned by an ES3 lineage query that shows the first portion of the challenge workflow.

[[http://eil.bren.ucsb.edu/ES3/SecondProvenanceChallenge/PHASE2/Teams/PASS/Results/lineageTrace-part1.png]lineageTrace-part1.png]] is a graphical rendering of an ES3 lineage query that shows the PASS provenance data in ES3.

Using ES3 Provenance Data

ES3 provenance data was used for the second portion of the challenge workflow. This data was collected by running the provenance challenge workflow scripts while the probulator was monitoring them. The script run was 'workflow-part2.sh' which executed the command:


     $AIR_DIR/bin/softmean atlas.hdr y null resliced1.img resliced2.img resliced3.img resliced4.img

The ES3 transmitter was then run, which send the information captured by the probulator to ES3.

lineageTrace-part2.graphml is the XML retuned by and ES3 lineage query that shows the second portion of the challenge workflow.]]

lineageTrace-part2 is a graphical rendering of an ES3 lineage query that shows the VisTrails provenance data in ES3.

Translating VisTrails Provenance Data

Provenance data from the VisTrails system was used for the third portion of the challenge workflow.

lineageTrace-part3.graphml is the XML returned by an ES3 lineage query that shows the third portion of the challenge workflow.

lineageTrace-part3.png is a graphical rendering of an ES3 lineage query that shows the VisTrails provenance data in ES3.

lineageTrace-part3-Q7.graphml is the XML returned by an ES3 lineage query that shows the third portion of the challenge workflow.

lineageTrace-part3-Q7.png is a graphical rendering of an ES3 lineage query that shows the VisTrails provenance data in ES3.

Combining parts of the workflow

The usual method for ES3 to combine workflows is via the files that they share. One workflow creates an output file, then subsequent workflows read these files. The md5sum calculated for these files and stored when the file is registerd is used to determine which files are common to workflows. Lineage queries will determine these common files and traverse workflows that share them.

If the md5sum is not provided however, such as with the provenance data from the Provenance Challenge, then the workflows have to be stitched together manually creating "identity" links between common files.

The files

demonstrate how this was done to stitch together part1 to part2 and part3 to part3 of the workflow.

The file

shows is a graphical representation of a lineage query showing the combined workflow.

Benchmarks

We used the provenance queries from the first challenge as a benchmark, since these queries are well known to every team and results are easily compared between the first and second challenge. The provenance queries used in the First Provenance Challenge were used successfully for this challenge without changes.

Provenance Queries

Query 1

  1. Find UUID for object named "Atlas X Graphic".
  2. trace lineage backwards from corresponding UUID
  3. display results

Query 2

  1. Find UUID for object named "Atlas X Graphic".
  2. trace lineage backwards from corresponding UUID until object named "softmean" is encountered
  3. display results

Query 3

  1. Find UUID for object named "Atlas X Graphic".
  2. trace lineage backwards 5 links from corresponding UUID
  3. display results

Discussion

The ES3 Core data model doesn't include a concept of workflow "stages". For this query we simply traced back five links (our interpretation of "Stages 3, 4, and 5" in the challenge workflow) from the "A tlas X Graphic" object. The lineage trace query uses a termination condition that states the trace should end after traversing five links from the starting UUID.

Query 4

  1. Find all ES3 transformation objects (i.e. processes) that have the specified name and command line arguments

Discussion

The split score ( thumbs up + frown ) for this query is due to XQuery's lack of support for queries based on day-of-week.

Query 5

We did not implement Query 5, since the ES3 Probulator currently doesn't examine the contents of the objects it monitors. (See Further Comments below)

Query 6

  1. retrieve all align_warp transformations with arguments -m 12
  2. trace lineage forward to softmean
  3. retrieve file objects one lineage step forward from softmean

Query 7

Pending

Discussion

Our solution to Query 7, while not implemented entirely as an ES3 Core query, is nevertheless responsive to one of the primary classes of user queries that ES3 as whole was designed to support; namely, "what changed?" queries. It's extremely common for scientists developing ad hoc workflows to notice differences in outputs across invocations between which "nothing was changed". Our graph-differencing approach is designed to answer the "what changed?" query as directly (and visually) as possible, while still allowing subsequent drill-down into the details.

Further Comments

The manual operation of stitching together the provenance data from different systems to make a complete workflow was cumbersome. ES3 can use md5sums to combine workflows, but md5sums are often an expensive operation and often this data is not collected. Another method of combining data should be found if it proves to be beneficial to combine dissimilar provenance data in the future.

Conclusions

Translating foreign provenance data and importing into ES3 was fairly straighforward. However, fully understanding another systems data model from exported data and documentation is an incomplete method, which affects the implementation of the translation process.

Interoperability would be facilitated by a common set of terms and possibly a common provenance data format.

-- JamesFrew - 25 June 2007
to top


Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback