Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.SDG

Start of topic | Skip to actions

Participating Team

  • Short team name: SDG
  • Participant names: Karen Schuchardt, Eric Stephan, Tara Gibson, George Chin
  • Project URL: http://sdg.pnl.gov/
  • Project Overview: While technologies are emerging from the semantic and grid communities to address issues related to data federation and tracking data context and provenance-critical needs for systems-oriented research-there is as yet no coherent "semantic data grid" architecture that will allow these advances to be applied at the scope, scale, and speed necessary to support next-generation science.
  • Provenance-specific Overview:
  • Relevant Publications:

Workflow Representation

SDG Provenance challenge workflow depicted in Kepler
Figure 1: Provenance challenge workflow depicted in Kepler

Provenance Trace

Provenance information was stored and queried using Scientific Annotation Middleware (SAM) http://collaboratory.emsl.pnl.gov/docs/collab/sam/. Any “thing”, for which we want to capture some information (actor, piece of data, parameter etc), is given a unique id and then properties and relationships are associated with that id. Figure 2 below shows the primary “things” we capture provenance on and what that provenance is in the context of this challenge. In an actual system, the “things” we capture provenance would be somewhat larger. We would include the workflow descriptors, actor descriptors, and versions of each of these at a minimum. We currently use generated urls for the unique ids but have an lsid server deployed and plan to experiment using it instead of generated urls.

Links are captured by the provenance store allowing forward and backward traversal. Relationship cardinality is captured as only information and the provenance store does not require or act upon this information. Because SAM supports schema-less metadata, no predefined metadata structures are required and as a result any metadata can be captured at any time. This capability is denoted by the [arbitrary triple]* notation in the figure. Inputs, outputs, and parameters can be captured as one of hasValue, hasHashOfValue, or hasRefToValue. The latter is used to point to raw data artifacts such as image files.

SDG Provenance data model
Figure 2: Provenance Data Model

Provenance Queries

This section depicts the methods and technologies we used to yield results for the provenance challenge. We used a query language based closely on the DAV Searching and Locating (DASL) protocol already supported by SAM. We will refer to this extension as Semantic Extended DASL (SEDASL), and it is described more in depth in our results document.

1) Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.

Discussion: The SEDASL scope query was used to first find the Atlas X Graphic file and then reverse semantic links were used to determine all processes (Kepler actors) and files that were responsible for generating this file. The scope was set to infinite to represent all provenance responsible for the generation of the file. The result format was set to GXL and converted to a gif image through a separate process.

<d:search xmlns:d="DAV:">
  <d:basicsearch>
    <d:select>
      <d:prop>
        <d:allindexedprop />
      </d:prop>
      <d:format d:include-links="false">gxl</d:format>
    </d:select>
    <d:from>
      <d:scope>
        <d:following d:direction="reverse">
          <d:prop>
            <d:hasOutput />
            <d:isInput />
            <d:instantiationOf />
            <d:hasParameter d:direction="forward" />
          </d:prop>
        </d:following>
        <d:query>
          <d:href>/sam/users/sdg/karen/workflow2125321801/</d:href>
          <d:eq xmlns:s="http://jakarta.apache.org/slide/">
            <d:prop>
              <d:title />
            </d:prop>
            <d:literal>atlas-x.gif</d:literal>
          </d:eq>
        </d:query>
        <d:depth>INFINITE</d:depth>
        <d:min-depth>0</d:min-depth>
      </d:scope>
    </d:from>
  </d:basicsearch>
</d:search>
Figure 3: SEDASL query for provenance query 1

Query 1 result image
Figure 4: Query 1 Results

2) Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.

Discussion: Search criteria was used to first find the Atlas X Graphic, and follow semantic links, similar to Query 1, until the process Softmean was found. Because the scope was set to stop when the Softmean process was found, only the links between the graphic and the Softmean process were returned. The result format was set to GXL and converted to a gif image.

<d:searchrequest xmlns:d="DAV:">
  <d:basicsearch>
    <d:select>
      <d:prop>
        <d:format />
        <d:title />
        <d:isInput />
        <d:hasOutput />
        <d:hasValue />
        <d:hasHashOfValue />
        <d:hasRefToValue />
        <d:instantiationOf />
        <d:hasParameter />
      </d:prop>
      <d:format d:include-links="false">gxl</d:format>
    </d:select>
    <d:from>
      <d:scope>
        <d:following d:direction="reverse">
          <d:prop>
            <d:isInput />
            <d:hasOutput />
            <d:hasRefToValue />
            <d:instantiationOf />
            <d:hasParameter d:direction="forward" />
          </d:prop>
          <d:stop-condition>
            <d:eq>
              <d:prop>
                <d:title />
              </d:prop>
              <d:literal>Softmean</d:literal>
            </d:eq>
          </d:stop-condition>
        </d:following>
        <d:query>
          <d:href>/sam/users/sdg/karen/workflow2125321801/</d:href>
          <d:eq>
            <d:prop>
              <d:title />
            </d:prop>
            <d:literal>atlas-x.gif</d:literal>
          </d:eq>
        </d:query>
        <d:depth>INFINITE</d:depth>
        <d:min-depth>0</d:min-depth>
      </d:scope>
    </d:from>
  </d:basicsearch>
</d:searchrequest>
Figure 5: SEDASL query for provenance query 2

Query 2 result image
Figure 6: Query 2 results

3) Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.

Discussion: Search criteria was used to first find the Atlas X Graphic, and follow semantic links, similar to Queries 1 and 2, except that the minimum and maximum scope depths were used to return certain stages of the workflow. The result format was set to GXL and converted to a gif image.

<d:searchrequest xmlns:d="DAV:">
  <d:basicsearch>
    <d:select>
      <d:prop>
        <d:format />
        <d:title />
        <d:isInput />
        <d:hasOutput />
        <d:hasValue />
        <d:hasHashOfValue />
        <d:hasRefToValue />
        <d:instantiationOf />
        <d:hasParameter />
      </d:prop>
      <d:format d:include-links="false">gxl</d:format>
    </d:select>
    <d:from>
      <d:scope>
        <d:following d:direction="reverse">
          <d:prop>
            <d:isInput />
            <d:hasOutput />
            <d:hasRefToValue />
            <d:instantiationOf />
            <d:hasParameter d:direction="forward" />
          </d:prop>
        </d:following>
        <d:query>
          <d:href>/sam/users/sdg/karen/workflow2125321801/</d:href>
          <d:eq>
            <d:prop>
              <d:title />
            </d:prop>
            <d:literal>atlas-x.gif</d:literal>
          </d:eq>
        </d:query>
        <d:depth>5</d:depth>
        <d:min-depth>0</d:min-depth>
      </d:scope>
    </d:from>
  </d:basicsearch>
</d:searchrequest>
Figure 7: SEDASL query for provenance query 3

Query 3 result image
Figure 8: Query 3 results

4) Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.

Discussion: Search criteria in the scope was used to find “ModelMenuNumber” parameters with the value “-m 12” and semantic links were used to find the associated process. Search criteria in the where clause was then used to filter the AlignWarp? results created on a given day of the week (in our case Thursday). The result format was set to GXL and converted to a gif image.

<d:searchrequest xmlns:d="DAV:">
  <d:basicsearch>
    <d:select>
      <d:prop>
        <d:displayName />
        <d:format />
        <d:title />
        <d:isPartOf />
      </d:prop>
      <d:format d:include-links="true">gxl</d:format>
    </d:select>
    <d:from>
      <d:scope>
        <d:following>
          <d:prop>
            <d:hasParameter d:direction="reverse" />
          </d:prop>
        </d:following>
        <d:query>
          <d:href>/sam/users/sdg/karen/</d:href>
          <d:and>
            <d:eq>
              <d:prop>
                <d:title />
              </d:prop>
              <d:literal>ModelMenuNumber</d:literal>
            </d:eq>
            <d:eq>
              <d:prop>
                <d:value />
              </d:prop>
              <d:literal>-m 12</d:literal>
            </d:eq>
          </d:and>
        </d:query>
        <d:depth>INFINITE</d:depth>
        <d:min-depth>0</d:min-depth>
      </d:scope>
    </d:from>
    <d:where>
      <d:and>
        <s:propcontains xmlns:s="http://jakarta.apache.org/slide/">

          <d:prop>
            <d:startedExecution />
          </d:prop>
          <d:literal>Th</d:literal>
        </s:propcontains>
        <s:propcontains xmlns:s="http://jakarta.apache.org/slide/">

          <d:prop>
            <d:instantiationOf />
          </d:prop>
          <d:literal>AlignWarp</d:literal>
        </s:propcontains>
      </d:and>
    </d:where>
  </d:basicsearch>
</d:searchrequest>
Figure 9: SEDASL query for provenance query 4

Query 4 result image
Figure 10: Query 4 results

5) Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095. The contents of a header file can be extracted as text using the scanheader AIR utility

Discussion: Our development team found the scanheader source code which yielded the binary structure of the image header. Based on this structure, a DFDL schema along with a corresponding XSLT file was created and registered on the SAM server. When the workflow runs, metadata from the image header is automatically extracted and associated with the file. Search criteria was used to first find all anatomy header files with a ‘global maximum’ property of ‘4095’ and follow semantic links to generate a search pool of all items below the header in the workflow. Search criteria in the where clause was then used to return the atlas image.

<?xml version='1.0' encoding='UTF-8'?>
<dfdl:DFDL xmlns:dfdl="DFDL">
  <dfdl:sizeofhdr>348</dfdl:sizeofhdr>
  <dfdl:extents>16384</dfdl:extents>
  <dfdl:regular>r</dfdl:regular>
  <dfdl:dims>4</dfdl:dims>
  <dfdl:xdim>256</dfdl:xdim>
  <dfdl:ydim>256</dfdl:ydim>
  <dfdl:zdim>128</dfdl:zdim>
  <dfdl:tdim>1</dfdl:tdim>
  <dfdl:datatype>4</dfdl:datatype>
  <dfdl:bits>16</dfdl:bits>
  <dfdl:xsize>1.0</dfdl:xsize>
  <dfdl:ysize>1.0</dfdl:ysize>
  <dfdl:zsize>1.25</dfdl:zsize>
  <dfdl:glmax>4095</dfdl:glmax>
  <dfdl:glmin>0</dfdl:glmin>
</dfdl:DFDL> 
Figure 11: XML Result after Defuddle translation of the header

<d:searchrequest xmlns:d="DAV:">
  <d:basicsearch>
    <d:select>
      <d:prop>
        <d:displayName />
        <d:format />
        <d:title />
        <d:hasParameter />
      </d:prop>
      <d:format d:include-links="false">gxl</d:format>
    </d:select>
    <d:from>
      <d:scope>
        <d:following>
          <d:prop>
            <d:isInput />
            <d:hasOutput />
            <d:hasRefToValue />
            <d:instantiationOf />
            <d:hasParameter />
          </d:prop>
        </d:following>
        <d:query>
          <d:href>/sam/users/sdg/karen/</d:href>
          <d:and>
            <s:propcontains xmlns:s="http://jakarta.apache.org/slide/">
              <d:prop>
                <d:title />
              </d:prop>
              <d:literal>anatomy*.hdr</d:literal>
            </s:propcontains>
            <d:eq xmlns:s="http://jakarta.apache.org/slide/">
              <d:prop>
                <d:glmax />
              </d:prop>
              <d:literal>4095</d:literal>
            </d:eq>
          </d:and>
        </d:query>
        <d:depth>INFINITE</d:depth>
        <d:min-depth>0</d:min-depth>
      </d:scope>
    </d:from>
    <d:where>
      <d:eq>
        <d:prop>
          <d:title />
        </d:prop>
        <d:literal>atlas.img</d:literal>
      </d:eq>
    </d:where>
  </d:basicsearch>
</d:searchrequest>
Figure 12: SEDASL query for provenance query 5

<D:multistatus xmlns:D="DAV:">
  <gxl xmlns="" xmlns:cmcs="http://purl.oclc.org/NET/cmcs/internal/schema/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sam="http://purl.oclc.org/NET/SAM/mms">
    <graph>
      <node id="node0">
        <attr name="Label">
          <string></string>
          <dav:href xmlns:dav="DAV:">
                  /users/sdg/karen/question7/workflow488799345/1378310357atlas.img
          </dav:href>
        </attr>
        <attr name="title" xmlns="DAV:">
          <string>atlas.img</string>
        </attr>
      </node>
    </graph>
  </gxl>
</D:multistatus>
Figure 13: Query 5 raw results

6) Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align_warp procedure with argument -m 12."

Discussion: Search criteria was used to find output from SoftMean? process which was preceded by processes containing the parameter “ModelMenuNumber” and value “-m 12” within a workflow. The result format was set to GXL and converted to a gif image.

<d:searchrequest xmlns:d="DAV:">
  <d:basicsearch>
    <d:select>
      <d:prop>
        <d:format />
        <d:title />
        <d:isInput />
        <d:hasOutput />
        <d:hasValue />
        <d:hasHashOfValue />
        <d:hasRefToValue />
        <d:instantiationOf />
      </d:prop>
      <d:format d:include-links="false">gxl</d:format>
    </d:select>
    <d:from>
      <d:scope>
        <d:following>
          <d:prop>
            <d:isInput />
            <d:hasOutput />
            <d:hasRefToValue />
            <d:instantiationOf />
            <d:hasParameter d:direction="reverse" />
          </d:prop>
        </d:following>
        <d:query>
          <d:href>/sam/users/sdg/karen/</d:href>
          <d:and>
            <d:eq>
              <d:prop>
                <d:title />
              </d:prop>
              <d:literal>ModelMenuNumber</d:literal>
            </d:eq>
            <d:eq>
              <d:prop>
                <d:hasValue />
              </d:prop>
              <d:literal>"-m 12"</d:literal>
            </d:eq>
          </d:and>
        </d:query>
        <d:depth>INFINITE</d:depth>
        <d:min-depth>0</d:min-depth>
      </d:scope>
    </d:from>
    <d:where>
      <d:eq>
        <d:prop>
          <d:title />
        </d:prop>
        <d:literal>Softmean</d:literal>
      </d:eq>
    </d:where>
  </d:basicsearch>
</d:searchrequest>
Figure 14: SEDASL query for provenance query 6

<D:multistatus xmlns:D="DAV:">
  <gxl xmlns="" xmlns:cmcs="http://purl.oclc.org/NET/cmcs/internal/schema/"               
                xmlns:dc="http://purl.org/dc/elements/1.1/"   
                xmlns:sam="http://purl.oclc.org/NET/SAM/mms">
    <graph>
      <node id="node0">
        <attr name="Label">
          <string></string>
          <dav:href xmlns:dav="DAV:">
/users/sdg/karen/workflow-1590847631/-885893539atlas.img</dav:href>
        </attr>
        <attr name="title" xmlns="DAV:">
          <string>atlas.img</string>
        </attr>
      </node>
    </graph>
  </gxl>
</D:multistatus>
Figure 15: Query 6 raw results

7) A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.

We developed some code based on Batagelj and Mrvar's subquadratic triad census algorithm. The graph is read in from a file which lists the set of nodes and set of edges for the graph.Internally, the graph is stored as an array of structures. Each array element or structure conveys a particular node in the graph. Each structure also contains a linked list which maps to every edge that moves outward from the node. Thus, each edge is represented only once. A second array is generated to capture neighbor information for the triad computations. With this array, each array element still represents a node, but the linked list represents all neighbors (inwards and outwards). Thus, each edge is represented twice in this neighbor data structure.

To be similar the triad distance should be between 0.0 and 1.0. In our tests, the distance was 0.246237, which seems reasonable. The result shows that 68 common nodes across the two files. It looks for node matches in the metadata with the same following attributes: title, instantiationOf, resourcetype, source, and format.

8) A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago.

Discussion: We used DavExplorer? to annotate an anatomy file with metadata, because the SAM server supports arbitrary metadata there were no schema modifications required. Search criteria was used to find the AlignWarp? process in the workflow of the file containing the metadata ‘center’ with the value ‘UChicago’. The result format was set to GXL and converted to a gif image. The output of the AlignWarp? process was also included in the graph result by including the output generated though links.

<d:searchrequest xmlns:d="DAV:">
  <d:basicsearch>
    <d:select>
      <d:prop>
        <d:format />
        <d:title />
        <d:isInput />
        <d:hasOutput />
        <d:hasValue />
        <d:hasHashOfValue />
        <d:hasRefToValue />
        <d:instantiationOf />
      </d:prop>
      <d:format d:include-links="true">gxl</d:format>
    </d:select>
    <d:from>
      <d:scope>
        <d:following>
          <d:prop>
            <d:isInput />
            <d:hasOutput />
            <d:hasRefToValue />
            <d:instantiationOf />
          </d:prop>
        </d:following>
        <d:query>
          <d:href>/sam/users/sdg/karen/</d:href>
          <d:eq>
            <d:prop>
              <d:center />
            </d:prop>
            <d:literal>UChicago</d:literal>
          </d:eq>
        </d:query>
        <d:depth>INFINITE</d:depth>
        <d:min-depth>0</d:min-depth>
      </d:scope>
    </d:from>
    <d:where>
      <s:propcontains xmlns:s="http://jakarta.apache.org/slide/">
        <d:prop>
          <d:title />
        </d:prop>
        <d:literal>AlignWarp</d:literal>
      </s:propcontains>
    </d:where>
  </d:basicsearch>
</d:searchrequest>
Figure 16: SEDASL query for provenance query 8

Query 8 result image
Figure 17: Query 8 results

9) A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.

Discussion: We used DavExplorer? to annotate several atlas graphic files with metadata, because the SAM server supports arbitrary metadata there were no schema modifications required. Search criteria was used to return all properties on the processes containing the metadata ‘studyModality’ with the values ‘speech’, ‘audio’, or ‘visual’. The result format was set to GXL and converted to a gif image.

<d:searchrequest xmlns:d="DAV:">
  <d:basicsearch>
    <d:select>
      <d:prop>
       <d:allindexedprop/>   
      </d:prop>
      <d:format d:include-links="false">gxl</d:format>
    </d:select>
    <d:from>
      <d:scope>
        <d:href>/sam/users/sdg/karen</d:href>
        <d:depth>INFINITE</d:depth>
        <d:min-depth>0</d:min-depth>
      </d:scope>
    </d:from>
    <d:where>
      <d:or>
        <s:propcontains xmlns:s="http://jakarta.apache.org/slide/">
          <d:prop>
            <d:studyModality />
          </d:prop>
          <d:literal>audio</d:literal>
        </s:propcontains>
        <s:propcontains xmlns:s="http://jakarta.apache.org/slide/">
          <d:prop>
            <d:studyModality />
          </d:prop>
          <d:literal>visual</d:literal>
        </s:propcontains>
        <s:propcontains xmlns:s="http://jakarta.apache.org/slide/">
          <d:prop>
            <d:studyModality />
          </d:prop>
          <d:literal>speech</d:literal>
        </s:propcontains>
      </d:or>
    </d:where>
  </d:basicsearch>
</d:searchrequest>
Figure 18: SEDASL query for provenance query 9

Query 9 result image
Figure 19: Query 9 results

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
SDG team thumbs up thumbs up thumbs up thumbs up thumbs up thumbs up thumbs up thumbs up thumbs up

Suggested Workflow Variants

  • Instead of just activities/actors, add other components such as data sources. How readily does each model/system adapt to such a change.
  • How do you handle error/stop and continue of workflows.
  • How do you model workflows with multiple iterations.

Suggested Queries

A few other interesting possibilities for queries were:

  • To query for workflow content/output as well as metadata. (This mostly applies to text-based content, and may not apply to this particular workflow).
  • To query/display one branch of graph, but not another based on metadata values. Stop the generation of graph based on conditional statement(s).
  • something to do with format
  • View result data in a different format than originally stored (e.g. view Excel table as an html table).
  • Analysis of workflow which includes content/hash of content.
  • Find instances of AlignWarp? version 3.5
  • Find workflows that include actors of some semantic type.
  • Identify bottlenecks in a workflow.
  • Identify failure patterns.
  • Analyze events by execution time
  • Comparison between workflow description and execution provenance

Categorization of queries

We chose to look at possible categorizations based on several different factors. The stucture that the query is based on, how the data/metadata was generated, and how the result is intended to be used.

The structure of the query - How has the query been structured, is it an iterative query in which the scope is searched for and the result obtained by applying conditions to it, or a simple query which specifies what to include or exclude.

  • 2 phase/ query (or recursive)
  • Specifying what to include \ These seem odd grouped together with 2-phase, I think I would
  • Specifying what to exclude / classify structure as iterative or directed.

How data was generated - Some of the queries differ based on how provenance was added to the store. It could be annotations added by the user, generated through workflow execution, by outside tools, or auto generated. Note, this may not (should not?) necessarily change the structure, appearance of the query, merely the type of data that you are querying.

  • User set parameter values
  • Workflow structure/execution capture
  • Outside tools
  • Auto-generated metadata/content

What it will be used for - The queries also differ based on the intent of the query. For exploratory analysis a full or partial graph is needed and for a directed query a more limited subset or single result is returned.

  • Exploratory analysis
  • Directed query to answer a specific question
  • Debugging
  • Verification
  • Comparison

Live systems

If your system can be accessed live (through portal, web page, web service, or other), provide relevant information here.

Further Comments

We have also provided a full result document for descriptions and comments not on this page.

Conclusions

Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.

-- TaraGibson - 13 Sep 2006


to top

I Attachment sort Action Size Date Who Comment
prov-workflow-sdg.png manage 130.1 K 11 Sep 2006 - 15:46 TaraGibson  
prov-model-sdg.png manage 101.9 K 11 Sep 2006 - 16:28 TaraGibson  
q1.gif manage 20.1 K 12 Sep 2006 - 00:18 TaraGibson Query 1 result image
q2.gif manage 3.4 K 11 Sep 2006 - 16:20 TaraGibson Query 2 result image
q3.gif manage 3.4 K 11 Sep 2006 - 16:21 TaraGibson Query 3 result image
q4.gif manage 2.3 K 11 Sep 2006 - 16:21 TaraGibson Query 4 result image
q8.gif manage 0.8 K 11 Sep 2006 - 23:56 TaraGibson Query 8 result image
q9.gif manage 5.4 K 11 Sep 2006 - 16:22 TaraGibson Query 9 result image
IPAW_Challenge_SDG.doc manage 761.0 K 13 Sep 2006 - 13:59 TaraGibson Full Results Document

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback