Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.SDSCPc3

Start of topic | Skip to actions

Provenance Challenge: SDSC

Participating Team

Team and Project Details

Workflow Representation

Provenance Challenge 3 workflow PNG

Open Provenance Model Output

Our OPM output is here for a successful execution, and here for a failed execution (IsExistsCSVFile? fails). The output is XML using the OPM v1.01.a schema by Paul Groth and Luc Moreau. Here is the opm2dot graph for the successful execution.

Query Results

We implemented our queries in XQuery 1.0. For each query, we load the provenance XML document into $graph. Additionally, we created a library (called opmLib) that contains utility queries, such as getAllAncestorProcesses() and getArtifactIdsThatContainsValue().

Query 1

For a given detection, which CSV files contributed to it?

LoadCSVFileIntoTable? tells the database to import the detections directly from a file. Since we did not instrument the database, we added an output to LoadCSVFileIntoTable?, called Detections, which outputs the detection values. We can then query for a specific detection value, e.g., 261887437030025141.

(: get the artifact id containing the detection :)
let $artifactId := opmLib:getArtifactIdsThatContainsValue($graph, "261887437030025141")

(: get the process that generated it :)
let $inputs := opmLib:getImmediateAncestorUseds($graph, $artifactId)

(: get the artifact with role FileEntry :)
for $used in $inputs, $artifact in $graph/artifacts/artifact
where $used/role/@value = "FileEntry" and
    $used/cause/@id = $artifact/@id
return $artifact/value

The output is a FileEntry? used by LoadCSVFileIntoTable?. This is a composite artifact, and support for accessing sub-artifacts would allow extracting the file name.

Output:

<value>
  {Checksum = "f8f9d70711cb3a1cb8b359d99d98fa63", 
   ColumnNames = {"objID", "detectID", "ippObjID", "ippDetectID", "filterID", "imageID", "obsTime", "xPos", "yPos", "xPosErr", "yPosErr", "instFlux",
   "instFluxErr", "psfWidMajor", "psfWidMinor", "psfTheta", "psfLikelihood", "psfCf", "infoFlag", "htmID", "zoneID", "assocDate", "modNum", "ra",
   "dec", "raErr", "decErr", "cx", "cy", "cz", "peakFlux", "calMag", "calMagErr", "calFlux", "calFluxErr", "calColor", "calColorErr", "sky",
   "skyErr", "sgSep", "dataRelease"}, 
   FilePath = "pc3/workflows/data/J062941/P2_J062941_B001_P2fits0_20081115_P2Detection.csv",
   HeaderPath = "pc3/workflows/data/J062941/P2_J062941_B001_P2fits0_20081115_P2Detection.csv.hdr", 
   RowCount = 20, 
   TargetTable = "P2Detection"}
</value>

Query 2

The user considers a table to contain values they do not expect. Was the range check (IsMatchTableColumnRanges) performed for this table?

(: find artifact containing table name :)
let $artifactIds := opmLib:getArtifactIdsThatContainsValue($graph, 'TargetTable = "P2Detection"')

(: find the one used by LoadCSVFileIntoTable :)
let $artifactId := opmLib:getArtifactIdsUsedByProcessValue($graph, $artifactIds, "LoadCSVFileIntoTable")

(: see if any descendant processes were IsMatchTableColumnRanges :)
let $found := (for $process in opmLib:getAllDescendantProcesses($graph, $artifactId)
               where contains($process/value, "IsMatchTableColumnRanges")
               return $process)

return if(count($found) = 0) then "no" else "yes"

Output:

yes

Query 3

Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?

(: find artifacts containing the image table name :)
let $artifactIds := opmLib:getArtifactIdsThatContainsValue($graph, 'TargetTable = "P2ImageMeta"')

(: get the artifact id that was used by LoadCSVFileIntoTable :)
let $id := (for $id in $artifactIds,
                $used in $graph/causalDependencies/used,
                $process in $graph/processes/process
            where $id = $used/cause/@id and
                $used/effect/@id = $process/@id and
                contains($process/value, "LoadCSVFileIntoTable")
            return $id)

(: return all processes that led to that artifact :)
return opmLib:getAllAncestorProcesses($graph, $id)

Output:

<process id="_p0">
    <value>.load.IsCSVReadyFileExists fire 0</value>
</process>
<process id="_p1">
    <value>.load.StopOnFalse fire 0</value>
</process>
<process id="_p2">
    <value>.load.ReadCSVReadyFile fire 0</value>
</process>
<process id="_p3">
    <value>.load.IsMatchCSVFileTables fire 0</value>
</process>
<process id="_p4">
    <value>.load.StopOnFalse2 fire 0</value>
</process>
<process id="_p5">
    <value>.load.CreateEmptyLoadDB fire 0</value>
</process>
<process id="_p6">
    <value>.load.Array Permute fire 0</value>
</process>
<process id="_p8">
    <value>.load.ForEach.in fire 0</value>
</process>
<process id="_p27">
    <value>.load.ForEach.CompositeActor.in fire 3</value>
</process>
<process id="_p28">
    <value>.load.ForEach.CompositeActor.Record Disassembler fire 1</value>
</process>
<process id="_p29">
    <value>.load.ForEach.CompositeActor.IsExistsCSVFile fire 1</value>
</process>
<process id="_p30">
    <value>.load.ForEach.CompositeActor.StopOnFalse fire 1</value>
</process>
<process id="_p31">
    <value>.load.ForEach.CompositeActor.ReadCSVFileColumnNames fire 1</value>
</process>

Optional Query 1

The workflow halts due to failing an IsMatchTableColumnRanges check. How many tables successfully loaded before the workflow halted due to a failed check?

(: count how many times IsMatchTableColumnRangesOutput was executed. :)
let $num := count(for $wgbs in $graph/causalDependencies/wasGeneratedBy
            where $wgbs/role/@value = "IsMatchTableColumnRangesOutput"
            return $wgbs)

(: since it halted, n - 1 tables were loaded. :)
return $num - 1

Output:

2

Optional Query 3

A CSV or header file is deleted during the workflow's execution. How much time expired between a successful IsMatchCSVFileTables test (when the file existed) and an unsuccessful IsExistsCSVFile? test (when the file had been deleted)?

(: find the wasGeneratedBy of the false output from IsExistsCSVFile :)
let $fail := opmLib:getWasGeneratedBy($graph, "IsExistsCSVFile", "false")/time
            
(: find the wasGeneratedBy of the true output from IsMatchCSVFileTables :)
let $ok := opmLib:getWasGeneratedBy($graph, "IsMatchCSVFileTables", "true")/time

(: return elapsed seconds :)
let $diff := xs:time($fail/noLaterThan) - xs:time($ok/noEarlierThan)
return $diff div xs:dayTimeDuration('PT1S')

Output:

1.562

Optional Query 6

Determine the step where halt occured?

(: get the last used or wasGeneratedBy relation :)
let $last:= $graph/causalDependencies/(used|wasGeneratedBy)[last()]

let $processId := if(name($last) = "used") then $last/effect/@id else $last/cause/@id

return $graph/processes/process[@id=$processId]

Output:

<process id="_p13">
    <value>.load-for-opt-query3.ForEach.CompositeActor.StopOnFalse fire 0</value>
</process>

Optional Query 8

Which steps were completed successfully before the halt occurred?

(: get the last used or wasGeneratedBy relation :)
let $last:= $graph/causalDependencies/(used|wasGeneratedBy)[last()]

let $artifactId := if(name($last) = "used") then $last/cause/@id else $last/effect/@id

return opmLib:getAllAncestorProcesses($graph, $artifactId)

Output:

<process id="_p0">
    <value>.load-for-opt-query3.IsCSVReadyFileExists fire 0</value>
</process>
<process id="_p1">
    <value>.load-for-opt-query3.StopOnFalse fire 0</value>
</process>
<process id="_p2">
    <value>.load-for-opt-query3.ReadCSVReadyFile fire 0</value>
</process>
<process id="_p3">
    <value>.load-for-opt-query3.IsMatchCSVFileTables fire 0</value>
</process>
<process id="_p4">
    <value>.load-for-opt-query3.StopOnFalse2 fire 0</value>
</process>
<process id="_p5">
    <value>.load-for-opt-query3.CreateEmptyLoadDB fire 0</value>
</process>
<process id="_p6">
    <value>.load-for-opt-query3.Array Permute fire 0</value>
</process>
<process id="_p8">
    <value>.load-for-opt-query3.ForEach.in fire 0</value>
</process>
<process id="_p10">
    <value>.load-for-opt-query3.ForEach.CompositeActor.in fire 1</value>
</process>
<process id="_p11">
    <value>.load-for-opt-query3.ForEach.CompositeActor.Record Disassembler fire 0</value>
</process>
<process id="_p12">
    <value>.load-for-opt-query3.ForEach.CompositeActor.IsExistsCSVFileFail fire 0</value>
</process>

Optional Query 10

For a workflow execution, determine the user inputs?

(: find all artifacts in a used relation, but not in a wasGeneratedBy
   relation.
:)

let $used := $graph/causalDependencies/used/cause/@id

let $wasGeneratedBy := $graph/causalDependencies/wasGeneratedBy/effect/@id

(: find the difference :)
let $diff := distinct-values($used[not(.=$wasGeneratedBy)])

(: return the artifacts :)
return $graph/artifacts/artifact[@id=$diff]

Output:

<artifact id="0">
    <value>"pc3/workflows/data/J062941"</value>
</artifact>
<artifact id="6">
    <value>"J062941"</value>
</artifact>
<artifact id="8">
    <value>"Record"</value>
</artifact>

Optional Query 11

For a workflow execution, determine steps that required user inputs?

(: get artifacts ids of user inputs (from optional query 10) :)
    
let $used := $graph/causalDependencies/used/cause/@id    
let $wasGeneratedBy := $graph/causalDependencies/wasGeneratedBy/effect/@id   
let $diff := distinct-values($used[not(.=$wasGeneratedBy)])
    
(: find processes that directly used these artifacts :)
return opmLib:getImmediateUsedProcessesForArtifactId($graph, $diff)

Output:

<process id="_p0">
    <value>.load.IsCSVReadyFileExists fire 0</value>
</process>
<process id="_p2">
    <value>.load.ReadCSVReadyFile fire 0</value>
</process>
<process id="_p5">
    <value>.load.CreateEmptyLoadDB fire 0</value>
</process>
<process id="_p6">
    <value>.load.Array Permute fire 0</value>
</process>

Suggested Workflow Variants

Suggested Queries

Suggestions for Modification of the Open Provenance Model

Conclusions

-- DanielCrawl - 31 Mar 2009


to top

I Attachment sort Action Size Date Who Comment
pc3-load.png manage 43.5 K 14 Apr 2009 - 16:46 DanielCrawl screenshot of load workflow
pc3-J062941.out.xml manage 96.7 K 18 May 2009 - 22:43 DanielCrawl provenance for J062941 INVALID
pc3-J062941.opt-query3.xml manage 22.1 K 18 May 2009 - 22:36 DanielCrawl provenance for J062941 that fails
j41.png manage 941.2 K 01 May 2009 - 01:07 DanielCrawl opm2dot of pc3-J062941.out.xml
pc3-J062941.good.xml manage 96.7 K 18 May 2009 - 22:45 DanielCrawl provenance for J062941

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback