Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.UCDGC

Start of topic | Skip to actions

Provenance Challenge: UCD/Genome Center

Participating Team

Team and Project Details

  • Short team name: UCDGC
  • Participant names: Manish Kumar Anand, Shawn Bowers, Bertram Ludaescher, Sven Kholer, Timothy McPhillips?, Sean Riddle
  • Project Url: http://daks.ucdavis.edu/kepler-ppod/
  • Project Overview: Comad-Kepler provides a complete set of provenance management features. COMAD (Collection-Oriented Modeling and Design) provenance framework supports nested data collections and captures explicit data dependencies introduced by provenance-aware processes at a fine-grained (node) level. COMAD framework automatically captured in a xml file (.trace), through a set of provenance annotations. COMAD framework has been further extended to automatically store provenance information in a relational database., where immediate and transitive closure dependencies for each node and invocation is stored, and the storage cost is reduced by applying a set of reduction techniques. We have also implemented a visualization tool, Provenance Browser, to allow users to browse and navigate workflow execution traces. Thus, COMAD-kepler provenance system provides support for creating scientific workflows, automatically capturing and storing provenance information, querying provenance information through a high-level query language and visualizing relevant views of provenance information.
  • Relevant Publications:

Workflow Representation

pc3-comad-revised.png

The workflow is developed in the Kepler scientific workflow system. It uses the COMAD (Collection-Oriented Modeling and Design) framework, a system which allows nested collections of data to flow through a pipeline of actors, each actor interacting with only those parts of the stream it is interested in. This allows us to model the main aspect of control flow, the iteration over CSVFiles, as a functional map operation; this is very simple in COMAD. One challenge, though, was halting the entire workflow whenever any actor fails. While failures could easily be propagated downstream to actors 'below,' communicating them to upstream actors is challenging. To solve this, the classes simply communicate directly in an out-of-band fashion; if one fails, it informs all the rest, and they then act as passthroughs, just forwarding data and not operating on them further.

Open Provenance Model Output

Exporting COMAD-Kepler provenance traces into OPM traces

We ran the workflow across various data sets such that the workflow halts at each different step (8 halts) and in one case succeeds in executing all steps. The table below states those cases, comad-kepler provenance traces (xml) for the same, corresponding opm trace (xml), opm graph (png) and opm process dependency graph (png).

case comad-kepler trace (xml) opm trace (xml) opm graph (png) opm process dependency graph (png)
Halts at IsCSVReadyFileExists? J062941-halt1-IsCSVReadyFileExists.trace J062941-halt1-IsCSVReadyFileExists.xml J062941-halt1-IsCSVReadyFileExists.png J062941-halt1-IsCSVReadyFileExists_Process.png
Halts at IsMatchCSVFileTables? J062941-halt2-IsMatchCSVFileTables.trace J062941-halt2-IsMatchCSVFileTables.xml J062941-halt2-IsMatchCSVFileTables.png J062941-halt2-IsMatchCSVFileTables_Process.png
Halts at IsExistsCSVFile? J062941-halt3-IsExistsCSVFile.trace J062941-halt3-IsExistsCSVFile.xml J062941-halt3-IsExistsCSVFile.png J062941-halt3-IsExistsCSVFile_Process.png
Halts at IsMatchCSVFileColumnNames? J062941-halt4-IsMatchCSVFileColumnNames.trace J062941-halt4-IsMatchCSVFileColumnNames.xml J062941-halt4-IsMatchCSVFileColumnNames.png J062941-halt4-IsMatchCSVFileColumnNames_Process.png
Halts at LoadCSVFileIntoTable? J062941-halt5-LoadCSVFileIntoTable.trace J062941-halt5-LoadCSVFileIntoTable.xml J062941-halt5-LoadCSVFileIntoTable.png J062941-halt5-LoadCSVFileIntoTable_Process.png
Halts at UpdateComputedColumns? J062941-halt6-UpdateComputedColumns.trace J062941-halt6-UpdateComputedColumns.xml J062941-halt6-UpdateComputedColumns.png J062941-halt6-UpdateComputedColumns_Process.png
Halts at IsMatchTableRowCount? J062941-halt7-IsMatchTableRowCount.trace J062941-halt7-IsMatchTableRowCount.xml J062941-halt7-IsMatchTableRowCount.png J062941-halt7-IsMatchTableRowCount_Process.png
Halts at IsMatchTableColumnRanges? J062941-halt8-IsMatchTableColumnRanges.trace J062941-halt8-IsMatchTableColumnRanges.xml J062941-halt8-IsMatchTableColumnRanges.png J062941-halt8-IsMatchTableColumnRanges_Process.png
Succeeds J062941-success.trace J062941-success.xml J062941-success.png J062941-success_Process.png

J062941-success_Process.png

Importing OPM traces into COMAD-Kepler traces

We imported the opm output from various teams into our comad-kepler trace. We then stored it into our relational system (MySQL?) to query the imported provenance information.

Team opm trace (xml) comad-kepler trace (xml)
UCDGC UCDGC_opm.xml UCDGC_comad.trace
SotonUSCISIPc3 SotonUSCISIPc3_opm.xml SotonUSCISIPc3_comad.trace
TetherlessPC3 TetherlessPC3_opm.xml TetherlessPC3_comad.trace
KCL KCL_opm.xml KCL_comad.trace
PASS3 PASS3_opm.xml PASS3_comad.trace

Query Results

The provenance information (trace) where stored in MySQL?, and queries were written in SQL.

Core Query 1
For a given detection, which CSV files contributed to it?

Basic sample answer: The CSV file containing the Detection table.

Advanced sample answer: The CSV file containing the Detection table, CSV file containing the Image table (as the image is an attribute of the detection), and CSV file containing the FrameMetadata table (as the frame metadata is an attribute of the image).

SELECT value 
FROM runAnnotationView 
WHERE name="file" 
AND targetNodeId IN (SELECT targetNodeId FROM runAnnotationView WHERE value LIKE '%Detection%')
;

Query Result:

+----------------------------------------------------------------------------------------------------------+
| value                                                                                                    |
+----------------------------------------------------------------------------------------------------------+
| 
/Users/manish/Desktop/pc3/datasets/J062941-success/P2_J062941_B001_P2fits0_20081115_P2Detection.csv | 
+----------------------------------------------------------------------------------------------------------+

Core Query 2
The user considers a table to contain values they do not expect. Was the range check (IsMatchTableColumnRanges) performed for this table?

Sample answer: Yes

SELECT "true"
FROM runDataView rd1, runColldata r, runAnnotation ra 
WHERE rd1.name="ccdID" 
AND rd1.value="1"  
AND r.nodeId=rd1.parent
AND ra.targetNodeId=r.parent
AND ra.name="IsMatchTableColumnRanges_success"
;

Query Result:

+------+
| true |
+------+
| true | 
+------+

Core Query 3

Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?

Sample answer: call of ReadCSVReadyFile, call of CreateEmptyLoadDB, 2nd call of ReadCSVFileColumnNames, 2nd call of LoadCSVFileIntoTable (2nd calls because Image is loaded in the 2nd iteration of the for loop, excluded checks because they do not change anything, excluded UpdatedComputedColumns because it is non-computed, excluded CompactDatabase because it does not affect the value).

SELECT inv.invocName + inv.invocCount
FROM runitemView rv, dbValue dv, invokedAfterDep invD, invocation inv
WHERE rv.name="ccdID"
AND rv.dbValueId=dv.id
AND dv.value="1"
AND invD.toDbInvocId=rv.pIns
AND inv.dbInvocId=invD.fromDbInvocId
;

Query Result:

+---------------------------+------------+
| invocName                 | invocCount |
+---------------------------+------------+
| SetupTestInputs           |          1 | 
| IsCSVReadyFileExists      |          1 | 
| ReadCSVReadyFile          |          1 | 
| CreateCSVTableFiles       |          1 | 
| IsExistsCSVFile           |          2 | 
| ReadCSVFileColumnNames    |          2 | 
| IsMatchCSVFileColumnNames |          2 | 
| LoadCSVFileIntoTable      |          2 | 
+---------------------------+------------+


Optional Queries ran over workflow execution "_J062941-halt8-IsMatchTableColumnRanges.trace_"

Optional Query #1

The workflow halts due to failing an IsMatchTableColumnRanges check. How many tables successfully loaded before the workflow halted due to a failed check?

Sample answer: 2

SELECT COUNT(*) 
FROM runAnnotation 
WHERE name="IsMatchTableColumnRanges_success"
;

Query Result:

+----------+
| COUNT(*) |
+----------+
|        2 | 
+----------+

Optional Query #3

A CSV or header file is deleted during the workflow's execution. How much time expired between a successful IsMatchCSVFileTables test (when the file existed) and an unsuccessful IsExistsCSVFile? test (when the file had been deleted)?

Sample answer: 3ms

For testing the above query, we it may be simplest to edit the workflow to include deletion of the CSV file as a step.

SELECT ra2.value - ra1.value 
FROM runAnnotationView ra1, runAnnotationView ra2
WHERE ra1.name="IsMatchCSVFileTables_success"
AND ra2.name="halt"
;

Query Result: Difference in time is in ms.

+-----------------------+
| ra2.value - ra1.value |
+-----------------------+
|                  4795 | 
+-----------------------+

Optional Query #4

Why is this entry in the database?

INVOCATION DEPENDENCY CLOSURE

SELECT inv.invocName , inv.invocCount
FROM runitemView rv, dbValue dv, invokedAfterDep invD, invocation inv
WHERE rv.name="ccdID"
AND rv.dbValueId=dv.id
AND dv.value="1"
AND invD.toDbInvocId=rv.pIns
AND inv.dbInvocId=invD.fromDbInvocId
;

Query Result:

+---------------------------+------------+
| invocName                 | invocCount |
+---------------------------+------------+
| SetupTestInputs           |          1 | 
| IsCSVReadyFileExists      |          1 | 
| ReadCSVReadyFile          |          1 | 
| CreateCSVTableFiles       |          1 | 
| IsExistsCSVFile           |          2 | 
| ReadCSVFileColumnNames    |          2 | 
| IsMatchCSVFileColumnNames |          2 | 
| LoadCSVFileIntoTable      |          2 | 
+---------------------------+------------+

DATA DEPENDENCY CLOSURE

SELECT dcv.depNodeId
FROM runitemView rv, dbValue dv, runitemDepcView dcv
WHERE rv.dbValueId=dv.id
AND rv.name="ccdID"
AND dv.value="1"
AND dcv.nodeId=rv.nodeId
;

Query Result:


Empty set 
(for this specific query)

Optional Query #5

A user executes the workflow many times (say 5 times) over different sets of data (j062941, j062942, ... j062945). He wants to determine, which of the execution halted?

Sample answer: workflow_execution_id along with input_data_set_name (say j062941, j062942)

SELECT wfDefName
FROM run r, runAnnotation ra
WHERE ra.name="halt"
AND ra.runId=r.id
;

Query Result:

+-----------------------------------------+
| wfDefName                               |
+-----------------------------------------+
| J062941-halt4-IsMatchCSVFileColumnNames | 
| J062941-halt3-IsExistsCSVFile           | 
| J062941-halt1-IsCSVReadyFileExists      | 
| J062941-halt2-IsMatchCSVFileTables      | 
| J062941-halt8-IsMatchTableColumnRanges  | 
| J062941-halt7-IsMatchTableRowCount      | 
| J062941-halt5-LoadCSVFileIntoTable      | 
+-----------------------------------------+


Optional Query #6

Determine the step where halt occured?

Sample answer: 2nd call of ReadCSVFileColumnNames or 3rd call of LoadCSVFileIntoTable, etc.

SELECT inv.invocName , inv.invocCount
FROM runitemView rv, runAnnotation ra, invocation inv
WHERE ra.name="halt"
AND ra.nodeId=rv.nodeId
AND rv.pIns=inv.dbInvocId
;

Query Result:

+--------------------------+------------+
| invocName                | invocCount |
+--------------------------+------------+
| IsMatchTableColumnRanges |          3 | 
+--------------------------+------------+

Optional Query #7

Determine data and associated granularities of the data being processed, when halt occured?

Sample answer: data_id and associated granularities(column, row, table, file)

SELECT ra.name, ra.value
FROM runCollData r1, runCollData r2, runAnnotationView ra
WHERE r1.numLeft < r2.numLeft
AND r1.numRight > r2.numRight
AND r2.nodeId = (SELECT targetNodeId from runAnnotation WHERE name="halt")
AND r1.nodeId = ra.targetNodeId
;

Query Result:

+-------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| name                          | value                                                                                                                      |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| db                            | J062941-halt8-IsMatchTableColumnRanges--927058851_LoadDB                                                                   | 
| file                          | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges/P2_J062941_B001_P2fits0_20081115_P2Detection.csv | 
| sql_table                     | P2Detection                                                                                                                | 
| LoadCSVFileIntoTable_success  | 1240629827414                                                                                                              | 
| UpdateComputedColumns_success | 1240629828626                                                                                                              | 
| IsMatchTableRowCount_success  | 1240629828635                                                                                                              | 
+-------------------------------+----------------------------------------------------------------------------------------------------------------------------+

Optional Query #8

Which steps were completed successfully before the halt occurred?

Sample answer: steps name along with call (or count) number, if steps were executed multiple times

SELECT inv.invocName , inv.invocCount
FROM runitemView rv, runAnnotation ra, invocation inv, invokedAfterDep invD
WHERE ra.name="halt"
AND ra.nodeId=rv.nodeId
AND invD.toDbInvocId=rv.pIns
AND inv.dbInvocId=invD.fromDbInvocId
;

Query Result:

+---------------------------+------------+
| invocName                 | invocCount |
+---------------------------+------------+
| SetupTestInputs           |          1 | 
| IsCSVReadyFileExists      |          1 | 
| ReadCSVReadyFile          |          1 | 
| CreateCSVTableFiles       |          1 | 
| IsExistsCSVFile           |          3 | 
| ReadCSVFileColumnNames    |          3 | 
| IsMatchCSVFileColumnNames |          3 | 
| LoadCSVFileIntoTable      |          3 | 
| IsMatchTableColumnRanges  |          3 | 
+---------------------------+------------+

Optional Query #10

For a workflow execution, determine the user inputs?

Sample answer: jobID, CVSRootPath?

SELECT r.nodeId, r.itemType, r.name, r.type, dv.value 
FROM runitemView r, invocation inv, dbValue dv 
WHERE r.pins = inv.dbInvocId
AND inv.invocName="SetupTestInputs"
AND dv.id=r.dbValueId
;

Query Result:

+--------+----------+----------+-------------+---------------------------------------------------------------------------+
| nodeId | itemType | name     | type        | value                                                                     |
+--------+----------+----------+-------------+---------------------------------------------------------------------------+
|   7327 | data     | db_name  | StringToken | J062941-halt8-IsMatchTableColumnRanges--927058851                         | 
|   7328 | data     | data_dir | StringToken | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges | 
+--------+----------+----------+-------------+---------------------------------------------------------------------------+

Optional Query #11

For a workflow execution, determine steps that required user inputs?

Sample answer: IsCVSReadyFileExists?, CreateEmptyLoadDB?

SELECT inv2.invocName, inv2.invocCount, dv.value
FROM runitemview r, invocation inv, dependencyView d, runitemView rv, dbValue dv, invocation inv2 
WHERE r.pins = inv.dbInvocId
AND inv.invocName="SetupTestInputs"
AND d.nodeId=r.nodeId
AND rv.pDep=d.pDep
AND dv.id=r.dbvalueId
AND rv.pIns=inv2.dbInvocId
;

Query Result:

+----------------------+------------+---------------------------------------------------------------------------+
| invocName            | invocCount | value                                                                     |
+----------------------+------------+---------------------------------------------------------------------------+
| CreateEmptyLoadDB    |          2 | J062941-halt8-IsMatchTableColumnRanges--927058851                         | 
| IsCSVReadyFileExists |          1 | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges | 
| ReadCSVReadyFile     |          1 | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges | 
+----------------------+------------+---------------------------------------------------------------------------+

Optional Query #13

_For a workflow execution, display the following provenance views: data dependency views, step dependency view?

Sample answer: either display these views in form of a directed graph, or as a pair of dependency relations between data or steps.

DATA DEPENDENCY VIEW (RELATIONS)

SELECT nodeId, depNodeId 
FROM runitemDepView 
;

Query Result:

Dependency Relations between nodes

PROCESS DEPENDENCY VIEW (RELATIONS)

SELECT fromdbInvocId, todbInvocId
FROM immedInvokedAfter
;

Query Result:

Dependency Relations between processes

Suggested Workflow Variants

Suggested Queries

Suggestions for Modification of the Open Provenance Model

Conclusions

-- ManishAnand - 25 Apr 2009
to top

I Attachment sort Action Size Date Who Comment
pc3-comad-revised.pdf manage 104.1 K 21 Apr 2009 - 23:40 SeanRiddle  
pc3-comad-revised.png manage 131.4 K 21 Apr 2009 - 23:40 SeanRiddle  
pc3-comad-revised.xml manage 19.2 K 21 Apr 2009 - 23:41 SeanRiddle  
J062941-success.trace manage 198.3 K 25 Apr 2009 - 04:51 ManishAnand  
J062941-success.xml manage 425.2 K 25 Apr 2009 - 14:49 ManishAnand  
J062941-success_Process.png manage 165.6 K 07 May 2009 - 21:43 ManishAnand  
J062941-success.png manage 443.1 K 25 Apr 2009 - 04:58 ManishAnand  
J062941-halt1-IsCSVReadyFileExists.trace manage 6.0 K 25 Apr 2009 - 05:09 ManishAnand  
J062941-halt1-IsCSVReadyFileExists.xml manage 2.9 K 25 Apr 2009 - 14:54 ManishAnand  
J062941-halt1-IsCSVReadyFileExists_Process.png manage 12.2 K 25 Apr 2009 - 05:23 ManishAnand  
J062941-halt1-IsCSVReadyFileExists.png manage 32.3 K 25 Apr 2009 - 05:35 ManishAnand  
J062941-halt2-IsMatchCSVFileTables.trace manage 9.8 K 25 Apr 2009 - 05:14 ManishAnand  
J062941-halt3-IsExistsCSVFile.trace manage 13.1 K 25 Apr 2009 - 05:15 ManishAnand  
J062941-halt4-IsMatchCSVFileColumnNames.trace manage 20.3 K 25 Apr 2009 - 05:15 ManishAnand  
J062941-halt5-LoadCSVFileIntoTable.trace manage 56.7 K 25 Apr 2009 - 05:15 ManishAnand  
J062941-halt6-UpdateComputedColumns.trace manage 198.6 K 25 Apr 2009 - 05:15 ManishAnand  
J062941-halt7-IsMatchTableRowCount.trace manage 196.2 K 25 Apr 2009 - 05:16 ManishAnand  
J062941-halt8-IsMatchTableColumnRanges.trace manage 198.0 K 25 Apr 2009 - 05:16 ManishAnand  
J062941-small.trace manage 14.8 K 25 Apr 2009 - 05:16 ManishAnand  
J062941-halt2-IsMatchCSVFileTables.xml manage 12.1 K 25 Apr 2009 - 14:54 ManishAnand  
J062941-halt3-IsExistsCSVFile.xml manage 19.9 K 25 Apr 2009 - 14:54 ManishAnand  
J062941-halt4-IsMatchCSVFileColumnNames.xml manage 33.2 K 25 Apr 2009 - 14:54 ManishAnand  
J062941-halt5-LoadCSVFileIntoTable.xml manage 114.8 K 25 Apr 2009 - 14:53 ManishAnand  
J062941-halt6-UpdateComputedColumns.xml manage 425.4 K 25 Apr 2009 - 14:52 ManishAnand  
J062941-halt7-IsMatchTableRowCount.xml manage 419.8 K 25 Apr 2009 - 14:52 ManishAnand  
J062941-halt8-IsMatchTableColumnRanges.xml manage 424.0 K 25 Apr 2009 - 14:51 ManishAnand  
J062941-halt2-IsMatchCSVFileTables_Process.png manage 26.4 K 25 Apr 2009 - 05:24 ManishAnand  
J062941-halt3-IsExistsCSVFile_Process.png manage 68.7 K 25 Apr 2009 - 05:24 ManishAnand  
J062941-halt4-IsMatchCSVFileColumnNames_Process.png manage 90.5 K 25 Apr 2009 - 05:26 ManishAnand  
J062941-halt5-LoadCSVFileIntoTable.png manage 3191.6 K 25 Apr 2009 - 05:47 ManishAnand  
J062941-halt5-LoadCSVFileIntoTable_Process.png manage 87.6 K 25 Apr 2009 - 05:29 ManishAnand  
J062941-halt6-UpdateComputedColumns_Process.png manage 111.0 K 25 Apr 2009 - 05:30 ManishAnand  
J062941-halt7-IsMatchTableRowCount_Process.png manage 93.5 K 25 Apr 2009 - 05:31 ManishAnand  
J062941-halt8-IsMatchTableColumnRanges_Process.png manage 105.8 K 25 Apr 2009 - 05:32 ManishAnand  
J062941-small.png manage 428.0 K 25 Apr 2009 - 05:33 ManishAnand  
J062941-small_Process.png manage 88.8 K 25 Apr 2009 - 05:34 ManishAnand  
J062941-halt2-IsMatchCSVFileTables.png manage 192.1 K 25 Apr 2009 - 05:37 ManishAnand  
J062941-halt3-IsExistsCSVFile.png manage 364.8 K 25 Apr 2009 - 05:37 ManishAnand  
J062941-halt4-IsMatchCSVFileColumnNames.png manage 654.8 K 25 Apr 2009 - 05:39 ManishAnand  
J062941-halt6-UpdateComputedColumns.png manage 390.2 K 25 Apr 2009 - 05:46 ManishAnand  
J062941-halt7-IsMatchTableRowCount.png manage 535.9 K 25 Apr 2009 - 05:50 ManishAnand  
J062941-halt8-IsMatchTableColumnRanges.png manage 390.6 K 25 Apr 2009 - 05:54 ManishAnand  
J062941-small.xml manage 23.5 K 25 Apr 2009 - 14:51 ManishAnand  
UCDGC_comad.trace manage 195.0 K 27 Apr 2009 - 23:00 ManishAnand  
SotonUSCISIPc3_comad.trace manage 18.3 K 27 Apr 2009 - 23:01 ManishAnand  
TetherlessPC3_comad.trace manage 13.3 K 27 Apr 2009 - 23:01 ManishAnand  
PASS3_comad.trace manage 1610.8 K 27 Apr 2009 - 23:12 ManishAnand  
SotonUSCISIPc3_opm.xml manage 117.9 K 27 Apr 2009 - 23:21 ManishAnand  
TetherlessPC3_opm.xml manage 38.6 K 27 Apr 2009 - 23:21 ManishAnand  
PASS3_opm.xml manage 5618.4 K 27 Apr 2009 - 23:21 ManishAnand  
KCL_opm.xml manage 350.8 K 27 Apr 2009 - 23:22 ManishAnand  
KCL_comad.trace manage 160.4 K 27 Apr 2009 - 23:32 ManishAnand  
UCDGC_opm.xml manage 425.2 K 27 Apr 2009 - 23:41 ManishAnand  

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback