Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.UoM

Start of topic | Skip to actions

Provenance Challenge: University of Manchester, School of Computer Science

Team and Project Details

Project overview: The PC3 project is undertaken as part of the myGrid project, and in particular of the Taverna scientific workflow management system. Taverna 2.1, soon to be released, will offer a provenance API that allows third party developers to access the Taverna provenance DB. In particular the API supports queries to the provenance graph for a workflow run (one run at a time, that is), by specifying (a) the workflow ports for which provenance information is sought, and (b) the workflow tasks ("processors") where provenance information should be reported. This gives users the option to focus provenance information only on portions of the provenance graph that is of interest.

My immediate plan is to incorporate OPM graph generation functionality into the existing provenance query algorithm. This will be followed by a corresponding import functionality into our internal provenance model (relational).

The former is done and the OPM graphs below are an example. The latter is still in the works and hopefully will be ready in time for the PC3 f2f meeting.

The ability to fine-tune a provenance query means that Taverna can actually generated and export only the portion of the entire OPM provenance graph that is relevant to answer the query. In practice, the query is answered by our system, and the OPM graph represents the answer to the query. Examples of this are also given below. Note that the entire provenance graph is obtained simply by completely "unfocusing" the query, i.e., it corresponds to the degenerate query on all the output ports and all the workflow tasks.

Slideset presented at the Amsterdam meeting, June 11, 2009:

Workflow representation

The PAN-STARRS workflow is defined as a sequence of steps with exit conditions. While this is naturally expressed using an imperative language, the Taverna workflow model is that of a dataflow, i.e., the entire workflow execution is data-driven and there are (essentially) no control structures. This makes the implementation of the PAN-STARRS workflow less than obvious, in that, for example, one cannot simply "halt" a workflow in response to a data error condition. Taverna's treatment of data errors involves propagating the error from the point it is generated, through the remaining dataflow graph. Errors are treated just like any other piece of data, but can be distinguished as representing an error condition. Thus, the workflow continues to execute, but the workflow controller prevents Taverna processors (i.e., workflow tasks) from executing when any of the inputs are errors, immediately returning the same error values instead. Thus, the overall behaviour is that of a dataflow where errors are spotted, but they do not have the effect to halt executions.

In my rendering of the challenge workflow I provide one boolean output for each possible error condition. In addition, error values appear in the provenance trace as "error values". All intermediate Taverna values are identified using URIs, of the form "t2:ref//dataref.taverna.org?". In particular, errors are values that contain a description of the error, and are identified by a URI of the form :"t2:error//dataref.taverna.org?". this makes it easy to trace and query errors.

A graphical depiction of the challenge workflow appears in the figure below:

  • Taverna version of the challenge workflow (PNG):

Taverna version of the challenge workflow (PNG)

The workflow is also available here through the myExperiment workflow repository, part of the myGrid project.

Open Provenance Model Output

The output is in RDF and is produced using the Tupelo provenance API, courtesy of Joe Futrelle at NCSA.

The OPM graphs below, in RDF/XML format, represent (1) the complete OPM graph (the result of the "fully unfocused query" alluded to above), and (2,3,4) for each of the first 3 provenance queries, a graph that contains enough information to answer the query, but does not include parts of the graph that are not relevant for the query. This is done to show how the queries can be answered by Taverna on its native provenance system, and the resulting subgraphs are then exported to OPM.

SPARQL queries

Below are a few simple utility SPARQL queries that may be useful to inspect the OPM RDF graph:

A note of Taverna provenance queries

A provenance query in Taverna consists of two parts:

  1. a set of ports (output processor variables) whose provenance we are interested in. Optionally, one can also specify an iteration step, i.e., when the value is a list whose elements are produced during iterations. We refer to these using the notation:
target = processor&/variable/iteration [,processor/variable/iteration]* | ALL

(in reality the grammar allows for a few more options, but this will suffice here)

  1. a set of target processors, where we want provenance to be reported. This is done to avoid unnecessary noise in the query answer, i.e., we "jump" over uninteresting processors. We refer to these using the notation:
selected = processor [, processor]* | ALL

For example:

target=TOP / LoadCSVFileIntoTableOutput? /2 denotes element 2 of the value on port LoadCSVFileIntoTableOutput? of the top-level Taverna workflow (note that workflows can be nested)

select=LoadCSVFileIntoTable,IsMatchTableColumnRanges

this query returns the lineage of the value specified in the targets, but only on the input and output ports of processors LoadCSVFileIntoTable? and IsMatchTableColumnRanges?.

Complete OPM graph

  1. OPMGraph-complete.rdf: Complete provenance graph (RDF/XML) for one successful run of the PAN-STARRS workflow
  2. OPMGraph-complete.xml: Complete provenance graph (XML) for one successful run of the PAN-STARRS workflow
  3. OPMGraph-complete.dot: Complete provenance graph (dot) for one successful run of the PAN-STARRS workflow
  4. Complete provenance graph (PNG) for one successful run of the PAN-STARRS workflow:
    Complete provenance graph (PNG) for one successful run of the PAN-STARRS workflow

Query 1 For a given detection, which CSV files contributed to it?

note this will be updated to reflect Paul's latest qualification of the purpose of the query.

This maps to the Taverna provenance query: target = TOP/LoadCSVFileIntoTableOutput/2, selected = LoadCSVFileIntoTable?

and generates the following OPM graph:

  1. OPMGraph-query1.xml: Provenance graph (XML) as an answer to provenance query 1
  2. OPMGraph-query1.dot: Provenance graph (dot) as an answer to provenance query 1
  3. OPMGraph-query1.rdf: Provenance graph (RDF) as an answer to provenance query 1
  4. Provenance graph (PNG) as an answer to provenance query 1:
    Provenance graph (PNG) as an answer to provenance query 1

using the SPARQL queries listed above to inspect the graph, one obtains the artifacts:

------------------------------------------------------------------------------------------------------------------------------------------------------
| artifact                             | value                                                                                                       |
======================================================================================================================================================
| <t2:ref//dataref.taverna.org?test20> | "J062941_LoadDB"                                                                                            |
| <t2:ref//dataref.taverna.org?test21> | "true"                                                                                                      |
| <t2:ref//dataref.taverna.org?test15> | "/Users/paolo/Documents/myGRID/OPM/PC3/SampleData/J062941/P2_J062941_B001_P2fits0_20081115_P2Detection.csv" |
------------------------------------------------------------------------------------------------------------------------------------------------------
where the third entry is file that contributes to the detection. One can inspect the Used relations:
--------------------------------------------------------------------------------------------------------------------------------------------------------
| process                                            | usedArtifact                         | role                                  | processIteration |
========================================================================================================================================================
| "http://taverna.opm.org/LoadCSVFileIntoTable?it=2" | "t2:ref//dataref.taverna.org?test15" | "LoadCSVFileIntoTable/FileEntry?it=2" | "[2]"            |
--------------------------------------------------------------------------------------------------------------------------------------------------------

Query 2: "Was the range check (IsMatchTableColumnRanges?) performed for this table?"

this translates simply into a boolean query that tests whether IsMatchTableColumnRanges? is mentioned anywhere in the provenance graph, and is basically a reacheability query:

target = TOP / ALL / ALL --all output ports of the top level workflow

select = IsMatchTableColumnRanges?

  1. OPMGraph-query2.xml: Provenance graph (XML) as an answer to provenance query 2
  2. OPMGraph-query2.dot: Provenance graph (dot) as an answer to provenance query 2
  3. OPMGraph-query2.rdf: Provenance graph (RDF) as an answer to provenance query 2
  4. Provenance graph (PNG) as an answer to provenance query 2:
    Provenance graph (PNG) as an answer to provenance query 2

Query 3: Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?

Taverna can only provide the simple answer to the query, because query answers are based purely on the data dependencies that are exposed to the workflow.

This translates in Taverna by considering that the Image table is populated from the 2nd CSV file. Since the provenance graph only mentions processors that contributed to the execution, a query that traces back from the target port any processor collects all required processors along the path:

target = TOP / LoadCSVFileIntoTableOutput? / 1

select = ALL

  1. OPMGraph-query3.xml: Provenance graph (XML) as an answer to provenance query 3
  2. OPMGraph-query3.dot: Provenance graph (dot) as an answer to provenance query 3
  3. OPMGraph-query3.rdf: Provenance graph (RDF) as an answer to provenance query 3
  4. Provenance graph (PNG) as an answer to provenance query 3:
    Provenance graph (PNG) as an answer to provenance query 3

Notes on the IDs used in the graph, and on the values of artifacts:

In Taverna, repeated invocation of a processor occurs when the processor expects an atomic value, i.e., a string, but instead its input port is bound to a list of strings (the story is actually a bit longer but this will suffice for this note). So for example, processor LoadCSVFileIntoTable? is defined to have one input port, called DBEntry, which expects a string (the file name). In this workflow it receives a list of 3 file names. This causes it to execute independently on each of them. A trace for each of these 3 independent executions appears in the OPM graph, and each occurrence is indexed as [0], [1], and [2]. (This path notation makes it possible to express more complex paths into nested lists).

Since there is no explicit provision in OPM to account for indexing of multiple occurrences of the same process, one new ID is created for each occurrence simply by appending the index to the process name. So for example, the following records appear in the list produced by WhoUsedWhat.sparql:

--------------------------------------------------------------------------------------------------------------------------------------------------------
| process                                            | usedArtifact                         | role                                  | processIteration |
========================================================================================================================================================
| "http://taverna.opm.org/LoadCSVFileIntoTable?it=2" | "t2:ref//dataref.taverna.org?test16" | "LoadCSVFileIntoTable/FileEntry?it=2" | "[2]"            |
| "http://taverna.opm.org/LoadCSVFileIntoTable?it=1" | "t2:ref//dataref.taverna.org?test14" | "LoadCSVFileIntoTable/FileEntry?it=1" | "[1]"            |
| "http://taverna.opm.org/LoadCSVFileIntoTable?it=0" | "t2:ref//dataref.taverna.org?test12" | "LoadCSVFileIntoTable/FileEntry?it=0" | "[0]"            |
--------------------------------------------------------------------------------------------------------------------------------------------------------
This is interpreted as "occurrence [i] of LoadCSVFileIntoTable used artifact t2:ref//dataref.taverna.org?test16 with role LoadCSVFileIntoTable/FileEntry?it=i during iteration _[i]_"

The last item, processIteration, is added as an explicit RDF triple to the Used resource to make it possible to query the graph by iteration.

Note also that the role is used, at the moment, to describe the binding of a variable to a value (an artifact ID), e.g. LoadCSVFileIntoTable/FileEntry?it=0 means that t2:ref//dataref.taverna.org?test20 is the artifact (id) bound to variable FileEntry of processor LoadCSVFileIntoTable during iteration [0].

Regarding values, at the moment my implementation can optionally include a property, which is not part of standard OPM, to associate the dereferenced value to the artifact. This is useful to quickly inspect the graph when the values are simple. In the case of the challenge workflow, values are actually Java beans, so it's not immediately clear how to represent their value. In my implementation of the workflow, I am XMLEncoding/XMLDecoding the beans, as Taverna really only likes to work with strings. Rather then associating XMLEncoded beans to the artifacts, I have chosen to extract field values from them. This is done through customizable "data value extractors" that are invoked by the OPM graph builder when a new artifact is produced. So my choice of extractors reflects the needs of the provenance queries, but should be viewed just as an example of a generic data extraction pattern. In particular, my plan is to map beans to JSON, which is a generally useful format when presenting OPM graphs using rich web pages. (I just haven't had the time to do it).

Query Results

This section describes how third party OPM graphs are imported into the Taverna provenance model. Once imported, the challenge queries can be answered using the Taverna provenance query engine, just like the native Taverna provenance traces.

Representing third party OPM graphs in the Taverna provenance model: the MPOD idea

One peculiarity of the Taverna provenance model is that it relies on the structure of the static workflow graph in order to answer provenance queries efficiently. This is not a problem when provenance is captured from a Taverna workflow execution, of course, but it may become problematic when third party OPM graphs are imported, because those do not carry the original workflow structure.

Our solution is to use the causal relations provided by the graph, to "induce" a Minimal Plausible Originating (Taverna) Dataflow (MPOD) that could have produced the graph. As a consequence, the import algorithm consists of two parts:

  1. generate a MPOD from the artifact-artifact and artifact-process relations, and store it in the provenance DB. This includes processors with inputs and output ports, connected through data dependencies. A small number of mapping rules are used for this purpose;
  2. generate the bindings of ports to artifacts that would have been observed upon execution of the generated workflow.

The result is a complete provenance DB. One twist to this approach is that, for multi-account graphs, the algorithm maps each account to a separate MPOD, and in addition it generates a comprehensive dataflow that includes all relations found across all accounts.

Ideally, if this mapping were completely lossless, then the OPM graph obtained by exporting the provenance DB that results from this import method, would be identical to the initial third party graph that was imported. While this is not always the case (in some cases the mapping is lossy), the interesting point is that one can pose (with a few renamings) the same queries described above to the imported provenance DB, and obtain structurally similar query answer OPM graphs.

The following OPM XML graphs have been successfully imported into Taverna:

source OPM file MOPD visualization(where available)
UC Davis Halts at IsCSVReadyFileExists  
UC Davis Success MOPD for UC Davis
Soton 2 accounts  
NCSA 609241_output.xml MOPD for NCSA

Regarding posing queries on the DB that stores the content of the these graphs, the problem seems to be that, although all the necessary information is, at least apparently, in the graph, there is no simple way to map the provenance queries from the first part of the exercise, i.e., those posed on the native Taverna workflow provenance, to these. This is where the current effort is concentrating.

Example: query 3 graph from UC Davis:

query3

Suggested Workflow Variants

Suggested Queries

Suggestions for Modification of the Open Provenance Model

Conclusions

-- PaoloMissier - 07 May 2009


to top

I Attachment sort Action Size Date Who Comment
OPMGraph.rdf manage 62.9 K 07 May 2009 - 14:43 PaoloMissier Complete provenance graph for one successful run of the PAN-STARRS workflow
PAN-STARRSTaverna.png manage 222.5 K 30 Apr 2009 - 15:15 PaoloMissier Taverna version of the challenge workflow (PNG)
PAN-STARRSTaverna.svg manage 30.1 K 30 Apr 2009 - 15:16 PaoloMissier Taverna version of the challenge workflow (SVG)
allArtifacts.sparql manage 0.4 K 07 May 2009 - 15:45 PaoloMissier List all artifacts
allProcesses.sparql manage 0.3 K 30 Apr 2009 - 15:36 PaoloMissier List all processes
WhoUsedWhat.sparql manage 0.7 K 30 Apr 2009 - 15:37 PaoloMissier List Who Used What (with roles)
WhoWasGeneratedByWhom.sparql manage 0.7 K 30 Apr 2009 - 15:37 PaoloMissier List Who Generated What (with roles)
processesAndIterations.sparql manage 0.4 K 30 Apr 2009 - 15:37 PaoloMissier List all iterations for all processes
OPMGraph-query1.rdf manage 2.9 K 14 May 2009 - 15:56 PaoloMissier Provenance graph (RDF) as an answer to provenance query 1
OPMGraph-query2.rdf manage 7.9 K 14 May 2009 - 16:05 PaoloMissier Provenance graph (RDF) as an answer to provenance query 2
OPMGraph-query3.rdf manage 8.7 K 14 May 2009 - 16:09 PaoloMissier Provenance graph (RDF) as an answer to provenance query 3
OPMGraph-complete.rdf manage 51.9 K 14 May 2009 - 15:47 PaoloMissier Complete provenance graph (RDF/XML) for one successful run of the PAN-STARRS workflow
OPMGraph-complete.xml manage 33.4 K 14 May 2009 - 15:48 PaoloMissier Complete provenance graph (XML) for one successful run of the PAN-STARRS workflow
OPMGraph-complete.dot manage 15.1 K 14 May 2009 - 15:48 PaoloMissier Complete provenance graph (dot) for one successful run of the PAN-STARRS workflow
OPMGraph-complete.png manage 233.7 K 14 May 2009 - 15:48 PaoloMissier Complete provenance graph (PNG) for one successful run of the PAN-STARRS workflow
OPMGraph-query1.png manage 22.2 K 14 May 2009 - 15:55 PaoloMissier Provenance graph (PNG) as an answer to provenance query 1
OPMGraph-query1.xml manage 2.4 K 14 May 2009 - 15:56 PaoloMissier Provenance graph (XML) as an answer to provenance query 1
OPMGraph-query1.dot manage 0.8 K 14 May 2009 - 15:56 PaoloMissier Provenance graph (dot) as an answer to provenance query 1
OPMGraph-query2.png manage 37.6 K 14 May 2009 - 16:04 PaoloMissier Provenance graph (PNG) as an answer to provenance query 2
OPMGraph-query2.xml manage 5.5 K 14 May 2009 - 16:04 PaoloMissier Provenance graph (XML) as an answer to provenance query 2
OPMGraph-query2.dot manage 2.3 K 14 May 2009 - 16:05 PaoloMissier Provenance graph (dot) as an answer to provenance query 2
OPMGraph-query3.png manage 75.9 K 09 Jun 2009 - 13:19 PaoloMissier OPM graph for query 3 from native 2 PAN-STARRS
OPMGraph-query3.xml manage 6.0 K 14 May 2009 - 16:09 PaoloMissier Provenance graph (XML) as an answer to provenance query 3
OPMGraph-query3.dot manage 2.5 K 14 May 2009 - 16:09 PaoloMissier Provenance graph (dot) as an answer to provenance query 3
MOPD_OPMDefaultAccount-1525f737-2d11-4a0f-a897-ea9fec09951b.png manage 100.3 K 08 Jun 2009 - 11:20 PaoloMissier MOPD for UC Davis
MOPD_OPMDefaultAccount-4acaef59-c25b-4a7d-9f99-499c290ee378.png manage 112.1 K 08 Jun 2009 - 11:28 PaoloMissier MOPD for NCSA
ImportedOPMGraph-query3.png manage 58.4 K 08 Jun 2009 - 20:54 PaoloMissier query3
PC3-report.pdf manage 735.0 K 12 Jun 2009 - 07:26 PaoloMissier PC3 report presented at OPM PC3 meeting

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback