Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.KCL

Start of topic | Skip to actions

Provenance Challenge: KCL

Participating Team

Team and Project Details

  • Short team name: KCL
  • Participant names: Simon Miles
  • Project URL:
  • Project Overview:
  • Relevant Publications:

Quick Overview

We here provide a quick overview of the system used to generate the provenance, to make the contents of the OPM model clearer for other participants.

The system, SourceSource, aims to record provenance from the execution of Java programs, and their use of non-Java services, in a way which requires no changes to the source code, minimal set-up, but where what will be recorded is apparent.

As such, the source code is transformed into a self-documenting form in a pre-compilation step. Plug-ins can be inserted to handle services to be treated as black boxes, parsing the inputs and outputs to add to the provenance graph, e.g. a database with SQL statements as inputs. SourceSource is built upon a commonly used program transformation tool, TXL.

The SourceSource executables will be uploaded and the internal provenance model described soon.

Workflow Representation

The workflow is the Java program exactly as supplied by Yogesh. It is pre-compiled, then run using the JVM as normal.

Open Provenance Model Output

The OPM output is available here. It is formatted following the example XML in schema v1.01.a provided by Paul Groth and Luc Moreau at the OpenProvenance website.

A note about the serialisation of artifact values: In the current model, each artifact is either a variable with a given value at a given time, or a database entry with a given value at a given time. One particular class of variables is critical to answering the queries, but has no default serialisation: the CSVFileEntry?. This object contains a CSV file path and database table name (see LoadWorkflow? and LoadAppLogic? classes in the workflow for its use). In the OPM output, we have serialised the variable value in the form: file-path#table-name, e.g. D:\Personal\challenge\pc3\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2ImageMeta.csv#P2IMAGEMETA

Query Results

For now we provide the pseudo-code and output for the queries, to guide others in how the OPM is to be interpreted. The query code (in Java) will be uploaded shortly.

A few points about the internal model of SourceSource need to be made clear to understand the query implementations:

Naming Artifacts: Each source code variable is given a name scoped by its class and method, e.g. LoadAppLogic_IsMatchTableRowCount_FileEntry (in class LoadAppLogic, in method IsMatchTableRowCount, the local variable named FileEntry). Each database entry is given a name comprised of the table name and the first two fields of the entry, e.g. P2DETECTION_113191992826421637,261887437010025729. These can be used to query for the provenance of the last values of the variable/entry (OPM artifacts).

Naming Processes: Each Java statement is given a name scoped by its class, e.g. LoadWorkflow_main_Declaration12 (in class LoadWorkflow, in method main, the 12th declaration). These can be used to query for the provenance of the iterations of executing the statements. A tool is provided to see what names statements are given to aid those building queries (to be uploaded shortly).

Occurrences: As most querying concerns apply to artifacts and processes in the same way, we generalise and call them both kinds of occurrence.

Features: The OPM value of each occurrence is a set of features, comprising of a type and an Java value. A subset of these are the defining features, which distinguish this occurrence from others, i.e. its identity.

Provenance and Future: The provenance of an occurrence is the sub-tree taken from the provenance graph recursively leading backwards from effects to causes starting from that occurrence. The future of an occurrence is the sub-tree taken from the provenance graph recursively leading forwards from causes to effects starting from that occurrence.

Query 1

Pseudo-code:

  1. Get the last occurrence of database entry named P2DETECTION_112051986299712706,261887437040025450
  2. Find, within the provenance of this occurrence, occurrences of variables with a value of type CSVFileEntry?
  3. Get the file paths from the CSVFileEntry? objects

Output: [D:\Personal\challenge\pc3\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2Detection.csv]

Query 2

Pseudo-code:

  1. Get the last occurrence of database entry named P2DETECTION_112051986299712706,261887437040025450
  2. Find, within the provenance of this occurrence, occurrences of variables with a value of type CSVFileEntry?
  3. For each CSVFileEntry? occurrence:
    1. Find, within the future of the CSVFileEntry? occurrence, causal relations of the form:
      • The effect is process point named LoadWorkflow_main_Declaration12 (the IsMatchTableColumnRanges check)
      • The relationship (OPM: role of the cause artifact) is Used In Expression
      • The cause is the same variable as the CSVFileEntry?
    2. Where such a relation exists, then the table referred to by the CSVFileEntry? has been checked

Output: P2DETECTION was checked

Query 3

Pseudo-code:

  1. Get the last occurrence of database entry named P2IMAGEMETA_6294101,62941
  2. For each occurrence in the provenance of that entry:
    1. If the occurrence is a variable having a given value at a specific statement in the program, get the name of that statement; or if the occurrence is a statement being executed, get the name of that statement
    2. If the occurrence is part of a method's execution (OPM: fine-grained account), get the occurrence representing the method call (OPM: overlapping coarse-grained account) and get the name of that statement.
  3. The collection of statement names gathered are those which affected the database entry.

Output:

[LoadAppLogic_LoadCSVFileIntoTable_Declaration1, LoadAppLogic?_LoadCSVFileIntoTable_Declaration2, LoadAppLogic?_LoadCSVFileIntoTable_Statement10, LoadAppLogic?_LoadCSVFileIntoTable_Statement3, LoadAppLogic?_LoadCSVFileIntoTable_Statement4, LoadAppLogic?_LoadCSVFileIntoTable_Statement5, LoadAppLogic?_LoadCSVFileIntoTable_Statement6, LoadAppLogic?_LoadCSVFileIntoTable_Statement7, LoadAppLogic?_LoadCSVFileIntoTable_Statement8, LoadAppLogic?_LoadCSVFileIntoTable_Statement9, LoadCSVFileIntoTable?, LoadWorkflow?_main_Declaration1, LoadWorkflow?_main_Declaration5, LoadWorkflow?_main_Declaration7, LoadWorkflow?_main_Declaration9, LoadWorkflow?_main_Statement5, main]

The statements in the main workflow (LoadWorkflow?_main...) named above correspond to the following statements in the source code:

  • LoadWorkflow_main_Declaration1 : String JobID = args [0], CSVRootPath? = args [1];
  • LoadWorkflow_main_Declaration5 : LoadAppLogic.DatabaseEntry CreateEmptyLoadDBOutput = LoadAppLogic.CreateEmptyLoadDB (JobID);
  • LoadWorkflow_main_Statement5 : for (LoadAppLogic.CSVFileEntry FileEntry : ReadCSVReadyFileOutput)
  • LoadWorkflow_main_Declaration7 : LoadAppLogic.CSVFileEntry ReadCSVFileColumnNamesOutput = LoadAppLogic.ReadCSVFileColumnNames (FileEntry);
  • LoadWorkflow_main_Declaration9 : boolean LoadCSVFileIntoTableOutput = LoadAppLogic.LoadCSVFileIntoTable (CreateEmptyLoadDBOutput, ReadCSVFileColumnNamesOutput);

The implication is that every other statement (e.g. including the validation checks) can be removed without affecting the result.

Suggested Workflow Variants

None as yet.

Suggested Queries

See query page.

Suggestions for Modification of the Open Provenance Model

To be completed soon.

Conclusions

-- SimonMiles - 03 Apr 2009
to top

I Attachment sort Action Size Date Who Comment
kcl-opm.xml manage 350.8 K 09 Apr 2009 - 14:46 SimonMiles OPM output of provenance from KCL's execution of the workflow

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback