Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.PASS3

Start of topic | Skip to actions

Provenance Challenge: Provenance-Aware Storage Systems (PASS)

Participating Team

Team and Project Details

  • Short team name: PASS
  • Participant names: Uri Braun, David Holland, Peter Macko, Diana MacLean, Daniel Margo, Kiran-Kumar Muniswamy-Reddy, Margo Seltzer, Robin Smogor
  • Project URL: http://www.eecs.harvard.edu/syrah/pass
  • Project Overview: PASS stands for Provenance-Aware Storage Systems and refers to systems (in our case file systems) that treat provenance as a first class object, collecting it, maintaining it, and querying it automatically. The second PASS prototype that we use for this Challenge is implemented as a set of Linux kernel modules and file system that automatically capture provenance as users interact with the system as they normally do. Therefore, capturing provenance requires no specialized workflow engines or other special-purpose software. PASS captures provenance for any program that runs on Linux 2.6.
  • Relevant Publications:
    • Muniswamy-Reddy, K., Braun, U., Holland, D., Macko, M., MacLean, D., Margo, D., Seltzer, M., and Smogor, R., Layering in Provenance Systems, Proceedings of the 2009 USENIX Annual Technical Conference, San Diego, CA, June 2009.
    • Muniswamy-Reddy, K., Holland, D., Braun, U., Seltzer, M., Provenance-Aware Storage Systems, Proceedings of the 2006 USENIX Annual Technical Conference, Boston, MA, June 2006.
    • Holland, D., Braun, U., MacLean, D., Muniswamy-Reddy, K., and Seltzer, M., Choosing a Data Model and Query Language for Provenance. Proceedings of the 2nd International Provenance and Annotation Workshop, Salt Lake City, UT, Jun 2008.

Workflow Representation

The workflow is represented as a Bash script that executes a modified version of the supplied Java classes. The script mirrors the supplied .bat files with the exception that some command-line arguments are passed directly instead of serialized Java object. For example, we use "--job J062941" instead of "-f JobIDInput.xml". We modified the Java classes to use our version of SQLite instead of Derby, which tracks provenance at the cell level granularity.

Open Provenance Model Output

XML-formatted OPM: J062941_v2.opm

We use the following naming conventions:

  • Files: full path of the file
  • Processes: the name of the executable
  • Database cells: table_name row_id:column_name
  • Cells in a CSV file: row_id:column_id
There are also several nameless artifacts, which correspond to Unix pipes. The previous version of our XML-formatted OPM export is J062941.opm, in which the command-line arguments are a part of the process name instead of separate artifacts.

Query Results

The queries were written in our Path Query Language (PQL) and evaluated on the provenance graph before it was exported to OPM. The version of PQL used for this Challenge uses the following edge labeling conventions: INPUT is a generic ancestry edge, WHERE denotes a where-provenance edge between two database cells, and CONTAINS represents a containment (modeled as an ancestry edge).

Core Query 1

select csv.NAME
  from Provenance.% as db, db.CONTAINS as cell, cell.WHERE+.CONTAINS-OF as csv
 where db.NAME glob "*/pc3.db"
   and cell.TABLE = "P2Detection"
   and csv.NAME glob "*.csv";

This query first finds the SQLite database file pc3.db (variable db) and then the set of all cells in table P2Detection (variable cell). The query then looks for all where-ancestors of the cells that originated from a CSV file.

Result:

"/challenge3/PC3/SampleData/J062941/P2_J062941_B001_P2fits0_20081115_P2Detection.csv"

Given a particular entry in the table, we can find where exactly it came from:

select csv.NAME, w.ROW, w.COLUMN
  from Provenance.% as db, db.CONTAINS as cell, cell.WHERE+ as w, w.CONTAINS-OF as csv
 where db.NAME glob "*/pc3.db"
   and cell.TABLE = "P2Detection"
   and cell.COLUMN = "peakFlux" and cell.ROW = "8"
   and csv.NAME glob "*.csv";

Result:

{
    "/challenge3/PC3/SampleData/J062941/P2_J062941_B001_P2fits0_20081115_P2Detection.csv"
    "7"
    "30"
}

That is, the particular value came from the row 7 and column 30 of the given .csv file (counting from 0).

Core Query 2

select count(X.NAME)
  from Provenance.% as X
 where X.NAME = "./PSLoadExecutable.sh" and X.TYPE = "PROC"
   and X.ARG1 = "IsMatchTableColumnRanges" and X.ARGS glob "*-t P2Detection*";

This query searches for all invocations of IsMatchTableColumnRanges on table P2Detection. If the operation was executed, the count aggregate in the query returns a positive number. If it was not executed, the query result is 0.

Result:

1

Core Query 3

select X.ARG4, X.ARG6, X.ARG8, X.ARG10, X.ARG12, X.ARG14 
  from Provenance.% as db, db.CONTAINS as cell, cell.WHERE*.INPUT as X
 where db.NAME glob "*/pc3.db"
   and cell.TABLE = "P2Detection"
   and cell.COLUMN = "imageID" and cell.ROW = "4"
   and X.NAME glob "*/java";

The query identifies all processes that wrote a particular cell (in this example, fourth row of P2Detection, column imageID) and relevant command-line arguments, but does not check its inputs. The ancestry of the cell includes all previous processes that used the database, since the SQLite database file is both an input and an output of every previous workflow process. By including these ancestors, we would include operations that were not strictly necessary.

Result:

{
    {
        "LoadCSVFileIntoTable"
        "IsLoadedCSVFileIntoTableOutput_FileEntry2.xml"
        "CreateEmptyLoadDBOutput.xml"
        "ReadCSVFileColumnNamesOutput_FileEntry2.xml"
        "P2Detection"
    }
}

Our system does not keep track of control flows that do not result in any data flow, unless we would modify /bin/bash to insert custom annotations. For example, there is no data flow from IsCSVReadyFileExists to LoadCSVFileIntoTable, so we do know whether a successful execution IsCSVReadyFileExists was strictly necessary for a particular cell to appear in the database.

Optional Query 1

select count(java.ARG10) - 1
  from Provenance.% as java
 where java.NAME glob "*/java"
   and java.ARG4 = "IsMatchTableColumnRanges";

In this query, we just count the number of invocations of IsMatchTableColumnRanges. According to the workflow specification, if we know that the last execution of IsMatchTableColumnRanges failed, the number of correctly loaded tables is just the number of successful invocations of IsMatchTableColumnRanges (which is the total number of invocations minus one).

Result:

2

Optional Query 3

select max(X.FREEZETIME) from Provenance.% as X where X.ARG4 = "IsExistsCSVFile";
select max(X.FREEZETIME) from Provenance.% as X where X.ARG4 = "IsMatchCSVFileTables";

We answer this query by getting the timestamps associated with the last executions of IsExistsCSVFile and IsMatchCSVFileTables, and computing their difference.

Result:

1239499744.640488611 - 1239499740.764267717 = 3.88 seconds

Optional Query 4

select I.NAME
  from Provenance.% as db, db.CONTAINS as cell, cell.INPUT+ as I
 where db.NAME glob "*/pc3.db"
   and cell.TABLE = "P2Detection"
   and cell.COLUMN = "peakFlux" and cell.ROW = "8";

This query returns the entire ancestry graph of a particular cell (row 8, column peakFlux of P2Detection), which in PASS is equivalent to the why-provenance.

Result:

(omitted for brevity)

Optional Query 6

First, we find the timestamp of the last execution of a workflow operator.

select max(X.FREEZETIME) from Provenance.% as X where X.ARG0 = "./PSLoadExecutable.sh";
For example, if the result is "1239499744.632488155", we can then query again to find the actual process and its command-line arguments:
select X.ARGS
  from Provenance.% as X
 where X.ARG0 = "./PSLoadExecutable.sh"
   and X.FREEZETIME = "1239499744.632488155";

Result:

"./PSLoadExecutable.sh IsExistsCSVFile -o IsExistsCSVFileOutput_FileEntry2.xml -f FileEntry2.xml"

Optional Query 8

The set of successfully executed steps is just the set of all executed steps minus the failed step. We get the set of all steps using the following query:

select X.ARGS from Provenance.% as X where X.ARG0 = "./PSLoadExecutable.sh";
After subtracting the failed step determined by the optional query 6, we get the following set (in no particular order):

{
    "./PSLoadExecutable.sh CreateEmptyLoadDB -o CreateEmptyLoadDBOutput.xml --job J062941"
    "./PSLoadExecutable.sh IsCSVReadyFileExists -o IsCSVReadyFileExistsOutput.xml --path /disk/disk1/challenge3/PC3/SampleData/J062941/"
    "./PSLoadExecutable.sh IsExistsCSVFile -o IsExistsCSVFileOutput_FileEntry0.xml -f FileEntry0.xml"
    "./PSLoadExecutable.sh IsExistsCSVFile -o IsExistsCSVFileOutput_FileEntry1.xml -f FileEntry1.xml"
    "./PSLoadExecutable.sh IsMatchCSVFileColumnNames -o IsMatchCSVFileColumnNamesOutput_FileEntry0.xml
                           -f ReadCSVFileColumnNamesOutput_FileEntry0.xml"
    "./PSLoadExecutable.sh IsMatchCSVFileColumnNames -o IsMatchCSVFileColumnNamesOutput_FileEntry1.xml
                           -f ReadCSVFileColumnNamesOutput_FileEntry1.xml"
    "./PSLoadExecutable.sh IsMatchCSVFileTables -o IsMatchCSVFileTablesOutput.xml -f ReadCSVReadyFileOutput.xml"
    "./PSLoadExecutable.sh IsMatchTableColumnRanges -o IsMatchTableColumnRangesOutput_FileEntry0.xml -f CreateEmptyLoadDBOutput.xml
                           -f ReadCSVFileColumnNamesOutput_FileEntry0.xml -t P2FrameMeta"
    "./PSLoadExecutable.sh IsMatchTableColumnRanges -o IsMatchTableColumnRangesOutput_FileEntry1.xml -f CreateEmptyLoadDBOutput.xml
                           -f ReadCSVFileColumnNamesOutput_FileEntry1.xml -t P2ImageMeta"
    "./PSLoadExecutable.sh IsMatchTableRowCount -o IsMatchTableRowCountOutput_FileEntry0.xml -f CreateEmptyLoadDBOutput.xml
                           -f ReadCSVFileColumnNamesOutput_FileEntry0.xml -t P2FrameMeta"
    "./PSLoadExecutable.sh IsMatchTableRowCount -o IsMatchTableRowCountOutput_FileEntry1.xml -f CreateEmptyLoadDBOutput.xml
                           -f ReadCSVFileColumnNamesOutput_FileEntry1.xml -t P2ImageMeta"
    "./PSLoadExecutable.sh LoadCSVFileIntoTable -o IsLoadedCSVFileIntoTableOutput_FileEntry0.xml -f CreateEmptyLoadDBOutput.xml
                           -f ReadCSVFileColumnNamesOutput_FileEntry0.xml -t P2FrameMeta"
    "./PSLoadExecutable.sh LoadCSVFileIntoTable -o IsLoadedCSVFileIntoTableOutput_FileEntry1.xml -f CreateEmptyLoadDBOutput.xml
                           -f ReadCSVFileColumnNamesOutput_FileEntry1.xml -t P2ImageMeta"
    "./PSLoadExecutable.sh ReadCSVFileColumnNames -o ReadCSVFileColumnNamesOutput_FileEntry0.xml -f FileEntry0.xml"
    "./PSLoadExecutable.sh ReadCSVFileColumnNames -o ReadCSVFileColumnNamesOutput_FileEntry1.xml -f FileEntry1.xml"
    "./PSLoadExecutable.sh ReadCSVReadyFile -o ReadCSVReadyFileOutput.xml --path /disk/disk1/challenge3/PC3/SampleData/J062941/"
    "./PSLoadExecutable.sh SplitList -o FileEntry?.xml -f ReadCSVReadyFileOutput.xml"
    "./PSLoadExecutable.sh UpdateComputedColumns -o IsUpdatedComputedColumnsOutput_FileEntry0.xml -f CreateEmptyLoadDBOutput.xml
                           -f ReadCSVFileColumnNamesOutput_FileEntry0.xml -t P2FrameMeta"
    "./PSLoadExecutable.sh UpdateComputedColumns -o IsUpdatedComputedColumnsOutput_FileEntry1.xml -f CreateEmptyLoadDBOutput.xml
                           -f ReadCSVFileColumnNamesOutput_FileEntry1.xml -t P2ImageMeta"
}

Optional Query 10

select L.ARGS
  from Provenance.% as L
 where L.NAME = "./LoadWorkflow.sh" and L.TYPE = "PROC";

This query just returns all command-line arguments to LoadWorkflow.sh.

Result:

"./LoadWorkflow.sh J062941 /disk/disk1/challenge3/PC3/SampleData/J062941/"

Optional Query 11

select L.ARG1
  from Provenance.% as L
 where (L.ARGS glob "*--path*" or L.ARGS glob "*--job*") and L.NAME = "./PSLoadExecutable.sh";

In our implementation, we pass the two user inputs to other processes via command-line arguments --path and --job, so our query just looks for the processes with at least one of these arguments. Equivalently, given the results of the previous query, we could instead search for all processes with arguments "J062941" or "/disk/disk1/challenge3/PC3/SampleData/J062941/".

Result:

{
    "CreateEmptyLoadDB"
    "IsCSVReadyFileExists"
    "ReadCSVReadyFile"
}

Suggested Workflow Variants

Suggested Queries

Query 1

A particular detection value seems wrong. However, the workflow, queries, and the CSV files are correct, so it is possible that the error is due to something external to the workflow engine. Which shared libraries were involved in computing a given value in the database?

select lib.NAME
  from Provenance.% as db, db.CONTAINS as cell, cell.INPUT+ as lib
 where db.NAME glob "*/pc3.db"
   and cell.TABLE = "P2Detection"
   and cell.COLUMN = "peakFlux" and cell.ROW = "8"
   and lib.NAME glob "*.so*";

The query first finds the SQLite database file and then locates the provenance record that corresponds to the given cell. It then searches for all shared libraries within the ancestry of that cell.

Result:

{
    "/challenge3/sqlite/java/libsqlite-java.so"
    "/etc/ld.so.cache"
    "/lib/libacl.so.1.1.0"
    "/lib/libattr.so.1.1.0"
    "/lib/libblkid.so.1.0"
    "/lib/libdevmapper.so.1.02"
    "/lib/libnss_mdns4_minimal.so.2"
    "/lib/libselinux.so.1"
    "/lib/libsepol.so.1"
    "/lib/libuuid.so.1.2"
    "/lib/tls/i686/cmov/libc-2.3.6.so"
    "/lib/tls/i686/cmov/libdl-2.3.6.so"
    "/lib/tls/i686/cmov/libm-2.3.6.so"
    "/lib/tls/i686/cmov/libnsl-2.3.6.so"
    "/lib/tls/i686/cmov/libnss_compat-2.3.6.so"
    "/lib/tls/i686/cmov/libnss_dns-2.3.6.so"
    "/lib/tls/i686/cmov/libnss_files-2.3.6.so"
    "/lib/tls/i686/cmov/libnss_nis-2.3.6.so"
    "/lib/tls/i686/cmov/libpthread-2.3.6.so"
    "/lib/tls/i686/cmov/libresolv-2.3.6.so"
    "/pmacko/pass/tools/challenge3/sqlite/java/libsqlite-java.so"
    "/usr/local/java/jdk1.5.0_16/jre/lib/i386/libjava.so"
    "/usr/local/java/jdk1.5.0_16/jre/lib/i386/libverify.so"
    "/usr/local/java/jdk1.5.0_16/jre/lib/i386/libzip.so"
    "/usr/local/java/jdk1.5.0_16/jre/lib/i386/native_threads/libhpi.so"
    "/usr/local/java/jdk1.5.0_16/jre/lib/i386/server/libjvm.so"
}

Suggestions for Modification of the Open Provenance Model

Conclusions

-- PeterMacko - 22 May 2009
to top

I Attachment sort Action Size Date Who Comment
J062941.opm manage 6078.8 K 13 Apr 2009 - 21:42 PeterMacko XML-formatted OPM
J062941_brief.opm manage 3706.9 K 13 Apr 2009 - 21:43 PeterMacko XML-formatted OPM without cell-level tracking
J062941_v2.opm manage 20868.6 K 22 May 2009 - 17:24 PeterMacko XML-formatted OPM, 2nd version

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback