The workflow is represented as a Bash script that executes a modified version of the supplied Java classes. The script mirrors the supplied .bat files with the exception that some command-line arguments are passed directly instead of serialized Java object. For example, we use "--job J062941" instead of "-f JobIDInput.xml". We modified the Java classes to use our version of SQLite instead of Derby, which tracks provenance at the cell level granularity.
XML-formatted OPM: J062941_v2.opm
We use the following naming conventions:
The queries were written in our Path Query Language (PQL) and evaluated on the provenance graph before it was exported to OPM. The version of PQL used for this Challenge uses the following edge labeling conventions: INPUT is a generic ancestry edge, WHERE denotes a where-provenance edge between two database cells, and CONTAINS represents a containment (modeled as an ancestry edge).
select csv.NAME from Provenance.% as db, db.CONTAINS as cell, cell.WHERE+.CONTAINS-OF as csv where db.NAME glob "*/pc3.db" and cell.TABLE = "P2Detection" and csv.NAME glob "*.csv";
This query first finds the SQLite database file pc3.db (variable db
) and then the set of all cells in table P2Detection (variable cell
). The query then looks for all where-ancestors of the cells that originated from a CSV file.
Result:
"/challenge3/PC3/SampleData/J062941/P2_J062941_B001_P2fits0_20081115_P2Detection.csv"
Given a particular entry in the table, we can find where exactly it came from:
select csv.NAME, w.ROW, w.COLUMN from Provenance.% as db, db.CONTAINS as cell, cell.WHERE+ as w, w.CONTAINS-OF as csv where db.NAME glob "*/pc3.db" and cell.TABLE = "P2Detection" and cell.COLUMN = "peakFlux" and cell.ROW = "8" and csv.NAME glob "*.csv";
Result:
{ "/challenge3/PC3/SampleData/J062941/P2_J062941_B001_P2fits0_20081115_P2Detection.csv" "7" "30" }
That is, the particular value came from the row 7 and column 30 of the given .csv file (counting from 0).
select count(X.NAME) from Provenance.% as X where X.NAME = "./PSLoadExecutable.sh" and X.TYPE = "PROC" and X.ARG1 = "IsMatchTableColumnRanges" and X.ARGS glob "*-t P2Detection*";
This query searches for all invocations of IsMatchTableColumnRanges on table P2Detection. If the operation was executed, the count aggregate in the query returns a positive number. If it was not executed, the query result is 0.
Result:
1
select X.ARG4, X.ARG6, X.ARG8, X.ARG10, X.ARG12, X.ARG14 from Provenance.% as db, db.CONTAINS as cell, cell.WHERE*.INPUT as X where db.NAME glob "*/pc3.db" and cell.TABLE = "P2Detection" and cell.COLUMN = "imageID" and cell.ROW = "4" and X.NAME glob "*/java";
The query identifies all processes that wrote a particular cell (in this example, fourth row of P2Detection, column imageID) and relevant command-line arguments, but does not check its inputs. The ancestry of the cell includes all previous processes that used the database, since the SQLite database file is both an input and an output of every previous workflow process. By including these ancestors, we would include operations that were not strictly necessary.
Result:
{ { "LoadCSVFileIntoTable" "IsLoadedCSVFileIntoTableOutput_FileEntry2.xml" "CreateEmptyLoadDBOutput.xml" "ReadCSVFileColumnNamesOutput_FileEntry2.xml" "P2Detection" } }
Our system does not keep track of control flows that do not result in any data flow, unless we would modify /bin/bash to insert custom annotations. For example, there is no data flow from IsCSVReadyFileExists to LoadCSVFileIntoTable, so we do know whether a successful execution IsCSVReadyFileExists was strictly necessary for a particular cell to appear in the database.
select count(java.ARG10) - 1 from Provenance.% as java where java.NAME glob "*/java" and java.ARG4 = "IsMatchTableColumnRanges";
In this query, we just count the number of invocations of IsMatchTableColumnRanges. According to the workflow specification, if we know that the last execution of IsMatchTableColumnRanges failed, the number of correctly loaded tables is just the number of successful invocations of IsMatchTableColumnRanges (which is the total number of invocations minus one).
Result:
2
select max(X.FREEZETIME) from Provenance.% as X where X.ARG4 = "IsExistsCSVFile"; select max(X.FREEZETIME) from Provenance.% as X where X.ARG4 = "IsMatchCSVFileTables";
We answer this query by getting the timestamps associated with the last executions of IsExistsCSVFile and IsMatchCSVFileTables, and computing their difference.
Result:
1239499744.640488611 - 1239499740.764267717 = 3.88 seconds
select I.NAME from Provenance.% as db, db.CONTAINS as cell, cell.INPUT+ as I where db.NAME glob "*/pc3.db" and cell.TABLE = "P2Detection" and cell.COLUMN = "peakFlux" and cell.ROW = "8";
This query returns the entire ancestry graph of a particular cell (row 8, column peakFlux of P2Detection), which in PASS is equivalent to the why-provenance.
Result:
(omitted for brevity)
First, we find the timestamp of the last execution of a workflow operator.
select max(X.FREEZETIME) from Provenance.% as X where X.ARG0 = "./PSLoadExecutable.sh";For example, if the result is "1239499744.632488155", we can then query again to find the actual process and its command-line arguments:
select X.ARGS from Provenance.% as X where X.ARG0 = "./PSLoadExecutable.sh" and X.FREEZETIME = "1239499744.632488155";
Result:
"./PSLoadExecutable.sh IsExistsCSVFile -o IsExistsCSVFileOutput_FileEntry2.xml -f FileEntry2.xml"
The set of successfully executed steps is just the set of all executed steps minus the failed step. We get the set of all steps using the following query:
select X.ARGS from Provenance.% as X where X.ARG0 = "./PSLoadExecutable.sh";After subtracting the failed step determined by the optional query 6, we get the following set (in no particular order):
{ "./PSLoadExecutable.sh CreateEmptyLoadDB -o CreateEmptyLoadDBOutput.xml --job J062941" "./PSLoadExecutable.sh IsCSVReadyFileExists -o IsCSVReadyFileExistsOutput.xml --path /disk/disk1/challenge3/PC3/SampleData/J062941/" "./PSLoadExecutable.sh IsExistsCSVFile -o IsExistsCSVFileOutput_FileEntry0.xml -f FileEntry0.xml" "./PSLoadExecutable.sh IsExistsCSVFile -o IsExistsCSVFileOutput_FileEntry1.xml -f FileEntry1.xml" "./PSLoadExecutable.sh IsMatchCSVFileColumnNames -o IsMatchCSVFileColumnNamesOutput_FileEntry0.xml -f ReadCSVFileColumnNamesOutput_FileEntry0.xml" "./PSLoadExecutable.sh IsMatchCSVFileColumnNames -o IsMatchCSVFileColumnNamesOutput_FileEntry1.xml -f ReadCSVFileColumnNamesOutput_FileEntry1.xml" "./PSLoadExecutable.sh IsMatchCSVFileTables -o IsMatchCSVFileTablesOutput.xml -f ReadCSVReadyFileOutput.xml" "./PSLoadExecutable.sh IsMatchTableColumnRanges -o IsMatchTableColumnRangesOutput_FileEntry0.xml -f CreateEmptyLoadDBOutput.xml -f ReadCSVFileColumnNamesOutput_FileEntry0.xml -t P2FrameMeta" "./PSLoadExecutable.sh IsMatchTableColumnRanges -o IsMatchTableColumnRangesOutput_FileEntry1.xml -f CreateEmptyLoadDBOutput.xml -f ReadCSVFileColumnNamesOutput_FileEntry1.xml -t P2ImageMeta" "./PSLoadExecutable.sh IsMatchTableRowCount -o IsMatchTableRowCountOutput_FileEntry0.xml -f CreateEmptyLoadDBOutput.xml -f ReadCSVFileColumnNamesOutput_FileEntry0.xml -t P2FrameMeta" "./PSLoadExecutable.sh IsMatchTableRowCount -o IsMatchTableRowCountOutput_FileEntry1.xml -f CreateEmptyLoadDBOutput.xml -f ReadCSVFileColumnNamesOutput_FileEntry1.xml -t P2ImageMeta" "./PSLoadExecutable.sh LoadCSVFileIntoTable -o IsLoadedCSVFileIntoTableOutput_FileEntry0.xml -f CreateEmptyLoadDBOutput.xml -f ReadCSVFileColumnNamesOutput_FileEntry0.xml -t P2FrameMeta" "./PSLoadExecutable.sh LoadCSVFileIntoTable -o IsLoadedCSVFileIntoTableOutput_FileEntry1.xml -f CreateEmptyLoadDBOutput.xml -f ReadCSVFileColumnNamesOutput_FileEntry1.xml -t P2ImageMeta" "./PSLoadExecutable.sh ReadCSVFileColumnNames -o ReadCSVFileColumnNamesOutput_FileEntry0.xml -f FileEntry0.xml" "./PSLoadExecutable.sh ReadCSVFileColumnNames -o ReadCSVFileColumnNamesOutput_FileEntry1.xml -f FileEntry1.xml" "./PSLoadExecutable.sh ReadCSVReadyFile -o ReadCSVReadyFileOutput.xml --path /disk/disk1/challenge3/PC3/SampleData/J062941/" "./PSLoadExecutable.sh SplitList -o FileEntry?.xml -f ReadCSVReadyFileOutput.xml" "./PSLoadExecutable.sh UpdateComputedColumns -o IsUpdatedComputedColumnsOutput_FileEntry0.xml -f CreateEmptyLoadDBOutput.xml -f ReadCSVFileColumnNamesOutput_FileEntry0.xml -t P2FrameMeta" "./PSLoadExecutable.sh UpdateComputedColumns -o IsUpdatedComputedColumnsOutput_FileEntry1.xml -f CreateEmptyLoadDBOutput.xml -f ReadCSVFileColumnNamesOutput_FileEntry1.xml -t P2ImageMeta" }
select L.ARGS from Provenance.% as L where L.NAME = "./LoadWorkflow.sh" and L.TYPE = "PROC";
This query just returns all command-line arguments to LoadWorkflow.sh.
Result:
"./LoadWorkflow.sh J062941 /disk/disk1/challenge3/PC3/SampleData/J062941/"
select L.ARG1 from Provenance.% as L where (L.ARGS glob "*--path*" or L.ARGS glob "*--job*") and L.NAME = "./PSLoadExecutable.sh";
In our implementation, we pass the two user inputs to other processes via command-line arguments --path and --job, so our query just looks for the processes with at least one of these arguments. Equivalently, given the results of the previous query, we could instead search for all processes with arguments "J062941" or "/disk/disk1/challenge3/PC3/SampleData/J062941/".
Result:
{ "CreateEmptyLoadDB" "IsCSVReadyFileExists" "ReadCSVReadyFile" }
A particular detection value seems wrong. However, the workflow, queries, and the CSV files are correct, so it is possible that the error is due to something external to the workflow engine. Which shared libraries were involved in computing a given value in the database?
select lib.NAME from Provenance.% as db, db.CONTAINS as cell, cell.INPUT+ as lib where db.NAME glob "*/pc3.db" and cell.TABLE = "P2Detection" and cell.COLUMN = "peakFlux" and cell.ROW = "8" and lib.NAME glob "*.so*";
The query first finds the SQLite database file and then locates the provenance record that corresponds to the given cell. It then searches for all shared libraries within the ancestry of that cell.
Result:
{ "/challenge3/sqlite/java/libsqlite-java.so" "/etc/ld.so.cache" "/lib/libacl.so.1.1.0" "/lib/libattr.so.1.1.0" "/lib/libblkid.so.1.0" "/lib/libdevmapper.so.1.02" "/lib/libnss_mdns4_minimal.so.2" "/lib/libselinux.so.1" "/lib/libsepol.so.1" "/lib/libuuid.so.1.2" "/lib/tls/i686/cmov/libc-2.3.6.so" "/lib/tls/i686/cmov/libdl-2.3.6.so" "/lib/tls/i686/cmov/libm-2.3.6.so" "/lib/tls/i686/cmov/libnsl-2.3.6.so" "/lib/tls/i686/cmov/libnss_compat-2.3.6.so" "/lib/tls/i686/cmov/libnss_dns-2.3.6.so" "/lib/tls/i686/cmov/libnss_files-2.3.6.so" "/lib/tls/i686/cmov/libnss_nis-2.3.6.so" "/lib/tls/i686/cmov/libpthread-2.3.6.so" "/lib/tls/i686/cmov/libresolv-2.3.6.so" "/pmacko/pass/tools/challenge3/sqlite/java/libsqlite-java.so" "/usr/local/java/jdk1.5.0_16/jre/lib/i386/libjava.so" "/usr/local/java/jdk1.5.0_16/jre/lib/i386/libverify.so" "/usr/local/java/jdk1.5.0_16/jre/lib/i386/libzip.so" "/usr/local/java/jdk1.5.0_16/jre/lib/i386/native_threads/libhpi.so" "/usr/local/java/jdk1.5.0_16/jre/lib/i386/server/libjvm.so" }
-- PeterMacko - 22 May 2009
to top
I | Attachment ![]() | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|
![]() | J062941.opm | manage | 6078.8 K | 13 Apr 2009 - 21:42 | PeterMacko | XML-formatted OPM |
![]() | J062941_brief.opm | manage | 3706.9 K | 13 Apr 2009 - 21:43 | PeterMacko | XML-formatted OPM without cell-level tracking |
![]() | J062941_v2.opm | manage | 20868.6 K | 22 May 2009 - 17:24 | PeterMacko | XML-formatted OPM, 2nd version |