There is a directory in the Swift repository at http://trac.ci.uchicago.edu/swift/browser/SwiftApps/pc3 where the Swift representation of the challenge workflow lives.
Older OPM output for the challenge workflow is at http://www.ci.uchicago.edu/~benc/opm-20090428.xml
Older more broken and less informative output: http://www.ci.uchicago.edu/~benc/opm-20090419.xml corresponding to the version of the workflow in SVN r2724
Swift does not record database provenance data, so we used annotations in the application database to record which process inserted or modified each row.
Suppose we want to determine the provenance of the detection that has the identifier 261887481030000003, the first query can be answered in the following manner:
select provenanceid from ipaw.p2detectionprov where detectid = 261887481030000003; tag:benc@ci.uchicago.edu,2008:swiftlogs:execute2:pc3-20090507-1008-q4dpcm28:ps_load_executable_db_app-b2bclgaj
select execute_id from execute2s where id = 'tag:benc@ci.uchicago.edu,2008:swiftlogs:execute2:pc3-20090507-1140-z7ebbrz0:ps_load_executable_db_app-8d52pgaj'; tag:benc@ci.uchicago.edu,2008:swiftlogs:execute:pc3-20090507-1140-z7ebbrz0:0-5-5-1-5-1-2-0
select filename from trans, dataset_filenames where after='tag:benc@ci.uchicago.edu,2008:swiftlogs:execute:pc3-20090507-1140-z7ebbrz0:0-5-5-1-5-1-2-0' and before=dataset_id and filename like '%split%'; file://localhost/_concurrent/split_list_output-65fe229c-2da2-4054-997e-fb167b8c30ed--array//elt-3 file://localhost/_concurrent/split_list_output-65fe229c-2da2-4054-997e-fb167b8c30ed--array//elt-2 file://localhost/_concurrent/split_list_output-65fe229c-2da2-4054-997e-fb167b8c30ed--array//elt-1
P2_J062941_B001_P2fits0_20081115_P2Detection.csv, P2_J062941_B001_P2fits0_20081115_P2ImageMeta.csv, P2_J062941_B001_P2fits0_20081115_P2FrameMeta.csv
q2.sh in the SVN runs this query.
The guts of the query is this SQL:
select dataset_values.value from processes, invocation_procedure_names, dataset_usage, dataset_values where type='compound' and procedure_name='is_match_table_column_ranges' and dataset_usage.direction='O' and dataset_usage.param_name='inputcontent' and processes.id = invocation_procedure_names.execute_id and dataset_usage.process_id = processes.id and dataset_usage.dataset_id = dataset_values.dataset_id;
This returns the input parameter XML for all is_match_table_column_ranges calls. These are XML values, and it is necessary to examine the resulting XML to determine if it was invoked for the specific table. There is unpleasant cross-format joining necessary here to get an actual yes/no result properly, although probably could use a LIKE clause to peek inside the value.
This uses the additional annotations made for query 1. These annotations only store which process originally inserted a row, not which processes have modified a row. So to some extent, rows are regarded a bit like artifacts (though not first order artifacts in the provenance database); and we can only answer questions about the provenance of rows, not the individual fields within those rows. That is sufficient for this query, though.
First find the row that contains the interesting value and extract its IMAGEID. Then find the process that created the image ID by querying the derby database table P2IMAGEPROV:
ij> select * from ipaw.p2imageprov where imageid=6294301; IMAGEID |PROVENANCEID ----------------------------------------------------------------------------------------------------------------------------------------------------- 6294301 |tag:benc@ci.uchicago.edu,2008:swiftlogs:execute2:pc3-20090519-2057d8dyi9o9:ps_load_executable_db_app-dpc8q1bj
Now we have a process ID for the process that created the row.
Now query the transitive closure table for all predecessors for that process (as in q1). This will produce all processes and artifacts that preceeded this row creation.
Our answer differs from the sample answer because we have sequenced access to the db, rather than regarding each row as a proper first-order artifact. The entire DB state at a particular time is a successor to all previous database accessing operations, so any process which led to any database access before the row in question is regarded as a necessary operations. This is undesirable in some respects, but desirable in others. For example, a row insert only works because previous database operations which inserted other rows did not insert a conflicting primary key - so there is data dependency between the different operations even though they operate on different rows.
This counts how many load processes are known to the database (over all recorded workflows)
select count(*) from invocation_procedure_names where procedure_name='load_csv_file_into_table';
This can be restricted to a particular workflow run like this:
sqlite> select count(process_id) from invocation_procedure_names,processes_in_workflows where procedure_name='load_csv_file_into_table' and workflow_id='tag:benc@ci.uchicago.edu,2008:swiftlogs:execute:pc3-20090519-1659-jqc5od2f:run' and invocation_procedure_names.execute_id = processes_in_workflows.process_id; 3
In our Swift representation of the workflow, we control flow dataflow dependencies. So many of the activities that could be commuted are in our implementation run in parallel. One significant thing can't describe in SwiftScript? (and so cannot answer from the provenance database using this method) is commuting operations on the database. From a Swift perspective, this is a limitation of our SwiftScript? language rather than in the provenance implementation, benc thinks.
The query lists which pairs unix process executions (of which there are 50x50) have no data dependencies on each other. There are 2082 rows. The base SQL query is this:
select L.id, R.id from processes as L, processes as R where L.type='execute' and R.type='execute' and NOT EXISTS (select * from trans where before=L.id and after=R.id);
This answer is deficient in a few ways. We do not take into account non-execute procedures (such as compound procedures, function invocations, and operator executions) - there are 253 processes in total, 50 being executes and the remaineder being the other kinds of process. If we did that naively, we would not take into account compound procedures which contain other procedures (due to lack of decent support for nested processes - something like OPM accounts) and would come up with commutations which do not make sense.
To v1.01.a.xml schema definition, make time be represented by dataTime XSD type. We already do this in our OPM above, meaning that it should not validate with v1.01.a.xml.
Better support for hierarchies - of collections of artifacts; and of processes. Swift makes heavy use of both in its input language, SwiftScript?, though this information is not recorded explicitly in the provenance database.