A small piece of code with lots of comments explaining the process we used when recording a single OPM event in the work flow execution is available here. Additionally, a piece of code illustrating our approach to recording provenance about the loop structure of the work flow can be found here.
case | RDF XML(Tupelo) | OPM XML v1.01.a |
---|---|---|
J062941 | J062941_output.rdf | J609241_output.xml |
J062942 | J062942_output.rdf | J609242_output.xml |
J062943 | J062943_output.rdf | J609243_output.xml |
J062944 | J062944_output.rdf | J609244_output.xml |
J062945 | J062945_output.rdf | J609245_output.xml |
Unifier u = new Unifier(); u.setColumnNames("file", "path"); u.addPattern("file", Rdf.TYPE, PC3Utilities.ns("CSV_file")); u.addPattern("file", PC3Utilities.ns("PathToFile"), "path"); context.perform(u); for(Tuple<Resource> r : u.getResult()) { System.out.println(r); }
This yields the following output.
[http://pc3#FileEntryArtifact1, C:\dev\workspace\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2ImageMeta.csv] [http://pc3#FileEntryArtifact2, C:\dev\workspace\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2Detection.csv] [http://pc3#FileEntryArtifact0, C:\dev\workspace\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2FrameMeta.csv]
Then, to answer the query, we accept a string representing the table name in question and use Tupelo's Unifier tool to find the process that was "PerformedOn" that table.
boolean checkPerformed = false; Unifier u = new Unifier(); u.setColumnNames("process"); u.addPattern("process", PC3Utilities.ns("PerformedOn"), Resource.literal(s)); context.perform(u); if(u.getResult().iterator().hasNext()) { String process = u.getResult().iterator().next().get(0).toString(); assertTrue(process.substring(process.lastIndexOf("#")+1, process.length()-1).equals("IsMatchTableColumnRangesConditionals")); checkPerformed = true; } if(checkPerformed) { System.out.println("IsExistsCSVFile was performed on table "+s); }
Answering the query in this way allows us to say that not only was the check performed on the table, but that the check was successful. The assertion should always be true since we recorded provenance such that any process with a "PerformedOn" property will be among the IsMatchTableColumnRangesConditionals? processes.
With this understanding, we answered this query as follows. We used a Tupelo tool called a Transformer to query the graph and assert new relations on the results. In particular we created a new relation (removed the provenance API and the workflow execution) called "InferredTrigger."
Transformer t; t = new Transformer(); t.addInPattern("t", Opm.TRIGGERED_PROCESS, "p1"); t.addInPattern("t", Opm.TRIGGERED_BY_PROCESS, "p2"); t.addOutPattern("p1", PC3Utilities.ns("InferredTrigger"), "p2"); context.perform(t); context.addTriples(t.getResult()); t = new Transformer(); t.addInPattern("g", Opm.GENERATED_BY_PROCESS, "p2"); t.addInPattern("g", Opm.GENERATED_ARTIFACT, "a"); t.addInPattern("u", Opm.USED_ARTIFACT, "a"); t.addInPattern("u", Opm.USED_BY_PROCESS, "p1"); t.addOutPattern("p1", PC3Utilities.ns("InferredTrigger"), "p2"); context.perform(t); context.addTriples(t.getResult());
This InferredTrigger? relation is an arc from a process to a process on which the first process causally depends on the second. Once that relation was constructed we used another Tupelo tool called TranstiveClosure? which requires two parameters: an entity called the seed, and the relation on which to close. Once those two properties were set we closed the InferredTrigger? relation on the LoadCSVFileIntoTableProcess? as the seed.
TransitiveClosure tc; tc = new TransitiveClosure(); tc.setSeed(PC3Utilities.ns("LoadCSVFileIntoTableProcess1")); tc.setPredicate(PC3Utilities.ns("InferredTrigger")); Set<Resource> result = tc.close(context); for(Resource r : result) { context.addTriple(PC3Utilities.ns("LoadCSVFileIntoTableProcess1"), PC3Utilities.ns("InferredTrigger"), r); }Finally, we returned all the processes which were related to LoadCSVFileIntoTableProcess? by an InferredTrigger? arc.
Unifier u = new Unifier(); u.setColumnNames("p"); u.addPattern(PC3Utilities.ns("LoadCSVFileIntoTableProcess1"), PC3Utilities.ns("InferredTrigger"), "p"); context.perform(u); System.out.println("Suggested Query Five Results:"); for(Tuple<Resource> r : u.getResult()) { System.out.println("\t"+r); }
These processes are the answer to the query.
This is our output:
[http://pc3#IsCSVReadyFileExistsProcess] [http://pc3#LoadCSVFileIntoTableConditional0] [http://pc3#IsExistsCSVFileConditional1] [http://pc3#LoadCSVFileIntoTableProcess0] [http://pc3#IsExistsCSVFileProcess1] [http://pc3#UpdateComputedColumnsConditional0] [http://pc3#Iterator1] [http://pc3#ReadCSVFileColumnNamesProcess0] [http://pc3#IsMatchCSVFileTablesConditional] [http://pc3#IsCSVReadyFileExistsConditional] [http://pc3#UpdateComputedColumnsProcess0] [http://pc3#IsMatchTableRowCountProcess0] [http://pc3#IsExistsCSVFileProcess0] [http://pc3#IsMatchCSVFileTablesProcess] [http://pc3#ReadCSVFileColumnNamesProcess1] [http://pc3#IsMatchCSVFileColumnNamesProcess1] [http://pc3#IsMatchTableColumnRangesProcess0] [http://pc3#IsMatchTableColumnRangesConditional0] [http://pc3#IsMatchCSVFileColumnNamesProcess0] [http://pc3#CreateEmptyLoadDBProcess] [http://pc3#Iterator0] [http://pc3#IsMatchCSVFileColumnNamesConditional0] [http://pc3#ReadCSVReadyFileProcess] [http://pc3#IsMatchCSVFileColumnNamesConditional1] [http://pc3#IsExistsCSVFileConditional0] [http://pc3#IsMatchTableRowCountConditional0]
void _findIterators(GraphNode n, Set<Resource> beenThere, Set<Resource> result, boolean findOne) throws Exception{ Resource subject = n.getSubject(); if(beenThere.contains(subject) || (findOne && !result.isEmpty())) { return; } beenThere.add(subject); for(GraphEdge edge : n.getOutgoingEdges()) { if(session.fetchThing(edge.getSink().getSubject()).getTypes().contains(PC3Utilities.ns("loop_iteration"))) { result.add(edge.getSink().getSubject()); } _findIterators(edge.getSink(),beenThere, result, findOne); } }
Note that one cannot simply select every entity of rdf type loop iteration from the Tupelo context and take that count minus one to be the answer to the query since it is possible to create many entities of rdf type loop_iteration long before they are used in the provenance trail.
Once we had those two arcs, answering this query was a matter of extracting the intervals of time both events were known to have occurred within and returning the upper and lower bound of time that may have elapsed between the two events.
ProvenanceContextFacade pcf = new ProvenanceContextFacade(); pcf.setContext(context); ProvenanceArcSeeker s = new ProvenanceArcSeeker(pcf); Collection<ProvenanceUsedArc> matchArcs = s.findUsedEdges( pcf.getArtifact(PC3Utilities.ns("IsMatchCSVFileTablesArtifact")), pcf.getProcess(PC3Utilities.ns("IsMatchCSVFileTablesConditional")), null, pcf.getAccount(PC3Utilities.ns("account"))); assertTrue(matchArcs.size() == 1); ObservedTime matchTime = ((ProvenanceRoleArc) matchArcs.toArray()[0]).getInterval(); Collection<ProvenanceUsedArc> existsArcs = s.findUsedEdges( pcf.getArtifact(PC3Utilities.ns("IsExistsCSVFileArtifact1")), pcf.getProcess(PC3Utilities.ns("IsExistsCSVFileConditional1")), null, pcf.getAccount(PC3Utilities.ns("account"))); assertTrue(existsArcs.size() == 1); ObservedTime existTime = ((ProvenanceRoleArc) existsArcs.toArray()[0]).getInterval(); long lowerBoundElapsed; long upperBoundElapsed; lowerBoundElapsed = existTime.getNoEarlier().getTime() - matchTime.getNoLater().getTime(); if(lowerBoundElapsed < 0 ) { lowerBoundElapsed = 0; } upperBoundElapsed = existTime.getNoLater().getTime() - matchTime.getNoEarlier().getTime(); System.out.println("Somewhere between ["+ lowerBoundElapsed+", "+upperBoundElapsed+"] milliseconds elapsed" + " between the successful IsMatchCSVFileTables and the unsuccessful IsExistsCSVFile.");One a particular run of the workflow and query we generated the following output:
Somewhere between [9988, 10007] milliseconds elapsed between the successful IsMatchCSVFileTables? and the unsuccessful IsExistsCSVFile?.
To answer this we wrote a utility that takes as parameters two graphs, two accounts and two nodes. This utility returns true if an account contained in the first graph is contained in an account contained within the second. More precisely, beginning at the given nodes, one node in the first graph and one node in the second, continue if the account specific arcs on the first node is a subset of the account specific arcs of the second. This is where arc equality is defined as having the same source and sink and the same role, if applicable. Two nodes are defined as equal if they have the same name. To continue, recrursively follow each of the account specific arcs and verify that the set of account specific arcs of each node in the first graph is contained in the set of account specific arcs of each node from the second graphs. If we have visited each node in the first graph and the set of arcs on each node satsify the set containment requirement then the account contained in the first graph is a subgraph of the account contained in the second graph.
The code that accomplishes this can be found here .
As a consequence of what we record when the workflow halts, and as we expected, it turns out that the provenance account given in a halted run of the workflow is indeed contained in the provenance account of a successful run of the workflow.
The new approach to this query is as follows. We assert that a user's input is of rdf type User_Input and return provenance elements only of this type.
Set<Resource> result = new HashSet<Resource>(); Unifier u = new Unifier(); u.setColumnNames("artifact"); u.addPattern("artifact", Rdf.TYPE, PC3Utilities.ns("User_Input")); context.perform(u); for(Tuple<Resource> tuple : u.getResult()) { System.out.println(tuple); result.add(tuple.get(0)); } return result;
Result:
http://pc3#CSVRootPathResource http://pc3#JobIDresource
Our original approach to the query follows below.
To answer this query, we walk the graph down from the final entity produced by the work flow execution in this case the DisplayPlotProcess? and keep any entities that do not causally depend on any other entity in the graph (i.e. those graph nodes with no outgoing edges).
void findInputs(GraphNode n, Set<Resource> beenThere, Set<Resource> result) throws ModelException, IOException, OperatorException { Resource subject = n.getSubject(); if(beenThere.contains(subject)) { return; } beenThere.add(subject); for(GraphEdge edge : n.getOutgoingEdges()) { if(edge.getSink().getOutgoingEdges().isEmpty()) { result.add(edge.getSink().getSubject()); } findInputs(edge.getSink(),beenThere, result); } }
Result:
http://pc3#CSVRootPathResource http://pc3#JobIDresource
ProvenanceContextFacade pcf = new ProvenanceContextFacade(); Set<Resource> result = new HashSet<Resource>(); pcf.setContext(context); Collection<ProvenanceUsedArc> arcs = new HashSet<ProvenanceUsedArc>(); for(Resource input : userInputs) { arcs = pcf.getUsedBy(pcf.getArtifact(input)); for(ProvenanceUsedArc arc : arcs) { result.add(((RdfProvenanceElement)arc.getProcess()).getSubject()); } } for(Resource r : result) { System.out.println(r); }Result:
http://pc3#IsCSVReadyFileExistsProcess http://pc3#CreateEmptyLoadDBProcess http://pc3#ReadCSVReadyFileProcess
-- RobertClark - 11 May 2009
-- RobertClark - 13 Apr 2009
-- RobertClark - 04 Mar 2009
to top
I | Attachment ![]() | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|
![]() | dot.gif | manage | 519.1 K | 10 Apr 2009 - 19:59 | RobertClark | |
![]() | opmExample.txt | manage | 2.3 K | 27 Apr 2009 - 15:32 | RobertClark | |
![]() | loopIteration.txt | manage | 6.0 K | 27 Apr 2009 - 16:28 | RobertClark | |
![]() | J062941_output.rdf | manage | 71.5 K | 11 May 2009 - 14:05 | RobertClark | |
![]() | J609241_output.xml | manage | 49.2 K | 11 May 2009 - 14:06 | RobertClark | |
![]() | J062942_output.rdf | manage | 71.0 K | 11 May 2009 - 14:06 | RobertClark | |
![]() | J609242_output.xml | manage | 48.9 K | 11 May 2009 - 14:06 | RobertClark | |
![]() | J062943_output.rdf | manage | 70.9 K | 11 May 2009 - 14:07 | RobertClark | |
![]() | J609243_output.xml | manage | 48.9 K | 11 May 2009 - 14:07 | RobertClark | |
![]() | J062944_output.rdf | manage | 70.9 K | 11 May 2009 - 14:07 | RobertClark | |
![]() | J609244_output.xml | manage | 48.9 K | 11 May 2009 - 14:07 | RobertClark | |
![]() | J062945_output.rdf | manage | 71.0 K | 11 May 2009 - 14:08 | RobertClark | |
![]() | J069245_output.xml | manage | 48.9 K | 11 May 2009 - 14:10 | RobertClark | |
![]() | PC3Utilities.java | manage | 4.3 K | 11 May 2009 - 18:41 | RobertClark |