Using PASOA for Pegasus
On this page, we describe how the data contained in logs from running workflows (Condor DAGs) generated from Pegasus can be translated into the data structure used by PASOA. Once this translation is made, the querying facilities of PASOA can be used to determine the provenance of result and intermediate data produced in such workflows.
From this first mapping step, we can then move to
recording more detailed data directly into the PASOA data model as a workflow runs, recording the transfer of data during workflow execution and include the generation of the workflow itself through refinements as part of the provenance.
Interaction Records
In PASOA, the process documentation (data used for determining provenance of items), follows a schema called the p-structure. The p-structure contains a set of
interaction records, each documenting the sending of data/control from one component to another. This structure is shown below.
<pstruct xmlns="http://www.pasoa.org/schemas/version025/PStruct.xsd">
<interactionRecord>
...
</interactionRecord>
<interactionRecord>
...
</interactionRecord>
...
</pstruct>
For each job in the DAX, there are two interactions record in the p-structure: one for the invocation of the job, one for its completion. We will refer to these below as the
invocation interaction and
completion interaction for a job, respectively.
Interaction Keys
Every interaction record has a globally unique identifier called an
interaction key. An interaction key is comprised of three parts: the address from which the interaction was sent (
message source), the address at which the interaction was received (
message sink), and an additional
interaction ID that distinguishes between different interactions with these addresses. We commonly use the WS-Addressing endpoint reference schema for addresses, though this is not compulsory.
<interactionRecord>
<interactionKey>
<messageSource>
<Address xmlns = "http://schemas.xmlsoap.org/ws/2004/03/addressing"> ... </Address>
</messageSource>
<messageSink>
<Address xmlns = "http://schemas.xmlsoap.org/ws/2004/03/addressing"> ... </Address>
</messageSink>
<interactionId> ... </interactionId>
</interactionKey>
...
</interactionRecord>
We suggest the following mapping for the invocation interaction of a job:
- The message source is
http://www.pegasus.edu/CondorDAGMan
- The message sink is the URL of the derivation the job implements, e.g.
fmri:convert:1
, where namespace fmri is expanded to its assigned URL
- The interaction ID is the concatenation of the workflow name, DAX creation time and the job's id attribute, e.g.
fmri2006-09-06T17:34:27-07:00Node_convertZ
<interactionKey>
<messageSource>
<Address xmlns = "http://schemas.xmlsoap.org/ws/2004/03/addressing">http://www.pegasus.edu/CondorDAGMan</Address>
</messageSource>
<messageSink>
<Address xmlns = "http://schemas.xmlsoap.org/ws/2004/03/addressing">http://fmri#slicer:1</Address>
</messageSink>
<interactionId>fmri2006-09-06T17:34:27-07:00Node_slicerZ</interactionId>
</interactionKey>
The completion interaction would have the same key as the invocation interaction but with the message source and message sink swapped.
<interactionKey>
<messageSource>
<Address xmlns = "http://schemas.xmlsoap.org/ws/2004/03/addressing">http://fmri#slicer:1</Address>
</messageSource>
<messageSink>
<Address xmlns = "http://schemas.xmlsoap.org/ws/2004/03/addressing">http://www.pegasus.edu/CondorDAGMan</Address>
</messageSink>
<interactionId>fmri2006-09-06T17:34:27-07:00Node_slicerZ</interactionId>
</interactionKey>
Views
Every interaction record contains two
views: the
sender view and the
receiver view. These contain documentation provided by each of the two actors in the interaction, sender and receiver. The items of documentation are called
p-assertions and come in three forms, described below.
<interactionRecord>
...
<sender>
...
</sender>
<receiver>
...
</receiver>
</interactionRecord>
The sender in an invocation interaction and receiver in a completion interaction is always Condor DAGMan. The receiver in an invocation interaction and sender in a completion interaction is the application/job being executed.
Asserter Identities
In addition to p-assertions, every view contains the
asserter identity of the actor that is recording documentation in that view. The asserter identity is an arbitrary fragment of XML. In a secured system, this could contain the distinguished name (DN) of the actor.
<sender>
<asserter>
...
</asserter>
...
</sender>
We suggest that, for this mapping, the asserter identities be the same as the respective message source (for sender view) or message sink (for receiver view).
<asserter>
<Address xmlns = "http://schemas.xmlsoap.org/ws/2004/03/addressing">http://fmri#slicer:1</Address>
</asserter>
Interaction P-Assertions
An
interaction p-assertion documents the data content exchanged in an interaction. Both the sender and receiver in the interaction can record an interaction p-assertion, normally expected to have the same content, so as to allow detection of communication and other failures from the documentation. An interaction has
content, which is the data exchanged, as an XML fragment, and a
documentation style which is an URL used to ídentify how the data has been encoded in the p-assertion's content. Additionally, every p-assertion has a
local p-assertion ID which is an identifier which is different for every p-assertion in a view.
<sender>
...
<interactionPAssertion>
<localPAssertionId> ... </localPAssertionId>
<documentationStyle> ... </documentationStyle>
<content>
...
</content>
</interactionPAssertion>
...
</sender>
The invocation interaction should declare the application called and the data (physical filenames) that are used as input to the job. The application is declared in the job submission file.
arguments = "-n fmri::convert:1 -N null -R skynet -L fmri -T 2006-09-06T17:34:27-07:00 /nfs/software/imagemagick/default/bin/convert atlas-z.pgm atlas-z.gif"
The logical filenames and whether they are inputs can be found in the DAX file. For example, in the snippet below, the atlas-z.pgm logical filename is declared to be an input in the
Node_convertZ
job. The physical filename is the workflow run directory plus the logical filename
(is this right?).
<job id="Node_convertZ" namespace="fmri" name="convert" version="1">
<argument><filename file="atlas-z.pgm"/> <filename file="atlas-z.gif"/></argument>
<uses file="atlas-z.pgm" link="input"/>
<uses file="atlas-z.gif" link="output"/>
</job>
We devise an XML format for the representation of the input data for a job, and a documentation style URL to refer to this format. We suggest a local p-assertion ID of "1" for every interaction p-assertion. The interaction p-assertion for the invocation interaction of the job above would be:
<interactionPAssertion>
<localPAssertionId>1</localPAssertionId>
<documentationStyle>http://www.pasoa.org/Pegasus/invocationStyle</documentationStyle>
<content>
<invocation xmlns="http://www.pasoa.org/Pegasus/invocationStyle">
<application>/nfs/software/imagemagick/default/bin/convert<application>
<input>/workflow/run/fmri0254/atlas-z.pgm</input>
</invocation>
</content>
</interactionPAssertion>
Similarly, the completion invocation interaction p-assertion would record the outputs.
<interactionPAssertion>
<localPAssertionId>1</localPAssertionId>
<documentationStyle>http://www.pasoa.org/Pegasus/completionStyle</documentationStyle>
<content>
<completion xmlns="http://www.pasoa.org/Pegasus/completionStyle">
<application>/nfs/software/imagemagick/default/bin/convert<application>
<output>/workflow/run/fmri0254/atlas-z.gif</output>
</completion>
</content>
</interactionPAssertion>
Actor State P-Assertions
An
actor state p-assertion contains data regarding the state of an actor at an instant, or the resources used in a computation. A view can contain any number of actor state p-assertions regarding the actor asserting that view.
<sender>
...
<actorStatePAssertion>
<localPAssertionId> ... </localPAssertionId>
<content>
...
</content>
</actorStatePAssertion>
...
</sender>
The kickstart log files are equivalent to actor state p-assertions. For each completion interaction of a job, two actor state p-assertions can be made, the content of which will be the the contents of the
.out
file and
.err
files for that job. For instance, in the completion interaction for job
Node_convertZ
, there will be the contents of
convert_Node_convertZ.out
and
convert_Node_convertZ.err
. The actor state p-assertions will be in the sender view of the interaction record. If the
.err
file is empty, then its p-assertion can be omitted.
We suggest that the actor state p-assertion containing the
.out
document has a local p-assertion ID of
2
and the
.err
p-assertion has a local p-assertion ID of
3
.
<actorStatePAssertion>
<localPAssertionId>2</localPAssertionId>
<content>
<invocation xmlns="http://www.griphyn.org/chimera/Invocation" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.griphyn.org/chimera/Invocation http://www.griphyn.org/chimera/iv-1.7.xsd" version="1.7" start="2006-09-06T17:56:59.726-07:00" duration="0.067" transformation="fmri::convert:1" derivation="null" resource="skynet" wf-label="fmri" wf-stamp="2006-09-06T17:34:27-07:00" hostaddr="128.9.233.102" hostname="skynet-92.pegasus.edu" pid="25230" uid="1006" user="gmehta" gid="1006" group="gmehta">
<mainjob start="2006-09-06T17:56:59.728-07:00" duration="0.065" pid="25231">
<usage utime="0.020" stime="0.010" minflt="812" majflt="0" nswap="0" nsignals="0" nvcsw="114" nivcsw="0"/>
...
</invocation>
</content>
</actorStatePAssertion>
Relationship P-Assertions
A
relationship p-assertion describes the causal connection between events and data, e.g. that one event triggered another or that one data item was derived from another. Every relationship p-assertion connects a single effect (the
subject) to one or more causes (the
objects). The relationship can be typed, with a
relation URL, to declare exactly how the effect is related to the causes, e.g. that one data item was produced by a particular function over other data items.
A subject or object is a data item contained within the contents of an interaction p-assertion. Each is identified by the following:
- The interaction key of the interaction record in which the interaction p-assertion is contained
- The view kind (sender or receiver) of the view in which the interaction p-assertion is contained
- The local p-assertion ID of the interaction p-assertion
- The data accessor indicating where the item appears within the interaction p-assertion's content, i.e. an unambiguous XPath to the item
The relationship p-assertion is recorded in the same view as the subject of the relationship. Therefore, of the above identifiers, the interaction key and view kind is not required for the subject.
<sender>
...
<relationshipPAssertion>
<localPAssertionId> ... </localPAssertionId>
<subjectId>
<localPAssertionId> ... </localPAssertionId>
<dataAccessor> ... </dataAccessor>
<subjectId>
<relation> ... </relation>
<objectId>
<interactionKey> ... </interactionKey>
<viewKind xmlns:ps="http://www.pasoa.org/schemas/version025/PStruct.xsd" xsi:type=" ... ">
<localPAssertionId> ... </localPAssertionId>
<dataAccessor> ... </dataAccessor>
</objectId>
</relationshipPAssertion>
...
</sender>
In our mapping, there will be two kinds of relationship p-assertion, which we will name here for ease of discussion. An
operation relationship will document the operation of each application/job (one per output data item). A
data link relationship will document the connection between data in two dependent jobs in the workflow (one per input data item).
An operation relationship p-assertion for an output of a job, e.g.
atlas-z.gif
in the example above, will be recorded in the sender view of the completion interaction for that job. The subject of an operation relationship will be the output and have a local p-assertion ID of
1
(to refer to the interaction p-assertion) and an XPath data accessor to the "output" element within the interaction p-assertion's content. The object IDs will be the job inputs documented in the invocation interaction p-assertion. They will have the interaction key of the invocation interaction, the "receiver" view kind, the local p-assertion ID
1
and an XPath data accessor to the
input
element in the interaction p-assertion content. The relation name of the relationship p-assertion will be the qualified name of the derivation, e.g. the expansion of
fmri:convert
. The relationship p-assertion will have a local p-assertion ID of
4
.
<sender>
...
<relationshipPAssertion>
<localPAssertionId>4</localPAssertionId>
<subjectId>
<localPAssertionId>1</localPAssertionId>
<dataAccessor>
<singleNodeXPath xmlns="http://www.pasoa.org/schemas/version025/pquery/ProvenanceQuery.xsd">
<path>/isic:completion[0]/isic:output[0]</path>
<namespaceMapping>
<prefix>isic</prefix>
<namespace>http://www.pasoa.org/Pegasus/completionStyle</namespace>
</namespaceMapping>
</singleNodeXPath>
</dataAccessor>
<subjectId>
<relation>http://fmri#convert</relation>
<objectId>
<interactionKey>
<messageSource>
<Address xmlns = "http://schemas.xmlsoap.org/ws/2004/03/addressing">http://www.pegasus.edu/CondorDAGMan</Address>
</messageSource>
<messageSink>
<Address xmlns = "http://schemas.xmlsoap.org/ws/2004/03/addressing">http://fmri#convert:1</Address>
</messageSink>
<interactionId>fmri2006-09-06T17:34:27-07:00Node_convertZ</interactionId>
</interactionKey>
<viewKind xmlns:ps="http://www.pasoa.org/schemas/version025/PStruct.xsd" xsi:type="ps:ReceiverViewKind">
<localPAssertionId>1</localPAssertionId>
<dataAccessor>
<singleNodeXPath xmlns="http://www.pasoa.org/schemas/version025/pquery/ProvenanceQuery.xsd">
<path>/isii:invocation[0]/isii:input[0]</path>
<namespaceMapping>
<prefix>isii</prefix>
<namespace>http://www.pasoa.org/Pegasus/invocationStyle</namespace>
</namespaceMapping>
</singleNodeXPath>
</dataAccessor>
</objectId>
</relationshipPAssertion>
</sender>
A data link relationship p-assertion documents data dependencies caused by the structure of the workflow. Potential data links are apparent in the DAX file via parent-child relationships.
<child ref="Node_convertZ">
<parent ref="Node_slicerZ"/>
</child>
For every data item output by the parent activity and then input by the child activity, we should record a relationship p-assertion relating the data item in those two states. Again, the data items are documented in interaction p-assertions (a completion interaction for the output and an invocation interaction for the input). The p-assertion is recorded in the sender view of the invocation interaction. The relation name for all data link relationships will be
http://www.pasoa.org/Pegasus/dataLink
.
<sender>
...
<relationshipPAssertion>
<localPAssertionId>4</localPAssertionId>
<subjectId>
<localPAssertionId>1</localPAssertionId>
<dataAccessor>
<singleNodeXPath xmlns="http://www.pasoa.org/schemas/version025/pquery/ProvenanceQuery.xsd">
<path>/isii:invocation[0]/isii:input[0]</path>
<namespaceMapping>
<prefix>isii</prefix>
<namespace>http://www.pasoa.org/Pegasus/invocationStyle</namespace>
</namespaceMapping>
</singleNodeXPath>
</dataAccessor>
<subjectId>
<relation>http://www.pasoa.org/Pegasus/dataLink</relation>
<objectId>
<interactionKey>
<messageSource>
<Address xmlns = "http://schemas.xmlsoap.org/ws/2004/03/addressing">http://fmri#slicer:1</Address>
</messageSource>
<messageSink>
<Address xmlns = "http://schemas.xmlsoap.org/ws/2004/03/addressing">http://www.pegasus.edu/CondorDAGMan</Address>
</messageSink>
<interactionId>fmri2006-09-06T17:34:27-07:00Node_slicerZ</interactionId>
</interactionKey>
<viewKind xmlns:ps="http://www.pasoa.org/schemas/version025/PStruct.xsd" xsi:type="ps:ReceiverViewKind">
<localPAssertionId>1</localPAssertionId>
<dataAccessor>
<singleNodeXPath xmlns="http://www.pasoa.org/schemas/version025/pquery/ProvenanceQuery.xsd">
<path>/isic:completion[0]/isic:output[0]</path>
<namespaceMapping>
<prefix>isic</prefix>
<namespace>http://www.pasoa.org/Pegasus/completionStyle</namespace>
</namespaceMapping>
</singleNodeXPath>
</dataAccessor>
</objectId>
</relationshipPAssertion>
</sender>
--
SimonMiles - 10 Nov 2006
to top