Our primary concern is to be able to represent how "things", whether digital data such as simulation results, physical objects such as cars, or immaterial entities such as decisions, came out to be in a given state, with a given set of characteristics, at a given moment. It is recognised that many of such "things" can be stateful: a car may be at various locations, it can contain different passengers, and it can have a tank full or empty; likewise, a file can contain different data at different moments of its existence. Hence, from the perspective of provenance, we introduce the concept of an artifact as an immutable1 piece of state; likewise, we introduce the concept of a process as actions resulting in new artifacts.
A process usually takes place in some context, which enables or facilitates its execution: examples of such contexts are varied and include a place where the process executes, an individual controlling the process, or an institution sponsoring the process. These entities are being referred to as Agents. Agents, as we shall see when we discuss causality dependencies, are a cause (like a catalyst) of a process taking place.
The Open Provenance Model is based on these three primary entities, which we define now.
Definition 1 (Artifact) Immutable piece of state, which may have a physical embodiment in a physical object, or a digital representation in a computer system.
Definition 2 (Process) Action or series of actions performed on or caused by artifacts, and resulting in new artifacts.
Definition 3 (Agent) Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, affecting its execution.
The Open Provenance Model is a model of artifacts in the past, explaining how they were derived. Likewise, as far as processes are concerned, they may also be in the past, i.e. they may have already completed their execution; in addition, processes can still be currently running (i.e., they have not completed their execution yet). In no case is OPM intended to describe the state of future artifacts and the activities of future processes.
We introduce a graphical notation and a formal definition for provenance graphs. Specifically, artifacts are represented by circles, and are denoted by elements of the set Artifact. Likewise, processes are represented graphically by rectangles and denoted by elements of the set Process. Finally, agents are represented by octogons and are elements of the set Agent in the formal notation.
Footnote 1: In the presence of streams, we consider an artifact to be a slice of stream in time, i.e. the stream content at a specific instant in the computation. A future version of OPM will refine the model to accomodate streams fully as they are recognized to be crucial in many applications. piece of state; likewise, we introduce the concept of a process as actions resulting in new artifacts.
A provenance graph aims to capture the causal dependencies between the abovementioned entities. Therefore, a provenance graph is defined as a directed graph, whose nodes are artifacts, processes and agents, and whose edges belong to one of following categories depicted in Figure 1. An edge represents a causal dependency, between its source, denoting the effect, and its destination, denoting the cause.
Figure 1: Edges in the Provenance Model
The first two edges express that a process used an artifact and that an artifact was generated by a process. Since a process may have used several artifacts, it is important to identify the roles under which these artifacts were used. (Roles are denoted by the letter R in Figure 1.) Likewise, a process may have generated many artifacts, and each would have a specific role. For instance, the division process uses two numbers, with roles dividend and divisor, and produces two numbers, with roles quotient and remainder. Roles are meaningful only in the context of the process where they are defined. The meaning of roles is not defined by OPM but by application domains; OPM only uses roles syntactically (as "tags") to distinguish the involvement of artifacts in processes.
A process is caused by an agent, essentially acting as a catalyst or controller: this causal dependency is expressed by the was controlled by edge. Given that a process may have been catalyzed by several agents, we also identify their roles as catalysts. We note that the dependency between an agent and a process represents a control relationship, and not a data derivation relationship. It is introduced in the model to more easily express how a user (or institution) controlled a process.
It is also recognized that we may not be aware of the process that generated some artifact A2, but that artifact A2 was derived from another artifact A1. Likewise, we may not be aware of the exact artifact that a process P2 used, but that there was some artifact generated by another process P1. Process P2 is then said to have been triggered by P1. Both edges wasDerivedFrom and wasTriggeredBy are introduced, because they allow a dataflow or process oriented views of past executions to be adopted, according to the preference of system designers. (Since wasDerivedFrom and wasTriggeredBy are edges that summarize some activities for which all details are not being exposed, it was felt that it was not necessary to associate a role with such edges.)
As far as conventions are concerned, we note that causality edges use past tense to indicate that they refer to past execution. Causal relationships are defined as follows.
Defintion 4 (Causal Relationship) A causal relationship is represented by an arc and denotes the presence of a causal dependency between the source of the arc (the effect) and the destination of the arc (the cause). Five causal relationships are recognized: a process used an artifact, an artifact was generated by a process, a process was triggered by a process, an artifact was derived from an artifact, and a process was controlled by an agent.
Multiple notions of causal dependencies were considered for OPM. A very strong notion of causal dependency would express that a set of entities was necessary and sufficient to explain the existence of another entity. It was felt that such a notion was not practical, since, with an open world assumption, one could always argue that additional factors may have influenced an outcome (e.g. electricity was used, temperature range allowed computer to work, etc). It was felt that weaker notions, only expressing necessary dependencies, were more appropriate. However, even then, one can distinghish data dependencies (e.g. where a quotient is clearly dependent on the dividend and divisor) from a control dependency where the mere presence of some artifact or the beginning of a process can explain the presence of another entity. A number of factors have influenced us to adopt a weak notion of causal dependency for OPM.
Definition 5 (Artifact Used by a Process) In a graph, connecting a process to an artifact by a used edge is intended to indicate that the process required the availability of the artifact to complete its execution. When several artifacts are connected to a same process by multiple used edges, all of them were required for the process to complete.
Alternatively, a stronger interpretation of the used edge would have required the artifact to be available for the process to be able to start. It is believed that such a notion may be useful in some circumstances, and it may be defined as a subtype of used. We note that both interpretations of used coincide, when processes are modelled as instantaneous.
Definition 6 (Artifacts Generated by Processes) In a graph, connecting an artifact to a process by an edge wasGeneratedBy is intended to mean that the process was required to initiate its execution for the artifact to be generated. When several artifacts are connected to a same process by multiple wasGeneratedBy edges, the process had to have begun, for all of them to be generated.
Definition 7 (Process Triggered by Process) A connection of a process P2 to a process P1 by a "was triggered by" edge indicates that the start of process P1 was required for P2 to be able to complete.
We note that the relationship P2 wasTriggeredBy P1 (like the other causality relationships we describe in this section) only expresses a necessary condition: P1 was required to have started for P2 to be able to complete. This interpretation is weaker than the common sense definition of "trigger", which tends to express a sufficient condition for an event to take place.
Definition 8 (Artifact Derived from Artifact) The assertion of an edge "was controlled by" between a process P and an agent Ag indicates that a start and end of process P was controlled by agent Ag.
Definition 9 (Process Controlled by Agent) The assertion of an edge "was controlled by" between a process P and an agent Ag indicates that a start and end of process P was controlled by agent Ag.
A role is an annotation on used, wasGeneratedBy and wasControlledBy.
Defintion 10 (Role) A role designates an artifact's or agent's function in a process.
A role is used to differentiate among several use, generation, or controlling relations.
A role has meaning only within the context of a given process (and/or agent). For a given process, each used, wasGeneratedBy or wasControlledBy relation has a role specific to the process, though the roles may have no meaning outside that process. In general, for a given process (agent) with several arcs, each role should be distinct for that process. However, it is possible, though not recommended, for roles to be the same within a context. For example, baking a cake with two eggs, may define each egg as a separate artifact, and the two used edges might have the identical role, say, egg.
The role is recommended but may be unspecified when not known. It is recommended to give roles whenever possible. For interoperability, communities should define standard sets of roles with agreed meanings. In addition, a reserved value will be defined for "undefined", which should be used when the role is not known or omitted.
An example illustrating all the concepts and a few of the causal dependencies is displayed in Figure 2. This provenance graph expresses that John baked a cake with ingredients butter, eggs, sugar and flour.
Figure 2: Victoria Sponge Cake Provenance
A computational example is displayed in Figure 3. The final data product is a scientific-grade mosaic of the sky, which was produced by a process that used scientific images in FITS format (such as the Sloan Digital Sky Survey data set) and a parameter indicating the size of the mosaic to be produced. The process was caused by the Pegasus/Condor Dagman agent.
Figure 3: Montage Provenance
While graphs can be constructed by incrementally connecting artifacts, processes, and agents with individual edges, the meaning of the causality relations can be understood in the context of all the used (or wasGeneratedBy) edges, for each process. By connecting a process to several artifacts by used edges, we are not just stating the individual inputs to the process. We are asserting a causal dependency expressing that the process could take place and complete only because all these artifacts were available. Likewise, when we express that several artifacts were generated by a process, we mean that these artifacts would not have existed if the process had not begun its execution; furthermore, all of them were generated by the process; one could not have been generated without the others. The implication is that any single generated artifact is caused by the process, which itself is caused by the presence of all the artifacts it used. We will use such a property to derive transitive closures of causality relations in Section 6.
We can see here the crucial difference between artifacts and the data they represent. For instance, the data may have existed, but the particular artifact did not. For example, a BLAST search can be given a DNA sequence and return a set of "similar" DNA sequences; however, these returned sequences all existed prior to the process (BLAST) invocation, but the artifacts are novel.
As illustrated by the two examples above, the entities and edges introduced in Figure 1 allow us to capture many of the use cases we have come across in the provenance literature. However, they do not allow us to provide descriptions at multiple level of abstractions, or from different view points. To support these, we allow multiple descriptions of a same execution to coexist.
-- PatrickPaulson - 18 Aug 2008
I | Attachment ![]() | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|
![]() | edges2.jpg | manage | 73.0 K | 30 Jul 2008 - 19:17 | PaulGroth | |
![]() | cake2.jpg | manage | 93.8 K | 30 Jul 2008 - 18:46 | PaulGroth | |
![]() | pegasus1.jpg | manage | 80.9 K | 30 Jul 2008 - 18:48 | PaulGroth |