Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.OPM1-01Review-Basics

Start of topic | Skip to actions
Open Provenance Model Contents
  1. Introduction
  2. Basics
  3. Overlapping and Hierarchichal Descriptions
  4. Provenance Graph Definition
  5. Timeless Formal Model
  6. Inferences
  7. Formal Model and Time Annotations
  8. Time Constraints and Inferences
  9. Support for Collections
  10. Example of Representation
  11. Conclusion
  12. Best Practice on the Use of Agensts
  13. References

2 Basics

2.1 Entities

Our primary concern is to be able to represent how "things", whether digital data such as simulation results, physical objects such as cars, or immaterial entities such as decisions, came out to be in a given state, with a given set of characteristics, at a given moment. It is recognised that many of such "things" can be stateful: a car may be at various locations, it can contain different passengers, and it can have a tank full or empty; likewise, a file can contain different data at different moments of its existence. Hence, from the perspective of provenance, we introduce the concept of an artifact as an immutable1 piece of state; likewise, we introduce the concept of a process as actions resulting in new artifacts.

A process usually takes place in some context, which enables or facilitates its execution: examples of such contexts are varied and include a place where the process executes, an individual controlling the process, or an institution sponsoring the process. These entities are being referred to as Agents. Agents, as we shall see when we discuss causality dependencies, are a cause (like a catalyst) of a process taking place.

The Open Provenance Model is based on these three primary entities, which we define now.

Definition 1 (Artifact) Immutable piece of state, which may have a physical embodiment in a physical object, or a digital representation in a computer system.

Definition 2 (Process) Action or series of actions performed on or caused by artifacts, and resulting in new artifacts.

Definition 3 (Agent) Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, affecting its execution.

The Open Provenance Model is a model of artifacts in the past, explaining how they were derived. Likewise, as far as processes are concerned, they may also be in the past, i.e. they may have already completed their execution; in addition, processes can still be currently running (i.e., they have not completed their execution yet). In no case is OPM intended to describe the state of future artifacts and the activities of future processes.

We introduce a graphical notation and a formal definition for provenance graphs. Specifically, artifacts are represented by circles, and are denoted by elements of the set Artifact. Likewise, processes are represented graphically by rectangles and denoted by elements of the set Process. Finally, agents are represented by octogons and are elements of the set Agent in the formal notation.

Footnote 1: In the presence of streams, we consider an artifact to be a slice of stream in time, i.e. the stream content at a specific instant in the computation. A future version of OPM will refine the model to accomodate streams fully as they are recognized to be crucial in many applications. piece of state; likewise, we introduce the concept of a process as actions resulting in new artifacts.

2.2 Dependencies

A provenance graph aims to capture the causal dependencies between the abovementioned entities. Therefore, a provenance graph is defined as a directed graph, whose nodes are artifacts, processes and agents, and whose edges belong to one of following categories depicted in Figure 1. An edge represents a causal dependency, between its source, denoting the effect, and its destination, denoting the cause.

Edges in the Provenance model
Figure 1: Edges in the Provenance Model

The first two edges express that a process used an artifact and that an artifact was generated by a process. Since a process may have used several artifacts, it is important to identify the roles under which these artifacts were used. (Roles are denoted by the letter R in Figure 1.) Likewise, a process may have generated many artifacts, and each would have a specific role. For instance, the division process uses two numbers, with roles dividend and divisor, and produces two numbers, with roles quotient and remainder. Roles are meaningful only in the context of the process where they are defined. The meaning of roles is not defined by OPM but by application domains; OPM only uses roles syntactically (as "tags") to distinguish the involvement of artifacts in processes.

A process is caused by an agent, essentially acting as a catalyst or controller: this causal dependency is expressed by the was controlled by edge. Given that a process may have been catalyzed by several agents, we also identify their roles as catalysts. We note that the dependency between an agent and a process represents a control relationship, and not a data derivation relationship. It is introduced in the model to more easily express how a user (or institution) controlled a process.

It is also recognized that we may not be aware of the process that generated some artifact A2, but that artifact A2 was derived from another artifact A1. Likewise, we may not be aware of the exact artifact that a process P2 used, but that there was some artifact generated by another process P1. Process P2 is then said to have been triggered by P1. Both edges wasDerivedFrom and wasTriggeredBy are introduced, because they allow a dataflow or process oriented views of past executions to be adopted, according to the preference of system designers. (Since wasDerivedFrom and wasTriggeredBy are edges that summarize some activities for which all details are not being exposed, it was felt that it was not necessary to associate a role with such edges.)

As far as conventions are concerned, we note that causality edges use past tense to indicate that they refer to past execution. Causal relationships are defined as follows.

Defintion 4 (Causal Relationship) A causal relationship is represented by an arc and denotes the presence of a causal dependency between the source of the arc (the effect) and the destination of the arc (the cause). Five causal relationships are recognized: a process used an artifact, an artifact was generated by a process, a process was triggered by a process, an artifact was derived from an artifact, and a process was controlled by an agent.

Multiple notions of causal dependencies were considered for OPM. A very strong notion of causal dependency would express that a set of entities was necessary and sufficient to explain the existence of another entity. It was felt that such a notion was not practical, since, with an open world assumption, one could always argue that additional factors may have influenced an outcome (e.g. electricity was used, temperature range allowed computer to work, etc). It was felt that weaker notions, only expressing necessary dependencies, were more appropriate. However, even then, one can distinghish data dependencies (e.g. where a quotient is clearly dependent on the dividend and divisor) from a control dependency where the mere presence of some artifact or the beginning of a process can explain the presence of another entity. A number of factors have influenced us to adopt a weak notion of causal dependency for OPM.

  • Expressibility. It is anticipated that systems will produce descriptions of what their components are doing, without having intimate knowledge of the exact internal data and control dependencies. Weak notions of dependency are necessary for such systems to be able to use OPM in practice.

  • Composability. We shall see how OPM supports multi-level descriptions (Section~\reoverlapping:descriptio). In a system consisting of the parallel composition of two subcomponents, the high level summary of the system requires a weaker notion of dependency than the low level descriptions of its subcomponents. Hence, we adopt the following causal dependencies in OPM. We anticipate that subclasses of these dependencies, capturing stronger notions of causality, may be defined in specific systems, and over time, may be incorporated in OPM.

Definition 5 (Artifact Used by a Process) In a graph, connecting a process to an artifact by a used edge is intended to indicate that the process required the availability of the artifact to complete its execution. When several artifacts are connected to a same process by multiple used edges, all of them were required for the process to complete.

Alternatively, a stronger interpretation of the used edge would have required the artifact to be available for the process to be able to start. It is believed that such a notion may be useful in some circumstances, and it may be defined as a subtype of used. We note that both interpretations of used coincide, when processes are modelled as instantaneous.

Definition 6 (Artifacts Generated by Processes) In a graph, connecting an artifact to a process by an edge wasGeneratedBy is intended to mean that the process was required to initiate its execution for the artifact to be generated. When several artifacts are connected to a same process by multiple wasGeneratedBy edges, the process had to have begun, for all of them to be generated.

Definition 7 (Process Triggered by Process) A connection of a process P2 to a process P1 by a "was triggered by" edge indicates that the start of process P1 was required for P2 to be able to complete.

We note that the relationship P2 wasTriggeredBy P1 (like the other causality relationships we describe in this section) only expresses a necessary condition: P1 was required to have started for P2 to be able to complete. This interpretation is weaker than the common sense definition of "trigger", which tends to express a sufficient condition for an event to take place.

Definition 8 (Artifact Derived from Artifact) The assertion of an edge "was controlled by" between a process P and an agent Ag indicates that a start and end of process P was controlled by agent Ag.

Definition 9 (Process Controlled by Agent) The assertion of an edge "was controlled by" between a process P and an agent Ag indicates that a start and end of process P was controlled by agent Ag.

2.3 Roles

A role is an annotation on used, wasGeneratedBy and wasControlledBy.

Defintion 10 (Role) A role designates an artifact's or agent's function in a process.

A role is used to differentiate among several use, generation, or controlling relations.

  1. A process may use (generate) more than one artifact. Each used (wasGeneratedBy) relation may be distinguished by a role with respect to that process. For example, a process may use several files, reading parameters from one, and reading data from another. The used relations would be labeled with distinct roles.
  2. An artifact might be used by more than one process, possibly for different purposes. In this case, the used relations can be distinguished or said to be the same by the roles associated with the used relations. For example, a dictionary might be used by one process to look up the spelling of "provenance", (role = "look up provenance"), while another process uses the same dictionary to hold open the door ( role = "doorstop").
  3. An agent may control more than one process. In this case, the different processes may be distinguished by the role associated with the wasControlledBy relation. For example, a gardener may control the digging process (role = "dig the bed"), as well as planting a rose bush (role = "plant") and watering the bush (role = "irrigating")
  4. A process may be controlled by more than one agent. In this case, each agent might have a distinct control function, which would be distinguished by roles associated with the wasControlledBy relations. For example, boarding the train may be controlled by the ticket agent (role = "sell ticket"), the gate agent (role = "take ticket") and the steward (role = ``guide to seat'').

A role has meaning only within the context of a given process (and/or agent). For a given process, each used, wasGeneratedBy or wasControlledBy relation has a role specific to the process, though the roles may have no meaning outside that process. In general, for a given process (agent) with several arcs, each role should be distinct for that process. However, it is possible, though not recommended, for roles to be the same within a context. For example, baking a cake with two eggs, may define each egg as a separate artifact, and the two used edges might have the identical role, say, egg.

The role is recommended but may be unspecified when not known. It is recommended to give roles whenever possible. For interoperability, communities should define standard sets of roles with agreed meanings. In addition, a reserved value will be defined for "undefined", which should be used when the role is not known or omitted.

2.4 Examples

An example illustrating all the concepts and a few of the causal dependencies is displayed in Figure 2. This provenance graph expresses that John baked a cake with ingredients butter, eggs, sugar and flour.

Victoria Sponge Cake Provenance
Figure 2: Victoria Sponge Cake Provenance

A computational example is displayed in Figure 3. The final data product is a scientific-grade mosaic of the sky, which was produced by a process that used scientific images in FITS format (such as the Sloan Digital Sky Survey data set) and a parameter indicating the size of the mosaic to be produced. The process was caused by the Pegasus/Condor Dagman agent.

Montage Provenance
Figure 3: Montage Provenance

While graphs can be constructed by incrementally connecting artifacts, processes, and agents with individual edges, the meaning of the causality relations can be understood in the context of all the used (or wasGeneratedBy) edges, for each process. By connecting a process to several artifacts by used edges, we are not just stating the individual inputs to the process. We are asserting a causal dependency expressing that the process could take place and complete only because all these artifacts were available. Likewise, when we express that several artifacts were generated by a process, we mean that these artifacts would not have existed if the process had not begun its execution; furthermore, all of them were generated by the process; one could not have been generated without the others. The implication is that any single generated artifact is caused by the process, which itself is caused by the presence of all the artifacts it used. We will use such a property to derive transitive closures of causality relations in Section 6.

We can see here the crucial difference between artifacts and the data they represent. For instance, the data may have existed, but the particular artifact did not. For example, a BLAST search can be given a DNA sequence and return a set of "similar" DNA sequences; however, these returned sequences all existed prior to the process (BLAST) invocation, but the artifacts are novel.

As illustrated by the two examples above, the entities and edges introduced in Figure 1 allow us to capture many of the use cases we have come across in the provenance literature. However, they do not allow us to provide descriptions at multiple level of abstractions, or from different view points. To support these, we allow multiple descriptions of a same execution to coexist.


Comments

I don't see the advantage to incorportating streams -- they caused a lot of problems when we came up with the initial model (patrick.paulson@pnl.gov)

-- PatrickPaulson - 18 Aug 2008


to top

I Attachment sort Action Size Date Who Comment
edges2.jpg manage 73.0 K 30 Jul 2008 - 19:17 PaulGroth  
cake2.jpg manage 93.8 K 30 Jul 2008 - 18:46 PaulGroth  
pegasus1.jpg manage 80.9 K 30 Jul 2008 - 18:48 PaulGroth  

You are here: Challenge > OPM > OPM1-01Review > OPM1-01Review-Basics

to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback