Change Proposal: Simplify Core OPM Specification by Removing Unnecessary Parts
This change proposal is now obsolete. Its parts and comments have now been now been divided between ChangeProposalRemoveProcessValues, ChangeProposalRemoveOverlaps, ChangeProposalMoveTimeToProfile, and ChangeProposalRemoveIDs
Authors
SimonMiles
2009 June 19, revised 2009 July 3.
Subject
Core OPM specification
Background
There are many parts of OPM specification which I did not use or consider using in the meeting the third provenance challenge. As the simpler the specification, the easier it is to encourage adoption, the proposal is that these parts could be removed. This is not to say they are unimportant, but that they may not be 'core' and that OPM may still be useful and have its requirements met without them.
Problem addressed
Possible unnecessary complexity and length of the specification, and the discouragement of adoption which that may bring.
Proposed solution
I realise that not all of the below should probably be removed, and the suggestions may be my lack of knowledge about the motivations behind OPM, but I invite discussion to determine which should remain in this proposal to simplify OPM.
I propose the removal of the following from the core specification.
- Accounts overlaps other than refinement
- Process values
- The time-annotated OPM model
- Artifact or process IDs
Specifically, I propose the following changes to the specification document.
1. Account overlaps other than refinement
Removal of point 13 from chapter 4 of OPM v1.1: "Two account views can be declared to be
overlapping to express the fact that they represent different descriptions of an execution."
Removal of the Overlaps set from the formalisation(s) of OPM.
Replacement of rule 11 in the formalisation of OPM v1.1:
"Two accounts α1,α2 are declared to be overlapping in an OPMgraph
gr=⟨ A,P, AG, U,G,T,D,C, Ov, Re⟩, if ⟨ α1,α2⟩ ∈ Ov
or ⟨ α2,α1⟩ ∈ Ov."
with
"Two accounts are
overlapping in an OPM graph iff a subset of artifacts and/or processes in one account are identical to a subset in the other."
2. Process values
Remove "Processes contain a placeholder for domain specific values or references." from the Provenance Graph Definition.
Define Process as mapped to just a set of accounts in the formalism (plus an ID, if change 4 below is rejected).
3. Time-annotated OPM model
Remove the time-annotated model and formalism and put them into a profile instead.
4. Artifact and process IDs
In the Provenance Graph Definition, place the same rules on artifacts and processes as accounts, i.e.
- "Artifacts are entities that we assume can can be compared. Artifacts contain a placeholder for a domain specific value or reference to a piece of state. Two artifacts are equal if and only if they have the same identifier (irrespective of their placeholder contents). Artifacts can optionally belong to accounts: account membership is declared by listing the accounts an artifact belongs to."
- "Processes are entities that we assume can can be compared. Processes can optionally belong to accounts: account membership is declared by listing the accounts a process belongs to."
In the formalisation, replace the following:
- "We assume the existence of a few primitive sets: identifiers for processes, artifacts and agents, roles, and accounts. These sets of identifiers provide indentifies to the corresponding entities within the scope of a given provenance graph. A given serialization will standardize on these sets, and provide concrete representations for them. "
with:
- "A serialisation of OPM will provide means to declare two accounts, two artifacts or two processes to be equal to each other, e.g. IDs scoped locally to the serialised graph."
Remove the following, and put it instead in the specification of the XML serialisation of OPM:
- "It is important to stress that the purpose of these identifiers is to define the structure of graphs: they are not meant to define identities that are persistent and reliably resolvable over time."
In the formalism, take Account, Role, Process, Agent, Artifact, Value to be the primitive sets; define
ArtifactValue? to be a mapping from an Artifact to its Value; define
ArtifactAccounts? to be a mapping from an Artifact to its set of Accounts; define
ProcessAccounts? to be a mapping from a Process to its set of Accounts; use Artifact/Process instead of
ArtifactID?/ProcessID in defining the causal relationships.
Rationale for the solution
1. Account overlaps other than refinement
If two accounts overlap, then this appears to be evident from the graph itself, and so does not need to be declared separately. What would be lost by removing the
Overlaps set from the formal model?
If we do want to talk about overlapping accounts, e.g. to make legality clear, it would also be good to include an example of non-refinement overlap, to make its meaning and implications clear.
For refinement, I am unsure whether it is possible to determine whether one account is a refinement of another without explicit assertion. If a single process P uses A and generates B in one account, and a chain of processes Q initially use A and finally generate B in another account, is it possible for P
not to be a refinement of Q? Given that each artifact is generated by only one process, presumably the 'end' of P must be the same as the 'end' of Q?
2. Process values
I don't understand from the specification what process' values should be. This problem may be resolved anyway if we treat 'hasValue' as just one of arbitrary optional annotations as discussed elsewhere.
3. Time-annotated OPM model
Time is not obviously essential to causality, whether defined purely in terms of use, derivation and generation, or by a general definition such as counterfactual causation. The debate over how best to provide annotations of time could be held separately from the core model. The time annotations seem an ideal candidate for putting into a profile.
4. Artifact and process IDs
I would normally consider graphs to be modelled as edges between nodes, but an OPM graph is modelled by edges between IDs which are parts of nodes.
Nodes need identity to allow sharing but that does not mean they have to have explicit identifiers outside of any one serialisation. If we want to assign identifiers for a particular purpose or in a particular serialisation we can, e.g. that proposed in
ChangeProposalDCNaming. The artifact and process IDs seem tied to serialisation for a few reasons.
First, IDs being replaced seems to affect the serialisation but not the meaning of the graph. For example, if I took an OPM graph in the current XML serialisation and loaded it into memory using the
JavaBeans?? deserialiser (effectively creating a new serialisation), then the
JavaBeans?? for the graph edges would address the artifact and process node
JavaBeans?? by memory locations not their original IDs, and the graph can be fully interpreted without ever using the IDs used in the XML serialisation.
Second, for a particular class of applications, it may be sufficient not to ascribe explicit IDs in the serialisation, because only tree structures are ever present in the causal graphs. This would not make the graph non-interpretable or non-interoperable.
Third, a given serialisation or use of annotations may provide adequate identity for expressing shared nodes without requiring further IDs. For example, an XML serialisation automatically gives each part of the graph a unique XPath. Or, where global identifiers are provided in annotations for other purposes, these can also be used to express that two artifacts are one and the same. This suggests to me that the requirement for core OPM is for identity, not identifiers.
Finally, to answer the point about how to know whether two artifact descriptions denote the same or not, an alternative would be to include a serialisation-specific relation between two artifact descriptions saying they denote the same if they are (with the default assumption that they denote something different). I agree that, to include such a relationship requires identifying the artifacts, but, as with the relationship, this can be done in a serialisation-specific way. I am not suggesting this is preferable to using IDs, only that it seems to achieve the same end if a particular serialisation chose to do this, and either way of establishing identity and the important thing for the core model is to that (shared) identity is clear rather identifiers.
Comments
Community is invited to provide comments on proposals.
comment 1 by Luc Moreau
PC3 had a limited set of Questions (I believe not as broad as in PC1/PC2) and therefore didn't exercise the full OPM model. PC1 had a temporal question (Q4) which would require time annotations. Questions 7, 8, 9 of PC1 also refer to a "user", notion intended to be captured by agents. A question such as identify the code that created the database could make use of the value field in a process. So I would be against removing these issues on the ground that we have not used them in PC3. I agree however that more justification/explanation is required. The issue of agents was raised at the first OPM workshop. We should discuss that in its own sake, as is it the right way to model user/funding bodies/etc.
comment 2 by Luc Moreau
It is quite clear to me that the notions of account, refinement and overlap are novelties of OPM that we need to explore more. I think there is a case for overlapping (non refinement) accounts. I will try to write up the example of a process randomly selecting one of its two inputs. When the actual choice has not been observed by an observer, we can use alternating accounts to describe that one of the inputs was used.
comment 3 by Luc Moreau
About legality: I believe we try to define a model for inter-operability and its associated meaning. Non circularity (within one account) is one of the key properties we had identified at the end of the PC2 workshop.
comment 4 by Luc Moreau
IDs are introduced to help us construct graphs and express
sharing of nodes. Without an ID, how can we decide whether
<artifact value="5"/>
is the same or not as
<artifact value="5"/>
I therefore think that IDs are crucial to understand the shape of a graph, and an essential part of OPM.
Regarding the question "Even if artifact/process IDs are desirable, why are 'used' and 'wasGeneratedBy' arcs defined as being between IDs and not between processes and artifacts themselves? Surely the edges of the graph should be between the nodes of the graph? ", this is applies to the proposed XML serialisation. We might have serialised OPM differently, and proposals are welcomed (note the xml serialisation has never been reviewed!) It is however crucial that sharing is expressed in the graph. How would you do it if artifacts/processes are placed in the edges instead of their identifiers. If our underpinning data structure was a tree (without sharing of nodes/branches), then, agreed, without IDs.
comment 1RM by Robert McGrath
WRT accounts: The whole idea of accounts is that they are "hearsay". Note that the definition of an OPM graph is that it
is an account. There is no such thing as an OPM graph that is not hearsay.
By definition, there can be many different accounts of the same events. That is the nature of hearsay. Hence, accounts overlap in many ways.
This is critical to the entire OPM!
comment 5 by Luc Moreau
I see a big distinction between IDs in OPM graphs and (global) names of nodes. IDs help express the topology of the graph. Two nodes with different IDs are by definition distinct nodes in the graph. Naming schemes and naming conventions are different. It is not rare that a given entity could be different names. In such a case, different names do not imply that the entities are different.
comment 6 by Simon Miles in reply to comment 1
I agree that absence of use in PC3 does not necessarily mean concepts in core OPM are non-essential. However, the fact that a challenge can be completed without using some part could suggest that it is essential only in some circumstances, and so perhaps more suitably expressed in a profile.
With the timed OPM model, what seems to be added are particular kinds of annotation for answering particular kinds of provenance question. I agree time is important and interoperability requires some standard way to include it with OPM data, but what would be lost by putting this in a profile? Moreover, separating the core causal graph model from the time-annotated model allows each to be refined separately. For example, if the community successfully argued that the time annotations should explicitly identify the clock from which the time readings came, then this could change the profile without affecting users of the core model.
With regards to agents, I can propose, in a separate change proposal, a clearer definition based on that currently in the Dublin Core profile.
comment 7 by Simon Miles in reply to comment 3
OK, I agree that it is important for interoperability and general clarity that an OPM graph creator knows what they should and should not do. My perception of it as non-essential perhaps comes more from the way legality is expressed, but changing that would be another proposal. I will remove legality from the list of proposed removals.
With regards to non-cyclic accounts, I am a little uneasy still. If artifact A is generated at time T by process P, and artifact B is used by P at time T+N, isn't it possible that B was derived from A? Wouldn't these 3 relationships be a cycle in the OPM graph? Is the modeller required to decompose P into sub-processes in this case to avoid the cycle? If so, why do we place this (possibly arduous) requirement on them?
comment 8 by Ben Clifford
I agree that time be removed from the core specification and made into a profile . The present OPM time stuff feels to me to be like "just another annotation".
comment 9 by Ben Clifford
Artifacts and processes have identity by virtue of their existence, not by virtue of being given an identifying label. In some representations (such as the present XML format) its necessary to give IDs to artifacts and processes in order to describe the relations between them. But in another representation, where an OPM graph is drawn on a piece of paper, IDs are not necessary, whilst still being a complete representation of the OPM graph. My feeling is that local ID information, if necessary, be pushed to the specifications defining the particular representation.
comment 10 by Luc Moreau in reply to comment 7
Simon, your example is nice:
A <- B
P <- A
A <- P
It shows a data dependency (edge wasGeneratedBy) from B to A. It also shows that P generated A and used B.
However (cf.
ChangeProposalWasDerivedCannotBeInferred), there is
no dependency from A to B!
It is the data derivation (wasGeneratedBy) and its transitive closure that cannot have cycles! The current specification
does not say so, and it needs to be fixed accordingly.
comment 11 by Luc Moreau in response to comment 7
When drawing a graph on a piece of paper, nodes have a unique "address" given by their position on the paper.
You express
sharing in the graph by drawing lines between specific positions.
I don't understand the proposal of moving IDs out of the model to specific serialisations. How do we know whether two artifact
descriptions in an OPM graph denote the same or not?
comment 12 by Simon Miles in response to comments 4, 5, 9, 11
My intuition is, I think, the same as Ben's: nodes need identity to allow sharing but that does not mean they have to have explicit identifiers outside of any one serialisation. If we want to assign identifiers for a particular purpose or in a particular serialisation we can, e.g. that proposed in
ChangeProposalDCNaming. The artifact and process IDs seem tied to serialisation for a few reasons.
First, IDs being replaced seems to affect the serialisation but not the meaning of the graph. For example, if I took an OPM graph in the current XML serialisation and loaded it into memory using the
JavaBeans? deserialiser (effectively creating a new serialisation), then the
JavaBeans? for the graph edges would address the artifact and process node
JavaBeans? by memory locations not their original IDs, and the graph can be fully interpreted without ever using the IDs used in the XML serialisation.
Second, for a particular class of applications, it may be sufficient not to ascribe explicit IDs in the serialisation, because only tree structures are ever present in the causal graphs. This would not make the graph non-interpretable or non-interoperable.
Third, a given serialisation or use of annotations may provide adequate identity for expressing shared nodes without requiring further IDs. For example, an XML serialisation automatically gives each part of the graph a unique XPath. Or, where global identifiers are provided in annotations for other purposes, these can also be used to express that two artifacts are one and the same. This suggests to me that the requirement for core OPM is for identity, not identifiers.
Finally, to answer the point about how to know whether two artifact descriptions denote the same or not, an alternative would be to include a serialisation-specific relation between two artifact descriptions saying they denote the same if they are (with the default assumption that they denote something different). I agree that, to include such a relationship requires identifying the artifacts, but, as with the relationship, this can be done in a serialisation-specific way. I am not suggesting this is preferable to using IDs, only that it seems to achieve the same end if a particular serialisation chose to do this, and either way of establishing identity and the important thing for the core model is to that (shared) identity is clear rather identifiers.
comment 13 by Luc Moreau in response to 12
I can't see what your proposal is.
To me, it is crucial that we can reason about node equality in the abstract model,
independently of any serialisation. Serialisations (in xml and rdf) or representations (as Java objects) will have to preserve this notion of equality.
Given that we aim at inter-operability, I am not in favour to say that the model "assume a notion of equality over nodes based on their identity". This would lead to problems of interpretation, and ultimately, systems will not inter-operate.
Sharing is an essential aspect of a provenance graph, and we must have a precise, unambiguous way of doing it. This does not prevent a given serialisation to do without identifiers, but it will be the duty of that serialisation to provide the means to reconstruct identifiers when reconstructing an OPM graph, and to drop them as it sees fit when serializing an OPM graph.
comment 14 by Simon Miles in response to comments 2, 1RM
Sorry, I was not clear enough about the change proposed regarding overlapping accounts. I did not mean to say that accounts cannot be about the same occurrences, or that they would not overlap.
The part of the specification I considered removable is the explicit
assertion of non-refinement overlaps, and its inclusion in the formalised model. I don't see what is lost, and can see a gain in simplification, by removing declaration of overlaps between accounts. I have tried to clarify this by revising the change proposal above.
comment 15 by Simon Miles in reply to all above
I have tried to take into account all the comments so far to improve the specification of the proposed changes, and their rationale. Hopefully, this should make it clearer what the proposal is (especially in regards to artifact/process IDs and to overlapping accounts).
comment 16 by Simon Miles in reply to comment 10
I think I understand and agree with your argument, but just to clarify a couple of things. First, I wanted to check there is a typo in your graph, and it should be the following (A becomes B in third arc)?
A <- B
P <- A
B <- P
Second, you say "it is the data derivation (wasGeneratedBy) and its transitive closure that cannot have cycles". I assume you mean wasDerivedFrom?
So, in conclusion, the value in including legality in the OPM specification is to say explicitly: "An account creator must ensure there are no cycles in the sub-graph formed from all the wasDerivedFrom relationships in the account." If so, I agree that this makes sense with an intuitive understanding of "was derived from".
We might want to clarify the definition of wasDerivedFrom to ensure the graph above could not effectively occur in reality, e.g. replacing the current definition:
- Definition 8 (Artifact Derived from Artifact): An edge "was derived from" between two artifacts A1 and A2 indicates that artifact A1 needs to have been generated for A2 to be generated.
with:
- Definition 8 (Artifact Derived from Artifact): An edge "was derived from" between two artifacts A1 and A2 indicates that artifact A1 needs to have been completely generated for A2 to begin being generated.
comment 17 by Luc Moreau in reply to comment 16
Yes, you're right, there was a typo. It should be P used B in the third edge.
And yes, I meant to say wasDerivedFrom edges cannot form a cycle.
I don't think your suggested definition change is appropriate, since artifacts are instantaneous state snapshots. There is no such thing as "begin to generate" and "end to generate" an artifact.
comment 18 by Luc Moreau in reply to comment 14
The reason why we have explicit assertions of account relationship (e.g. overlap, refinement, maybe others in the future) is that in the absence of
such assertions, an opm graph reader would have to process the whole graph to infer them. It's not impossible, but it's tedious to do. Typically, this information
is readily available to the opm graph creator and hence is worth including in the graph.
I think that overtime we will come up with new account relationships. Overlapping is a simple property (currently not very useful, I agree).
Refinment is more interesting, and more work is required to tighten its definition. I attach an example of overlapping accounts that are not-refinement.
Maybe we may want to introduce the "mutually exclusive" account relationship, which is another kind of overlap.
comment 19 by Luc Moreau in reply to all the above
Would be nice to see some summary of discussion, and maybe create new pages, for each proposed change, so that
we can vote on things easily.
Comment 20 by Luc on the revised proposal
1. I feel that it's important to keep the overlap declaration because it helps a querier/visualisation tool to make sense of the graph, its nesting/overlap,
without having to process it! What would be the cost of inferring pairwise overlap relationships in a graph with A accounts, N nodes, E edges. I think in the
worst case, it's O(A^2 x N^2 x E) (for all edges propagate effective accounts to all nodes, then for each potential pairing of account check whether there is overlap.
2. I am opposed to removal the value field of processes. Where are we going to attach the library/procedure name? the wsdl interface?
The value field is an extensibility point that allows such application specific information to be attached. If a better name "value" is found,
I am fine with it. I am also fine with adopting an agreed annotation for this kind of information.
3. Is this a cosmetic change or a fundamental change? I don't understand. I think we are stating that the way to provide time annotation is the one we propose and not any other. This does not seem to be a profile to me.
4. I am opposed to this proposal for the reason I explained before. It is important that the opm abstract model provides the means to decide if two nodes are equal. This is not an issue to be left to serialisations, because otherwise we will have no means of mapping serialisation X to serialisation Y, unless we have intimate knowledge of both x and y. I also want to be able to implement opm graph reasoning, independently of how I am going to serialise my graphs.
Your comment however raises the issue of account equality, and maybe we should introduce identifiers for them too.
Comment 21 by Simon Miles in reply to Comment 19
OK, as suggested I've split the proposal into parts:
ChangeProposalRemoveProcessValues,
ChangeProposalRemoveOverlaps,
ChangeProposalMoveTimeToProfile, and
ChangeProposalRemoveIDs. I have copied the relevant comments across to the new pages, and replied to your most recent ones.
There are some points on this page which do not fit into those proposals. I will try to extract these, but it will have to wait for another time.
Vote
No vote on this proposal. Instead vote for the following proposals: ChangeProposalRemoveProcessValues, ChangeProposalRemoveOverlaps, ChangeProposalMoveTimeToProfile, and ChangeProposalRemoveIDs.
--
SimonMiles - 19 Jun 2009
to top