Change Proposal: Remove IDs from Serialisation-Independent Model
Authors
SimonMiles
2009 July 21, extracted from previous discussion in
ChangeProposalRemoveNonCore.
Subject
Core OPM specification
Background
Problem addressed
Artifact and process IDs appear to serialisation-specific, so should not belong to the core model, and an OPM graph is defined as edges between IDs rather than between artifacts and processes which is not intuitive and contradicts the figures illustrating OPM (see my arguments for these assertions in rationale and comments sections below).
Proposed solution
In the Provenance Graph Definition, place the same rules on artifacts and processes as accounts, i.e.
- "Artifacts are entities that we assume can can be compared. Artifacts contain a placeholder for a domain specific value or reference to a piece of state. Two artifacts are equal if and only if they have the same identifier (irrespective of their placeholder contents). Artifacts can optionally belong to accounts: account membership is declared by listing the accounts an artifact belongs to."
- "Processes are entities that we assume can can be compared. Processes can optionally belong to accounts: account membership is declared by listing the accounts a process belongs to."
In the formalisation, replace the following:
- "We assume the existence of a few primitive sets: identifiers for processes, artifacts and agents, roles, and accounts. These sets of identifiers provide indentifies to the corresponding entities within the scope of a given provenance graph. A given serialization will standardize on these sets, and provide concrete representations for them. "
with:
- "A serialisation of OPM will provide means to declare two accounts, two artifacts or two processes to be equal to each other, e.g. IDs scoped locally to the serialised graph."
Remove the following, and put it instead in the specification of the XML serialisation of OPM:
- "It is important to stress that the purpose of these identifiers is to define the structure of graphs: they are not meant to define identities that are persistent and reliably resolvable over time."
In the formalism, take Account, Role, Process, Agent, Artifact, Value to be the primitive sets; define
ArtifactValue? to be a mapping from an Artifact to its Value; define
ArtifactAccounts? to be a mapping from an Artifact to its set of Accounts; define
ProcessAccounts? to be a mapping from a Process to its set of Accounts; use Artifact/Process instead of
ArtifactID?/ProcessID in defining the causal relationships.
Rationale for the solution
I would normally consider graphs to be modelled as edges between nodes, but an OPM graph is modelled by edges between IDs which are parts of nodes.
Nodes need identity to allow sharing but that does not mean they have to have explicit identifiers outside of any one serialisation. If we want to assign identifiers for a particular purpose or in a particular serialisation we can, e.g. that proposed in
ChangeProposalDCNaming. The artifact and process IDs seem tied to serialisation for a few reasons.
First, IDs being replaced seems to affect the serialisation but not the meaning of the graph. For example, if I took an OPM graph in the current XML serialisation and loaded it into memory using the
JavaBeans?? deserialiser (effectively creating a new serialisation), then the
JavaBeans?? for the graph edges would address the artifact and process node
JavaBeans?? by memory locations not their original IDs, and the graph can be fully interpreted without ever using the IDs used in the XML serialisation.
Second, for a particular class of applications, it may be sufficient not to ascribe explicit IDs in the serialisation, because only tree structures are ever present in the causal graphs. This would not make the graph non-interpretable or non-interoperable.
Third, a given serialisation or use of annotations may provide adequate identity for expressing shared nodes without requiring further IDs. For example, an XML serialisation automatically gives each part of the graph a unique XPath. Or, where global identifiers are provided in annotations for other purposes, these can also be used to express that two artifacts are one and the same. This suggests to me that the requirement for core OPM is for identity, not identifiers.
Finally, to answer the point about how to know whether two artifact descriptions denote the same or not, an alternative would be to include a serialisation-specific relation between two artifact descriptions saying they denote the same if they are (with the default assumption that they denote something different). I agree that, to include such a relationship requires identifying the artifacts, but, as with the relationship, this can be done in a serialisation-specific way. I am not suggesting this is preferable to using IDs, only that it seems to achieve the same end if a particular serialisation chose to do this, and either way of establishing identity and the important thing for the core model is to that (shared) identity is clear rather identifiers.
Comments
Community is invited to provide comments on proposals.
comment 1 by Luc Moreau
IDs are introduced to help us construct graphs and express
sharing of nodes. Without an ID, how can we decide whether
<artifact value="5"/>
is the same or not as
<artifact value="5"/>
I therefore think that IDs are crucial to understand the shape of a graph, and an essential part of OPM.
Regarding the question "Even if artifact/process IDs are desirable, why are 'used' and 'wasGeneratedBy' arcs defined as being between IDs and not between processes and artifacts themselves? Surely the edges of the graph should be between the nodes of the graph? ", this is applies to the proposed XML serialisation. We might have serialised OPM differently, and proposals are welcomed (note the xml serialisation has never been reviewed!) It is however crucial that sharing is expressed in the graph. How would you do it if artifacts/processes are placed in the edges instead of their identifiers. If our underpinning data structure was a tree (without sharing of nodes/branches), then, agreed, without IDs.
comment 2 by Luc Moreau
I see a big distinction between IDs in OPM graphs and (global) names of nodes. IDs help express the topology of the graph. Two nodes with different IDs are by definition distinct nodes in the graph. Naming schemes and naming conventions are different. It is not rare that a given entity could be different names. In such a case, different names do not imply that the entities are different.
comment 3 by Ben Clifford
Artifacts and processes have identity by virtue of their existence, not by virtue of being given an identifying label. In some representations (such as the present XML format) its necessary to give IDs to artifacts and processes in order to describe the relations between them. But in another representation, where an OPM graph is drawn on a piece of paper, IDs are not necessary, whilst still being a complete representation of the OPM graph. My feeling is that local ID information, if necessary, be pushed to the specifications defining the particular representation.
comment 4 by Luc Moreau in response to comment 3
When drawing a graph on a piece of paper, nodes have a unique "address" given by their position on the paper.
You express
sharing in the graph by drawing lines between specific positions.
I don't understand the proposal of moving IDs out of the model to specific serialisations. How do we know whether two artifact
descriptions in an OPM graph denote the same or not?
comment 5 by Simon Miles in response to comments 1, 2, 3, 4
My intuition is, I think, the same as Ben's: nodes need identity to allow sharing but that does not mean they have to have explicit identifiers outside of any one serialisation. If we want to assign identifiers for a particular purpose or in a particular serialisation we can, e.g. that proposed in
ChangeProposalDCNaming. The artifact and process IDs seem tied to serialisation for a few reasons.
First, IDs being replaced seems to affect the serialisation but not the meaning of the graph. For example, if I took an OPM graph in the current XML serialisation and loaded it into memory using the
JavaBeans? deserialiser (effectively creating a new serialisation), then the
JavaBeans? for the graph edges would address the artifact and process node
JavaBeans? by memory locations not their original IDs, and the graph can be fully interpreted without ever using the IDs used in the XML serialisation.
Second, for a particular class of applications, it may be sufficient not to ascribe explicit IDs in the serialisation, because only tree structures are ever present in the causal graphs. This would not make the graph non-interpretable or non-interoperable.
Third, a given serialisation or use of annotations may provide adequate identity for expressing shared nodes without requiring further IDs. For example, an XML serialisation automatically gives each part of the graph a unique XPath. Or, where global identifiers are provided in annotations for other purposes, these can also be used to express that two artifacts are one and the same. This suggests to me that the requirement for core OPM is for identity, not identifiers.
Finally, to answer the point about how to know whether two artifact descriptions denote the same or not, an alternative would be to include a serialisation-specific relation between two artifact descriptions saying they denote the same if they are (with the default assumption that they denote something different). I agree that, to include such a relationship requires identifying the artifacts, but, as with the relationship, this can be done in a serialisation-specific way. I am not suggesting this is preferable to using IDs, only that it seems to achieve the same end if a particular serialisation chose to do this, and either way of establishing identity and the important thing for the core model is to that (shared) identity is clear rather identifiers.
comment 6 by Luc Moreau in response to 5
I can't see what your proposal is.
To me, it is crucial that we can reason about node equality in the abstract model,
independently of any serialisation. Serialisations (in xml and rdf) or representations (as Java objects) will have to preserve this notion of equality.
Given that we aim at inter-operability, I am not in favour to say that the model "assume a notion of equality over nodes based on their identity". This would lead to problems of interpretation, and ultimately, systems will not inter-operate.
Sharing is an essential aspect of a provenance graph, and we must have a precise, unambiguous way of doing it. This does not prevent a given serialisation to do without identifiers, but it will be the duty of that serialisation to provide the means to reconstruct identifiers when reconstructing an OPM graph, and to drop them as it sees fit when serializing an OPM graph.
Comment 7 by Luc on the revised proposal
I am opposed to this proposal for the reason I explained before. It is important that the opm abstract model provides the means to decide if two nodes are equal. This is not an issue to be left to serialisations, because otherwise we will have no means of mapping serialisation X to serialisation Y, unless we have intimate knowledge of both x and y. I also want to be able to implement opm graph reasoning, independently of how I am going to serialise my graphs.
Your comment however raises the issue of account equality, and maybe we should introduce identifiers for them too.
Comment 8 by Simon Miles in reply to Comment 7
I afraid I still don't understand why IDs are part of the abstract (serialisation-independent) model. I try to explain why it seems wrong from a couple of perspectives below, then answer specific points in your comment.
First, to make a comparison, if I create a UML model, for example, I would not have to add ID attributes to every class before I can make one object an aggregate of another, or represent one object passing a message to another. I can even make a graph out of inter-referencing objects. It is in the nature of modelling that entities are distinguished and become referenceable. This does not place particular restrictions on how the abstract model may be realised in implementation, i.e. how C++/Java/whatever chooses to give references to the classes and objects.
Second, the formal model also seems to contradict the example figures in the specification depicting OPM graphs (and I agree with the model used by the figures). In the formal model, the edges go between IDs of the graph nodes, but in the figures the edges go between the nodes themselves. It might be argued that the figures only simplify and approximate the OPM graph, but I can't see anything missing in them.
With regards to mapping between serialisations, I'm not sure if you mean translation of a graph from one serialisation to another or combination of two independently produced graphs including documentation of the same artifact/process. If the former, there seems no need for equivalence of IDs (if any) between one serialisation and another: X represents an OPM graph, possibly using IDs to express node sharing in the graph, Y represents the same graph in a different form, possibly using different IDs to express node sharing. If the latter, then you would require IDs whose scope of uniqueness exceeded the graph they are part of, to know that something in one graph is equivalent to something in another, and I understand the usefulness of globally unique ID annotations as a separate issue.
With regards to reasoning, I still see no impediment. Isn't it the graph itself you are reasoning over, in some representation? The IDs are opaque, so provide no information over which to reason?
Comment 9 by Luc to comment 7
Your proposal states: "Artifacts are entities that we assume can can be compared". The OPM model for me should be "computable", by this I mean that it is a technology-agnostic representation, in which we should be capable of performing all possible reasoning over OPM graphs. Hence, an assumption that "artifacts can be compared" is not enough. We need to have a decision procedure
in the model to compare artifacts. My proposal is to have IDs. If you do not adopt IDs, then what is your comparison procedure?
Once such IDs have been adopted, it is natural in semantics to map IDs to values, etc. Programming language semantics for instance uses a notion of store location, where variable values can be found and assigned. It also allows us to express sharing, like our IDs do.
--
JimMyers - 17 Sep 2009
I get a sense that this discussion has aspects that are tied to the formal spec versus the English descripton and I'm not sure I understand the issues there.
At the English level, I think I'm agreeing with Luc in thinking that regardless of realization there is some notion of identifier that we want to allow us to decide when things are the same. The alternative seems to be to define quality by recipe - things that have the same value, owner, and creation date (or same provenance?) are the same - which makes things unnecesarily complex.
If part of the issue is that the spec waffles between description of identifers and nodes/the things themselves, I would agree that it would be good to clean it up. I don't know what that entails - it seems most natural to me to talk in terms of identifiers that capture th topology and assertions that, for example, artifacts are assumed to have defined state and thus there is anotion of a canonical value for an artifact that should always be the same for a given artifact ID used in an OPM graph.
Comment 11 by Simon Miles in reply to Comment 9
Maybe I have a different idea of what a model is, as I can't see how you perform reasoning over the model itself rather than representations of instances of the model. I understand you can create a template for particular patterns of reasoning, such as with the inference rules or heuristics, where you could say, for example, "if A1 was generated by P1 and A2 was used by P2 and A1 denotes the same artifact as A2 then P1 triggered P2". I don't see why it would help to add a decision procedure in the model for describing how A1 denotes the same artifact as A2. However, in a particular representation, e.g. as objects in Java, as RDF resources, or whatever, i.e. where there is actual data, I can understand describing an equivalence procedure based on global IDs, IDs scoped by graph or application, equivalence relations, or whatever.
Comment 12 by Simon Miles in reply to Jim
I agree that the more important issue is the consistency of formalisation and description. I still think removing IDs from the model would make sense, and would allow for more flexibility in defining OPM representations, but it is not a major issue in the scheme of things. More importantly, the description and figures show arcs going between artifacts and processes, while the formalisation describes the arcs going between IDs, which are only parts of those entities and independent from, for example, artifacts' values. This could cause confusion and so should be avoided.
Comment 13 by Jan Van den Bussche
A graph consists of nodes and edges. OPM likes to call the nodes "identifiers" and that seems harmless to me.
Joe Futrelle
Agreed that identifiers and identity must be treated as equivalent in the formal model; in terms of how that model is explained in the spec, consistently referring to nodes or ID's could potentially improve readability. I don't want a formal model without identifiers though because then we need some other way of expressing node equivalence classes, and the only way I can think of--existential quantification--is not only awkward to represent in our favorite serialization formats (including RDF) but is also not in any relevant way semantically different from using identifiers (via Skolemization).
Comment 15 by Paolo
When you require that two entities be comparable, in the sense that it can be established whether they are the same or not, you are quite close to the notion of identifiers. Hiding them from the formal model would seem like an unnecessary purism that would then bring about further complications, some of which have been pointed out. IMO the most important is that an equality function must be defined as part of the model, and lexical equality of identifiers seems to be not only common and well-understood, but also harmless. One additional complication has to do with establishing that two serializations are equivalent, when they use different sets of identifiers (this has been pointed out in comment 7).
I agree that the notation used in the text, and the figures, should be harmonized.
Vote
Simon Miles, yes
Paul Groth, no - especially if the formalization is to be separated from the main doc
Luc Moreau, no since it totally breaks what we tried to achieve with OPM.
NataliaKwasnikowska, no, a set of IDs can be used in a formalisation as a set of nodes
Jan Van den Bussche, no in the sense of my comment
Joe Futrelle, no
PaoloMissier, no
EricStephan, no
Outcome
The vote result is: no: 7, yes: 1.
There is no support for this proposal. The proposal is to reject it.
--
SimonMiles - 21 Jul 2009
to top