This document describes the provenance requirements of CombeChem. It does not focus on a single application of CombeChem, because the requirements overlap and provenance is seen by the project to be generic over several applications. We first describe the form of experiments run and then detail the provenance uses envisaged. As CombeChem have already started to look at how to meet the requirements listed, we also describe their initial work in tackling each use case. Thanks to Hugo Mills for much of the information below.
In the crystallography application, National Crystallography Service analyse crystals submitted to them by chemists. This is a very well-defined process of about 4 or 5 steps that determines the structure of the crystal and its comprising compound. The final results should be a data file containing refined atomic positions.
The synthetic-organic application is slightly less structured, but a rough idea of the workflow to be followed in each experiment will be known and encoded in advance (as it is required for health and safety reasons at least). At each stage of the experiment the experimenter will decide which next step to take based on the data produced at the last. This application is mostly lab-based (rather than software processes).
The simple-harmonic generation application, which analyses properties of liquids by bouncing lasers off them, is very unstructured and different processes and analyses will be attempted without a prior plan.
The computational chemists are processing result data from already performed chemistry experiments to try and determine connections between properties of materials. Some properties are easy to discover, such as the charge distribution around a molecule, while others are more difficult, such as the melting point of a molecule. Therefore, if a connection can be made between the two properties, a lot of time will be saved by discovering the easy to determine property and deriving the hard to determine one. Other experiments that the group are involved in involve simluating protein folding, protein docking and molecular dynamics.
For lab-based experiments, such as the SHG experiment, the process is largely one performed on equipment that does not record all intermediate data or the details of the process taking place. Computer-stored provenance comes from the user interacting with their electronic lab book. This may simply consist of the user ticking off that a experimental task has been done. They choose whether to add extra information, such as intermediate data or process information as annotations.
The computer scientists in CombeChem are currently working on ways to best encode the provenance information. They are describing the process in RDF, using semantic terms from their own ontology. The record of an experiment is divided into four parts.
Rather than storing the actual data produced, which may be large, in the RDF graph, they intend to store the data in a separate store and refer to it with a URI. They are looking into the Storage Request Broker (SRB) for this purpose.
Currently, the National Crystallography Service are using, or setting up to use, e-Bank. e-Bank is a Semantic Grid project and provides an adaptation of the e-Prints publications archive that allows scientific data to be stored and linked.
In the synthetic-organic application, the experiments will be based on pre-defined plans and the interesting information will be in how each experiment differed from its pre-defined plan.
A comparable check may be performed if several experiments produce strange results. The scientist may then wish to see whether the experiments used some of the same original material, or material from the same batch, which may suggest contamination. Similarly, equipment will deteriorate over time affecting experiment results, e.g. the laser in the simple-harmonic generation experiment. To detect this deterioration, the results of several experiments using that equipment can be compared with earlier ones where the equipment was known to be reliable.
As an extension to this, results may be published on-line, allowing data to be inter-linked between related experiments. This information should also be captured in the provenance.
CombeChem are considering cryptographically time-stamping and digitally sigining their RDF graphs as a computational equivalent to this process. There is concern about what part of the graph to sign: entities outside of the graph will be referred to, so what should be included? Where data is referred to by a URI and stored separately (e.g. by an SRB), a hash can be taken of the data and signed by a trusted third party to ensure its validity. A possible time-stamping algorithm they are considering using is detailed in Section 4.1 of Applied Cryptography (second edition) by Bruce Schneier.