Provenance Requirements of CombeChem Applications

Author: Simon Miles
Project: This work was conducted as part of the PASOA project (EPSRC GR/S67623/01)
Last modified: 5th May 2004

This document describes the provenance requirements of CombeChem. It does not focus on a single application of CombeChem, because the requirements overlap and provenance is seen by the project to be generic over several applications. We first describe the form of experiments run and then detail the provenance uses envisaged. As CombeChem have already started to look at how to meet the requirements listed, we also describe their initial work in tackling each use case. Thanks to Hugo Mills for much of the information below.

Scenario

CombeChem experiments are a mixture of lab-based and software processes, and the group includes lab-based chemists, computer scientists and computational chemists. There are several distinct applications in the project, including: crystallography, synthetic-organic and simple-harmonic generation experiments. Experiments have typically few stages, e.g. 12 to 15 at most, but each stage may take several hours to several months.

In the crystallography application, National Crystallography Service analyse crystals submitted to them by chemists. This is a very well-defined process of about 4 or 5 steps that determines the structure of the crystal and its comprising compound. The final results should be a data file containing refined atomic positions.

The synthetic-organic application is slightly less structured, but a rough idea of the workflow to be followed in each experiment will be known and encoded in advance (as it is required for health and safety reasons at least). At each stage of the experiment the experimenter will decide which next step to take based on the data produced at the last. This application is mostly lab-based (rather than software processes).

The simple-harmonic generation application, which analyses properties of liquids by bouncing lasers off them, is very unstructured and different processes and analyses will be attempted without a prior plan.

The computational chemists are processing result data from already performed chemistry experiments to try and determine connections between properties of materials. Some properties are easy to discover, such as the charge distribution around a molecule, while others are more difficult, such as the melting point of a molecule. Therefore, if a connection can be made between the two properties, a lot of time will be saved by discovering the easy to determine property and deriving the hard to determine one. Other experiments that the group are involved in involve simluating protein folding, protein docking and molecular dynamics.

Use Cases

Use of provenance 1: Human determination of the origin of data

CombeChem would like to enable chemists to browse back through the provenance trace to discover through what a result (material or data) was obtained, what original material or data was used to produce it and how the data was later used. They consider the details required to be 'vague', that is: the high level description of the experiment that a chemist would normally enter in their lab book.

For lab-based experiments, such as the SHG experiment, the process is largely one performed on equipment that does not record all intermediate data or the details of the process taking place. Computer-stored provenance comes from the user interacting with their electronic lab book. This may simply consist of the user ticking off that a experimental task has been done. They choose whether to add extra information, such as intermediate data or process information as annotations.

The computer scientists in CombeChem are currently working on ways to best encode the provenance information. They are describing the process in RDF, using semantic terms from their own ontology. The record of an experiment is divided into four parts.

  1. The general plan being followed in human-understandable terms.
  2. The general plan being followed described as entities representing processes and intermediate results expected.
  3. The actual instantiations of the plan processes in a given experiment
  4. The sub-processes involved in the experiment, e.g. weighing materials ready for a process, which are not included in the general plan. Also, other observations and annotations which provide more information on how the experiment went and the valuesof intermediate data.
Example RDF graphs are shown for a tea-making experiment in Tea-Experiment-Resources-20040428.png and Tea-Experiment-Properties-20040428.png, the former showing the resources in the graph, the latter showing the properties linking resources.

Rather than storing the actual data produced, which may be large, in the RDF graph, they intend to store the data in a separate store and refer to it with a URI. They are looking into the Storage Request Broker (SRB) for this purpose.

Use of provenance 2: Referencing and linking produced data

CombeChem, and particularly crystallography, want the result and intermediate data of an experiment to be available and referenceable so that it can be linked to from papers and discovered for use in other experiments.

Currently, the National Crystallography Service are using, or setting up to use, e-Bank. e-Bank is a Semantic Grid project and provides an adaptation of the e-Prints publications archive that allows scientific data to be stored and linked.

Use of provenance 3: Recording execution of workflow that was not pre-defined

In the simple harmonic generation application, the chemists are interested in recording the workflow that actually took place so that they can examine it and possibly follow it again. This is particularly important because they do not know in advance what the workflow will be.

In the synthetic-organic application, the experiments will be based on pre-defined plans and the interesting information will be in how each experiment differed from its pre-defined plan.

Use of provenance 4: Third-party verification

One scenario considered in CombeChem is where a PhD student performs an experiment and shows their results to their supervisor. The supervisor may wish to check that the experiment was performed properly, especially if the results are odd. In that case, they will wish to work back from the result data to see at which step an error may have occurred.

A comparable check may be performed if several experiments produce strange results. The scientist may then wish to see whether the experiments used some of the same original material, or material from the same batch, which may suggest contamination. Similarly, equipment will deteriorate over time affecting experiment results, e.g. the laser in the simple-harmonic generation experiment. To detect this deterioration, the results of several experiments using that equipment can be compared with earlier ones where the equipment was known to be reliable.

Use of provenance 5: Automated publication

Studies in some parts of chemistry, such as crystallography, are very formulaic and the papers written about interesting results follow the same structure. Therefore, if enough information is recorded about an experiment, the paper describing it can be automatically created. CombeChem aims to record enough information to allow this, both by providing software and hardware and by changing the practices of chemists.

As an extension to this, results may be published on-line, allowing data to be inter-linked between related experiments. This information should also be captured in the provenance.

Use of provenance 6: Intellectual property rights

Completed experiments must be dated and signed off by someone other than the experimenter, to ensure that intellectual property rights are protected. For a PhD student, the signer will typically be their supervisor; for a staff member it will be a colleague or head of department. The signing chemist will use their expertise to determine whether the experiment was performed correctly, and the provenance should be complete enough that they could potentially re-run the experiment to check the results. Signing off is currently done using carbon paper, so that a record is kept in the experimenter's lab book and independently.

CombeChem are considering cryptographically time-stamping and digitally sigining their RDF graphs as a computational equivalent to this process. There is concern about what part of the graph to sign: entities outside of the graph will be referred to, so what should be included? Where data is referred to by a URI and stored separately (e.g. by an SRB), a hash can be taken of the data and signed by a trusted third party to ensure its validity. A possible time-stamping algorithm they are considering using is detailed in Section 4.1 of Applied Cryptography (second edition) by Bruce Schneier.