Provenance of Published Crystal Images
Scenario Authors:Simon Miles, Mark Hedges, Stella Fabiane
Characteristics
This proposed provenance challenge scenario is based on a simple (largely linear) process. However, it exhibits features not found in previous challenges, namely
It is not purely automated. Some steps involve a user making a decision.
The results of each execution of the process are accessible via the web.
There is an emphasis on maintaining the ability to determine provenance in the long-term, not just immediately following process execution.
Brief Summary
Crystallography is the experimental science of determining the arrangement of atoms in solids. Crystallographic methods depend on the analysis of the diffraction patterns that emerge from a crystal sample that is targeted by X-ray beams.
In the scenario described below, scientists perform a series of steps to produce a set of atom coordinates from a crystal, and then publish this on a public database.
The raw data and conduct of an experiment which produced a crystal image are important for others to interpret the quality of that image.
Scenario Diagram
The figure below shows the process around which this scenario is based. Artefacts (data or physical) are depicted as ovals, while boxes represent processes. Where a process is marked with a U, this means it is conducted by the user rather than being automated.
Reading the process from top-left downwards, the artefacts and processes denote the following.
- A scientist produces a number of crystal samples from a particular protein substance.
- The developed crystals are subjected to X-ray beams to produce diffraction images.
- Each diffraction image contains several hundred spots.
- To inspect the diffraction images, the data must be transformed using Mosflm.
- A set of visual images is produced.
- These are then inspected by the user, and if found inadequate, new diffraction images will be generated.
- The location and intensity of these spots are determined using specialised software
- This results in a dataset called a reflection file.
- The reflections data are merged.
- The merged reflections states averaged intensities.
- The merged reflections are then processed...
- ...to produce a model of the atomic coordinates of the protein being studied.
- The coordinates are used to produce a visualisation of the crystal image.
- This image is the final product of the process.
- Having a visualisation enables the coordinates to be checked, and if found inadequate a new set will be produced.
- If adequate, the coordinates and any metadata will be submitted to the database.
- On publication, the database provides a persistent unique URL for web access to the coordinate data.
Users
This experiment is one performed by crystallographers working King's College London. The process abstracts from the details but has been confirmed to be realistic, and the provenance questions are once which have been confirmed as valuable to answer.
Requirement for Provenance
The quality of the data produced is critical not only in understanding the crystalised molecule, but because the data from one experiment is used in creating images in future experiments. The public database can only store the coordinate and reflection files, but not the diffraction files (as they are too large).
Provenance Questions
We wish to ask the following questions about the provenance of a crystal image.
Question 1: It is 10 years after the process was conducted, and the process has become obsolete. For a given published crystal image (named by web reference), what was the raw diffraction images from which the crystal image was produced? Assume that the public database can contain only the coordinate and reflection files, and data kept on the desktop PC which ran the process has gone and knowledge in people's heads has been forgotten.
Question 2: For a given crystal, how often did a crystallographer reject and reproduce coordinates (the later stages of the experiment)? This is important because difficulty in obtaining an adequate crystal image can indicate that the original diffraction data was poor quality.
Technologies Used
A crystal image may be identified by a URL browsing the web interface of the database, or may be seen as a row in a database table, as the challenge participant prefers. Tools and sample data are available for the software stages of the process.
Background
This scenario was developed in the context of the Biophysical Repositories in the Lab (BRIL) project.
--
SimonMiles - 20 May 2010
to top