Provenance Requirements of Proteomics Applications
Author: Simon Miles
Project: This work was conducted as part of the PASOA project (EPSRC GR/S67623/01)
Last modified: 4th November 2004
This document describes the provenance requirements taken from proteomics applications. We are grateful to David O'Connor and Paul Skipp of the Centre for Proteomics Research at the University of Southampton for providing the information on these use cases.
Scenario
Proteomics is the study of the proteome, where a proteome is all the proteins produced by a single organism. One of the main set of experiments conducted by the scientists is the identification of proteins from a given sample, e.g. to determine what proteins are present only in someone with a certain disease. To do this they measure the characteristics of fragments of a protein, which can provide evidence for the identification of the protein. This involves first breaking the protein at well-identified points, i.e. at given amino acids, resulting in a set of peptides. The peptides are examined using a mass spectrometer to determine their mass-to-charge (m/Z) ratio, a useful identifying characteristic. To obtain more accurate results, the peptides are then further fragmented, at random points, by bombarding the peptides with a charged gas, and these fragments are again fed to the spectrometer. Databases of previously analysed results and databases downloaded from public sources, such as the European Bioinformatics Institute, are used to match peptide characteristics to possible proteins, as well as provide further information on the proteins such as the functional group to which they belong.
Use Cases
Use of provenance 1: Determining experiment context
In determining the quality of an experiment's results, it is useful to know the context of the experiment, such as the versions of databases used, the settings used on the lab machines and other parameters of the experiment. The scientists would like to be able to trace back from a piece of data to determine the context and configuration of the experiment in which it was produced.
Use of provenance 2: Design based on previous success
Individual experiments on proteins identify peptides by their characteristics. The data from multiple experiments is collected in a database and used collectively to identify proteins that may be in the original material. To be certain of a successful protein identification, there should be multiple pieces of evidence, peptides found to occur in that protein, and this evidence may or may not come from multiple experiments. Experiments can be configured in different ways, e.g. by changing the settings on the lab machines, and so, if a peptide is used to successfully identify a protein, it suggests that the experiment producing that peptide was configured well for the given material. This would ideally inform later experiments on the material, and provenance data could help to trace back from the successful identification of a protein to the configuration of the experiments that produced its evidence.
Use of provenance 3: Proving that standards are met
In the near future, substantial research council funding for biology-related research will be dependent on the labs involved demonstrating that they conform to given standards in process and quality. The Joint Code of Practice being developed by BBSRC, DEFRA, NERC and the FSA specifies that a Quality of research Process (QP) must be observed, along with a broader Quality of research Science. Among its requirements are that the organisation plans its research processes and regularly reviews that they are being conformed to, samples are tracked through their analysis, all processes conducted are recorded, there is an audit trail linking secondary data to the primary data from which it is derived and that records must be stored in a way in which the integrity and security is ensured and unauthorised modification is prevented. The records can be stored as indexed computer files. Provenance recording software can help in all the preceding tasks.
Use of provenance 4: Consistency in fragmentation
A peptide tends to fragment with some degree of consistency, i.e. it is more likely to fragment at the weakest bonds in the molecule. If a peptide produces good fragmentation data and is clearly identified, it can be used to help identify and quantify the same peptide in other biological samples. Therefore, the quality of previous experiments can influence future experiments.
Use of provenance 5: Unidentified spectra
As the identification of peptides depends on the database containing the corresponding information, it is not always possible to make a positive identification when the experiment is first run. It would be useful to keep track of unidentified spectra and re-run the experiment when the database may contain more extensive information.
Use of provenance 6: Extension of Existing Standards
Other standards regarding the recorded structure of common types of data and experimental processes, with a strong support in the proteomics community, are being developed as part of the Proteomics Standards Initiative. Software is also being developed which aids in recording data according to these standards. Any software developed to provide more extensive use of provenance data should take account of the existing standards and ensure it can interoperate with them.