Provenance Requirements of a Bioinformatics Application
Author: Simon Miles
Project: This work was conducted as part of the PASOA project (EPSRC GR/S67623/01)
Last modified: 27th October 2004
This document describes a bioinformatics application we are looking at and summarises the provenance requirements found in discussion with the bioinformatician involved, Klaus-Peter Zauner at the University of Southampton. We first describe the biology behind the investigation being pursued, then detail each use of provenance information that Klaus identified, and the data required to fulfil that purpose.
The DNA of an organism consists of one or more chromosomes which encode for the proteins that create that organism. Each consists of two complementary strands each of which is a sequence of bases. A chromosome contains genes. A part of a gene is transcribed into a piece of pre-mRNA consisting of exons interspersed with introns. An exon is a sequence of bases that encodes for a part of a protein. An intron is a sequence of bases that does not code for any part of the final protein. Introns tend to be much larger than the exons. Klaus has given a name to something that is either an exon or an intron: eion. The introns are cut out from the pre-mRNA to create the mRNA, consisting of only the exons, which is then used to construct proteins. Over time, mutations occur in the DNA. For this to affect the ability of the organism, and so the gene, to survive and reproduce the mutation must cause an effect in the phenotype of the organism, i.e. in the proteins. Therefore, for evolution to have a deciding affect on whether the mutation exists in the future it must occur in an exon or in the part needed to recognise boundaries between exons and introns. Klaus is interested in the complexity of evolved sequences.
The process of analysing a given chromosome involves downloading the chromosome's data and its annotations as a giant text file by FTP from a curated database in the US, then parsing through a series of tools to eventually generate statistical analyses. Some tools are authored by third-parties, such as the European Bioinformatics Institute, and some are scripts authored by Klaus. In myGrid, we are in the process of wrapping these tools as Web Services and constructing a workflow from them.
When a potentially interesting result is found, Klaus will re-run some of the later parts of the workflow with different configuration parameters to try and determine why the result was caused.
Use of provenance 1: Determining the difference in the system during two runs of an experiment.
Unreliability in the computational environment means that it is hard to determine what has caused an interesting result. It could be a validation or contradiction of the hypotheses, but could also be due to algorithms changing (either Klaus' or those provided by a third party), chromosome data or its annotations being
corrected (versions of the database), data coming from different sources in a different experiment, e.g. mouse or human or yeast chromosomes, errors in the
experimental process etc.
Potentially relevant differences include the following.
- The exact source input data used and its metadata, e.g. the version of the source database, which chromosome and which organism it came from.
- Configuration input parameters given on the command-line to scripts, which help determine why interesting anomolies have occurred.
- Which version of a script was used, and preferably the script itself so that differences in code can be detected. Variations in a script come not only from improvement and development, but also because he changes configuration parameters set in the script itself. The latter changes occur frequently in the course of experimentation and Klaus needs to keep track of which version of a script has been used. The current arrangements he has are unsatisfactory.
- Versions and logged information by services used internally by a script.
It would be useful to generate a difference report from two runs of an experiment. At different times, different sets of differences will be important, so it would be useful to be able to configure which differences are highlighted in the report. For example, if chromosome data is updated then all differences derived from that data are relevant, while if a script is changed all differences that could be caused by that script should be shown.
Use of provenance 2: Determining how best to run the experiment in future
Klaus finds the following information useful for determining when best to execute a script or set of scripts and on what server.
- The time at which a script was run or service used and the time taken for it.
- The server on which a script was run.
Use of provenance 3: Historical record and proof of process
Klaus wants all experimental information recorded in order to have an accurate historical record. This can be used by himself as a reminder of what he has done, and also for him to show to others as proof that he his results were gathered in a valid manner.
In particular, when a hypothesis is reconsidered, a scientist may want to go back to previous experiments and re-interpret the results. This may involve looking at parts of the results that were treated as irrelevant before. Therefore, one cannot determine at the time that an experiment is performed exactly the information that will need to be kept. Within reason, all potentially relevant information should be kept.
Information additional to that mentioned in the sections above that should be recorded, include the following.
- Graphical output, such as the graphs produced by the R statistics package.
Use of provenance 4: Checks on validity of bioinformatics process
The provenance logs could also be used to check whether the experiment performed (workflow run) was biologically valid. For instance, if a protein sequence had been given to a tool that processed DNA sequences then this should be highlighted as invalid. Configuration parameters could also be checked for being within valid ranges. However, Klaus cautioned that these checks should appear only as warnings and should not prevent the experiment from actually being done, as he sometimes deliberately provides data to a tool which the tool provider may consider invalid, e.g. giving the start and end points of introns to EMBOSS rather than those of exons as it expects.
For such checks to take place, we require information including but not limited to the following.
- The biological types expected as input by services.
- The biological types of data output by services.
- The valid ranges of script configuration parameters.
Use of provenance 5: Tracing the origin of data
For a given piece of data, Klaus needs to know where it originated from and where it can be found in the output of each stage of analysis. For this to be possible there must be consistent identification of entities through all stages of a provenance trace. Klaus currently constructs identifiers for exons, introns and other pieces of data and includes them in the output files so that he can find them later and determine where each piece of data has come from.
Use of provenance 6: Conflicting intellectual property rights
A bio-informatics database may have certain licensing restrictions with regard to IP generated based on the database. If a company uses the database and then patents a cancer drug, it may be important for them to be able to proof that they have discovered the drug by a method that did not involve the database. For example the Ecoli database is free for noncommerical research use. If a lab gets such a free licence and then makes a discovery that should be patented, they and the owners of the Ecoli database may want to verify to what degree the database was used in the research. Provenance data could be used to do this verification.