Name:
Paper Submission
Scenario Authors:
PASS group (Margo Seltzer et al.)
Harvard
Brief Summary:
You are in the last stages of submitting an important paper to your favorite major conference when you notice a major discrepancy in the figures even though they all ostensibly come from the same raw data.
Your collaborators are at different universities (or different departments, or whatever) and are using different provenance toolsets.
Your mission is to use the provenance of the paper, which partly came from your collaborators, to trace what went wrong in the figures and get it fixed in time to get the paper submitted before the deadline.
The workload consists of multiple broad steps; like previous challenges the goal is to interoperate, so each of the four steps will be done by a different group, simulating a different collaborator. Each step involves importing both data and provenance, doing some processing, and exporting both data and provenance.
The important property (besides sending both data and provenance) is that there are to be four steps organized into a "diamond". This means that the fourth step requires importing two foreign data sets, and specifically requires merging their provenance in a way that preserves the diamond structure. (That is, it will need to be possible to recognize common objects in the ancestry and unify them. We suspect that this is difficult.)
Each broad step will be composed of several smaller concrete workflow stages.
Scenario Diagram:
(I have one but I can't get it to upload)
Users:
Anyone who publishes papers; while the deadline element does not apply so much to journal submissions and may not be familiar outside of computer science, the basic experience is universal.
Requirement for provenance:
In order to trace why the figures are inconsistent it must be possible to trace where they came from and how they were generated; this requires provenance. Even partial automation of this search requires machine-readable provenance rather than lab notebooks.
Without a provenance toolset you would sooner or later find the problem anyway, but it might take a long time or require re-running expensive analyses.
Fine-grained provenance (at e.g. the tuple or single datum level) would allow tracing the specific inconsistency and ignoring other material; combining multiple abstraction layers or multiple accounts, such as provenance-aware Python running in a workflow engine, would allow tracing the inconsistency within the execution of individual analysis stages.
Provenance Questions:
If this happened in the real world with real data, identifying the inconsistency would probably require quite a bit of manual intervention, e.g. to compare intermediate results of fundamentally different analyses. Since the goal of the Challenge is to test provenance handling rather than domain knowledge, we expect to encode the inconsistency as a wrong value that can be found by simple inspection (e.g. with grep).
The cause of the inconsistency should actually be that both values are wrong, one because of something that appears in workflow-level provenance (e.g., using different calibration data) and the other because of something that appears in application-level provenance (e.g. a bug in a Python script). This allows groups to find either or both depending on the properties of their provenance toolset.
The questions are thus:
** What are the inputs to the paper PDF file, so we can (probably out of band) find the identity of the two inconsistent figures?
** For each figure, what are the individual analysis stages that were done and what was the ultimate raw data object?
** Is that ultimate raw data object really the same object in each case?
** In what stages, in either or both cases, is the intermediate result wrong? (This requires finding or regenerating the intermediate result object(s) and running some simple test on each.)
** Why is the/each identified wrong result actually wrong? What's different? (This is slightly fuzzy but should be sensible given the context that the previous questions establish.)
Technologies Used:
This is designed so it can be done entirely as workflows; however, to do the multiple accounts part we'll want a provenance-aware application language. Since different groups have different capabilities I think this part should be a script that does something exceptionally simple so it can be implemented in just about anything
(Python, Matlab, Octave, awk, Bourne shell, etc.) with the intent being that groups will pick what they use to show off the integration capabilities of their tools.
Within the workflow stages everything else can be off-the-shelf processing tools of one kind or another. We'll probably want the last stage, the paper build, to be
LaTeX?, in order to make the whole thing vaguely tractable; a WYSIWYG word processor would be a nuisance.
Background and description:
The paper should consist of
LaTeX? source files, raw data from lab instruments (including instrument-sourced provenance), intermediate partially analyzed data, and several charts and graphs.
The first broad step is to import the raw data (and its provenance) and do some analysis on it. The second and third steps each do further analysis; the fourth step transforms these results into charts and graphs and builds the paper PDF file.
The precise nature of the data and analysis transformations has yet to be determined; however, it ultimately doesn't really matter very much. The assumption is that two of the analyses done by different collaborators overlap on some parts and that one of the overlapping parts is different. In a real instance this would probably show up as
different values in a graph and some of the transformations would be graphing programs (gnuplot, ploticus, etc.) but in order to make things simple and concentrate on the provenance we'll assume that there are two tables of numbers where one pair should be the same, and they aren't.
To be complete we should also include the provenance of the
LaTeX? source files (which have been written/edited by various combinations of collaborators and perhaps managed in a source management tool like CVS or git) but it isn't clear that this adds value in the context of the Provenance Challenge. (And if it does, probably it should be its own scenario, which would in fact be rather similar to the
WikipediaRevisionHistory scenario.) So this is probably best left off.
We'll assume that each collaborator/group makes all their intermediate results (as well as their end result) available with full provenance. Regenerating objects from foreign provenance should be its own scenario.
--
PassProject - 15 May 2010
to top