ES3
PowerPoint presentation
Participating Team
- Short team name: ES3
- Participant names: James Frew, Dominic Metzger, Peter Slaughter
- Project URL: http://eil.bren.ucsb.edu
- Project Overview: The Earth System Science Server (ES3) project is developing a local infrastructure for managing Earth science data products derived from satellite remote sensing. By “local,” we mean the infrastructure that a scientist uses to manage the creation and dissemination of her own data products, particularly those that are constantly incorporating corrections or improvements based on the scientist’s own research. Therefore, in addition to being robust and capacious enough to support public access, ES3 is intended to be flexible enough to manage the idiosyncratic computing ensembles that typify scientific research.
- Provenance-specific Overview: Instead of specifying provenance explicitly with a workflow model, ES3 extracts provenance information automatically from arbitrary applications by monitoring their interactions with their execution environment. These interactions (arguments, file I/O, system calls, etc.) are logged to the ES3 database, which assembles them into provenance graphs. These graphs resemble workflow specifications, but are really reports -- they describe what actually happened, as opposed to what was requested. The ES3 database supports forward and backward navigation through provenance graphs (i.e. ancestor/descendant queries), as well as graph retrieval.
- Relevant Publications: The work described here has been presented publicly but has not yet been published. ES3 provenance management is based partly on the ideas in:
- Valeur, H., 2005. (unpublished M.S. thesis). http://www.haavar.com/Thesis_Haavar_Valeur.pdf
- Bose, R., and Frew, J., 2004. Composing Lineage Metadata with XML for Custom Satellite-Derived Data Products, Sixteenth International Conference on Scientific and Statistical Database Management, Santorini Island, Greece, 21-23 June 2004. http://dx.doi.org/10.1109/SSDM.2004.1311219
- Frew, J. and Bose, R., 2001. Earth System Science Workbench: A Data Management Infrastructure for Earth Science Products. In: L. Kerschberg and M. Kafatos (Editors), SSDBM 2001 Thirteenth International Conference on Scientific and Statistical Database Management. IEEE Computer Society, George Mason University, Fairfax, VA, pp. 180-189. http://dx.doi.org/10.1109/SSDM.2001.938550
Workflow Representation
ES3 executes the challenge workflow shell script directly, without any modification.
The corresponding workflow representation is assembled post hoc (as described below) by ES3, and is retrieved from ES3 as a GraphML document. The workflow diagrams in this report were generated by yWorks' yEd graph Editor, using reformatted ES3 GraphML documents as input. Files are represented as circles and transformations as squares. Process arguments are omitted to minimize clutter.
Provenance Trace
Provenance in ES3 is managed by two components: the Probulator, and the ES3 Core:
Unlike its namesake, the ES3 Probulator is designed to non-intrusively monitor the execution of complex scientific applications. All operations of the Probulator are completely transparent to ES3 users, and the default mode of operation requires no modification whatsoever of existing codes.
The Probulator comprises two applications, the Logger and the Transmitter. The Logger automatically instruments, monitors, and logs the execution of targeted programs and their interactions with their environment (files, parameters, system calls, etc.) A family of plug-ins adapt the Logger to different scientific processing environments. Currently two plug-ins are provided:
- The default plugin uses system call tracing to intercept and log a subset of the probulated process's system calls. This plugin currently works on Linux (and should work on any UNIX-like system that supports the "strace" facility.)
- A plugin for the IDL analysis environment preprocesses IDL scripts to insert ES3 specific logging information, and to replace calls to certain IDL built-in functions with calls to instrumented ES3 equivalents. Although this plugin does modify the targeted application code, it does so transparently and reversibly -- no user intervention is required beyond setting a flag in an environment variable to enable or disable probulation.
Upon termination of a Logger session (or on specific request), Logger log files are read by the Transmitter, which:
- assigns a universally unique identifier (UUID) to every provenance-relevant object (file or process) referenced in the log file;
- converts the plugin-specific log files into standard ES3 execution reports; and
- sends these reports as XML messages via a web service interface to the ES3 Core.
The ES3 Core decomposes the execution reports into object references and linkages between objects, using the Transmitter-supplied UUIDs as primary keys. This allows the Core to reconstruct the provenance graph at arbitrary starting points, forward and backward in time, by following the UUID references. The Core can also use file name, process name, and argument information captured by the Probulator to map between UUIDs and external names, allowing ES3 users to form queries in terms of objects they're familiar with.
Example Provenance Trace for workflow.sh
- User installs the Probulator and sets an environment variable to activate tracing
- User runs workflow.sh
- Logger writes log file to disk
- Transmitter processes log file and sends execution report to ES3 core
- ES3 core stores execution report in its database
- User requests provenance information for workflow.sh (or for the UUID under which the workflow was submitted)
- ES3 Core returns provenance information (in GraphML or ES3 XML format)
- If necessary, provenance report is post-processed for input to display tool
- Display tool (e.g. yEd) creates workflow DAG
Provenance Queries
Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 | Q9 |
| | | + | | | | | |
Query 1
- Find UUID for object named "Atlas X Graphic".
- trace lineage backwards from corresponding UUID
- display results
Query 2
- Find UUID for object named "Atlas X Graphic".
- trace lineage backwards from corresponding UUID until object named "softmean" is encountered
- display results
Query 3
- Find UUID for object named "Atlas X Graphic".
- trace lineage backwards 5 links from corresponding UUID
- display results
Discussion
The ES3 Core data model doesn't include a concept of workflow "stages". For this query we simply traced back five links (our interpretation of "Stages 3, 4, and 5" in the challenge workflow) from the "Atlas X Graphic" object. The lineage trace query uses a termination condition that states the trace should end after traversing five links from the starting UUID.
Query 4
- Find all ES3 transformation objects (i.e. processes) that have the specified name and command line arguments
Discussion
The split score ( + ) for this query is due to XQuery's lack of support for queries based on day-of-week.
Query 5
We did not implement Query 5, since the ES3 Probulator currently doesn't examine the contents of the objects it monitors. (See Further Comments below)
Query 6
- retrieve all
align_warp
transformations with arguments -m 12
- trace lineage forward to
softmean
- retrieve file objects one lineage step forward from
softmean
Query 7
-
workflow.sh
is modified as instructed.
- We allow the added programs to communicate via pipes (as opposed to intermediate files).
- We supply arbitrary arguments for
pgmtoppm
- The modified workflow (
workflow_Q7.sh
) is probulated, saved in ES3, and retrieved as GraphML
- We use a simple home-brewed graph differencing tool to flag the differences between the original and modified graphs on a per-element basis (ignoring UUIDs) with a
diff=[true|false]
attribute.
- The flagged graphs are rendered, with differing portions marked by (in this example) red dashed lines.
Discussion
Our solution to Query 7, while not implemented entirely as an ES3 Core query, is nevertheless responsive to one of the primary classes of user queries that ES3 as whole was designed to support; namely, "what changed?" queries. It's extremely common for scientists developing ad hoc workflows to notice differences in outputs across invocations between which "nothing was changed". Our graph-differencing approach is designed to answer the "what changed?" query as directly (and visually) as possible, while still allowing subsequent drill-down into the details.
Queries 8 and 9
We did not implement Queries 8 and 9, since the ES3 Core currently doesn't support annotations. (See Further Comments below)
Further Comments
ES3's provenance management currently concentrates on the automatic, transparent acquisition of structural provenance; i.e., reverse-engineering workflow. There is nothing that prevents one from storing in ES3 the additional content-based information required to by Queries 5, 8, and 9; however, we have not yet implemented a way to "slipstream" this information into the Probulator logs or Transmitter messages while remaining unobtrusive to the ES3 user. This is definitely within ES3's scope, which is why we've scored these queries , and is the part of ES3 currently being developed.
-- JamesFrew - 12 Sep 2006
to top