Virtual Data Language Preprocessing
To aid the definition and configuration of our derivation workflow, we wrote a pre-processor program,
vdlgen.sh. We had a pre-processed version of the VDL (
analysisbase.vdl), the transformation catalog (
tc.data) and the pool.config file (
pool.config.kickstart or
pool.config.nokickstart). These were transformed by the pre-processor into versions ready to be used by the abstract DAG generator. An example post-processed VDL script is shown in
analysis.vdl.
New constructs
We added 4 simple constructs to VDL to aid our derivation workflow definition, that the pre-processor extracted and used to alter the VDL accordingly.
$${var}
- expands to the value of variable var, supplied when pre-processing in a particular instance
MAP lfn pfn
- causes the RLS to be checked for the mapping (logical filename) lfn to (physical filename) pfn and adds if not already mapped
FOR var FROM start TO end STEP step
block
END
- counts from start to end incrementing by step, and outputs block each time
- if the token %var% appears in block, it is expanded to the value of the counter
- start, end and step may be calculations
- these loops can be nested
LIST block start end step
- a single-line convenience form of FOR, where var is always 'item'
Wrapping
Our pre-processor also takes configuration options that determine whether and how to 'wrap' the transformations. We have a wrapper program, like kickstart, that records provenance in our provenance store before, and potentially after, a script is executed. It is called
recordProvenance.sh. We could use it by replacing kickstart in the pool.config, but we generally wanted kickstart as well.
If told to, the pre-processor calls a wrapping script, passing some configuration parameters. The wrapping script changes the VDL transformations to take extra inputs, including the path of the 'unwrapped' program to be executed at that step. It also changes the VDL derivations to pass that extra information. Finally, the wrapping script generates a new transformation catalog in which each transformation physical location was replaced with recordProvenance.
An extra field, delegates-provenance, in the pre-processed transformation catalog could mark an entry as not to be wrapped, but still be given the information required to record provenance. Transformations for which this is useful are workflow scripts that run several smaller activities locally, each of which should record its own provenance (inputs, outputs etc.), rather than just the workflow recording its inputs and outputs. This allows us to independently control the granularity of the distributed workflow, where each task should last about 15 minutes, and the granularity of the workflow of provenance-recording activities, which may be much finer.
--
SimonMiles - 23 Feb 2005
to top