The challenge workflow is encoded in the Virtual Data Language (VDL). VDL is described in the VDS USers Guide.
The full workflow is at: genatlas.vdl. An excerpt is included here:
TR air::align_warp( in refimg, in refhdr, in subimg, in subhdr, model, quick, out warp ) { argument = ${refimg}; argument = ${subimg}; argument = ${warp}; argument = "-m " ${model}; argument = ${quick}; } TR air::reslice( in warp, in subimg, in subhdr, out slicedimg, out slicedhdr ) { argument = ${warp}; argument = ${slicedimg}; } ... DV wilde.pc1.wf01::align_warp.1->air::align_warp( refhdr = @{in:"Data/Raw/reference.hdr"}, refimg = @{in:"Data/Raw/reference.img"}, subhdr = @{in:"Data/Raw/anatomy1.hdr"}, subimg = @{in:"Data/Raw/anatomy1.img"}, warp = @{out:"Data/Derived/warp1.warp"}, model = "12", quick = "-q" ); DV wilde.pc1.wf01::reslice.2->air::reslice( warp = @{in:"Data/Derived/warp1.warp"}, subhdr = @{in:"Data/Raw/anatomy1.hdr"}, subimg = @{in:"Data/Raw/anatomy1.img"}, slicedhdr = @{out:"Data/Derived/resliced1.hdr"}, slicedimg = @{out:"Data/Derived/resliced1.img"} );
A few notes on VDL:
Upload a representation of the information you captured when executing the workflow. Explain the structure (provide pointers to documents describing your schemas etc.)
We generated a DAX using the VDLC command, which compiles the VDL, inserts a representation of it into the virtual data catalog database, and then traverses the graph of transformation calls ("derivations" - the nodes) and their datadependencies (the edges) in the VDL produce a workflow.
[Directed acyclic graph in XML � �DAX� document] [Compiled workflow graph] This document describes the explicit control flow graph. It contains three sections:
The DAX is a primary element of the data structure with which runtime provenance will be associated, and is useful in querying the provenance. Note that in future releases, DAX information may be fully maintained in the VDC schema itself. Currently, it is not. The full DAX workflow graph for the challenge workflow is in the file genatlas.dax.
We augmented the VDC database schema with a few changes to better support provenance query. The current database schema can be obtained from the VDS CVS repository. The schema schanges consist of two tables that break out the details of transformation and derivation arguments.
The primary provenence records we capture are the XML invocation documents, or "kickstart records" - one XML file per application invocation. The records from the execution of the challenge workflow are in this tar file of invocation documents. Here's the invocation document for a run of reslice: reslice_ID000002.out
Details of the structure and interpretation of the invocation documents are in the VDS Users Guide chapter Interpreting Results from Kickstart.
We merge the information from the runtime invocation records with the workflow graph structure in the dax, in a formatted "provenance" file. The full workflow, documented in this fashion, can be found in the file genatlas.pro.
The full workflow was executed using the initialization and execution script contained in these files: runpc1, inittc and initrc which run the workflow, and set up the list of input files and application programs, respectively.
Provence quriees in the VDS can take several forms: SQL queries on the virtual data catalog, XML-filtering queries (performed in this exercise using text tools such as awk, but which can conceptually (and more properly) be done using XPath and XQuery), and plain text-filtering queries.
These three mechanisms are joined in ad-hoc but expedient manners as illustrated below. In general, throughout our queries, we use gendax to perform graph traversals on the VDL definitions in the virtual data catalog database, and then tex tools to traverse that graph in its XML file format, and join this information with SQL queries using shell scripts. Note that several queries are performed entirely in SQL.
Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.
We perform a gendax to create the workflow graph for just the atlas x graphic. (Note that the executed workflow comprises the workflow graph to derive all three Atlas graphic files).
# Get provenance for just the Atlas-X file gendax -o genatlasx.dax -l pc1.wf01 -f Data/Derived/atlas-x.gif prdax < genatlasx.dax >genatlasx.pro drawwf.sh genatlasx
prdax is an awk script that formats the DAX document into humanly readable report-like format, and augments the workflow graph from the DAX with the runtime invocation provenance information from the virtual data catalog.
Note that the workflow has already been executed � gendax (a Java app) performs a graph traversal of the virtual data catalog (RDBMS), to select the derivation definitions required to derive the requested file. Passing gendax the same workflow id as the wf we ran allows us to match the invocation records in the database with the steps of the requested subgraph.
The result files for this query are:
The DAX graph: genatlasx.dax
The Provenance (process record) file: genatlasx.pro
Graphical renderings of the process graph, at various levels of detail, in png and svg:
(Note that you may need a browser plugin to view the SVG. They can be viewed nicely in visio).
Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.
This query asks us to filter a workflow graph. We take �prior to� to mean all graph predecessors of softmean. We do a graph traversal from the specified node (softmean) and go backwards through the workflow graph � opposite to the direction of execution, from the sync to the source � eliminating all predecessor nodes of the target node.
The script exclude_prior performs this pruning on the XML workflow graph document (the genatlas.dax file), producing a new workflow document, the exuptosoftmean.dax We then run the prdax script on this graph to produce the provenance list.
$ exclude_prio ID000009 <genatlas.dax >exuptosoftmean.dax $ prdax
The provenance of the pruned workflow can be seen in these two files:
Note that we specify the softmean job here with its Job ID (with no loss of generality). The exclude_prio script reads the dax file, constructs the graph in associative array objects (as a successor list) and uses the recursive function �elim� to remove graph elements that we wish to prune:
function elim(node,plist) { delete jobtext[node]; if ( ! (node in parlist) ) return; split(parlist[node], plist, " ") for ( p in plist ) { elim(plist[p]); } delete parlist[node]; } j = jobtext["$exjob"]; elim("$exjob"); jobtext["$exjob"] = j;
The full source for exclude_prior is here: exclude_prior.
Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.
Query 3 is similar to query 2 but is based on stages of the workflow graph rather than on graph paths. Each derivation (invocation) in our workflow graphs is annotated with a level attribute, that signifies the level of the breadth-first graph traversal that was done to go from the requested data product of the workflow backwards to all the dependent data products and the derivations needed to produce them. These levels are the inverse of the stage numbers specied in the challenge problem. We apply a function maxlevel which obtains the maximum level of the workflow graph, and use this number to convert the request stage numbers to requested levels. Then, the work of this query is done in a manner very similar to that of query 2, using select_stage to prune out all levels of the graph outside of a specific range of stages: Select_stage calls maxlevel to convert stages to levels.
$ select_stage 3 5 <genatlas.dax >genatlas.3-5.dax $ prdax <genatlas.3-5.dax >genatlas.3-5.pro
Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.
This is a straightforward SQL join of the invocation tables with the parameters of the DV (as given by the darg table).
For days of the week, we use a SQL expression on the DV invocation date: SELECT EXTRACT(DOW FROM start) to pick a specific date. ((Sunday=0. Note that we queried for Tuesday (2) rather than Monday (1) to match our example run) but show a similar technique to query on a range of �seconds of minute� to test the approach.
The query is:
-- invocations from Tuesday SELECT iv.* from ptc_invocation iv, anno_definition def WHERE def.id IN ( SELECT dvid FROM darg da WHERE dvid IN ( SELECT id FROM anno_definition WHERE xml like '<derivation%uses="align_warp"%' ) AND da.name = 'model' AND da.value = 12 ) AND def.name = iv.dv_name AND EXTRACT(DOW from iv.start) = 2; -- invocations occuring < 20 seconds into the minute SELECT iv.* from ptc_invocation iv, anno_definition def WHERE def.id IN ( SELECT dvid FROM darg da WHERE dvid IN ( SELECT id FROM anno_definition WHERE xml like '<derivation%uses="align_warp"%' ) AND da.name = 'model' AND da.value = 12 ) AND def.name = iv.dv_name AND EXTRACT(SECONDS from iv.start) < 20;
The output in this simple test was a crued dump of all the fields of the invocation record as stored in the VDC database. While hard to read, it shows the richnes of the fields captured in the provenance record: uid, gid, pid, cwd, system architecture, and a pointer to the getruasge() record info:
-bash-3.00$ cat find_invo.out id | creator | creationtime | wf_label | wf_time | version | start | duration | tr_namespace | tr_name | tr_version | dv_namespace | dv_name | dv_version | resource | host | pid | uid | gid | cwd | arch | total ----+---------+----------------------------+----------+----------------------------+---------+----------------------------+----------+--------------+------------+------------+----------------+--------------+------------+----------+-----------------+-------+------+------+-------------------------+------+------- 80 | wilde | 2006-09-12 09:46:34.068-05 | pc1.wf01 | 2006-10-10 19:00:01.219-05 | 1.7 | 2006-09-12 09:44:16.749-05 | 3.759 | air | align_warp | | wilde.pc1.wf01 | align_warp.3 | | local | 128.135.125.191 | 19843 | 1031 | 1000 | /autonfs/home/wilde/pc1 | 3 | 160 79 | wilde | 2006-09-12 09:46:33.946-05 | pc1.wf01 | 2006-10-10 19:00:01.219-05 | 1.7 | 2006-09-12 09:44:11.526-05 | 4.008 | air | align_warp | | wilde.pc1.wf01 | align_warp.1 | | local | 128.135.125.191 | 19817 | 1031 | 1000 | /autonfs/home/wilde/pc1 | 3 | 158 81 | wilde | 2006-09-12 09:46:34.141-05 | pc1.wf01 | 2006-10-10 19:00:01.219-05 | 1.7 | 2006-09-12 09:44:26.981-05 | 3.75 | air | align_warp | | wilde.pc1.wf01 | align_warp.5 | | local | 128.135.125.191 | 19893 | 1031 | 1000 | /autonfs/home/wilde/pc1 | 3 | 162 82 | wilde | 2006-09-12 09:46:34.222-05 | pc1.wf01 | 2006-10-10 19:00:01.219-05 | 1.7 | 2006-09-12 09:44:21.771-05 | 4.008 | air | align_warp | | wilde.pc1.wf01 | align_warp.7 | | local | 128.135.125.191 | 19867 | 1031 | 1000 | /autonfs/home/wilde/pc1 | 3 | 164 (4 rows) id | creator | creationtime | wf_label | wf_time | version | start | duration | tr_namespace | tr_name | tr_version | dv_namespace | dv_name | dv_version | resource | host | pid | uid | gid | cwd | arch | total ----+---------+----------------------------+----------+----------------------------+---------+----------------------------+----------+--------------+------------+------------+----------------+--------------+------------+----------+-----------------+-------+------+------+-------------------------+------+------- 80 | wilde | 2006-09-12 09:46:34.068-05 | pc1.wf01 | 2006-10-10 19:00:01.219-05 | 1.7 | 2006-09-12 09:44:16.749-05 | 3.759 | air | align_warp | | wilde.pc1.wf01 | align_warp.3 | | local | 128.135.125.191 | 19843 | 1031 | 1000 | /autonfs/home/wilde/pc1 | 3 | 160 79 | wilde | 2006-09-12 09:46:33.946-05 | pc1.wf01 | 2006-10-10 19:00:01.219-05 | 1.7 | 2006-09-12 09:44:11.526-05 | 4.008 | air | align_warp | | wilde.pc1.wf01 | align_warp.1 | | local | 128.135.125.191 | 19817 | 1031 | 1000 | /autonfs/home/wilde/pc1 | 3 | 158 (2 rows)
Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095. The contents of a header file can be extracted as text using the scanheader AIR utility.
This query treats the entire workflow as a black box, and assumes, for example, that many workflows have been run, and that we are looking for the output products of workflows whose inputs meet a metadata criterion.
We run a simple script (which could itself be captured as a workflow) to turn the headers into metadata and reattach this metadata to the headers.
The query script - which integrates SQL on the VDC with awk to query the DAX XML workflow - is:
targetfiles=`mktemp /tmp/query5.XXXXXX` derivedfiles=`mktemp /tmp/query5.XXXXXX` targetpattern=`mktemp /tmp/query5.XXXXXX` psql -t -d mwvdc1 -U wilde -f - <<EOF >$targetfiles SELECT lfn.name FROM anno_lfn lfn, anno_int ai WHERE lfn.name like '%/anatomy%.hdr' and lfn.id = ai.id and ai.value = 4095; EOF psql -t -d mwvdc1 -U wilde -f - <<EOF >$derivedfiles SELECT of.name FROM anno_lfn_o of WHERE of.name like '%/atlas%.gif' OR of.name like '%/atlas%.pgm' EOF sed -e 's/^ //' \ -e '/^$/d' \ -e 's/^/ <filename file="/' \ -e 's/$/" link="input"\/>/' <$targetfiles >$targetpattern for d in `cat $derivedfiles` ; do gendax -f $d | grep -q -F "`cat $targetpattern`" && echo $d done rm $targetfiles $derivedfiles $derivedpattern
Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align_warp procedure with argument -m 12.
We use a similar approach to that of query 5:
# Find align_warp calls with argument model=12 psql -t -d mwvdc1 -U wilde -A -F ' ' -f - <<EOF >$alignwarpdvs SELECT def.namespace, def.name, def.version from darg da, anno_definition def WHERE da.name = 'model' AND da.value = 12 AND da.dvid = def.id EOF # Find all softmean derivations psql -t -d mwvdc1 -U wilde -A -F ' ' -f - <<EOF >$softmeandvs SELECT def.namespace, def.name, def.version from anno_definition def WHERE def.xml LIKE '%uses="softmean"%'; EOF # Form DV pattern to search workflow graphs cat <<EOF >$awkp { line=""; if( \$1 != "") {line = line "dv-namespace=\"" \$1 "\""; } if( \$2 != "") {line = line " dv-name=\"" \$2 "\""; } if( \$3 != "") {line = line " dv-version=\"" \$3 "\""; } print line; } EOF awk -f $awkp <$alignwarpdvs >$dvpattern # Find workflows of softmean derivations that contain the desired preceding align_warp DVs # and select the output images of these workflows while read ns name ver; do gendax -o $dax -n "$ns" -i "$name" -v "$ver" grep -q -F "`cat $dvpattern`" $dax && grep 'filename.*img.*output' $dax done <$softmeandvs
A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago.
This query is done entirely in SQL on the VDC:
Find lfns with annotation center=uc Find alignwarp Dvs with those lfns as inputs Find outputs of these DVs (need to add more data to show the selectivity of this query) SELECT outf.name FROM anno_lfn_i inf, anno_lfn_o outf, anno_definition dv WHERE inf.name in ( SELECT lfn.name FROM anno_lfn lfn, anno_text txt WHERE mkey='center' AND txt.id = lfn.id AND txt.value = 'UChicago' ) AND inf.did = dv.id AND dv.xml like '%uses="align_warp"%' AND dv.id = outf.did; name ------------------------- Data/Derived/warp2.warp Data/Derived/warp4.warp (2 rows)
A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.
For this query we create 10 new logical file records, and annotate them with test data. The Query, like #8, is entirely in SQL on the VDC:
SELECT lfn.name, lfn.mkey, txt.value FROM anno_lfn lfn, anno_text txt WHERE lfn.id = txt.id AND lfn.name in ( SELECT lfn1.name FROM anno_lfn lfn1, anno_lfn lfn2, anno_text txt1, anno_text txt2 WHERE lfn1.mkey='datatype' AND txt1.id = lfn1.id AND txt1.value = 'graphics' AND lfn2.mkey='studyModality' AND lfn2.id = txt2.id AND txt2.value in ('speech', 'audio', 'visual') AND lfn1.name = lfn2.name ) ORDER BY lfn.name; SELECT lfn.name, lfn.mkey, flt.value FROM anno_lfn lfn, anno_float flt WHERE lfn.id = flt.id AND lfn.name in ( SELECT lfn1.name FROM anno_lfn lfn1, anno_lfn lfn2, anno_text txt1, anno_text txt2 WHERE lfn1.mkey='datatype' AND txt1.id = lfn1.id AND txt1.value = 'graphics' AND lfn2.mkey='studyModality' AND lfn2.id = txt2.id AND txt2.value in ('speech', 'audio', 'visual') AND lfn1.name = lfn2.name ) ORDER BY lfn.name; $ query9 name | mkey | value +----------------------+---------------+---------- Data/Raw/anatomy10.img | studyAgency | NSF Data/Raw/anatomy10.img | studyPI | Moreau Data/Raw/anatomy10.img | studyModality | speech Data/Raw/anatomy10.img | datatype | graphics Data/Raw/anatomy10.img | center | UChicago Data/Raw/anatomy11.img | studyState | complete Data/Raw/anatomy11.img | studyAgency | NSF Data/Raw/anatomy11.img | studyPI | Moreau Data/Raw/anatomy11.img | studyModality | audio Data/Raw/anatomy11.img | datatype | graphics Data/Raw/anatomy11.img | center | Oxford Data/Raw/anatomy12.img | studyAgency | NSF Data/Raw/anatomy12.img | studyPI | Moreau Data/Raw/anatomy12.img | studyModality | visual Data/Raw/anatomy12.img | datatype | graphics Data/Raw/anatomy12.img | center | UChicago Data/Raw/anatomy14.img | studyState | complete Data/Raw/anatomy14.img | studyAgency | NSF Data/Raw/anatomy14.img | studyPI | Groth Data/Raw/anatomy14.img | studyModality | speech Data/Raw/anatomy14.img | datatype | graphics Data/Raw/anatomy14.img | center | UChicago Data/Raw/anatomy15.img | studyAgency | NSF Data/Raw/anatomy15.img | studyPI | Moreau Data/Raw/anatomy15.img | studyModality | speech Data/Raw/anatomy15.img | datatype | graphics Data/Raw/anatomy15.img | center | London Data/Raw/anatomy16.img | studyAgency | NSF Data/Raw/anatomy16.img | studyPI | Moreau Data/Raw/anatomy16.img | studyModality | speech Data/Raw/anatomy16.img | datatype | graphics Data/Raw/anatomy16.img | center | Kyoto Data/Raw/anatomy18.img | studyAgency | NSF Data/Raw/anatomy18.img | studyPI | Moreau Data/Raw/anatomy18.img | studyModality | speech Data/Raw/anatomy18.img | datatype | graphics Data/Raw/anatomy18.img | center | UChicago (37 rows) name | mkey | value +-----------------------+-----------+------------------ Data/Raw/anatomy15.img | studyCost | 12500.9501953125 (1 row)
Suggest variants of the workflow that can exhibit capabilities that your system support.
Suggest significant queries that your system can support and are not in the proposed list of queries, and how you have implemented/would implement them. These queries may be with regards to a variant of the workflow suggested above.
According to your provenance approach, you may be able to provide a categorisation of queries. Can you elaborate on the categorisation and its rationale.
If your system can be accessed live (through portal, web page, web service, or other), provide relevant information here.
Provide here further comments.
Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.
-- MichaelWilde - 12 Sep 2006
to top
I | Attachment ![]() | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|
![]() | initrc.sh | manage | 1.7 K | 13 Sep 2006 - 10:38 | MichaelWilde |