The workflow is developed in the Kepler scientific workflow system. It uses the COMAD (Collection-Oriented Modeling and Design) framework, a system which allows nested collections of data to flow through a pipeline of actors, each actor interacting with only those parts of the stream it is interested in. This allows us to model the main aspect of control flow, the iteration over CSVFiles, as a functional map operation; this is very simple in COMAD. One challenge, though, was halting the entire workflow whenever any actor fails. While failures could easily be propagated downstream to actors 'below,' communicating them to upstream actors is challenging. To solve this, the classes simply communicate directly in an out-of-band fashion; if one fails, it informs all the rest, and they then act as passthroughs, just forwarding data and not operating on them further.
We ran the workflow across various data sets such that the workflow halts at each different step (8 halts) and in one case succeeds in executing all steps. The table below states those cases, comad-kepler provenance traces (xml) for the same, corresponding opm trace (xml), opm graph (png) and opm process dependency graph (png).
We imported the opm output from various teams into our comad-kepler trace. We then stored it into our relational system (MySQL?) to query the imported provenance information.
Team | opm trace (xml) | comad-kepler trace (xml) |
---|---|---|
UCDGC | UCDGC_opm.xml | UCDGC_comad.trace |
SotonUSCISIPc3 | SotonUSCISIPc3_opm.xml | SotonUSCISIPc3_comad.trace |
TetherlessPC3 | TetherlessPC3_opm.xml | TetherlessPC3_comad.trace |
KCL | KCL_opm.xml | KCL_comad.trace |
PASS3 | PASS3_opm.xml | PASS3_comad.trace |
The provenance information (trace) where stored in MySQL?, and queries were written in SQL.
Basic sample answer: The CSV file containing the Detection table.
Advanced sample answer: The CSV file containing the Detection table, CSV file containing the Image table (as the image is an attribute of the detection), and CSV file containing the FrameMetadata table (as the frame metadata is an attribute of the image).
SELECT value FROM runAnnotationView WHERE name="file" AND targetNodeId IN (SELECT targetNodeId FROM runAnnotationView WHERE value LIKE '%Detection%') ;
Query Result:
+----------------------------------------------------------------------------------------------------------+ | value | +----------------------------------------------------------------------------------------------------------+ | /Users/manish/Desktop/pc3/datasets/J062941-success/P2_J062941_B001_P2fits0_20081115_P2Detection.csv | +----------------------------------------------------------------------------------------------------------+
Sample answer: Yes
SELECT "true" FROM runDataView rd1, runColldata r, runAnnotation ra WHERE rd1.name="ccdID" AND rd1.value="1" AND r.nodeId=rd1.parent AND ra.targetNodeId=r.parent AND ra.name="IsMatchTableColumnRanges_success" ;
Query Result:
+------+ | true | +------+ | true | +------+
Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?
Sample answer: call of ReadCSVReadyFile, call of CreateEmptyLoadDB, 2nd call of ReadCSVFileColumnNames, 2nd call of LoadCSVFileIntoTable (2nd calls because Image is loaded in the 2nd iteration of the for loop, excluded checks because they do not change anything, excluded UpdatedComputedColumns because it is non-computed, excluded CompactDatabase because it does not affect the value).
SELECT inv.invocName + inv.invocCount FROM runitemView rv, dbValue dv, invokedAfterDep invD, invocation inv WHERE rv.name="ccdID" AND rv.dbValueId=dv.id AND dv.value="1" AND invD.toDbInvocId=rv.pIns AND inv.dbInvocId=invD.fromDbInvocId ;
Query Result:
+---------------------------+------------+ | invocName | invocCount | +---------------------------+------------+ | SetupTestInputs | 1 | | IsCSVReadyFileExists | 1 | | ReadCSVReadyFile | 1 | | CreateCSVTableFiles | 1 | | IsExistsCSVFile | 2 | | ReadCSVFileColumnNames | 2 | | IsMatchCSVFileColumnNames | 2 | | LoadCSVFileIntoTable | 2 | +---------------------------+------------+
Optional Queries ran over workflow execution "_J062941-halt8-IsMatchTableColumnRanges.trace_"
The workflow halts due to failing an IsMatchTableColumnRanges check. How many tables successfully loaded before the workflow halted due to a failed check?
Sample answer: 2
SELECT COUNT(*) FROM runAnnotation WHERE name="IsMatchTableColumnRanges_success" ;
Query Result:
+----------+ | COUNT(*) | +----------+ | 2 | +----------+
A CSV or header file is deleted during the workflow's execution. How much time expired between a successful IsMatchCSVFileTables test (when the file existed) and an unsuccessful IsExistsCSVFile? test (when the file had been deleted)?
Sample answer: 3ms
For testing the above query, we it may be simplest to edit the workflow to include deletion of the CSV file as a step.
SELECT ra2.value - ra1.value FROM runAnnotationView ra1, runAnnotationView ra2 WHERE ra1.name="IsMatchCSVFileTables_success" AND ra2.name="halt" ;
Query Result: Difference in time is in ms.
+-----------------------+ | ra2.value - ra1.value | +-----------------------+ | 4795 | +-----------------------+
Why is this entry in the database?
INVOCATION DEPENDENCY CLOSURE
SELECT inv.invocName , inv.invocCount FROM runitemView rv, dbValue dv, invokedAfterDep invD, invocation inv WHERE rv.name="ccdID" AND rv.dbValueId=dv.id AND dv.value="1" AND invD.toDbInvocId=rv.pIns AND inv.dbInvocId=invD.fromDbInvocId ;
Query Result:
+---------------------------+------------+ | invocName | invocCount | +---------------------------+------------+ | SetupTestInputs | 1 | | IsCSVReadyFileExists | 1 | | ReadCSVReadyFile | 1 | | CreateCSVTableFiles | 1 | | IsExistsCSVFile | 2 | | ReadCSVFileColumnNames | 2 | | IsMatchCSVFileColumnNames | 2 | | LoadCSVFileIntoTable | 2 | +---------------------------+------------+
DATA DEPENDENCY CLOSURE
SELECT dcv.depNodeId FROM runitemView rv, dbValue dv, runitemDepcView dcv WHERE rv.dbValueId=dv.id AND rv.name="ccdID" AND dv.value="1" AND dcv.nodeId=rv.nodeId ;
Query Result:
Empty set (for this specific query)
A user executes the workflow many times (say 5 times) over different sets of data (j062941, j062942, ... j062945). He wants to determine, which of the execution halted?
Sample answer: workflow_execution_id along with input_data_set_name (say j062941, j062942)
SELECT wfDefName FROM run r, runAnnotation ra WHERE ra.name="halt" AND ra.runId=r.id ;
Query Result:
+-----------------------------------------+ | wfDefName | +-----------------------------------------+ | J062941-halt4-IsMatchCSVFileColumnNames | | J062941-halt3-IsExistsCSVFile | | J062941-halt1-IsCSVReadyFileExists | | J062941-halt2-IsMatchCSVFileTables | | J062941-halt8-IsMatchTableColumnRanges | | J062941-halt7-IsMatchTableRowCount | | J062941-halt5-LoadCSVFileIntoTable | +-----------------------------------------+
Determine the step where halt occured?
Sample answer: 2nd call of ReadCSVFileColumnNames or 3rd call of LoadCSVFileIntoTable, etc.
SELECT inv.invocName , inv.invocCount FROM runitemView rv, runAnnotation ra, invocation inv WHERE ra.name="halt" AND ra.nodeId=rv.nodeId AND rv.pIns=inv.dbInvocId ;
Query Result:
+--------------------------+------------+ | invocName | invocCount | +--------------------------+------------+ | IsMatchTableColumnRanges | 3 | +--------------------------+------------+
Determine data and associated granularities of the data being processed, when halt occured?
Sample answer: data_id and associated granularities(column, row, table, file)
SELECT ra.name, ra.value FROM runCollData r1, runCollData r2, runAnnotationView ra WHERE r1.numLeft < r2.numLeft AND r1.numRight > r2.numRight AND r2.nodeId = (SELECT targetNodeId from runAnnotation WHERE name="halt") AND r1.nodeId = ra.targetNodeId ;
Query Result:
+-------------------------------+----------------------------------------------------------------------------------------------------------------------------+ | name | value | +-------------------------------+----------------------------------------------------------------------------------------------------------------------------+ | db | J062941-halt8-IsMatchTableColumnRanges--927058851_LoadDB | | file | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges/P2_J062941_B001_P2fits0_20081115_P2Detection.csv | | sql_table | P2Detection | | LoadCSVFileIntoTable_success | 1240629827414 | | UpdateComputedColumns_success | 1240629828626 | | IsMatchTableRowCount_success | 1240629828635 | +-------------------------------+----------------------------------------------------------------------------------------------------------------------------+
Which steps were completed successfully before the halt occurred?
Sample answer: steps name along with call (or count) number, if steps were executed multiple times
SELECT inv.invocName , inv.invocCount FROM runitemView rv, runAnnotation ra, invocation inv, invokedAfterDep invD WHERE ra.name="halt" AND ra.nodeId=rv.nodeId AND invD.toDbInvocId=rv.pIns AND inv.dbInvocId=invD.fromDbInvocId ;
Query Result:
+---------------------------+------------+ | invocName | invocCount | +---------------------------+------------+ | SetupTestInputs | 1 | | IsCSVReadyFileExists | 1 | | ReadCSVReadyFile | 1 | | CreateCSVTableFiles | 1 | | IsExistsCSVFile | 3 | | ReadCSVFileColumnNames | 3 | | IsMatchCSVFileColumnNames | 3 | | LoadCSVFileIntoTable | 3 | | IsMatchTableColumnRanges | 3 | +---------------------------+------------+
For a workflow execution, determine the user inputs?
Sample answer: jobID, CVSRootPath?
SELECT r.nodeId, r.itemType, r.name, r.type, dv.value FROM runitemView r, invocation inv, dbValue dv WHERE r.pins = inv.dbInvocId AND inv.invocName="SetupTestInputs" AND dv.id=r.dbValueId ;
Query Result:
+--------+----------+----------+-------------+---------------------------------------------------------------------------+ | nodeId | itemType | name | type | value | +--------+----------+----------+-------------+---------------------------------------------------------------------------+ | 7327 | data | db_name | StringToken | J062941-halt8-IsMatchTableColumnRanges--927058851 | | 7328 | data | data_dir | StringToken | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges | +--------+----------+----------+-------------+---------------------------------------------------------------------------+
For a workflow execution, determine steps that required user inputs?
Sample answer: IsCVSReadyFileExists?, CreateEmptyLoadDB?
SELECT inv2.invocName, inv2.invocCount, dv.value FROM runitemview r, invocation inv, dependencyView d, runitemView rv, dbValue dv, invocation inv2 WHERE r.pins = inv.dbInvocId AND inv.invocName="SetupTestInputs" AND d.nodeId=r.nodeId AND rv.pDep=d.pDep AND dv.id=r.dbvalueId AND rv.pIns=inv2.dbInvocId ;
Query Result:
+----------------------+------------+---------------------------------------------------------------------------+ | invocName | invocCount | value | +----------------------+------------+---------------------------------------------------------------------------+ | CreateEmptyLoadDB | 2 | J062941-halt8-IsMatchTableColumnRanges--927058851 | | IsCSVReadyFileExists | 1 | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges | | ReadCSVReadyFile | 1 | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges | +----------------------+------------+---------------------------------------------------------------------------+
_For a workflow execution, display the following provenance views: data dependency views, step dependency view?
Sample answer: either display these views in form of a directed graph, or as a pair of dependency relations between data or steps.
DATA DEPENDENCY VIEW (RELATIONS)
SELECT nodeId, depNodeId FROM runitemDepView ;
Query Result:
Dependency Relations between nodes
PROCESS DEPENDENCY VIEW (RELATIONS)
SELECT fromdbInvocId, todbInvocId FROM immedInvokedAfter ;
Query Result:
Dependency Relations between processes
-- ManishAnand - 25 Apr 2009
to top