Note here any changes in your provenance representation, workflow enactment or system since the first challenge. Alternatively, if you did not participate in the first challenge, please provide the same details as were required for those who did (particularly workflow representation and provenance representation).
The CESNET implementation of the First Provenance Challenge relied
on an explicit representation of workflow structure that was extracted from
the native workflow representation in gLite -- dependencies among
DAG subjobs specified by the user on its submission.
These dependencies were decoded and recorded as ancestor
and successor
attributes of the DAG subjobs and used for query implmentation.
This restriction is relaxed in the Second Challenge.
Instead, dependence between two workflow processes is inherited from data:
Process A is makred as ancestor
of B (and vice versa, B is successor
of A) if there is a data file F that is output of A and input of B.
Logical filenames are considered for this purpose
(name
in the file
elements in the format definition bellow, not
physical filenames -- content of url
elements).
For the purpose the challenge we implement this process in
an external "sew" script.
The script is seeded with one or more identifiers of processes,
it queries recursively JP,
data dependences (common input-output files) are traversed
in both directions until the complete graph closure is found.
The found dependences are recorded with processes
in terms of ancestor
and successor
attributes of the first challenge;
then the challenge queries implementation remains unchaged in this sense.
Currently the script is invoked on demand. However, it can be transformed into a part of the JP infrastructure -- an agent which subscribes for receiving notifications on input/output file assignments to processes, and generates the workflow dependencies automatically. The mechanism of generating such notifications is already available in JP. It is used in the communication of JP Primary storage and JP Index server.
The queries implementation remains unchaged as implemented for the first challenge except small adaptations described in next paragraphs.
The First Challenge query scripts used hardcoded executable names. This was not a problem, the names matched exactly the values recorded by our implementation of the workflow.
However, the naming varies among the teams, eg. it may or may not contain absolute path to the executable. Therefore the scripts had to be parametrized to be run with the names appropriate for the particular data source
JP starts gatering data on a job virtually at the same time the job is submitted to the Grid. Therefore, during the First Challenge, we could have used times of job registration with JP to approximate the job run time quite accurately. (Queries on the exact execution time were not implemented in JP that time.)
This is not true anymore in the Second Challenge. The job is registered with JP when the data are imported, ie. typically much later wrt. its real execution.
The query scripts were adjusted to use the true execution time.
Give links here to your provenance data files for the workflow parts of the challenge: three parts for the original workflow and three parts for the modified workflow (as per provenance query 7). The data files could be attached to the results page.
The format is custom-made specifically for the Challenge in order to facilitate the data exchange with other teams, however, it is a full-featured export format from Job Provenance:
An export utility used to generate the exchange files with JP queries is available here.
Here we show an example of the data format. This example was hand-edited for the sake of better readablility.
<?xml version="1.0"?> <workflow xmlns="http://egee.cesnet.cz/en/Schema/JP/Challenge2"> <exportedStages>1 2</exportedStages> <job id="https://skurut1.cesnet.cz:9000/yM3sz8v6WCIPgi5-0m8L4w"> <owner>/DC=cz/DC=cesnet-ca/O=Masaryk University/CN=Ales Krenek</owner> <regtime>2006-07-11T12:22:34</regtime> <!-- input and output files of this job --> <inputs> <file name="urn:challenge:anatomy1.img"> <url>gsiftp://umbar.ics.muni.cz:1414/home/mulac/pch06/anatomy1.img</url> <url>gsiftp://umbar.ics.muni.cz:1414/home/mulac/pch06/anatomy1.hdr</url> </file> </inputs> <outputs> <file name="urn:challenge:anatomy1_yM3sz8v6WCIPgi5-0m8L4w.warp"> <url>gsiftp://umbar.ics.muni.cz:1414/home/mulac/pch06/anatomy1_yM3sz8v6WCIPgi5-0m8L4w.warp</url> </file> </outputs> <!-- workflow structure: jobs that preceed and follow this one in the workflow --> <ancestors> <!-- empty for stage 1 --> </ancestors> <successors> <!-- note the reference to the other job bellow --> <jobid>https://skurut1.cesnet.cz:9000/wdWQHL0-RXkd3VeNcSrTaw</jobid> </successors> <!-- gLite middleware processing and job execution details --> <gliteJobRecord> <!-- omitted for readability --> </gliteJobRecord> <!-- user annotations, including Challenge-specific; only the latter are shown --> <annotations> <annotation> <name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_STAGE</name> <value>1</value> </annotation> <annotation> <name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_PROGRAM</name> <value>align_warp</value> </annotation> <annotation> <name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_PARAM</name> <value>-m 12</value> </annotation> <annotation> <name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_PARAM</name> <value>-q</value> </annotation> <annotation> <name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_HEADER</name> <value>global_maximum=4095</value> </annotation> </annotations> </job> <job id="https://skurut1.cesnet.cz:9000/wdWQHL0-RXkd3VeNcSrTaw"> <!-- another job in the workflow, omitted --> </job> <!-- further jobs follow --> </workflow>
The root element of the file is workflow
, correstponding to an entire exported workflow
or its parts as given by the Challenge definition. The stages present in this file
are listed in exportedStages
.
Further second level elements are job
's, representing the individual processes
in the workflow. Each one is assigned a unique ID already when processed by the gLite middleware.
Besides general metadata (owner and registration time) the data can be organized in
the following sections:
Inputs and outputs
file
elements refer to concrete inputs and outputs of the job.
The attribute name
is a URI identifying the particular file uniquely.
As we didn't follow any given file naming scheme in Challenge 1,
custom urn:
's are shown in the example. However, any suitable
file identifier can be used instead.
File name of input of the shown job has now suffix as it is the input of the entire workflow and only a single set of inputs was given. On the contrary, the output file name contains a unique suffix, suggesting that this file was generated by a particular workflow run.
As some of the files in the Challenge workflow are collections of files in fact (.img and .hdr files),
we use nested url
's (that may occur multiple times) to denote also physical file locations.
Workflow structure
Structure of the workflow is denoted by links between job
's using
their unique identifiers, and grouped in ancestors
and successors
.
These links are present in the exported format regardless their
targets are exported in this part of the workflow or not.
The links are sufficient to "stitch" together separately exported workflow parts in a unique and reliable way. However, if they are not available explicitely, they can be still reconstructed by searching matching inputs and outputs of the jobs.
Job processing details
gliteJobRecord
contains details on processing the job in gLite middleware.
It conforms to the schema originally
defined for the purpose of computing job statistics in EGEE project.
These data are virtually irrelevant for the Challenge, therefore they are omitted in this example. However, they are present in the full exported data bellow.
The contained elements are either described within the schema or they are self-explanatory.
User annotations
JP allows the user to add arbitrary "namespace:name = value" annotations to the job, while "value" can have arbitrary complex XML structure. The same "name" can also occur multiple times. The annotations can be added either during job execution (usually via L&B, the gLite service that tracks the job during its active life), or later via native JP interface.
The annotations of particular interest for the Challenge are shown above. They correspond to tags recorded and described in Challenge 1, with the exception of IPAW_INPUT and IPAW_OUTPUT which are mapped specifically in this format.
Modified workflow Not addressed in this challenge.
In order to get better understanding of the issues of translations between the provenance data models we extend the challenge specification into two stages:
softmean
process (part 2) to match outpus and inputs of parts 1 and 3
Steps 2--4 are rather artificial and serve the purpose of the challenge only.
Unification of names of softmean
inputs/outputs is necessary to trigger inheriting
dependences. If all the provenance systems gathered data on the same workflow execution,
the matching filenames in all the parts of the workflow would be the same either.
Similarly adding the unique suffix to all filenames allows us to run multiple imports on the same input data without the need to purge the JP database between the attempts. The same holds for assigning the new unique id's to the imported processes in step 4.
Step 6, as its side effect, produces a graph representation of the imported data. These graphs are shown in the result section bellow.
Provenance Query summary:
align_warp
parameters
global maximum
parameter
align_warp
parameters
studyModality
annotation
TODO:
More complicated due to duplicated arcs. This is caused by using different logical names for .img and .hdr pairs of files (unlike CESNET format which groups them together under a single logical name). Otherwise the graph matches expectations exactly.
Provenance Query summary:
global maximum
parameter
studyModality
annotation is present, should be doable
TODO: more comments on Q9
The graph contains number of "producer" nodes (see Translation Details bellow), a manually adjusted version (by removing these nodes) meets the expectation.
Provenance Query summary:
align_warp
parameters is present but not processed by our translator
global maximum
parameter may be present in the j.0:global
tag, however, the name is not unique, so the translator can't rely on it
align_warp
parameters is present but not processed by our translator
studyModality
annotation is present, should be doable
The graph contains the first row of "producer" jobs, otherwise it matches expectations.
Provenance Query summary:
studyModality
annotation
Provenance Query summary:
ipaw_header
missing.
studyModality
annotation
Most of the challenge queries are affected by availability of data in a particular part of the workflow. Therefore, in general, the results of heterogeneous queries follow the results of the homogeneous queries on the involved provenance system.
In particular:
align_warp
parameters, follow results of workflow part 1
global maximum
parameter, workflow part 1 again
studyModality
annotation, part 3
Provenance Query summary:
studyModality
annotation missing in SDG data
Provenance Query summary:
ipaw_param
not presented in ES3
ipaw_head
not presented in ES3
ipaw_param
not presented in ES3
studyModality
annotation missing in SDG data
The graph contains number of "producer" nodes from MyGrid.
Provenance Query summary:
ipaw_param
not presented in MyGrid
ipaw_head
not presented in MyGrid
ipaw_param
not presented in MyGrid
studyModality
annotation missing in SDG data
Provenance Query summary:
ipaw_head
not presented in Karma
studyModality
annotation missing in MINDSWAP data
Describe details regarding how data models were translated (or otherwise used to answer the query following the team's approach), any data which was absent from a downloaded model, and whether this affected the possibility of translation or successful provenance query, and any data which was excluded in translation from a downloaded model because it was extraneous
Sections bellow briefly describe issues that raised from translating the particular provenance system data, and importing them into JP. The list is not complete wrt. all the participating teams. We were not able to put the necessary effort into evaluation of all, we have chosen more or less random sample, based on a very subjective and brief view on the provided data. Therefore we are not able to provide any serious assessment of the data formats of systems that are not listed in this section.
Our CVS repository is organized as follows:
export/
: JP export and import utilities, ``sew'' script for inheriting the dependences, and common code for the automated translations
JP assigns job owner to each process (X509 certificate subject). There seems be no analogy in the other formats, therefore we supplied the value as parameter of the translators.
Most of the formats don't include explicitly information on the part of the workflow (that matches the notion of stage in our format). This was also supplied as an additional parameter of the translator.
reslice
outputs are not the same as softmean
inputs). We believe this to be an artifact of the challenge data rather than feature of the system, though, and we fixed the problem by manually renaming the files accordingly.
align_warp
seem to be defined according to Challenge 1 example, however, these data are missing in Challenge 2.
global maximum
parameter and studyModality
annotation are not supported, therefore queries 5 and 9 can't be run.
align_warp
parameters and global maximum
are present in the format, however, their naming is ambiguous (key of parameter is String Value
and the global maximum seems to be encoded in Ontology:4095
) according to our understanding. Therefore we could have not extracted them from the format.
global maximum
is missing, yielding query 5 to be impossible
workflowNodeID
and serviceID
, believing it to be sufficiently unique.
stage
missing, we supply its value as parameter of the translator.
global maximum
is missing, yelding query 5 to be impossible.
align_warp
jobs -- "-m 12" is stored as "-m -12".
Describe your proposed benchmark queries, how the comparable quantities are determined, and the results of applying the benchmark to your own system
On Fri, 22 Jun 2007, Simon Miles wrote: There is nothing particular to prepare for this prior to the workshop, though having thought about possible suitable scenarios or queries that would make suitable benchmarks would be welcome when we come to discuss it.
Provide here further comments.
Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.
TODO (ljocha)
-- SimonMiles - 26 Oct 2006
-- AlesKrenek - 19 Feb 2007
to top
I | Attachment ![]() | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|
![]() | out1.xml | manage | 30.3 K | 20 Feb 2007 - 21:41 | AlesKrenek | Original workflow, part1 |
![]() | out2.xml | manage | 4.9 K | 20 Feb 2007 - 21:45 | AlesKrenek | Original workflow, part2 |
![]() | out3.xml | manage | 19.2 K | 20 Feb 2007 - 21:46 | AlesKrenek | Original workflow, part3 |
![]() | es3.ps | manage | 16.1 K | 22 Jun 2007 - 11:06 | AlesKrenek | ES3 import graph |
![]() | es3-q1.log | manage | 6.5 K | 22 Jun 2007 - 11:34 | AlesKrenek | Query #1 results |
![]() | es3-q2.log | manage | 2.0 K | 22 Jun 2007 - 11:35 | AlesKrenek | Query #2 results |
![]() | es3-q3.log | manage | 1.9 K | 22 Jun 2007 - 11:35 | AlesKrenek | Query #3 results |
![]() | karma.ps | manage | 20.5 K | 22 Jun 2007 - 11:43 | AlesKrenek | Karma import graph |
![]() | karma-q1.log | manage | 7.0 K | 22 Jun 2007 - 11:44 | AlesKrenek | |
![]() | karma-q2.log | manage | 2.2 K | 22 Jun 2007 - 11:44 | AlesKrenek | |
![]() | karma-q3.log | manage | 2.2 K | 22 Jun 2007 - 11:44 | AlesKrenek | |
![]() | karma-q4.log | manage | 3.1 K | 22 Jun 2007 - 11:44 | AlesKrenek | |
![]() | karma-q6.log | manage | 7.6 K | 22 Jun 2007 - 11:44 | AlesKrenek | |
![]() | mygrid.ps | manage | 44.0 K | 22 Jun 2007 - 12:08 | AlesKrenek | MyGrid import graph |
![]() | mygrid2.ps | manage | 21.4 K | 22 Jun 2007 - 12:20 | AlesKrenek | |
![]() | mygrid-q1.log | manage | 15.8 K | 22 Jun 2007 - 12:29 | AlesKrenek | |
![]() | mygrid-q2.log | manage | 3.4 K | 22 Jun 2007 - 12:29 | AlesKrenek | |
![]() | mygrid-q3.log | manage | 5.4 K | 22 Jun 2007 - 12:29 | AlesKrenek | |
![]() | sdg.ps | manage | 28.5 K | 22 Jun 2007 - 12:50 | AlesKrenek | |
![]() | sdg-q1.log | manage | 8.0 K | 22 Jun 2007 - 13:02 | AlesKrenek | |
![]() | sdg-q2.log | manage | 1.3 K | 22 Jun 2007 - 13:02 | AlesKrenek | |
![]() | sdg-q3.log | manage | 1.3 K | 22 Jun 2007 - 13:02 | AlesKrenek | |
![]() | sdg-q4.log | manage | 1.8 K | 22 Jun 2007 - 13:02 | AlesKrenek | |
![]() | sdg-q5.log | manage | 1.3 K | 22 Jun 2007 - 13:03 | AlesKrenek | |
![]() | sdg-q6.log | manage | 0.6 K | 22 Jun 2007 - 13:03 | AlesKrenek | |
![]() | cks.ps | manage | 15.4 K | 22 Jun 2007 - 13:07 | AlesKrenek | CESNET-Karma-SDG import |
![]() | cks-q1.log | manage | 5.6 K | 22 Jun 2007 - 13:08 | AlesKrenek | |
![]() | cks-q2.log | manage | 1.9 K | 22 Jun 2007 - 13:08 | AlesKrenek | |
![]() | cks-q3.log | manage | 3.9 K | 22 Jun 2007 - 13:08 | AlesKrenek | |
![]() | cks-q4.log | manage | 13.4 K | 22 Jun 2007 - 13:08 | AlesKrenek | |
![]() | cks-q5.log | manage | 1.4 K | 22 Jun 2007 - 13:09 | AlesKrenek | |
![]() | cks-q6.log | manage | 1.1 K | 22 Jun 2007 - 13:09 | AlesKrenek | |
![]() | ems-q1.log | manage | 9.1 K | 25 Jun 2007 - 12:34 | JiriSitera | es3-mygrid-sdg2 query 1 |
![]() | ems-q2.log | manage | 2.0 K | 25 Jun 2007 - 12:36 | JiriSitera | es3-mygrid-sdg2 query 2 |
![]() | ems-q3.log | manage | 4.0 K | 25 Jun 2007 - 12:36 | JiriSitera | es3-mygrid-sdg2 query 3 |
![]() | mes-q1.log | manage | 12.8 K | 25 Jun 2007 - 12:50 | JiriSitera | mygrid-es3-sdg2 query 1 |
![]() | mes-q2.log | manage | 2.0 K | 25 Jun 2007 - 12:51 | JiriSitera | mygrid-es3-sdg2 query 2 |
![]() | mes-q3.log | manage | 2.0 K | 25 Jun 2007 - 12:51 | JiriSitera | mygrid-es3-sdg2 query 3 |
![]() | ksm-q1.log | manage | 7.3 K | 25 Jun 2007 - 13:04 | JiriSitera | karma-sdg2-mindswap2 query 1 |
![]() | ksm-q2.log | manage | 1.9 K | 25 Jun 2007 - 13:05 | JiriSitera | karma-sdg2-mindswap2 query 2 |
![]() | ksm-q3.log | manage | 0.9 K | 25 Jun 2007 - 13:05 | JiriSitera | karma-sdg2-mindswap2 query 3 |
![]() | ksm-q4.log | manage | 15.4 K | 25 Jun 2007 - 13:05 | JiriSitera | karma-sdg2-mindswap2 query 4 |
![]() | ksm-q6.log | manage | 1.0 K | 25 Jun 2007 - 13:06 | JiriSitera | karma-sdg2-mindswap2 query 6 |
![]() | ems.ps | manage | 21.5 K | 25 Jun 2007 - 13:10 | JiriSitera | |
![]() | mes.ps | manage | 31.5 K | 25 Jun 2007 - 13:10 | JiriSitera | |
![]() | ksm.ps | manage | 15.8 K | 25 Jun 2007 - 13:11 | JiriSitera | |
![]() | mindswap-q1.log | manage | 6.3 K | 25 Jun 2007 - 13:14 | JiriSitera | |
![]() | mindswap-q2.log | manage | 1.8 K | 25 Jun 2007 - 13:14 | JiriSitera | |
![]() | mindswap-q3.log | manage | 2.9 K | 25 Jun 2007 - 13:15 | JiriSitera | |
![]() | mindswap-q4.log | manage | 15.2 K | 25 Jun 2007 - 13:15 | JiriSitera | |
![]() | mindswap-q6.log | manage | 2.3 K | 25 Jun 2007 - 13:15 | JiriSitera | |
![]() | mindswap.ps | manage | 17.3 K | 25 Jun 2007 - 13:16 | JiriSitera |