gd_query

Performs queries over metadata or Matlab structures stored in the archive.

 

Syntax 

gd_query

qresults = gd_query(query)

qresults = gd_query(query,datasource)

qresults = gd_query(query,datasource,resultfields)

qresults = gd_query(query,datasource,resultfields,

                    orderby)

qresults = gd_query(query,datasource,resultfields,

                    orderby,resultlimit)

qresults = gd_query(query,datasource,resultfields,

                    resultlimit)

 

Description

gd_query with no input arguments starts the query GUI, a Graphical User Interface for querying metadata and structures which also allows hyperlink browsing between related data. See the Geodise Database Toolbox Tutorial for more details.

 

qresults = gd_query(query) sends a query string to the database requesting all file metadata that meets the criteria specified in the string. A query takes the form 'field = value', where = can be replaced by other comparison operators. More than one query condition can be included in the string using & to join them together. The function returns a cell array of metadata structures, one for each matching result. To view the query results, use function gd_display.

 

qresults = gd_query(query,datasource) sends a query string to the database requesting matching archived structures or metadata of a certain type, depending on the value of the datasource string. To query metadata setdatasource to ‘file’ (default), ‘varmeta’ (metadata about variables), ‘datagroup’ or ‘monitor’. A cell array of matching structures is returned, one for each result. To query variables stored in the database set datasource to ‘var’. In this case the function will return a cell array of matching variables. The only variables that can be queried in this way are structures, because they contain named fields that can be searched for.

 

qresults = gd_query(query,datasource,resultfields) sends a query string to the database as above but only returns selected fields for each matching result. The resultfields string can be one of the following:

1.      A comma separated list indicating which fields should be returned for each result, for example just the standard.ID fields. The default, *, returns all fields. Prefix with the keyword 'distinct' to remove duplicates from the results, e.g. 'distinct field1, field2'.

2.      An aggregate function (count, max, min, sum, avg) applied to a field, followed by an optional 'groupby' clause. For example, gd_query(query,'file','count(field1)') executes the query and returns the number of matching results which contain field1. The number of matching results for each different value of field2 can be returned with 'count(*) groupby field2'. Further details are given in the Input Arguments section.

 

qresults = gd_query(query,datasource,resultfields, orderby) sends a query to the database and sorts the results by one or more fields, specified as a comma separated list prefixed with the keyword 'orderby'. Use num() or str() on each field in the list to indicate whether to sort it numerically or alphabetically, e.g. 'orderby num(field1), str(field2)'.The default sort order is ascending; append the keyword 'desc' after a field to sort it in descending order, e.g. ‘orderby num(field1) desc’.

 

qresults = gd_query(query,datasource,resultfields, orderby, resultlimit) limits the number of results returned from a query. An integer value means return the top resultlimit values from the sorted results. An ordinal string of the form '1st' or '2nd' etc means return a specific result based on its position.

 

qresults=gd_query(query,datasource,resultfields, resultlimit) limits the number of results returned using resultlimit as above but without the overhead of sorting, which makes the operation quicker. This can be used to get a sample set of results when the order does not matter.

 

Input Arguments

query             A query takes the form 'field = value' where field is the name of a field in the archived metadata/variable structure, for example iterations or standard.ID  (dot notation is used to access the subfields of a structure). The value is an alphanumeric value the field should contain. The operator & (meaning ‘and’) can be used to specify more than one search condition.

 
The following operators can be used to compare fields with values:

 

=

Equal to

!=

Not equal to

Greater than

Less than

>=

Greater than or equal to

<=

Less than or equal to

like

Similar to

not like

Not similar to


Similarity matches with like and not like use the following wildcards:

_

Matches any single character.

%

Matches any string of any length (including 0).


For example, 'standard.localName like %dat%' will match strings containing the phrase ‘dat’, and 'model.name like _est%'
will match strings starting with any character followed by ‘est’ and then any string. To search for the characters _ and %, precede them with the \ escape character.


The operators do case sensitive comparison when used with string values. To make an operator case insensitive surround it with two # characters. For example, #=#, #!=#, #like#, #not like#.

Another wildcard, *, provides flexibility in describing the field path. For example, model.name  can be replaced by *.name for a less specific search.

 

                        In addition to user defined metadata fields, the following standard metadata fields can be queried:

standard.ID

ID that uniquely identifies a file, variable or datagroup.

standard.datagroupname

Name of datagroup. Only used when querying datagroups.

standard.localName

Name of a local file before it was archived.

standard.byteSize

Size in bytes of a file.

standard.format

Format of file (default is its extension).

standard.createDate

Date the file was created/modified.

standard.archiveDate

Date the file or variable was archived, or the datagroup was created.

standard.userID

ID of the user who archived the data or created the datagroup.

standard.comment

Comment about the file, variable or datagroup.

standard.version

User defined version number for the file, variable or datagroup.

standard.tree

String representing a user defined data hierarchy, similar to a directory path.

standard.files.fileID

Each file in a datagroup.

standard.vars.varID

Each variable in a datagroup.

standard.

subdatagroups.

datagroupID

Each subdatagroup in a datagroup.

standard.datagroups.

datagroupID

Each datagroup a file, variable or subdatagroup belongs to.

                       
Datagroups are collections that can contain files, variables or other datagroups, see gd_datagroup and gd_datagroupadd.

The fields in an archived structure variable can also be queried in conjunction with the standard metadata fields for that variable. However, this can be an expensive operation because two data sources, ‘var’and ‘varmeta’ (see below), are queried.

 

datasource  The data source indicates which type of data to query, and can be specified by one of the following strings (the default datasource value is 'file'):

 

‘file’

Metadata about files.

‘datagroup’

Metadata about datagroups.

‘monitor’

Metadata about monitorable datagroups.

‘varmeta’

Metadata about Matlab variables.

‘var’

Matlab variables.


A datagroup that was created with the ‘monitor’ flag can be queried as an ordinary datagroup, or as a collection of data about a computational job, by setting datasource to ‘monitor’. This provides a quick and easy query mechanism for finding a user’s most recent job, or the latest job meeting certain other metadata criteria. It is provided for convenience so that the user does not have to remember any particular field names, values, or what time the datagroup was created. In addition to standard.ID, standard.userID and user defined metadata, the following standard metadata can be used together with ‘monitor’ to query a job monitoring datagroup.

 

standard.jobIndex

Job index. Special query syntax jobIndex = max gets the highest index (most recent job).

standard.jobName

Name of job (same as datagroupname).

standard.startDate

Start date of job (when the datagroup was created).

 

resultfields
The resultfields string indicates a subset or summary of the query result fields that should be returned. The default, *, returns all fields. A comma separated list indicates particular fields that should be returned. This list can be prefixed with the 'distinct' keyword to remove duplicates from the results.


Alternatively the resultfields string contains an aggregate function (count, max, min, sum, avg) applied to one field, e.g. 'count(b)'. These aggregate functions return summary data from all the results matching a query as follows:

 

count

Total number of results.

'count(*)' counts all results

'count(b)' counts only results containing the field b.

'count(distinct b)' counts only results containing unique values of b.

max

Maximum value for a field.

'max(b)' or 'max(num(b))' treats all values as numbers.

'max(str(b))' treats all values as strings.

min

Minimum value for a field.

'min(b)' or 'min(num(b))' treats all values as numbers.

'min(str(b))' treats all values as strings.

sum

All values of a particular field added together.

avg

Average of all the values of a particular field.


By default an aggregate function is applied to a particular field over all the results. For example, 'avg(b)' returns a single result containing a number, the average of all b field values across the result set. It is also possible to apply an aggregate function to groups of values within the result set using the groupby keyword. For example, 'avg(b) groupby standard.format' returns the average value of b for each different file format.

 

Examples

Query file metadata to find files archived on or after 1st September 2004 where iterations = 9000. A datasource argument is not required because ‘file’ is the default.

 

q = 'standard.archiveDate>=2004-09-01 & iterations=9000';

qresults = gd_query(q)

 

qresults =

    [1x1 struct]    [1x1 struct]

 

disp(qresults{1});

 

      standard: [1x1 struct]

         model: [1x1 struct]

        params: [1 4.7000 5.3000]

    iterations: 9000

 

disp(qresults{1}.standard.archiveDate);

 

2004-09-03 15:25:45

 

See gd_display for an example of displaying the full contents of query results.

 

Query to find files which have a name field equal to ‘test_design’ in their metadata and only return the fields standard.ID and params, removing any duplicates.

 

q = '*.name = test_design';

qresults = gd_query(q,'file','distinct standard.ID, params');

 

disp(qresults{1})

 

    standard: [1x1 struct]

      params: [1 4.7000 5.3000]

 

Query to find datagroups with comments containing the text ‘experiment’.

 

q = 'standard.comment like %experiment%';

gd_query(q,'datagroup');

 

Query variable metadata to find the metadata for all variables that are in a particular datagroup.

 

q = 'standard.datagroups.datagroupID = dg_ce868f40-8ds0-45...';

gd_query(q,'varmeta');

 

Query variables to find structures where field width is between 9 and 14 inclusive.

 

gd_query('width >= 9 & width <= 14','var');


Find files that have a comment in their metadata, using "" (two double quotes) to indicate an empty value.

 

gd_query('standard.comment != ""');

 

Find the 10 most recent files archived by user ‘bob’. The default sort order is ascending so keyword ‘desc’ is used to list the dates in descending order.

 

q = 'standard.userID = bob';

gd_query(q,'file','*','orderby standard.archiveDate desc',10);

 

Find all the files in a particular datagroup and order the results first by iterations then by model.name. For custom metadata you must specify whether to sort numerically (num) or alphabetically (str).

 

q = 'standard.datagroups.datagroupID = dg_ce868f40-8ds0-45...';

gd_query(q,'file','*',...

         'orderby num(iterations), str(model.name)');

 

Find the total number of variables archived by user ‘bob’.

 

qresults = gd_query('standard.userID = bob','var','count(*)');

gd_display(qresults{1})

 

*** Content of the structure  ***

  standard.count: 150

 

Find the maximum value for iterations in file metadata archived by user ‘bob’.

 

q = 'standard.userID = bob';

qresults = gd_query(q,'file','max(iterations)');

gd_display(qresults{1})

 

*** Content of the structure  ***

  standard.max: 12000

 

Find the maximum value for model.name in file metadata archived by user ‘bob’. Find the maximum alphabetically (str) rather than numerically (default).

 

gd_query(q,'file','max(str(model.name))');

 

Count the number of files of each different format that user bob has archived.

 

r = gd_query(q,'file','count(*) groupby standard.format');

gd_display(r)

 

*** Content of structure r{1} (Total structures: 2) ***

  standard.format: dat

  standard.count: 73

Press ENTER to continue ..., q to quit:

 

*** Content of structure r{2} (Total structures: 2) ***

  standard.format: txt

  standard.count: 22

*** No more results. ***

 

Find the latest job monitoring datagroup then find the latest job monitoring datagroup which matches some other criteria.

 

m.modelver = 0.6; m2.modelver = 0.71;

gd_datagroup('design model job xyz',m,'monitor');

gd_datagroup('design model job abc',m,'monitor');

gd_datagroup('design model job 999',m2,'monitor');

 

r1 = gd_query('standard.jobIndex = max','monitor');

r1{1}.standard.jobName

 

ans =

design model job 999

 

r2 = gd_query('standard.jobIndex = max & modelver <= 0.6', 'monitor');

r2{1}.standard.jobName

 

ans =

design model job abc

 

Notes

When querying standard date information (archiveDate or createDate), specify the date/time using the International Standard Date and Time Notation (ISO 8601) which is: "YYYY-MM-DD hh:mm:ss" (hh:mm:ss is optional).

 

The sum and avg aggregate functions will only work on fields containing numerical data. Non-numerical custom metadata is ignored in the calculation and non-numerical standard metadata (e.g. standard.comment) throws an error.

 

Treating numerical data as strings when using orderby, max or min can lead to unexpected results. This can be illustrated in Matlab where sort({‘1’,’5’,’10’}) returns '1'    '10'    '5'. Always use the numerical syntax, e.g. orderby(num(a)) or max(a), when working with numbers in custom metadata.

 

In the current release orderby, groupby, distinct and the aggregate functions cannot be used on standard metadata about datagroup relationships (standard.datagroups, standard.subdatagroups, standard.files and standard.vars).

 

Only results for data you are authorised to access will be returned. Function gd_addusers can be used to grant access to others.

 

A valid proxy certificate is required to query the database (see gd_createproxy from the Geodise Compute Toolbox).

 

Your certificate subject must have been added to the authorisation database.

 

See also

gd_display, gd_createproxy, gd_archive, gd_retrieve, gd_archivefiles, gd_retrievefiles, gd_datagroup, gd_datagroupadd, gd_addusers

 



gd_markfordeletion

contents

gd_querydeleted

Copyright © 2007, The Geodise Project, University of Southampton