D5.1orig-2 Data Management Subsystem

From EChase
Jump to: navigation, search
Warning: this is a copy of the D5.1 document as originally submitted to the EC.

To contribute please use: Data Management Subsystem.

Back to D5.1orig System Specification


This subsystem is responsible of managing all the data that makes part of the eChase system.

This module is used by the Application subsystem for all the application flows. It is also used by the Presentation subsystem for the administration part of the eChase system.

This subsystem is based on the following modules:

  • Metadata Repository: manages and provides access to the core eChase system metadata.
  • Users Repository: contains all the data that represents the configured users of the eChase system.
  • Users Aggregated Data: provides to the user the ability to store aggregate data coming from search results.
  • Log Repository: contains the information about the main events occuring in the sistem.
  • Accounting Repository: keep tracks of the business transactions that keep place in eChase and that involve the licensing of the eChase contents.
  • Media Repository: is comprised of a database containing information regarding the media, and file system storage of the media on a server machine.

User repository[edit]

Overview[edit]

Users Repository deals with the representation, storage and management of identity and profiling information and provides API for their access. Policy and profiling information governing access to and use of information in the system is stored here as well.

This module allows the modules that use it to abstract from the physical representation of the eChase user identities (e.g. a particular database or directory implementation).

The eChase user repository is implemented as a LDAP Directory.

A directory store information that is read often by a large group of users but modified infrequently by a much smaller group of administrators.

LDAP is a widespread standard for directory access that runs directly on the TCP/IP stack.

Directories define user and resource hierarchies, assist domain-based user management, and provide distributed services such as naming, location, and security.

In the directory we will put the following data:

  • Users
    • User identification token
  • Roles
  • Resources
  • User Profiles
    • Profile identification token

Each type of data will be described by a schema.

A pre-processed indexing is built to support fast access to data in order to achieve an adequate response time for all the queries.

Accounting and Log repository[edit]

Overview[edit]

The design of an Accounting and Log Repository is a task that may be developed in many different ways depending on which information (and how they will interact) will be considered more significant than others at the time of detail analysis. Moreover

In this chapter we will summarize some relating best practices referring, particularly, in tracking log activities into an RDBMS DB.

Remembering that:

Log engine module provides the mechanism to track how information in the repository is created, modified and used.

And:

This module may be based on the customization of an open source module (eg. Log4J).

It provides API to trace and record important information with different levels of sensibility:

  1. Error: at these level are traced information regarding situation that prevents the normal behavior of the system.
  2. Warning: At this level is traced information about problems that are not normal but that do not prevent the system from working.
  3. Information: information that are related to important system events.
  4. Verbose: These are detailed information that is very useful for system troubleshooting.

We could imagine an agent that provides the mechanism to track every single action performed in the system (internal/external).

Naturally the tracking detail level may be tuned assigning a different level of importance (and a subsequent tracking level) to a particular software module, class or function/subroutine.

The entire process, however, must be refined during the detail low-level design phase to reach the desiderate tracking level granularity.

A typical start-up scenario[edit]

In the initial phase of the implementation of this particular service one can imagine to dispose of a reduced set of database table in which to store the chosen data.

One possible (minimal) physical DB schema may be represented through the following picture:

Example of Minimal DB Schema

In the previous picture we have three entities:

  1. A Personal data (users) information
  2. An Error data repository
  3. An Activities data repository

Each track data table is logically linked to the master one through a idClient key.

The two service tables own a similar record structure and their fields have the following meaning:

  1. idClient: FK related to tClient table
  2. ErrorId: PK of the tErrors table
  3. CallingObject: the SW module name who has thrown a particular exception
  4. CallingUser: detailed description of the User that has called this module
  5. ErrorDate: timestamp that indicates the error date
  6. ErrorDescription: an extensive, detailed description of the thrown error

As said previously by starting from this simple schema it will be possible to extend the model itself to contain the further needed eChase specification tracking data.

Media repository[edit]

Overview[edit]

The term databases describes the core unified database containing all metadata and the media repository. However, the media repository will be comprised of a database containing information regarding the media, and file system storage of the media on a server machine. The unified database’s structure will be heavily influenced by the CIDOC CRM structure since this is how we propose to expose data to applications. Additionally an RDF export of the CRM data stored by the unified database may be created for enhanced semantic searching.

The results of the mapping and import translation modules are stored in a MySQL database that is structured with a schema created with the requirements of eCHASE in mind.

The design of this schema will be strongly influenced by the CIDOC Conceptual Reference Model (CRM), a reference model for the interchange of information in the cultural heritage domain. More details on the CRM are described in the technological baseline document. Although the CRM is an ontological model, it can inform the design of the database schema to guarantee high quality metadata structures that are able to cope with the wide range of information sources that the eCHASE architecture is expected to accommodate. This structure will be flexible and extensible for the requirements of integrating new systems used by content providers joining eCHASE in the future, and is well suited for interoperability with other systems.

The unified database schema will define identifiers for metadata records to reference media content such as images and video files in the repository supported by the media engine.

Media Repository[edit]

The media repository stores media files in a file system hierarchy. The file system path to the media file is a function of the media file’s resource identifier (unique id) in the unified database.

Descriptor Repository[edit]

The Descriptor Repository stores multimedia descriptor files in a file system. The file system hierarchy is the same as in the Media Repository. Each of the media entities from the Media Repository is represented as a directory in the Descriptor Repository with the same name. The directories contain the versioned media descriptors.

The Descriptor Repository could reside on a different physical disk to the Media Repository. This would enable the overall system to continue to work (albeit without content-based search) in case of failure or maintenance.

Media and Descriptor Index[edit]

The Media and Descriptor Index allows high performance content-based searching. The Index provides indexes for some or all of the media descriptors to avoid having to perform slow, brute-force descriptor-by-descriptor comparisons. The individual indexes could be in a database table or a custom format (e.g. multidimensional index, or inverted files), depending on the requirements of the descriptor being indexed.

The Media and Descriptor Index is particularly important for media descriptors that contain temporal and/or spatial localization information. The index should probably reside on a separate physical disk to the repositories for performance purposes.

Classifier[edit]

The classifier can use clustering techniques to help speed up the retrieval process. By comparing the query object to representative objects of each cluster and only using objects in nearest clusters, the total number of objects in the comparison can be drastically reduced.

The classifier may also be able to create associations between metadata and concepts in the ontology and media objects using their associated descriptors. Using existing information as training data, metadata can automatically be assigned to new objects entering the system.


Back to D5.1orig System Specification