From PreservWiki

Jump to: navigation, search

The idea here is to implement a digital preservation solution in EPrints which can be built into the main EPrints software and distributed and people can use as little or as much of it as they want.

There are several stages to the process which require different levels of interaction with the repository and the repository users/administrators.


Stage 1 - File Classification

Comes in several types:

  • mime-type identification:
    • EPrints does not do enough automatic classification, rather it relies on the submitter to do this on behalf of the repository.
    • Current mime-types are stored in the mime_type field within the File object.
  • PRONOM-DROID classification:
    • Offers automated classification of not just the correct mime-type but also the version of the file.
    • Each mime-type/version has a PRONOM-UID (URL) from which other classification information can be obtained from PRONOM (such as format risk)
  • Namespace conformation:
    • Mainly used in plain text/XML/RDF files, a conforming namespace declaration helps to distinguish between the many thousands (and growing) XML namespaces.

Each of these classifications offers a level of detail which is very important when searching for data in a repository or in a distributed environment. Each level also aids the process of digital preservation and migration of a repository.

Stage 2 - Enabling File Classification in EPrints

We could make greater use of classification tools such as DROID and JHOVE for classification of files and write a couple of simplistic import plugins which map the data output from these classifiers to fields in the database. I would also like to argue that the classification object (XML) for each EPrint should also be stored as an object against that EPrint such that you have the provenance information.

A PRONOM-DROID implementation

  • All in solution
    • In the eprints-svn under tools is an update_pronom_puids script which relies on the user having java/droid installed on their local system. This script then invokes droid against the file objects in the repository and the resulting pronom UID URLs are written to the pronom_uid field within the file object.
  • Part Solution
    • The classification of the objects is done outside of eprints (thus no reliance on JAVA and DROID).
    • An import plugin can then read a DROID XML file and import this.

Stage 2 - Analysing format risks

Pronom is now being extended to offer the facility to be queried for a format risk based upon the Pronom_UID it receives. Based upon this the idea is to present a traffic light scale to the users of files in their repository and associated risks (red to green from high to no risk).

My intention is to make this an admin screen plugin which simply informs the admin of the state of their repository.

So what is required from EPrints

In theory a single field in the File object (pronom_uid) is enough however, caching of information may also be advisable however.

Personal tools