From PreservWiki

Jump to: navigation, search


  • Identified test data, where it is, where it comes from?

The test data represents a snapshot of ROAR taken at the time of running the import_test_data script. This script essentially asks ROAR for a current overall snapshot of all the repositories registered with ROAR as well as each one of the individual repositories identified. This snapshot consisted of a table which outlines how many files of each format (and version) are in the repository as classified by an earlier version of DROID. From this snapshot the import script then processed the results to establish how many of each file format are required to build a smaller, but similar looking, snapshot. For the typical repository snapshot, made from the snapshot of all registered repositories in ROAR (over 1400+), it was decided to harvest around 1000 files. For each of the individual repositories 100 files were targetted from each.

  • How did I get the data?
  • What did I have to change to make it work?

To harvest the content ROAR was used again to randomly select the file URLs from for each format and then these were downloaded directly from the source repository. Any URLs which failed to connect or download (of which there were a lot!) were ignored and another URL was requested from ROAR until a complete dataset was downloaded. Since ROAR is based on the data obtained from the OAI-PMH interface of a repository, which is intended to be a reliable service, it was surprising many URLs had to be attempted before a file was returned.

  • What did I do with the data?
  • What I did with the files. Where they went.

Once the data was harvested from the source repositories, it was then analyzed briefly to double check for completeness, before being injected into an EPrints 3.2 (svn) repository extended with the Eprints/Preserv2 Preservation Toolkit. Each dataset was fed into a separate repository, thus splitting the datasets. In total 2144 files were fed into 13 repositories.

  • What is the toolkit and what does it include?

The Preserv2/Eprints Preservation Toolkit consists of three main parts, two of which have been covered elsewhere within this project. The first one of which is the DroidWrapper, which has been customized to be able to directly read and manipulate EPrints 3.2 datasets to allow direct and reliable access to all of the files within the repository. More about why this was done is discussed in the later results section. This version of the DroidWrapper is otherwise the same as any other version; it simply locates the files and feeds them to DROID for classification, the results of which are then fed back into the repository. Due to the nature of this particular DROID wrapper this classification is fed directly into the repository rather than made available for a separate parser to process.

The next part of the Preserv2/Eprints Preservation Toolkit consist of a configuration file, including an EPrints dataset which is used to extend the EPrints file dataset with pronomid, classification_quality and classification_date. This configuration file also adds the PRONOM dataset, which consists of a cache table to store data from the PRONOM registry, such that it doesn't have to be queried every time the repository administrator asks for a preservation report. This dataset also caches the file counts and risk scores enabling this page to load much faster.

The final part of the Preserv2/Eprints Preservation Toolkit is the page which displays the result to a repository administrator. More about this page and the toolkit can be found @ http://wiki.preserv.org.uk/index.php/EPrintsPreservation.

  • Where were those files stored and managed when I fed them into the classification process?

They are fed to DROID via the new EPrints storage controller, so DROID essentially has direct access to the object (if local) or via a download if the file is offsite. This really depends on the separation between DROID and the objects. For more information please refer to the papers on Smart Storage and the EPrints Storage Layer.

  • What tool did I use for the process?

Not included in the toolkit is the DROID tool (as well as Java, which is required to run DROID). In order to classify our objects we used DROID v3.00 and Signature File v13.

  • What happens to the results?

Each object is fed to DROID individually via a DROID xml classification file. This is done to avoid command line interpretation problems involving escaped characters. DROID then classifies this file and feeds back the results in the same format. These results are then processed by the wrapper before being directly injected into the EPrints dataset fields.

Once all the objects are classified the wrapper then triggers an update on the risk scores relating to those objects. After which it then updates the file counts relating to each format, readying the data to be displayed by the Admin page.

  • How is that process and displayed?

The Preserv2/Eprints Preservation Toolkit comes with an Admin page which displays the results. More on this page including screen shots of it acting upon the typical repository can be found @ http://wiki.preserv.org.uk/index.php/EPrintsPreservation

  • Preparing the results for comparison to ROAR and conclusions

In order to further process these results an extra system component was produced which was able to further analyse the results. This component consisted of several extension services:

    • The ability to compare the Preserv2/Eprints Preservation Toolkit classification to that originally established by ROAR to find differences.
    • The ability to fill in gaps in mime-types.
    • The ability to display file extensions and compare these to the DROID classification.
Personal tools