Preservation - Analyse
The analysis stage of active preservation has been one of the main focal areas of both Preserv1 and Preserv2 working with The National Archives (UK) and Oxford University to develop and integrate tools and registries designed to aid and assist digital preservation. This section asks questions about our digital objects, the properties of our objects and the tools available to manipulate these objects.
What is the type of the file, is the file valid?
To get an accurate indication of file format, Preserv believes that the file extension (the bit in the name after the ".") should not be relied upon. Sometimes it is simply not present. Instead proven tools should be used to analyse the contents of the file to determine its type. We can also verify that the contents conform to the format specification, if there is one. Such an example would include examining XML-based files, or compiling source code such as that used in LaTeX documents.
In both Preserv1 and Preserv2 we use a tool,
DROID, produced by The National Archives (UK). This tool, which can be downloaded and run locally, uses a signature file available via the
PRONOM registry to classify files and provide specific details relating to individual files. Each file is classified using a PRONOM unique identifier, enabling extra information to be obtained from the PRONOM registry about the format, more of which we shall explain later.
Preserv1 used DROID to classify the files in many repositories indexed by the
Registry of Open Access Repositories (ROAR), presenting profiles for the repositories that could be classified in this way. An example Preserv profile from ROAR is shown below .
Preserv profile from ROAR
Although this approach provides a good breakdown of the file formats in the target repository, there are some problems with this system that we wanted to address in Preserv2. ROAR relies on harvesting tools to access remote repository content. This introduces bandwidth issues when downloading data, especially for large files. To limit bandwidth used by ROAR, and the repository, any files over 2Mb were not downloaded and could not be classified. It is more desirable to control this process within the repository software to ensure successful and complete classification. This step also goes some way to solving the 'no files found' problem that dominates the illustration above.
By linking DROID more closely with the repository software (as Preserv2 has been doing) we can ensure successful completion without limiting file size. The next illustration shows the interface to an EPrints repository where DROID has been run in the background and has classified all of the files in this repository. If it hadn't completed successfully this would be indicated in the risk scores box.
EPrints Format Classification Screen
Due to the modular nature of EPrints software, this uses a screen plug-in to read a single extra database record per file and display a summary of this information. Some EPrints-based functionality has been added to make the page readable, but as far as the classification is concerned the extra database field is populated using an import plug-in to parse the XML output from a vanilla DROID installation.
DROID need not be tied to the repository software, however, and can instead be used as part of a '
smart storage' approach.