TestDataConclusions

From PreservWiki

These conclusions are split into several sections which relate to the different stages of the testing process, this includes the initial harvesting of the data as well as the actual classification itself.

Building the Test Datasets

To build the test datasets a script was written that directly queries ROAR for relevant URLs of files relating to the dataset being harvested. This script samples randomly from ROAR, thus if it was run twice the resulting datasets would be different. Initially it was thought that to generate a 100 file dataset we would only need to ask ROAR for 100 URLs. However, this was soon found not to be the case. The URLs provided by ROAR seem to be valid but almost 1/3 of those provided fail to provide a file when attempting to harvest directly from the URL. Most of this 1/3 was probably made up of "file not found", "file withdrawn" or "request a copy" and these should provide the relevant error codes. At the point of receiving an error code the harvester would simply try another file. We could say based on this argument that the correct result was received for these files (i.e. the error code), but as we will find later it is rare that the error codes are actually relevant to the error.

Constructing the Test Repositories

With the datasets constructed it was time to submit these into EPrints 3.2 repositories. This proved to be the easiest stage (having done this process many times before) and each 100 item repository took about 5 minutes to fully populate. I did have one issue with a long file name which was wrongly escaped on disk. Renaming this one file solved that problem (bit of a hack but it was 1 file out of 2144).

The Classification Process

The Preserv2-EPrints-Toolkit which includes a DROID wrapper, caching database (as an EPrints dataset) and results page (as EPrints Admin Screen) was fully tested and found to be fully working (after some minor bug fixes). The toolkit can be downloaded at http://files.eprints.org/422/.

The classification part of the toolkit, which is meant to be scheduled by all our models, was in fact invoked manually on the datasets to save waiting for the scheduled job to launch. The command used to do this manually was the same as used in the scheduled process, however. The process generally ran very smoothly once it had been discovered that DROID command line invocation does not support the full shell escaping used on a Linux environment running the bash shell (other shells were untested). It is suspected that DROID has mostly been tested on Windows and thus supports the dos style command prompt. To solve this problem further investigation was ruled out in favour of using the XML Classification file syntax which DROID can both read from and write to. This slows the process slightly (by a few milliseconds) but is much more stable as a result.

With the above fix applied the classification process went extremely smoothly and all 2144 files in 13 repositories were classified and outputted data was fed into EPrints.

Classification Conclusions

DROID v3.00 & Signature File v13 classifies more files than the version used by ROAR

275 were unknown on import and 146 remain unknown. However, 89 invalid and badly formed files exist in both datasets. Discounting these means 186 were unknown originally verses the 57 that are now unknown, which is a major improvement.

DROID v3.00 + Sig File v13 has issues determining exact version of certain file formats

The newer version of DROID is much worse at identifying the different file versions when asked to identify formats including text, rtf and the tiff image format. This means that DROID knew which format, i.e. text file, but could not tell if it is a Comma Separated Values text file, Macintosh formatted text file or any other particular version.

This is likely to be the case with other formats but the three examples listed exist in significant numbers within our dataset. For these types DROID identified them all as the same file version, grouping them together inaccurately.

Mime-types should be considered as a means to a basic classification

While DROID gets the basic classification correct, e.g. knowing a text file is a text file, it is important to consider mime-types when concluding what files have changed their format classification in a way which may mean that the file is at risk. Without this factor comparison with ROAR implies that over 1/4 of the files in the 1000 item typical dataset have changed classification. When mime-types are applied to this, however, we find that 256 files match by mime-type and only 40 files change classification (investigated later).

To compare the mime-types of the files this data was obtained from the DROID/PRONOM identification file, which was fine for the records that had a mime-type listed against them. PRONOM lacks a lot of mime-type information for file formats which have an existing mime-type. In most cases this mime-type information has existed for a while and it is unclear why this data is missing from the PRONOM registry. TNA suspects that mime-type is not a compulsory field which needs to be present when data is provided to the registry.

DROID v3.00 - The percentages!

If we include the 89 malformed files then DROID v3.00 was unable to classify a total of 146 files out of the 2144 total across all 13 repositories. This is a 93.1% classification rate. If we also take into account those which are wrongly classified (discussed later) this comes down to a 92.75% success rate by mime-type.

This figure does not take into account the success rate per file version, which was deemed a task not necessary in completion of this part of the system testing. However, as we have already outlined, it is believed that one of the DROID versions (likely the new one) has problems differentiating file versions.

Classification by Extension & Percentages

This extension test was designed to see if DROID is more accurate than simply looking at file extensions when it comes to decided mime-types. Over the entire 2144 item dataset only 4 items did not have extensions. Processing the rest gave an accuracy rate of 99.8% to file extensions, which includes the 89 malformed files that were classified as possible HTML files by their extension (these files contained HTML but without an HTML header, hence the reason they were malformed).

It should be noted that not all files were opened to see if they were indeed what their extension states. However, fringe cases and DROID unknowns were tested in this way and found to be the file types their extension decreed.

It should also be noted that while file extension is a good way of finding the file type, many files contained additional data after their extension, e.g. a .1 .2 .3, etc., or a _backup or _old. Thus it is necessary to parse these manually to find the correct extension that may be in the middle of the file name (e.g. test.txt.backup).

DROID vs File Extensions

Although here we have stated that file extensions are potentially a more accurate way of classifying the content of your repository, you cannot find out the file version in this way. This means you cannot differentiate between a Word 95 document and one conforming to the Word 2003 specification. DROID is still required at this stage. The general conclusion is that by using a combination of techniques DROID may become more accurate.

Word Documents being wrongly classified as Excel files

There are three of these all from the Stirling Repository dataset: two were classified as Excel files by both versions of PRONOM while the other was originally classified as a Word document (correct) and newly classified as an Excel spreadsheet.

The files can be downloaded from http://www.preserv.org.uk/files/doc-excel-errors.zip and are named in a way that should make it obvious which is which.

Again, this did not need to be addressed further to complete this workpackage, other than to affirm that this was a problem with DROID and not created by our use of it within the Preserv framework.

File classification changes - So many HTML files!

A lot of files changed classification due to the fact that the harvester downloaded an HTML in the place of the required document. This was not a problem with the way the harvester was written but rather with the repository from which the content was sourced providing either incorrect or incomplete header information. In the majority of cases the repository had returned an HTTP code stating that it had found the content when in fact it hadn't, and that content was a web page stating that fact.

In the case where you have to request a copy, even I'm not sure on the correct header. However, it may have been worth modifying the harvester to attempt to match the mime-type of the resource (the web server of) the repository is supplying against the mime-type it hoped to get back in response.

Inconclusive file classification changes

Lastly there are a few file classification changes that cannot be explained. These included files which used to be an older version and have been updated, e.g. PDF 1.4 to PDF 1.5. However, without further metadata from the repository it is unclear if this is the fault of DROID or simply due to the fact that the user has updated the document in the repository.

The EPrints Admin Page GUI

With all this data classified the last test was that of the EPrints GUI which was able to summarize the data on one easy-to-read page. Due to the amount of data it was found necessary to extend the amount of cached data and this was done within a new user dataset for EPrints. Even with this, however, the page still takes time to return, with the time taken depending on the number of at-risk resources. Further options such as the use of AJAX need to be considered in time. This said, the page returns in less than 10 seconds, not instant but not bad.

Retrieved from "http://wiki.preserv.org.uk/wiki/TestDataConclusions"