From PreservWiki

Jump to: navigation, search

About the dataset

The test dataset is a snapshot from roar.eprints.org taken at 17:00 on Thursday 27th November 2008.

The Typical Repository Dataset

The following table and chart represent the typical repository according to ROAR [[1]]

Here the top 95 percentile have been specified leaving the final 5% consisting of other formats (a total of 111 formats in this case).

Format Percentage
Unknown 13
Portable Document Format (1.4) 13
Portable Document Format (1.3) 10
Portable Document Format - Archival (1) 8
Portable Document Format (1.2) 5
Portable Document Format (1.6) 5
Hypertext Markup Language 4
Portable Document Format (1.5) 4
Fixed Width Values Text File 2
MS-DOS Text File with line breaks 2
Unicode Text File 2
IBM DisplayWrite Document (3) 2
IBM DisplayWrite Document (2) 2
MS-DOS Text File 2
Macintosh Text File 2
Tab-Delimited Text File 2
Plain Text File 2
Fixed Width Values Text File 2
Other (111 Formats) 5

[Pie Chart[2]]

In Preserv2 we are going to select 1000 items in a weighted fashion (e.g. 13% PDF (1.4)) to make up a test dataset. This data will be taken randomly from the 1200+ repositories currently registered with ROAR.

The dataset will then be loaded into a full functional EPrints repository with Preserv2 extensions and used for testing of these extensions. This dataset can also be used as the exemplar dataset from which risk scores can be generated, although it is more of a guide for which formats we need risk scores for, being atypical in the community.

A Sample of Specific Repositories

In the specific repository tests we are going to choose at least 10 repositories from ROAR which exhibit the diversity of repositories. We are then going to consider the effects on each of the following following factors with regards the repositories preservation profile and possible strategy.

  • Size of repository.
  • Preserv profile.
  • Type of repositories (Software and Content)
  • Levels of activity.
  • What affects do mandates have?

Around 100 files will be taken from each repository randomly to represent an accurate cross section of that repository (similar to the first tests but this time repository specific). Each of these datasets will then be subject to the same tests as the atypical dataset in part 1.

Personal tools