Institutional repository preservation contexts


Digital preservation for institutional repositories (IRs) became muddled. Some open access (OA) advocates say the repositories are for OA and therefore do not need to be concerned or distracted by the needs of preservation (because either there is not enough content to be preserved yet, or open access is about copies of journal papers that are being preserved elsewhere by publishers). Advocates for digital preservation argue that this is a responsibility of repository software, without specifying what is required, for what and by whom.

With the number of repositories growing relentlessly and the volume of content growing accordingly, digital preservation is no longer a distraction for repositories, and we are beginning to find practical solutions to undo the muddle and answer the key questions, beginning with the most basic question.

1 What is digital preservation?

Digital preservation is 90% planning, administration and management; the remaining 10% is purely technical: "Determining which preservation strategies will be developed and when and in what circumstance the strategies will be implemented is the essence of digital preservation." (from the Cornell tutorial, Digital Preservation Management)

The goal of preservation is easier to state: to ensure content can continue to be accessed and used over time. The means by which this may be achieved vary widely, from reliable storage, backup and replication, to security and recovery, to managing different media formats (from tape to disc of various formats to hard drive) and other forms of technology preservation, to format migration or emulation, as ways to deal with the obsolescence over time of software applications and formats. Any one of these could be part of a digital preservation strategy, although none alone would be sufficient, and cost becomes an issue if many means of preservation are required.

2 What IR content should be preserved?

Whatever the institution decides. An institutional repository (rather than a repository in an institution) is the responsibility of the institution. Forget the arguments for or against preserving IR contents. Ultimately the institution decides what the repository is for, how and to what extent it is funded, what content to admit, and consequently what content to preserve and how.

3 How can institutions define what must be preserved?

This should all emerge naturally from mission statements and repository policy, including preservation policy. The problem, as surveys have shown, is that two-thirds of repositories have no policy, and even fewer have a preservation policy. This is not a basis for making preservation decisions. If there is an argument for not being concerned about preservation for IRs yet, this is it: lack of policy. Preservation should not precede policy. Repositories can begin to build policy using this tool. We anticipate that serious IRs will assume responsibility for preservation for most or all of the content that is allowed to be admitted.

4 Should preservation be built into repository software?

Repository software will never be a complete solution to all preservation requirements, given the broad scope of digital preservation activities described above and that preservation is only in some part technical. Repository software such as EPrints increasingly supports preservation activities, in terms of recording preservation metadata and audit histories of changes to digital objects, and dissemination formats such as METS. In this way repository software of most types provides a sound foundation for managing institutional digital content in the longer-term and, through software interfaces, a starting point for interacting with more specialised and extensive preservation services.

5 Who should perform technical preservation activities?

Specialist preservation services. Each country has well known organisations with long experience of preservation, not just digital preservation: the national libraries, the national archives, etc. There have always been many more content creators and publishers than preservation agencies. Why should this be any different for digital repositories? We can cut the preservation muddle by recognising that repositories and repository software do not need to become preservation specialists. Leave it to the experts. The Preserv project has been working with three organisations with this expertise: the British Library, The National Archives, Oxford University. Other projects are investigating preservation services for repositories: Sherpa-DP, Repository Bridge, ECHO DEPository Project, or preservation services for digital content more generally: Shared Infrastructure Preservation Models, Cornell Format & Media Migration Service, and the MetaArchive project. Others have described service-oriented preservation architectures. It's surprising there aren't more examples of preservation services, but it's enough to put an end to the preservation-by-repository-software canard.

6 What services are available?

Some experimental, but none commercially. It has been announced that Dutch university repositories will be curated by the National Library of the Netherlands. Preserv has been exploring models for preservation services. These models have tended to be based on super-aggregated 'black box' approaches in which the IR contracts one-to-one for the service provider to provide a complete preservation solution: to download, store and act on the content to be preserved. We now see how a new, more flexible and cost-effective approach may be possible based on interacting Web services. An example of this is the PRONOM-ROAR repository file format identification service. PRONOM is a Web services-based file format identification tool. To enable PRONOM to interact with many repositories what's needed is a database of repositories and a means of communicating with those repositories in an automated way. This is provided by the Registry of Open Access Repositories (ROAR) and the Celestial service (an OAI-PMH harvesting/caching tool), both developed by Tim Brody of the Preserv team at Southampton University. PRONOM-ROAR provides file format profiles (Preserv profiles) for 200+ repositories.

7 What happens next?

The range of services will be expanded. Format identification is only a first step towards a preservation strategy. The question is what you do with this information. Format IDs need to be verified, and file formats may need to be migrated to other formats in the event of obsolescence. This is where preservation services can help.

In an effort to reengineer workflow in the creation, management and preservation of electronic records, The National Archives in 2004 initiated a programme called Seamless Flow. Applying this approach to repository content suggests the following structured process for active preservation:
  1. Characterisation: identification (as in PRONOM-ROAR), validation, and property extraction
  2. Preservation planning: e.g. risk assessment (of generic risks associated with particular formats/representation networks), technology watch (monitoring technology change impacting on risk assessment), impact assessment (impact of risks on specific IR content), Preservation plan generation (to mitigate identified impacts, e.g. migration pathways)
  3. Preservation action: e.g. migration (including validation of the results) will provide ongoing preservation intervention to ensure continued access or provide on demand preservation action, performing migrations or supplying appropriate rendering tools at the point of user access.
For example, building on PRONOM-ROAR, when the PRONOM database supports it the aim is to provide a 'technology watch' service that can warn of file formats that are at risk of becoming inaccessible.

Active preservation services could be supplemented by a passive preservation package (aka bitstream preservation) offering core functionality for any repository system, including security and access control, integrity, storage management, backup and disaster recovery. This package would aim to offer more resilient and comprehensive preservation storage than could be cost-effectively provided locally or alternatively a cheaper backup solution that simply provides an offsite alternative to that provided locally by an institution.

Preserv hopes to have the opportunity to continue to investigate this structured range of preservation services, and to build the passive preservation package, beyond the end of its current phase in January 2007. If not, surely another project will.

Clear and present danger?

Preservation is a scary business. It's a long-term business. Or so it is believed. Preservation services are expected to take on extensive responsibilities and commitments. Rusbridge has suggested how these expectations should be moderated. Yet on the long-term view it's also potentially an expensive business, not dissimilar to taking out insurance cover. The problem is that on this basis the risks and costs of digital preservation are difficult to quantify, and without being able to do that preservation services find it hard to attract paying customers, even though the principle of preservation is well understood by most content managers. As a result some services have resorted to scare stories to raise interest levels. Often these stories are aimed at government and large funders rather than individual content owners, typically targetting concerns over national heritage.

For IRs the picture, and scale, are different. IRs are relatively new (since 2000) so the heritage angle is limited today, and they are there principally to provide immediate open access to the content that is deposited. Scare stories are not an option because (1) it would be counter-productive to deposits, and (2) there is nothing to be scared about at the moment. The problem is manageable while IRs are relatively new..

The first challenge for IRs is generating and acquiring content, and to do that the IR will need a policy. Don't stop with content policy, however. Preservation policy will naturally flow from that analysis. Preservation begins on ingest (actually, preservation should begin with authors, but let's be practical for now), at the point of deposit, and based on the preservation process above there are simple and cost-effective actions that IRs can take. It won't be scary if IRs develop appropriate policies and engage with preservation service providers at an early stage. Services such as PRONOM-ROAR shouldn't just lead to a restructuring of preservation models, but of philosophy and expectations too. Preservation isn't long-term, it's progressive, with simple and practical steps taken now paying dividends later.

Steve Hitchcock, Preserv project
This page created on 1 December 2006