Institutional repository preservation contexts
Digital preservation for institutional repositories (IRs) became
muddled. Some open access (OA) advocates say the repositories are for
OA and therefore do not need to be concerned or distracted by the needs
of preservation (because either there is not enough content to be
preserved yet, or open access is about copies of journal papers that
are being preserved elsewhere by publishers). Advocates for digital
preservation argue that this is a responsibility of repository
software, without specifying what is required, for what and by whom.
With the number of repositories growing relentlessly and the volume of
content growing accordingly, digital preservation is no longer a
distraction for repositories, and we are beginning to find practical
solutions to undo the muddle and answer the key questions, beginning
with the most basic question.
1 What is digital preservation?
Digital preservation is 90% planning, administration and management;
the remaining 10% is purely technical: "Determining which preservation
strategies will be developed and when and in what circumstance the
strategies will be implemented is the essence of digital preservation."
(from the Cornell tutorial, Digital
Preservation Management)
The goal of preservation is easier to state: to ensure content can
continue to be accessed and used over time. The means by which this may
be achieved vary widely, from reliable storage, backup and replication,
to security and recovery, to managing different media formats (from
tape to disc of various formats to hard drive) and other forms of
technology preservation, to format migration or emulation, as ways to
deal with the obsolescence over time of software applications and
formats. Any one of these could be part of a digital preservation
strategy, although none alone would be sufficient, and cost becomes an
issue if many means of preservation are required.
2 What IR content should be preserved?
Whatever the institution decides. An institutional
repository (rather than a repository in an institution) is the
responsibility of the institution. Forget the arguments for or against
preserving IR contents. Ultimately the institution decides what the
repository is for, how and to what extent it is funded, what content to
admit, and consequently what content to preserve and how.
3 How can institutions define what
must be preserved?
This should all emerge naturally from mission statements and repository
policy, including preservation policy. The problem, as surveys have
shown, is that two-thirds
of repositories have no policy, and even fewer
have a preservation policy. This is not a basis for making
preservation decisions. If there is an argument for not being concerned
about preservation for IRs yet, this is it: lack of policy.
Preservation should not precede policy. Repositories can begin to build
policy using this
tool. We anticipate that serious IRs will assume responsibility for
preservation for most or all of the content that is allowed to be
admitted.
4 Should preservation be built into
repository software?
Repository software will never be a complete solution to all
preservation requirements, given the broad scope of digital
preservation activities described above and that preservation is only
in some part technical. Repository software such as EPrints
increasingly supports preservation activities, in terms of
recording preservation metadata and audit histories of changes to
digital objects, and dissemination formats such as METS. In this way
repository software of most types provides a sound foundation for
managing institutional digital content in the longer-term and, through
software interfaces, a starting point for interacting with more
specialised and extensive preservation services.
5 Who should perform technical
preservation activities?
Specialist preservation services. Each country has well known
organisations with long experience of preservation, not just digital
preservation: the national libraries, the national archives, etc. There
have always been many more content creators and publishers than
preservation agencies. Why should this be any different for digital
repositories? We can cut the preservation muddle by recognising that
repositories and repository software do not need to become preservation
specialists. Leave it to the experts. The Preserv project has been
working with three organisations with this expertise: the British
Library, The National Archives, Oxford University. Other projects are
investigating preservation services for repositories: Sherpa-DP, Repository Bridge, ECHO DEPository Project,
or preservation services for digital content more generally: Shared Infrastructure
Preservation Models, Cornell Format &
Media Migration Service, and the MetaArchive project. Others
have described service-oriented
preservation architectures. It's surprising there aren't more
examples of preservation services, but it's enough to put an end to the
preservation-by-repository-software canard.
6 What services are available?
Some experimental, but none commercially. It has been announced that Dutch university
repositories will be curated by the National Library of the Netherlands.
Preserv has been exploring models for preservation services. These
models have tended to be based on super-aggregated 'black box'
approaches in which the IR contracts one-to-one for the service
provider to provide a complete preservation solution: to download,
store and act on the content to be preserved. We now see how a new,
more flexible and cost-effective approach may be possible based on
interacting Web services. An example of this is the PRONOM-ROAR
repository file format identification service. PRONOM is a Web
services-based file format identification tool. To enable PRONOM to
interact with many repositories what's needed is a database of
repositories and a means of communicating with those repositories in an
automated way. This is provided by the Registry of Open Access
Repositories (ROAR) and the Celestial service (an OAI-PMH
harvesting/caching tool), both developed by Tim Brody of the Preserv
team at Southampton University. PRONOM-ROAR provides file format
profiles (Preserv profiles) for 200+ repositories.
7 What happens next?
The range of services will be expanded. Format identification is only a
first step towards a preservation strategy. The question is what you do
with this information. Format IDs need to be verified, and file formats
may need to be migrated to other formats in the event of obsolescence.
This is where preservation services can help.
In an effort to reengineer workflow in the creation, management and
preservation of electronic records, The National Archives in 2004
initiated a programme called Seamless
Flow. Applying this approach to repository content suggests the
following structured process for active preservation:
- Characterisation: identification (as in PRONOM-ROAR), validation,
and property extraction
- Preservation planning: e.g. risk assessment (of generic risks
associated with particular formats/representation networks), technology
watch (monitoring technology change impacting on risk assessment),
impact assessment (impact of risks on specific IR content),
Preservation plan generation (to mitigate identified impacts, e.g.
migration pathways)
- Preservation action: e.g. migration (including validation of the
results) will provide ongoing preservation intervention to ensure
continued access or provide on demand preservation action, performing
migrations or supplying appropriate rendering tools at the point of
user access.
For example, building on PRONOM-ROAR, when the PRONOM database supports
it the aim is to provide a 'technology watch' service that can warn of
file formats that are at risk of becoming inaccessible.
Active preservation services could be supplemented by a passive
preservation package (aka bitstream preservation) offering core
functionality for any repository system, including security and access
control, integrity, storage management, backup and disaster recovery.
This package would aim to offer more resilient and comprehensive
preservation storage than could be cost-effectively provided locally or
alternatively a cheaper backup solution that simply provides an offsite
alternative to that provided locally by an institution.
Preserv hopes to have the opportunity to continue to investigate this
structured range of preservation services, and to build the passive
preservation package, beyond the end of its current phase in January
2007. If not, surely another project will.
Clear and present danger?
Preservation is a scary business. It's a long-term business. Or so it
is believed. Preservation services are expected to take on extensive
responsibilities and commitments. Rusbridge has suggested how these expectations should
be moderated. Yet on the long-term view it's also potentially an
expensive business, not dissimilar to taking out insurance cover. The
problem is that on this basis the risks and costs of digital
preservation are difficult to
quantify, and without being able to do that preservation services find
it hard
to attract paying customers, even though the principle of
preservation is well understood by most content managers. As a result
some services have resorted to scare stories to raise interest levels.
Often these stories are aimed at government and large funders rather
than individual content owners, typically targetting concerns over
national heritage.
For IRs the picture, and scale, are different. IRs are relatively new
(since 2000) so the heritage angle is limited today, and they are there
principally to provide immediate open access to the content that is
deposited. Scare stories are not an option because (1) it would be
counter-productive to deposits, and (2) there is nothing to be scared
about at the moment. The problem is manageable while IRs are relatively
new..
The first challenge for IRs is generating and acquiring content, and to
do that the IR will need a policy. Don't stop with content policy,
however. Preservation policy will naturally flow from that analysis.
Preservation begins on ingest (actually, preservation should begin with
authors, but let's be practical for now), at the point of deposit, and
based on the preservation process above there are simple and
cost-effective actions that IRs can
take. It won't be scary if IRs develop appropriate policies and engage
with preservation service providers at an early stage. Services such as
PRONOM-ROAR shouldn't just lead to a restructuring of preservation
models, but of philosophy and expectations too. Preservation isn't
long-term, it's progressive, with simple and practical steps taken now
paying dividends later.
Steve Hitchcock, Preserv project
This page created on 1 December 2006