Preservation Metadata for Institutional Repositories: applying PREMIS, January 2007
Introduction
Metadata designed for managing digital content over a long period of
time is commonly referred to as 'preservation metadata', and typically
informs, describes and records a range of activities concerned with
preserving specific digital objects.
The broad aim of digital preservation is to ensure that the content
remains accessible regardless of changes in hardware and software
technologies, notably: presentation formats; changes in organisational
responsibilities for managing the content; and to mitigate
environmental risks (e.g. Bradley 2005).
What characterises many approaches to digital preservation is the
implicit assumption
that, from data generation to input to archiving, preservation is
managed within an expert and specialised preservation environment.
Digitisation projects are good examples of this approach. This paper
considers the case where data creation, deposit and content management
in a repository will
be performed by a range of players, many non-specialists from a
preservation viewpoint. Institutional repositories (IRs) are such a
case, where authors of
research papers, for example, 'self-archive' their works in the IR, and
where the management of the IR often has no specialist preservation
skills (see Hitchcock et al.
2006). It is proposed that in
such cases the IR might contract preservation
requirements to an external service provider. Thus preservation
metadata here must inform not just long-term management of the data but
also the relationship between the IR and the service provider.
This paper provides a brief overview of preservation metadata, which we
then seek to develop for an application involving IRs. Currently, the
authoritative reference on preservation metadata is the
Preservation Metadata Implementation
Strategies Data Dictionary (PREMIS 2005), on which we have focussed
our initial investigation. Since some seek
to make IRs a malleable concept, we explain some background that
establishes the role of IRs, and place IRs in a preservation
context leading to the introduction of three OAIS-based models. In this
case we focus on one of these models, the service provider model, to
get a handle on the analysis of the PREMIS metadata set for this
application.
"It is difficult to anticipate the
metadata needed to support technical
and administrative processes that are not
fully developed, are
not fully tested, and in some ways, are not even fully understood.
Compounding the problem is the proviso that preservation metadata
recommendations must
be restrained by economic realities. Creating and maintaining metadata
is expensive, so
any recommended
preservation metadata elements should be backed by persuasive evidence
of necessity, as well as practical means for populating them" (Lavoie
and Gartner, 2005).
In this paper we begin our investigation of
supporting preservation metadata within the practical, very real and
growing content of IRs, and with reference to a number of relevant
business
models built into our OAIS examples.
What is preservation metadata?
“Preservation metadata is the information necessary to carry out,
document, and evaluate the processes that support the long-term
retention and accessibility of digital materials." (PREMIS 2005)
In terms of digital technology and the
widespread creation of digital materials, preservation metadata has had
a respectable period
of gestation and development. Various generic approaches have been
identified within projects, with the baton seemingly passing
periodically from
one group to another (e.g. National Library of Australia 1999;
NEDLIB, Lupovici and Masanès 2000, CEDARS 2002). The
OCLC/RLG Working Group on Preservation Metadata (2002)
introduced an
international consensus,
while implementations that emerged such as at the National Library
of New Zealand
(2002)
were largely application-specific in nature. According to Lavoie
and Gartner (2005) the earlier efforts "largely were speculative in
nature, seeking to anticipate the metadata needs of programmatic
digital preservation initiatives that would emerge in the future. On
the other hand, development of the more recent element sets, such as
OCLC, NLNZ were more closely aligned with planning and implementation
of 'production' digital archiving systems."
Given the changing descriptions, slightly different terminologies and
the fuzziness of the overlap between preservation metadata and other
forms of more widely used metadata, such as metadata for resource
discovery, administrative metadata, etc., it can be quite hard to
unravel the different perspectives, although a review by Day (2003a)
makes a good attempt.
Fortunately, a more coherent view has appeared in what is, currently,
the authoritive reference on preservation metadata, the PREMIS Data
Dictionary (2005). This is based, as the full name indicates, on the
idea of implementation, and for the first time provides, on
examination, a thorough, rigorous and comprehensive set of preservation
metadata elements that a "working archive
needs to support the
functions of ensuring viability, renderability, understandability,
authenticity, and identity in a preservation context." (Guenther
2004)
Evolution of institutional repositories: OAIRs
Before examining PREMIS for an IR application it is helpful to consider
what
IRs are and what they do, because like preservation metadata this
relatively recent development has been subject to misunderstanding,
confusion and attempted re-inventions.
The impetus for IRs could be said to have emerged from the Open
Archives
Initiative (OAI) in 1999, not to be confused in preservation terms with
the Open Archival Information System (OAIS; Hirtle 2001). Although
institutionally-based, or more typically departmental, 'archives' were
known before this, especially in areas such as computer science and
economics which were served by NCSTRL and RePEc, respectively, OAI
introduced the Protocol for Metadata Harvesting (OAI-PMH) to provide
common services that could operate over more
general, independent sites (Lynch 2001). Search is the most obvious
example of
such a service. OAI-PMH enables compliant sites to be interoperable, thus making
institutional, rather than only disciplinary, repositories visible and
viable. For
the first time institutions such as universities have the ability to
capture, store and disseminate
copies of the
published work of their own researchers. The significance of this
cannot be underestimated.
OAI was aimed initially at eprint archives (Van de Sompel and Lagoze
2000), and although the protocol was
soon widened to handle other digital library content, the first
software to
support it was EPrints, on which we
base our work here at Southampton. EPrints is software for building IRs
that capture and
provide open access to an
institution's research outputs, which are deposited directly by
authors in principle using the
version they created, a process known as 'self-archiving'.
EPrints first appeared in 2000, and an OAI-PMH 1.0-compliant version
was
announced on the
same day this breakthrough version of the protocol was unveiled in
January 2001 (Harnad 2001, OAI 2001). This application to institutional
archives, or
repositories, was reinforced with the emergence of DSpace software a
year later. A large number of repositories have been built using
software such as EPrints, DSpace and others (see the
Registry of Open Access Repositories, undated).
Despite the growth of such archives, interest in the wider use of
repositories has lead periodically to attempts by newcomers to broaden
the term 'institutional repositories' for other purposes.
Rather than redefine IRs, a better starting point is the broader term,
a digital
repository, characterised by Heery and Anderson (2005):
- content is deposited in a repository, whether by the content
creator,
owner or third party
- the repository architecture manages content as well as metadata
- the repository offers a minimum set of basic services e.g. put,
get,
search, access control
- the repository must be sustainable and trusted, well-supported
and
well-managed
An institutional
repository, based on experience and practice, builds on the above
definition and more specifically can be characterised by type of
content, the
source of that content, and its dissemination:
- content is the research
outputs of an institution
- submission to the archive is typically by author-self-archiving
- dissemination is primarily through common services, such as
search, but in particular by OAI-PMH services
With growing interest in digital repositories for various purposes
within institutions, such archives might be more helpfully and
specifically referred to as Open Access IRs, or eprints
repositories (James et al.
2003). As such,
where digital preservation might generally be
concerned with preserving access, for OAIRs it is concerned with
preserving open access.
What are the implications for preservation of repositories of this
type? In this context Hitchcock et al.
(2005) identified two
principal features:
- Heterogeneous data formats (Hitchcock et al. 2006)
- Low cost per item deposited. IRs must
keep costs low enough not to jeopardize open access.
This highlights the need for automation in the collection of
preservation metadata, and in the
efficient labelling, selection and delivery of content to preservation
services.
This characterisation of OAIRs for preservation can perhaps also be
informed by comparison with a large-scale
preservation test, the Archive Ingest and Handling Test.
The
AIHT practical preservation strategy will require "mechanisms for
continuous transfer of content from the wider world into the hands of
preserving institutions. The AIHT is designed to test the feasibility
of transferring digital archives in toto from one institution to
another" (Shirky 2005). This approach involving more than one agency in
content
management parallels our service provider model outlined below. AIHT
reveals important practical experience, although there are
some differences with anticipated preservation
service models for IRs. For example, in AIHT:
- There is no scope for interaction between creator and archive
- There is no moderated ongoing transfer process or protocol, just
a single disc of compressed data containing all files
- There is no business model (i.e. who is doing what for whom, and
why)
- The scope of the test archive may or may not reflect a typical
profile of an IR
This description of what IRs are and what they represent is not to say
that the role and target content of IRs won't evolve
legitimately and in an informed way to serve institutional needs and
research purposes, as suggested by Dempsey (2006). It is likely that
other types of content, such as research data sets
(Lyon et al. 2004), will be
deposited and managed within OAIRs, but not all
types of content produced in universities -- teaching and learning
materials, administrative documents, for example -- are best stored in
OAIRs. Such materials may require more specialised submission and
updating facilities, and may need to restrict dissemination.
Three OAIS preservation models for IRs
Having defined the preservation task in terms of preservation metadata,
and the target content for preservation in terms of OAIRs, we can
consider the types of services that can be offered. The OAIS reference
model
(Figure
1a) provides a framework in which we can construct these services (OAIS
2002).
At a very general level it can be seen that IRs provide a similar range
of functionality as found in OAIS -- input and output, data management
and storage. OAIS imposes more formality and discipline to these
processes for the purpose of long-term preservation. Thus deposit
becomes ingest, and we are
concerned with archival storage,
all enveloped by preservation planning, administrative and management
roles. To understand these distinctions and these support processes,
see the Cornell tutorial (2003).
Information in this system is managed in packages: submission
information packages (SIPs) at point of ingest, archival information
packages (AIPs) in the preservation store, and dissemination
information packages (DIPs) for access by users or other services.
Within the types of
services we could construct we wish to support a range of business
models to allow IRs some flexibility in managing the preservation risk
in terms of their real resources:
- service
provider model (service provider is OAIS, Figure 1b) the original and
core project model (Hitchcock et al.
2005)
- institutional
model (institution is OAIS, Figure 1c), an institution may have
more than one repository, e.g. EPrints-Fedora
- software
model (repository is OAIS, Figure 1d), preservation features built
into IR software, e.g. DSpace
Figure
1. Three preservation models based on OAIS: a, Base OAIS functional
model; b, Service provider model; c, Institutional model; d. IR model
The basis of the three service models in the formal OAIS model are
apparent in Figure 1. Representations of the OAIS reference model are
ubiquitous in
the digital preservation literature and may differ in presentation if
rarely in detail; for reference, this version (Figure 1a) was taken
from a
presentation by Day (2003b). The changes in the service models are
shown
in red and are all focussed on the ingest-data management-archival
storage roles and the relations between these as shown by the
connecting arrows. In the service provider model a case could be made
to re-introduce the arrow connecting the service provider and the
access point (e.g. EVIE 2006), depending on the agreement between the
IR and
service provider partners.
The three models illustrated have no specific costs attached, but
represent a hierarchy in terms of level of cost that might be incurred
to support preservation, based on Chapman's (2003) observation: "though
quantity, quality and size of the digital materials ingested
has
an impact on scale, the cost of long term digital sustainability
correlates more to the range of digital services offered." The range of
services offered in the first model is clearly potentially greater, and
more flexible than the latter two, with the software model providing a
baseline requirement.
Other models might include the federated model (where the federation is
OAIS). A
prominent federated example is LOCKSS (Rosenthal et al. 2005), which focusses on
journal
applications rather than more heterogeneous collections such as in
OAIRs.
Since it would be best to pursue such a model within LOCKSS rather than
re-invent it, this model is beyond the immediate concern of this
project, but could be considered if the opportunity arose.
Preservation service provider model
The preservation service provider model was broadly outlined in terms
of shared, or third-party, preservation services by Beagrie (2002),
while RLG-OCLC (2002) reported the need for third-party
preservation services to fulfill the need for trusted digital
repositories. This model was proposed
for IRs by James et al.
(2003). Referring to this as a disaggregated
OAIS-compliant model,
Knight (2005) extended the idea for a model-based, rather than
evidence- or experience-based, analysis. Knight presented a detailed
breakdown of the model and workflow from the service provider's
perspective. Experience is likely to
bring both more complexity and more clarity.
A similar although less detailed service provider model to be adopted
in the Preserv project was developed in stages in Hitchcock
(2005). This model is formalised in Figure 2. A notable feature of the
illustrated model is the integration
of an automated file format identification tool, PRONOM-DROID,
developed by The National Archives (Brown 2005). The service provider
model also fits well with an OAI
application, which as we have seen is core to IR software, as OAI is
predicated on the data provider-service provider relationship (Lynch
2001).
Figure
2. Schematic of Preserv service provider model, showing IR functions,
format ID tool and OAI interface to preservation service provider
As in Knight (2005), the Preserv model as presented easily lends itself
to
analogy
with the ubiquitous OAIS representation. In terms of the main OAIS
functionality -- ingest, data management, storage,
dissemination, etc. -- these models highlight how responsibilities
might be shared between partners. For example, in the service provider
model (OAIR-SP) the IR could be OAIS-compliant, but it need not
necessarily be if
the service provider delivers that compliance. At the other extreme, in
the software model
where there is no other partner, the IR clearly has to be OAIS-aware to
provide a minimal level of compliance.There are
essentially three variations:
- the whole illustrated model forms an OAIS unit (as in Figure 1b)
- both partners -- IR and service provider -- are OAIS-compliant
- the service provider is OAIS-compliant
In IR terms, however, the the formalisation of the deposit interface to
embrace
the requirements of OAIS ingest has particular significance: "until it
becomes common practice to integrate
digital stewardship and preservation concerns into the entire digital
content lifecycle -- especially front-end content creation --
most digital preservation workflows intended to be inclusive will be
reactive instead of prescriptive." (Anderson et al. 2005)
In Preserv the main service
provider partner is the British Library, which of course will offer an
OAIS service. Thus the second of the three variations is most likely to
be the case. Figure 3 shows two simplified, co-joined OAIS models
representing the IR and the service provider. The OAIS administrative
functions are shown shared between the two partners pending further
investigation into this model to determine practical allocations.
Figure 3 also explodes the service provider into a range of optional
services, which are described by Hitchcock et al. (2006).
Figure
3. Two OAIS repositories in Preserv preservation service provider
scenario
The model chosen enables us to
analyse and select preservation metadata elements from the PREMIS
Dictionary.
This paper will focus on the service provider model, while the other
two models will be developed in later papers.
Developing a preservation metadata set for IRs
"in contrast to the support for resource discovery metadata, managers
of e-print repositories have practically no preservation metadata
support provided by the common repository software packages." (James et
al. 2003)
The aim of this investigation is to rectify that omission, bearing in
mind that the IR software is only one player in our preservation
service provider scenario. In building on the detailed PREMIS metadata
set, critically we have identified in the
preliminary selection not just those elements that might be needed in
an IR preservation scenario, but the sources of the necessary metadata
in terms of the service provider model outlined above (Figure 2):
author/IR
submitter, IR software and associated tools (in this case EPrints and
PRONOM-DROID), IR policy profile and service provider.
There are precedents for the first three sources. Prior to producing
the data dictionary, PREMIS surveyed
repositories. Despite using a somewhat leading
questionnaire, it discovered (Caplan 2004):
- Three-quarters of all
repositories obtained
metadata from their depositors and the same number extracted metadata
automatically by program.
- Automatic extraction by repository
software was most often limited to
technical metadata – size, file format, and file characteristics stored
in file headers.
- Nearly two-thirds of the respondents
also had
some metadata supplied by repository staff, either through manual data
entry or by automatic derivation from bibliographic databases.
The
validity and need for some elements will be determined by IR policies.
Broadly, a preservation profile can be developed for an IR, including
content formats, usage, as well as policy profiles (Hitchcock et al.
2006). Such a profile can be used
to guide selection of preservation services by institution. The Preserv
project has IR partners, at Southampton and Oxford, and is building
profiles of both of these IRs together with profiles of other Eprints
IRs.
In the PREMIS data model there are
five
types of entity: intellectual entities, objects, events, agents and
rights. While "Intellectual entities
and agents are not
fully described", the majority of the
entries in the data dictionary involve
objects and events (Guenther 2004):
- Objects: "Semantic units associated with Object entities include identifiers,
environment information (e.g., hardware and software), location
information, technical characteristics that apply regardless of format
(e.g., fixity, size, significant properties, inhibitors, creating
application information), and relationships to other objects."
- Events: "Digital provenance metadata
is centered
around events that have acted upon objects and is intended to record
processes during the period of archival retention. Semantic units
include event identifier, event type (e.g., compression, fixity check,
migration, validation, etc.), event outcome, event date/time, and
related agents."
Selection and ongoing refinement of the preservation metadata
set is not
only
concerned with managing preservation but should also consider
minimising
the costs of preservation actions and services. These major entity
types in PREMIS, and the rights entity, can be aligned with what
according to James et
al. (2003) are the most significant factors affecting costs in
eprint
repositories:
- Potential costs involved in managing proprietary formats should
repositories choose to accept whatever is offered, to them (the Object
elements within PREMIS)
- The cost of creating additional metadata, particularly that
associated
with the technical and administrative needs for long-term management of
e-prints (the Event elements)
- The cost of negotiating rights (the Rights elements)
Particular attention, then, should be paid to elements in these
categories
that will assist the efficient collection, generation and delivery of
the necessary metadata to the service provider to control these costs.
With regard to rights, according to PREMIS the
minimum core rights information that a
preservation repository must know is the permissions that have been
granted to the repository itself to carry out actions related to
objects within the repository. It should be noted that OAIRs
could be considered
a special case as far as deposit of papers published elsewhere in
journal and proceedings, for example, are concerned. In these cases the
IR should require
only a simple licence agreement with the author. Several US
institutional repositories have publicly available
agreements, e.g. Caltech, California Digital Library, DSpace at MIT.
For
preservation purposes any rights statement should be extended
minimally to allow copying and uses prescribed by the service provider.
The DSpace at MIT licence includes the following clauses relating to
possible preservation actions within the IR, illustrated by MacColl
(2004):
"You agree that MIT may, without
changing the content, translate the
submission to any medium or format for the purpose of preservation. You
also agree that MIT may keep more than one copy of this submission for
purposes of security, back-up and preservation."
If an IR anticipates using an external preservation
service provider as described here, the MIT example may need to be
developed further in conjunction with the service provider.
Mapping PREMIS elements to the
IR-service provider model
Within the detailed and lengthy PREMIS Data Dictionary the five
entity types -- intellectual entities,
objects, events, agents and
rights -- are described by entries for the main elements and
subelements. In this analysis we attempt to map these elements to the
potential metadata sources -- IR
submitter/author (via the EPrints deposit interface), IR software
(within-code EPrints), file format ID tool (PRONOM-DROID), IR policy
profile and service provider -- identified in our OAIR-service provider
(Figure 2) model (Tables 1-5).
A possible additional source of metadata is environment registries
(Table 6), which
are recognised in PREMIS and other preservation activities, although
there are not yet any concrete examples of such registries based on the
broadest, most ambitious designs that could support and source the
elements identified here. PRONOM, GDFR and JHOVE are examples of more
specific registry types (e.g. for file format ID and validation) that
might be the basis of more expansive implementations. Representation
Networks are another ongoing preservation development that may inform
environment registries.
The version of the mapping presented here includes the main PREMIS
metadata elements without expanding on the
subelements where these are assumed to follow into the same source
categories unless indicated.
Since PREMIS is oriented towards implementation, this requires the
development of schema to define the use of certain elements within this
application. According to PREMIS, the schema should, where
possible, provide controlled vocabularies or codes for populating
elements, rather than relying on “free text”. In addition, the schema
should be adaptable to automated workflows for metadata collection and
management. This analysis has not yet extended to
identifying, building or including controlled
vocabularies or schema that may be required by some elements within the
preservation metadata set.
Tables 1-6 map the PREMIS elements to the principal sources
in the OAIR-SP model. It should be noted that elements are not fixed in
these tables. Some may apply to more than one table, especially where
related subelements are simply wrapped in a single entry, but for
clarity we have not duplicated elements between tables. For example, it
may be necessary for the relationship element to be informed by the IR
author, but subsequent use of that information in related subelements
may be the responsibility of other sources or services.
Key to tables: O optional, R required, R* conditionally required, +
includes
related subelements from PREMIS Data Dictionary
Table 1: From the IR submitter/author
(via Eprints interface)
|
PREMIS metadata elements
|
Part of (if not main element) |
PREMIS entity type
|
Comment
|
Other OAIR-SP sources
|
O
|
creatingApplication + |
|
Object
|
probably needs author (e.g.
would an MS-Word file generated from OpenOffice be flagged by ID tool?)
|
PRONOM?
|
|
originalName |
|
Object
|
this refers to files uploaded
with the eprint submission, named by the author and recorded by EPrints
as part of the upload directions; could be a DOI? Will files be renamed
by SP? |
|
R* |
dependency + |
environment
|
Object
|
e.g. schema; while the submitter
must indicate a relationship, the related subelements may be generated
elsewhere (see also table for environment registries) |
|
R* |
relationship + |
|
Object
|
relationship between objects,
high-level categorization, e.g. structural, transformation, and other
types to be determined by SP |
SP
|
R* |
linkingIntellectualEntityIdentifier
+
|
|
Object
|
e.g.
collection; while evidence of a higher entity is provided by the
submitter, the ID (e.g. URI) may be generated by another service
|
|
Table 2: From
within-code EPrints
|
PREMIS metadata elements |
Part of (if not main element)
|
PREMIS entity type |
Comment |
Other OAIR-SP sources |
R
|
objectIdentifier + |
|
Object
|
identifier of the eprint record
(identifiers of the digital objects are their URLs) |
|
|
fixity +
|
objectCharacteristics
|
Object
|
verifies if an object has been
altered. Where is fixity check first performed? Not within EPrints
currently, but a script that crawls the archive comparing files
with checksums is possible |
|
|
size
|
objectCharacteristics |
Object
|
|
|
Table 3: From PRONOM-DROID file
format ID tool
|
PREMIS metadata elements |
Part of (if not main element) |
PREMIS entity type |
Comment |
Other OAIR-SP sources |
|
objectCategory |
|
Object
|
bitstream, file, representation;
this will be "implicit" in the harvesting service |
IR policy
|
|
compositionLevel |
objectCharacteristics |
Object
|
e.g. compression, encryption,
zip; EPrints won't tell you this, but a file format ID tool might |
IR policy
|
R
|
format + |
objectCharacteristics |
Object
|
|
|
R
|
software + |
environment
|
Object
|
software to render or use the
object; SP decides which software environments are to be supported
|
|
Table 4: From IR policy
|
PREMIS metadata elements |
Part of (if not main element) |
PREMIS entity type |
Comment |
Other OAIR-SP sources |
R
|
preservationLevel |
|
Object
|
Depends on "preservability",
cost, etc.
|
SP?
|
|
significantProperties |
objectCharacteristics |
Object
|
e.g. pdf + links |
IR submitter |
O
|
inhibitors + |
objectCharacteristics
|
Object
|
inhibit access, use or
migration, e.g. encryption |
IR submitter, IR policy
|
R* |
signatureInformation + |
|
Object
|
validates submitter, for IRs
e.g. identifying authors among services, authenticating material coming
from a repository, etc. These appear to be fairly 'weak' needs assuming
the repository and preservation services are 'secure'. The signature
itself and associated elements (e.g. keyInformation) would be generated
by an appropriate tool to be decided by IR/SP policy
|
SP policy
|
R
|
permissionStatement + |
|
Rights
|
while the author is the ultimate
arbiter of which permissions to grant, the IR policy sets a framework
for standardising permissions by type of object to cover preservation
requirements; the SP records and formalises the management of this
information (see permissionStatementIdentifier) |
IR submitter, SP
|
R
|
permissionGranted + |
permissionStatement |
Rights
|
actions the grantingAgent allows
the preservation repository, using controlled values
|
|
Table 5: From preservation service
provider
|
PREMIS metadata elements |
Part of (if not main element) |
PREMIS entity type |
Comment
|
Other OAIR-SP sources |
R
|
storage +
|
|
Object
|
direction to locate object
stored in preservation repository
|
|
R
|
strorageMedium
|
|
Object
|
e.g. tape, hard disk, CD-ROM, DVD
|
|
R*
|
relatedEventIdentification +
|
relationship
|
Object
|
relates objects after an event,
e.g. migration |
|
R*
|
linkingEventIdentifier +
|
|
Object
|
Use to link to events not
associated with relationships, e.g. format validation, virus checking
|
|
R
|
linkingPermissionStatementIdentifier
+ |
|
Object
|
identifier for permission
statement associated with the object (see permissionStatementIdentifier
below) |
|
R*
|
eventIdentifier +
|
|
Event
|
Events are e.g. SP actions. Each
event must have unique, locally-generated ID
|
|
R*
|
eventType
|
|
Event
|
define controlled vocabulary,
e.g. capture, compression, migration, decryption
|
|
R*
|
eventDateTime
|
|
Event |
|
|
O
|
eventDetail
|
|
Event
|
e.g. why the event occurred
|
|
O
|
eventOutcomeInformation +
|
|
Event |
|
|
R*
|
linkingAgentIdentifier +
|
|
Event |
about an agent associated with
an event
|
|
|
linkingObjectIdentifier + |
|
Event
|
about an object associated with
an event |
|
R*
|
agentIdentifier +
|
|
Agent
|
identifies the agent uniquely
within the preservation repository system
|
|
|
agentName
|
|
Agent
|
|
|
|
agentType
|
|
Agent
|
from controlled vocabulary, e.g.
person, organisation, software
|
|
R
|
permissionStatementIdentifier +
|
permissionStatement |
Rights
|
designation used within the
preservation repository system |
|
R
|
linkingObject
|
permissionStatement |
Rights
|
objects to which permission
pertains, e.g. by IR
|
|
|
grantingAgent
|
permissionStatement |
Rights
|
identifying designation for
agent (IR?) granting permission, if agent is described as entity, e.g.
agentIdentifier |
|
|
grantingAgreement
|
permissionStatement |
Rights
|
agreement between IR and SP, as
recorded by SP
|
IR policy
|
Table 6: From environment registries
|
PREMIS metadata elements |
Part of (if not main element) |
PREMIS entity type |
Comment |
Other OAIR-SP sources |
|
environment
|
|
Object
|
omit if bit-level preservation
storage |
|
|
environmentCharacteristic
|
environment |
Object
|
assessment of the described
environment
|
IR policy, SP
|
|
environmentPurpose
|
environment |
Object
|
uses supported by the
environment, e.g. render, edit
|
|
R*
|
hardware +
|
environment
|
Object
|
hardware components needed by
software, e.g. hardware performance
required, does this object require a minimum hardware level?
|
|
Selection metadata
At the outset of the Preserv project we had expected, given an
apparently
umbilical connection between preservation and selection, to find
selection factors included in preservation metadata. This is not the
case. PREMIS is not concerned with selection for preservation, but with
content to be preserved.
Selection is generally regarded to be a vital element of preservation
services, for reasons of cost. In the general,
hypothetical case, selection may be principled but impractical. In
terms of digital content, especially Web content, new forms of content
such as email lists, blogs and wikis -- and there are many others --
raise new questions about
selection that simply cannot be answered from the standard reference
points. In fact, the selection question presents an inverted logic as
far as preservation is concerned: it is about first deciding what not to preserve in order to
identify
what to preserve. Simply, if used inappropriately selection could
easily be counter-productive in terms of diverting greater analysis,
added cost
and possible mis-selections.
Fortunately, IRs present a more concrete example. It could be argued
that in setting up an IR an institution commits to a responsibility for
all content that it admits. At least, in defining types of content that
can be deposited in an IR makes the selection issue more tractable.
It is also possible to identify other factors in terms of selection.
Other parties may be interested in preservation of certain materials
found in IRs, for example, research funders who may wish to setup
alternative preservation services for these materials (e.g. PubMed UK
is being set up by the Wellcome Trust).
Authors may be invited
to identify content, or special features within content, for
preservation. This more subjective approach has been applied to
artistic and multimedia works in the PANIC project (Hunter and
Choudhury 2003). Anderson et al.
(2005) describe the TAG Team Questionnaire to try and build the views
of creators into preservation
decisions, and NLM has devised a set of permanence ratings (Byrnes
2000), informed to
some extent by creators and authors, to guide selection and
preservation decisions. Both examples appear more suited to management
of in-house digital library materials than to IR authors and
submitters, but could be adapted.
In the digital environment, selection questions raise new issues
that remain to be framed. So we anticipate an extension to this
investigation as we attempt to combine 'selection metadata' with
preservation metadata to assist the efficiency and automation of
preservation workflows for IRs.
Next steps
"Creating and maintaining metadata is expensive, so
any recommended preservation metadata elements should be backed by
persuasive evidence
of necessity, as well as practical means for populating them." (Lavoie
and Gartner 2005)
The Preserv project will work towards test implementations of the
identified preservation metadata sets for each of the three OAIS-based
models identified. The elements
will be refined through consultation with project partners and
stakeholders, and where
necessary schema and vocabularies will be developed to control the data
used for certain metadata elements. On the data
input side we are developing institutional profiles (Hitchcock et
al. 2006) to include IR policy profiles, and working directly
with
service providers such as the
British Library and the National Archives to identify the needs of
preservation services in terms
of these metadata elements.
This paper has focussed on mapping the PREMIS data dictionary to one of
our models: the service provider model in which the service provider is
an expert preservation agency external to the IR. In later papers we
will explore the application of the PREMIS metadata
set to the other two OAIS-based IR models - the institutional
and software models - through further implementations.
References
Anderson, Richard, Hannah Frost, Nancy Hoebelheinrich, and Keith
Johnson (2005) The AIHT at Stanford University:
Automated Preservation Assessment of Heterogeneous Digital Collections, D-Lib Magazine,
Vol. 11, No.
12, December
http://www.dlib.org/dlib/december05/johnson/12johnson.html
Beagrie, Neil (2002) A Continuing Access and
Digital Preservation Strategy for the Joint Information Systems
Committee
(JISC) 2002-2005, JISC, 01 November
http://www.jisc.ac.uk/index.cfm?name=pres_continuing
Bradley, Kevin (2005) APSR Sustainability Issues Discussion
Paper, Australian
Partnership for Sustainable Repositories - National Library of
Australia, 28 January
http://www.apsr.edu.au/documents/APSR_Sustainability_Issues_Paper.pdf
Brown, Adrian (2005) Automatic Format Identification Using PRONOM and
DROID,
The National Archives,
Digital Preservation Technical Paper: 1, 17 September
http://www.nationalarchives.gov.uk/aboutapps/fileformat/pdf/automatic_format_identification.pdf
Byrnes, Margaret (2000) A, York, England, December 6-8
http://www.rlg.org/en/page.php?Page_ID=244
Caplan, Priscilla (2004) PREMIS - Preservation Metadata -
Implementation
Strategies Update 1. Implementing Preservation Repositories for Digital
Materials: Current Practice and Emerging Trends in the Cultural
Heritage Community, RLG DigiNews,
Vol. 8, No. 5, October
http://www.rlg.org/en/page.php?Page_ID=20462#article2
Cedars (2002) Guide To Preservation
Metadata,
March
http://www.leeds.ac.uk/cedars/guideto/metadata/guidetometadata.pdf
Chapman, S. (2003) Counting the Costs of Digital Preservation:
Is Repository
Storage
Affordable?
Journal of Digital Information,
Vol. 4 No. 2, May
http://jodi.ecs.soton.ac.uk/Articles/v04/i02/Chapman/
Cornell Tutorial (2003) The OAIS Reference Model,
section 4B in Digital
Preservation Management: Implementing Short-Term Strategies
for Long-Term Problems, Cornell University, September
http://www.library.cornell.edu/iris/tutorial/dpm/
Day, Michael (2003a) Preservation metadata initiatives: practicality,
sustainability, and interoperability, ERPANET
Training Seminar on
Metadata in Digital Preservation, Marburg, Germany, 3-5
September (revised)
http://www.ukoln.ac.uk/preservation/publications/erpanet-marburg/day-paper.pdf
Day,
Michael (2003b) Integrating metadata schema registries with digital
preservation systems to support interoperability.
2003 Dublin Core Conference,
Seattle, Washington, USA, 28 September - 2
October
http://www.ukoln.ac.uk/metadata/presentations/dc-2003/day/slides-draft.ppt
Dempsey, Lorcan (2006) Networkflows, Lorcan Dempsey's weblog, January 28
http://orweblog.oclc.org/archives/000933.html
EVIE (2006) Embedding a VRE in an
Institutional Environment (EVIE), Workpackage
4: VRE Preservation Requirements Analysis, to appear
Guenther, Rebecca (2004) PREMIS - Preservation Metadata Implementation
Strategies Update 2: Core Elements for Metadata to Support Digital
Preservation, RLG DigiNews,
Volume 8, Number 6, December
http://www.rlg.org/en/page.php?Page_ID=20492#article2
Harnad, Stevan (2001) Re: Eprints Open Archive Software, posting to
american-scientist-open-access-forum, January 23
http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/1079.html
Heery, Rachel, and Sheila Anderson (2005) Digital Repositories Review,
UKOLN-AHDS, 19 February
http://www.jisc.ac.uk/uploaded_documents/digital-repositories-review-2005.pdf
Hirtle, Peter (2001) OAI and OAIS: What's in a Name? D-Lib Magazine,
Vol. 7 No. 4, April
http://www.dlib.org/dlib/april01/04editorial.html
Hitchcock, Steve, Tim Brody, Jessie M.N. Hey, Paul
Wheatley, Adam Farquhar and Leslie
Carr (2006) Building an institutional preservation profile. Preserv
project paper
Hitchcock, Steve, Tim
Brody, Jessie M.N. Hey and Leslie Carr (2005) Preservation for
Institutional Repositories: practical and invisible. Ensuring Long-term
Preservation and Adding Value to Scientific and Technical data (PV
2005), Edinburgh, November 21-23
http://www.ukoln.ac.uk/events/pv-2005/pv-2005-final-papers/033.pdf
Hitchcock, Steve (2005) Capturing preservation metadata from
institutional repositories. DCC
Workshop on the Long-term
Curation within Digital Repositories, Cambridge, July 6
http://preserv.eprints.org/talks/hitchcock-dcccambridge060705.ppt
Hunter, Jane and Sharmin Choudhury (2003) Implementing Preservation
Strategies for Complex Multimedia Objects. Seventh European Conference on Research
and Advanced Technology for Digital Libraries, ECDL 2003,
Trondheim, Norway, August
http://metadata.net/newmedia/Papers/ECDL2003_paper.pdf
James, Hamish; Ruusalepp, Raivo; Anderson, Sheila; and Pinfield,
Stephen (2003) Feasibility and Requirements Study on Preservation of
E-Prints, JISC, October 29
http://www.jisc.ac.uk/uploaded_documents/e-prints_report_final.pdf
Knight, Gareth (2005) An OAIS compliant model for Disaggregated
services,
SHERPA-DP Report, version 1.1, 5/09/2005
http://ahds.ac.uk/about/projects/sherpa-dp/sherpa-dp-oais-report.pdf
Lavoie, Brian, and Richard Gartner (2005) Preservation metadata. DPC
Technology Watch Series Report 05-01, September
http://www.dpconline.org/docs/reports/dpctw05-01.pdf
Lupovici, Catherine, Julien Masanès (2000) Metadata for long
term-preservation, Nedlib Consortium, July
http://www.kb.nl/coop/nedlib/results/D4.2/D4.2.htm
Lynch, Clifford (2001)
Metadata Harvesting and the Open Archives Initiative. ARL Bimonthly
Report, No. 217, August http://www.arl.org/newsltr/217/mhp.html
Lyon, Liz, Heery,
Rachel, Duke, Monica, Coles, Simon J., Frey, Jeremy G., Hursthouse,
Michael B., Carr, Leslie A. and Gutteridge, Christopher J.
(2004)
eBank UK: linking research data, scholarly
communication and learning.
In All Hands Meeting 2004,
Nottingham, 31 Aug - 03 Sep 2004
http://eprints.soton.ac.uk/8183/
MacColl, John (2004) DSpace Institutional Repositories
and Digital Preservation, DPC Forum
on Digital Preservation in
Institutional Repositories, London, 19th October, slide 5
http://www.dpconline.org/docs/events/041019maccoll.pdf
National Library of Australia (1999) Preservation
Metadata for Digital
Collections, 15 October 1999
http://www.nla.gov.au/preserve/pmeta.html
National Library of New Zealand (2002) Metadata Standards Framework –
Preservation Metadata, November
http://www.natlib.govt.nz/files/4initiatives_metaschema.pdf
OAI (2001) Open Meeting, Washington DC, January 23
http://www.openarchives.org/meetings/DC2001/OpenMeeting.html
OAIS (2002) Reference
Model for an Open Archival Information System (OAIS),
Consultative Committee for Space Data Systems, CCSDS 650.0-B-1, Blue
Book, Issue 1, January, adopted as ISO 14721:2003
http://ssdoo.gsfc.nasa.gov/nost/wwwclassic/documents/pdf/CCSDS-650.0-B-1.pdf
OCLC/RLG Working Group on Preservation Metadata (2002) Preservation
Metadata
and the OAIS Information Model: A Metadata Framework
to Support the
Preservation of Digital Objects, June
http://www.oclc.org/research/projects/pmwg/pm_framework.pdf
PREMIS (2005) PREservation Metadata:
Implementation Strategies
Working
Group Data Dictionary for Preservation Metadata: Final Report of
the
PREMIS Working Group, May
http://www.oclc.org/research/projects/pmwg/
Registry of Open Access Repositories (ROAR), School of Electronics and
Computer Science, University of Southampton (undated) http://archives.eprints.org/
RLG-OCLC (2002)
Trusted Digital Repositories:
Attributes and Responsibilities, An
RLG-OCLC Report, May
http://www.rlg.org/longterm/repositories.pdf
Rosenthal, David S. H., Thomas Lipkis, Thomas S. Robertson, and
Seth
Morabito (2005) Transparent Format Migration of Preserved Web Content, D-Lib Magazine, Vol. 11 No.
1, January
http://www.dlib.org/dlib/january05/rosenthal/01rosenthal.html
Shirky, Clay (2005) AIHT: Conceptual Issues from Practical Tests, D-Lib Magazine,
Vol. 11, No.
12, December
http://www.dlib.org/dlib/december05/shirky/12shirky.html
Van de Sompel, Herbert, and Carl Lagoze (2000) The Santa Fe
Convention of the Open Archives Initiative. D-Lib Magazine, Vol. 6 No.
2, February
http://www.dlib.org/dlib/february00/vandesompel-oai/02vandesompel-oai.html