"It is difficult to anticipate the
metadata needed to support technical
and administrative processes that are not
fully developed, are
not fully tested, and in some ways, are not even fully understood.
Compounding the problem is the proviso that preservation metadata
recommendations must
be restrained by economic realities. Creating and maintaining metadata
is expensive, so
any recommended
preservation metadata elements should be backed by persuasive evidence
of necessity, as well as practical means for populating them" (Lavoie
and Gartner, 2005).
"You agree that MIT may, without
changing the content, translate the
submission to any medium or format for the purpose of preservation. You
also agree that MIT may keep more than one copy of this submission for
purposes of security, back-up and preservation."
"in contrast to the support for resource discovery metadata, managers
of e-print repositories have practically no preservation metadata
support provided by the common repository software packages." (James
et
al. 2003)
IR software is only one player in our preservation
service provider scenario. In the detailed PREMIS Data Dictionary the
five
entity types --
intellectual entities,
objects, events, agents and
rights -- are described by entries for the main elements and
subelements. In this analysis Tables 1-5 attempt to map these elements
to the
potential metadata sources identified in our IR-service provider (IR-SP)
model outlined by Hitchcock
et al.
(2007a):
- Author/IR
submitter (via the repository deposit interface)
- IR software (in this case EPrints)
- Associated tools (in this case file format ID tool PRONOM-DROID)
- IR policy
- Preservation service providers
A possible additional source of metadata is environment registries
(Table 6), which
are recognised in PREMIS and other preservation activities, although
there are not yet any concrete examples of such registries based on the
broadest, most ambitious designs that could support and source the
elements identified here. PRONOM, Global Digital Format Registry (GDFR
http://hul.harvard.edu/gdfr/) and JSTOR/Harvard Object Validation
Environment (JHOVE http://hul.harvard.edu/jhove/) are examples of more
specific registry types (e.g. for file format ID and validation) that
might be the basis of more expansive implementations. Representation
Networks are another ongoing preservation development that may inform
environment registries, e.g. DCC Representation Information Registry
(http://registry.dcc.ac.uk/).
The version of the mapping presented here includes the main PREMIS
metadata elements without expanding on the
subelements where these are assumed to follow into the same source
categories unless indicated.
Since PREMIS is oriented towards implementation, this requires the
development of schema to define the use of certain elements within this
application. According to PREMIS, the schema should, where
possible, provide controlled vocabularies or codes for populating
elements, rather than relying on “free text”. In addition, the schema
should be adaptable to automated workflows for metadata collection and
management. This analysis does not extend to
identifying, building or including controlled
vocabularies or schema that may be required by some elements within the
preservation metadata set.
Tables 1-6 map the PREMIS elements to the principal sources
in the IR-SP model. It should be noted that elements are
not fixed in
these tables. Some may apply to more than one table, especially where
related subelements are simply wrapped in a single entry, but for
clarity we have not duplicated elements between tables. For example, it
may be necessary for the relationship element to be informed by the IR
author, but subsequent use of that information in related subelements
may be the responsibility of other sources or services.
Key to tables: O optional, R required, R* conditionally required, +
includes
related subelements from PREMIS Data Dictionary
Table 3: From PRONOM-DROID file
format ID tool
|
PREMIS metadata elements |
Part of (if not main element) |
PREMIS entity type |
Comment |
Other IR-SP sources |
|
objectCategory |
|
Object
|
bitstream, file, representation;
this will be "implicit" in the harvesting service |
IR policy
|
|
compositionLevel |
objectCharacteristics |
Object
|
e.g. compression, encryption,
zip; EPrints won't tell you this, but a file format ID tool might |
IR policy
|
R
|
format + |
objectCharacteristics |
Object
|
|
|
R
|
software + |
environment
|
Object
|
software to render or use the
object; SP decides which software environments are to be supported
|
|
Table 4: From IR policy
|
PREMIS metadata elements |
Part of (if not main element) |
PREMIS entity type |
Comment |
Other IR-SP sources |
R
|
preservationLevel |
|
Object
|
Depends on "preservability",
cost, etc.
|
SP?
|
|
significantProperties |
objectCharacteristics |
Object
|
e.g. pdf + links |
IR submitter |
O
|
inhibitors + |
objectCharacteristics
|
Object
|
inhibit access, use or
migration, e.g. encryption |
IR submitter, IR policy
|
R* |
signatureInformation + |
|
Object
|
validates submitter, for IRs
e.g. identifying authors among services, authenticating material coming
from a repository, etc. These appear to be fairly 'weak' needs assuming
the repository and preservation services are 'secure'. The signature
itself and associated elements (e.g. keyInformation) would be generated
by an appropriate tool to be decided by IR/SP policy
|
SP policy
|
R
|
permissionStatement + |
|
Rights
|
while the author is the ultimate
arbiter of which permissions to grant, the IR policy sets a framework
for standardising permissions by type of object to cover preservation
requirements; the SP records and formalises the management of this
information (see permissionStatementIdentifier) |
IR submitter, SP
|
R
|
permissionGranted + |
permissionStatement |
Rights
|
actions the grantingAgent allows
the preservation repository, using controlled values
|
|
Table 5: From preservation service
provider
|
PREMIS metadata elements |
Part of (if not main element) |
PREMIS entity type |
Comment
|
Other IR-SP sources |
R
|
storage +
|
|
Object
|
direction to locate object
stored in preservation repository
|
|
R
|
strorageMedium
|
|
Object
|
e.g. tape, hard disk, CD-ROM, DVD
|
|
R*
|
relatedEventIdentification +
|
relationship
|
Object
|
relates objects after an event,
e.g. migration |
|
R*
|
linkingEventIdentifier +
|
|
Object
|
Use to link to events not
associated with relationships, e.g. format validation, virus checking
|
|
R
|
linkingPermissionStatementIdentifier
+ |
|
Object
|
identifier for permission
statement associated with the object (see permissionStatementIdentifier
below) |
|
R*
|
eventIdentifier +
|
|
Event
|
Events are e.g. SP actions. Each
event must have unique, locally-generated ID
|
|
R*
|
eventType
|
|
Event
|
define controlled vocabulary,
e.g. capture, compression, migration, decryption
|
|
R*
|
eventDateTime
|
|
Event |
|
|
O
|
eventDetail
|
|
Event
|
e.g. why the event occurred
|
|
O
|
eventOutcomeInformation +
|
|
Event |
|
|
R*
|
linkingAgentIdentifier +
|
|
Event |
about an agent associated with
an event
|
|
|
linkingObjectIdentifier + |
|
Event
|
about an object associated with
an event |
|
R*
|
agentIdentifier +
|
|
Agent
|
identifies the agent uniquely
within the preservation repository system
|
|
|
agentName
|
|
Agent
|
|
|
|
agentType
|
|
Agent
|
from controlled vocabulary, e.g.
person, organisation, software
|
|
R
|
permissionStatementIdentifier +
|
permissionStatement |
Rights
|
designation used within the
preservation repository system |
|
R
|
linkingObject
|
permissionStatement |
Rights
|
objects to which permission
pertains, e.g. by IR
|
|
|
grantingAgent
|
permissionStatement |
Rights
|
identifying designation for
agent (IR?) granting permission, if agent is described as entity, e.g.
agentIdentifier |
|
|
grantingAgreement
|
permissionStatement |
Rights
|
agreement between IR and SP, as
recorded by SP
|
IR policy
|
Table 6: From environment registries
|
PREMIS metadata elements |
Part of (if not main element) |
PREMIS entity type |
Comment |
Other IR-SP sources |
|
environment
|
|
Object
|
omit if bit-level preservation
storage |
|
|
environmentCharacteristic
|
environment |
Object
|
assessment of the described
environment
|
IR policy, SP
|
|
environmentPurpose
|
environment |
Object
|
uses supported by the
environment, e.g. render, edit
|
|
R*
|
hardware +
|
environment
|
Object
|
hardware components needed by
software, e.g. hardware performance
required, does this object require a minimum hardware level?
|
|
Testing the mappings
Modelling preservation scenarios and mapping
preservation metadata elements from an authoritative source to these
models
informs development but ultimately needs to be tested in examples using real preservation services.
Given that the underlying preservation service provider models have
been evolving in Preserv this has so far not been possible. This
approach to preservation metadata has been tested in another form,
however, as it was used as a basis for an objective survey of
repository managers of larger IRs
with known content profiles (Hitchcock
et al. 2007b). This
mapping gave us the opportunity to place the emphasis on
what repositories do, and the implications for preservation, rather
than on what they may plan to do or what repository managers think
about preservation.
Below are some findings
from the survey that may affect the proposed mappings:
- Approximately two-thirds of IR
deposits are mediated, either by
repository staff or by an agent acting for the author, with one-third
self-deposited by authors (the proportion of self-archiving appears to
be slightly higher for subject, rather than institutional,
repositories). This has implications for Table 1.
- At least 85% of surveyed repositories use IDs
generated by the repository software (Table
2)
- Compression
and zipping of files are acceptable but encryption is generally not
permitted (Tables 3, 4)
- Most repositories are unaware the inclusion of special
format features (e.g. pdf + links), unless informed by authors, so we
cannot easily estimate the scale of usage of such features
in repository content (Table 4)
- The majority of repositories have some kind
of log-in process to authenticate depositors, but it is not clear how
to authenticate files and hardly any of the surveyed repositories do
this (Table 4)
- None of the repositories surveyed has
a formal preservation policy of the type that would inform the
collection of preservation metadata, e.g. Table 4
- Almost all repositories surveyed present some sort of
licence or rights agreement to authors, but only a small minority refer
to
rights for
preservation, which will affect Table 4
A survey
of repositories by PREMIS, despite using a more leading
questionnaire, discovered (Caplan 2004):
- Three-quarters of all
repositories obtained
metadata from their depositors and the same number extracted metadata
automatically by program.
- Nearly two-thirds of the respondents
also had
some metadata supplied by repository staff, either through manual data
entry or by automatic derivation from bibliographic databases.
- Automatic extraction by repository
software was most often limited to
technical metadata – size, file format, and file characteristics stored
in file headers.
The
validity and need for the elements in Table 3 can additionally be
informed PRONOM-ROAR
format
profiles. The Preserv project has
presented format profiles
('Preserv profiles') of over 200 IRs through the Registry of Open
Access Repositories (ROAR) by applying the PRONOM-DROID format
recognition tools from the National Archives of the UK to OAI data
harvested from the repositories (see Preserv Format Profiling:
PRONOM-ROAR An illustrated guide
http://trac.eprints.org/projects/iar/wiki/Profile).
Given the inclusion of a History Module in
EPrints v3
(http://wiki.eprints.org/w/Preservation_Support#History_Module) it is
possible that some of the Event elements from Table 5 could be
generated within the IR and shared with the service provider, depending
on the nature of the services provided, and the number of service
providers contracted to provide them (Hitchcock et al. 2007a).
The Rights elements in Table 5 suggest a greater degree of granularity
may be required than even the most preservation-aware examples among
current author agreements.
Are the elements allocated to Tables 5 and 6
viable and useful for service providers? Determining this will require
the setting up of realistic service provider testbeds and has not been
performed so far in Preserv.
Mapping PREMIS to repositories: the PRESTA example
Lee
et
al. (2006) have also
mapped PREMIS to a repository service framework in PRESTA - PREMIS
Requirement Statement, an Australian Partnership for Sustainable
Repositories (APSR) project. In PRESTA the submission system (c.f. an
IR in
Preserv) and archive (c.f. a preservation service provider in Preserv)
are
less clearly defined than in Preserv, although the test repositories
are IRs at the Australian National University (ANU) and the University
of Queensland (UQ). In addition the framework includes a preservation
monitoring and management system (c.f. a distributed service
in Preserv) and a partner archive (no immediate equivalent in Preserv).
Recognising that preservation services are likely to be supplementary
to
repositories rather than part of the core definition, in PRESTA: "It
was decided there would be more emphasis on
what metadata was collected than
how it was collected."
PRESTA is wider than the analysis presented here. PRESTA considered
"all metadata, including PREMIS, necessary to support long term
sustainability", including descriptive metadata (describes content
including metadata providing context or meaning to a digital object)
and structural metadata (how parts relate to the whole and to each
other), as well as inclusion of PREMIS in a METS profile for exchanging
preservation metadata. In this paper only PREMIS is considered, without
an exchange profile.
The result of PRESTA is detailed and justifies close study for those
implementing preservation metadata for repositories, but the summary
recommendations tend to emphasise the role of repositories to a greater
degree than in Preserv's service provider model, with specific actions
required of repositories while the role of the National Library of
Australia seems to be as a general support framework rather than an
active service provider. This perception may be due to the nature of
the findings with respect to the two target IRs, which found gaps in
the collection of preservation metadata:
- recording of preservation events
- recording of structural relationships
- file format validation (ANU)
- checksum generation (UQ)
Appendix 4 (Gap reports for ANU DSpace and UQ Fez/Fedora repositories)
in Lee
et al. (2006) provides
a useful point of reference for Tables 1-6, although important
differences in the underlying models mean that direct comparison is not
possible.
It should be noted that the PREMIS Data Dictionary for
Preservation Metadata and its related XML schemas are currently being
reviewed.