EPrints Honeycomb

From PreservWiki

Jump to: navigation, search

We know the Honeycomb metadata manager is not the best however the idea here is to set in stone what is the minimal amount of metadata to allow for a working multirepository honeycomb which can be easily queried in a way such that an entire repository can be rebuilt from the data in the Honeycomb.

Contents

Honeycomb built in metadata

  • system.object_hash_alg : sha1
  • system.object_size : 28298
  • system.object_layoutMapId : 3793
    • I believe this is the distribution map of the data bits across the nodes.
  • system.object_hash : 0795cb987ccdc145d4c34722e88b2972672d5ec0
  • system.object_ctime : 1219419767074
    • Submission timestamp (with milliseconds)

EPrints Required Metadata

  • dc.isPartOf
    • Often the repository ID. This is sufficient as an identifier to which repository the data originated from.
  • dc.identifier
    • The id of the object, again this is often just a URL from eprints but will be a sufficient identifier.
  • dc.format
    • The format or mime type of the object
  • dc.conformsTo
    • Namespace of the schema this object conforms to, used for identification of specific object types.
  • eprints.revision
    • The revision number of this object as there may exist 2 objects with the same dc.identifier but there can never be 2 with the same id and revision number.

Performing a Rebuild

  • Find all files which conform to the EPXML schema type by dc.conformsTo which are dc.isPartOf your repository ID.

The Schema

This is the current schema in the Honeycomb. NOTE: The limited field lengths!

<metadataConfig>
 <schema>
   <namespace name="eprints" writable="false" extensible="false">
     <field name="revision" type="long" queryable="true" />
   </namespace>
   <namespace name="dc" writable="false" extensible="false">
     <field name="conformsTo" type="string" length="512" queryable="true" />
     <field name="identifier" type="string" length="512" queryable="true" />
     <field name="isPartOf" type="string" length="512" queryable="true" />
     <field name="format" type="string" length="512" queryable="true" />
   </namespace>
 </schema>
 <fsViews>
   <fsView name="byDCisPartOf" filename="${dc.isPartOf}.${dc.identifier}" namespace="eprints" fsattrs="true" readonly="true">
       <attribute name="dc.isPartOf"/>
       <attribute name="dc.identifier"/>
   </fsView>
   <fsView name="byST5800SystemId" filename="${system.object_id}" fsattrs="false" readonly="true">
       <attribute name="system.object_id" unset=""/>
       <attribute name="system.object_ctime" unset=""/>
   </fsView>
 </fsViews>
 <tables>
   <_table name="dc" >
     <column name="dc.isPartOf"/>
     <column name="dc.identifier"/>
     <column name="dc.conformsTo"/>
     <column name="dc.format"/>
   </_table>
   <_table name="eprints" >
     <column name="eprints.revision"/>
   </_table>
 </tables>
</metadataConfig>
Personal tools