Honeycomb H2

From PreservWiki

Jump to: navigation, search

Contents

Background

So Sun Microsystems officaially pulled the plug on the STK5800 system due to the fact that the hardware the system is made of went into end of line (EOL) status. The STK5800 became wrongly codenamed Honeycomb after the underlying software which managed the hardware and thus this software affectively dies with it, or does it...

Point is that the software was always meant to be platform independant, however in the process of putting the STK5800 to market this became untrue.

The idea now is to make this true again. Fix the problems. Restore the original promised functionality and improve/upgrade those parts which can be upgraded.

The Proposal

The proposal put forward by Sun Microsystems is to bring together a small team of specialists to become the Design Analists who put together the specification and architecture models which can then be taken forward into the new version of the product. This team will hopefully consist of leading researchers in the repository field from Oxford, Southampton and Stanford.

The Team

Southampton - Les Carr

Southampton - David Tarrant

Personal Note: David Tarrant

So in this pages early days i've so far written the introduction and proposal sections. I also thought that we should codename the project H2 as this could stand for Honeycomb2 (obvious I know) by also is suggestive of H2G2 which has the motto "Don't Panic" enscribed in big letters upon it's cover.

The Specification

Must support the 4 key points of the "Open Storage" paradigm.

  • Reliable
    • Must contain a self checking and self healing file system
  • Resiliant
    • Must be rebust in the case of part failure.
    • Must be able to easily handle typical lifetime predictions of componants.
  • Simple and Expandable
    • Must be made of parts which are easy to expand/upgrade way into the future.
  • Open
    • And software developed to enable the above must be open.
    • And hardware specifications must also be open.

Extended Specifications

  • Metadata Handling
    • Must have a tightly coupled but loosely specificied metadata layer.
  • Load Balancing
    • Must be able to handle large numbers of requests and processes by load balancing these accross the entire system.

Note: The STK5800 failed on the "Simple and Expandable" point, even though it wasn't meant to.

Architectures

In this section we firstly outline the STK5800 + Honeycomb architecture and analyse this. We then provide several other architectures which could lead to a final version of H2.

STK5800 + Honeycomb Architecture

Hardware

  • 16 nodes - Each with 4 500Gb hard disks
  • 1 service node - Handles upgrades to software/firmwares (not needed to maintain access to the system)
  • 2 switches - Perform the load balancing. One active, one failover.

Note: The number of nodes can be as little as 8 or as many as you like, the way the software operates in these conditions didn't change.

Operating Specifications

  • High level RAID
    • On receival at a file, a node splits the file into 5 parts + 2 parity and distributes these 7 parts on 7 of the 64 hard disks on 7 of the 16 nodes in the system.
    • In this way 2 complete nodes can fail without data loss.
  • Metadata layer
    • Provided by the propietry HADB, this was tightly couple however not very flaxable in it's usage (no multivalue fields).
    • The metedata layer was also extreamely slow to update and tried to self heal for errors created by a human (where it should have just thrown an informative error)
  • Datadoctor
    • Designed to provide the reliability and the resiliance.
    • However this was a slow and uninformative part of the system, often causing much grief.
    • It works, but just needs to be improved.
  • Simple Protocals
    • Tick in this box http based communication makes object manipulation easy.

Failure Points

  • Too tied to the switches for load balancing
  • Metadata layer capabilities and speed
  • Not enough information is available to the system administrator as to the system state.
  • Small files don't split well 5 ways and thus problems were encountered.
  • Not expandable, unclear as to why but possibly due to the calculations which are done on how much free space has to be left to accomodate a double node failure.
  • Tied to Solaris, so not cross Unix.

Cleversafe

This project recently won a Wall Street Journel prize. It distributes data geographically over many machines which have some simple Java code on them. There are 3 parts to the Cleversafe DSN (Distributed Storage Network) system, accessors, slicestors and managers. Accessors read and write the data, slicestors store data blocks and managers recover from failure. At first it sounded like an H2 solution already done but it isn't.

How it Works

  • Step 1: Create a load of slicestors but dropping some code on a box.
  • Step 2: Bind them together creating a network
  • Step 3: Create an iSCSI drive on the network
  • Step 4: Use it like a normal drive.

Positives

  • The slicestor and accessor code is open source via sourceforge, and will remain this way.
  • You can specify the number of slices your data is split into along with the threshold. So if you want your data split into 16 slices with a threshold of 10 you can then loose 6 slices. This uses the Cauchy Reed Solomon algorithm.

Failures (Of current Version)

  • Contains all the failures of RAID, your point of failure now lies in the controller and it is not expandable without creating a new RAID or drive on the slicestors.
  • The manager is a separate piece of software from the slicestors, unclear as to why other than the sales potential of the company. EDIT: Which is now confirmed by the company as exactly what it is. The manager performs the bit checking and reconstruction on the network, the company is not open sourcing this code as this is the way it intends to make any money.
  • No load-balencing

Future Versions

  • A little birdie (from within the company) has informed me: "We are soon introducing a new vault structure based upon a file/object orientation that will allow for easy expansion". This moves them away from the limitations of iSCSI and opens up endless possibilities for easy expansion by simply adding a new slicestor to the network. All code will remain open source and old versions will continue to be supported.

H2

Hardware

  • Anything running a Unix based operating system and a copy of JAVA JRE.

Requirements

  • Datadoctor
    • Carried forward but with better performance and more information to admins.
  • Simple APIs/Protocols
    • Carried forward
  • High level RAID
    • Carried forward by ONLY for large files (>4Mb?). Small files are just replicated 3 times allowing for 2 node failure. This reduces space but improves performance, reliability and rebuild times.
  • Metadata Layer
    • Replaced by a database abstraction and made semantic and triple based.
    • In this manor you could swap out the underlying database system to use mysql/JavaDb/Oracle (I don't see this as an easy task due to the challenges of load balencing and replication issues)

New Features

  • The system can run on any machine right down to donated space on desktop machines.
  • From above the network has to be self expandable.
  • Load balancing is handled by each node in the system and users always point at their local node first thus it load balences in the initial setup.
  • Authentication. Wasn't present in any useful way in Honeycomb software.
  • Although this specification is based upon a minimum of 8 nodes due to the 5+2 data splitting. You could have these figures customisable by the user on first setup. So they could choose to split their data into a 8+5 layout meaning 5 nodes could fail but you would need 14 in the first place.

Physical Requirements

  • In order to run storelets we need a "Buffer", which is a space where large files can be reconstructed from their split up parts in order to be operated on in some way (e.g. virus scanned). So the "Buffer" must be a large contineous space on a single disk somewhere is the cluster which is larger than the biggest file stored. Either that or the "Buffer" limits the biggest single file which can be accepted into the system.
  • Option 2 here is to rely on fsviews and Java.IO to stream the file as required to the process which wants to operate on it, thus avoiding this issue. Not my specialist area.

How it would all work

  • Each node is running a copy of mysql and java.
  • Minimum number of nodes is 8.
  • The space donated to the H2 system could be an entire disk, set of disks or just a single directory with a quota limit handled by the H2 software.
  • The H2 software required enough space on disk to contain a running copy of the software + space to add storelets (based on how much space the user is donating to H2)
  • Small files are replicated onto 3 different nodes (allowing for 2 node failure)*
  • In a community wide system since small files are just replicated do they need to be encrypted against the users credentials to stop anyone from viewing them? Larger files need to be reconstructed thus you can do authentication at this point (unless one user happens to own all the nodes the file parts are on)
  • Large files are split into 5 bits + 2 parity and spread over 7 nodes.
  • Each node acts as a local cache for that user and the user can define the amount of cache space where new and regularly accessed files can be stored in their entirity. For files less than the small file limit this would mean that one of the 3 copies would be on the local node if space is available, if not older small files could be quickly farmed out.
  • The largest "Buffer" is the largest contineous space and this can be a combination of "Cache" plus free storage space as the "Cache" is not a separate partition.
  • Metadata is stored in a mysql database and as yet i'm unclear on the distribution of metadata in the system. Metadata should probably be stored on the node which is the point of insertion and then replicated onto 2 other nodes one of which should be the requested node of insertion (if differnt from the actual one).
  • Metadata files should also be stored as a normal file on disk. These files should be stored on at least 3 nodes which are not the ones where the same data exists in the database on that node!
  • Querying is thus done to all nodes, in the most part the results the users are after will come from the node which is their point of insertion, however this may not always be the case.
  • If the users node on point of entry falls over i'd like to just use Anycast to find the next one, or some H2 registry. Previously this was the main role of the switches and is going to be the bit which is harder to replicate. However if we look at the web google.com redirects you to your nearest working server, can we replicate this functionality?

Issues

  • Can the above work or do all the underlying file systems need to be ZFS? Can you split a file into 5 bits with 2 parity and distribute these over a networked file system? What are limitations?
  • Q: If the H2 network becomes full and you add a single but large node, how do you redistribute in a way which causes least disrubtion.
    • A1: Small files are highly available and if there are a lot of them (>50%) in your system then redistribution can be done with very little loss in service
    • A2: If your files are all large then this is a RAID rebuild and will cause some loss in service as file parts are moved and indexes are updated.

H2 behind a NAT

This is what the STK5800 is and it saves of global IPs and also looks like one system. So you have a switch/router which has only the logic within it to know which nodes are online. The switch assigns a master and when this fails it picks another node. Nodes can be easily added to the system as the switch will just give them a DHCP address, the software handles the rest.

Q: How does the switch know the new node is a Honeycomb node? Does it need to?

Implementation Questions

  • Q: Raid-Z2 vs Software Raid
    • Raid-Z2 is designed to work and give you the ZFS capabilities but I think these capabilities are only important in the local cache if at all. Software Raid is the way to go if you want the expansion capabilities without having to re-write bits of ZFS. I like ZFS on file servers but H2 is not a file server, it's an object store on/in the web.
    • Also with software raid it is easier to recover the bits of the files if they are just that on various disks.
  • Q: If we perform load balancing on the node how do we redirect a dump client.
    • A: HTTP/307
  • Q: How do ensure that a H2 client doesn't use all the processor on the server box thus causing problems with load balancing and interfearing with the users day to day work.
    • A: Firstly most poeple work on a dual core machine now and only ever really use a single core per big process, secondly we can be "NICE" and "RENICE" (it's a linux command).
Personal tools