
dSys-II a High level description
================================

What is dSys-II?
================

dSys-II is an experimental low-level de-centralised, distributed 
and autonomously controlled block storage system that is both 
word and endian agnostic. It is exceptionally robust and
self-healing. It has at this stage a simplistic and rudimentary 
self-optimisation mechanism which the agent software may 
extend in the future. dSys-II does not require traditional 
application level modification for access.

What isn't dSys-II?
===================

dSys-II is not a file sharing system. dSys-II provides example
high-level personality access methods:

  - NFSv3 and v4 with ACL stubbs.
  - SMBFS (rudimentary SMBv2 without user UUID control).
  - Simple SQLfs support with transaction journal and roll-back.
  - Raw BlockFS for use with COMSTAR iSCSI initiator.
  - VMS Cluster Disk access simulating sufficient RMS attributes
    for ODS-2 'like' operation and ODS-5 Case support. VMS DLM
    support is complete for locked quorum access.

What is a Personality Access Method?
====================================

PAM's are a layer that sit above dSys-II's logical block 
storage layer and provides access to that data in a fashion acceptable
to an operating system's native access methods. It is a bridge.

Personality access modules are generally quite small and the 
intention is that they are trivial to write. The smallest at the
time of writing is the VMS cluster disk access with DLM that is 
under 8kb of comment stripped code, the largest personality 
module is SMBFS which is over 70kb of comment stripped code.

Supported Systems:
==================

dSys-II is an experimental platform that is deliberately designed to
function on a variety of processor architectures with at minimum a
32bit register width and at most currently a 64bit processor. Both
little and big endian modes are catered for and data structures are
transparently (to the PAM's) converted on the fly. dSys-II is 
currently developed and tested on the following system 
configurations:

 First Class Development and Testing:
   - SGI IRIX 6.5.24    (64bit-BE)
   - SGI IRIX 6.5.18    (32bit-BE) 
   - OpenVMS 7.2 VAX    (32bit-LE)
   - OpenVMS 7.3 AXP    (64bit-LE)
   - HP-UX 11i v1.5     (32bit-BE)
   - NetBSD 4.x i386    (32bit-LE)

 Second Class Buld and Test-Only Systems: 
   - AIX 4.3.3ml7 Power (32bit-BE)
   - AIX 5.1ml2   Power (64bit-BE)
   - MacOS-X 10.3 Power (32bit-BE)
   - Windows NT AXP     (32bit-LE)
   - Windows 2000 i386  (32bit-LE)
   - Windows XP   i386  (32bit-LE)
   - Debian Linux i386  (32bit-LE)

The Data Layers.
================

Data is organised in three layers.

  - Fragments. These are raw blocks of data on a disk.
  - Chunks. These are logical fragment groups on many 
    hosts and disks.
  - Views. These are many chunks allocated into a single flat or
    logically flat contigious space. 

Fragments:
Data Fragments are the lowest level blocks of storage. They contain
raw application data and a header which uniquely identify this 
fragment, it's version number, it's last access TTL (for orphan
detection) and it's checksum. Fragments are independent and do 
not store information regarding other fragments, fragment sequences, 
or version sequences.

Chunk:
Chunks keep track of the hashes of host and fragments to construct a 
chunk of consequitive storage to present to a view. Chunks contain 
the unique chunk id, a reference counter and a list of fragments 
to be accessed to present a spefific version of a chunk to a view. 

View: 
A view is a presented view of one or more chunks which are presented 
to the Personality Access Method requesting access to data.  The 
view layer is where chunks may be shared between various views. A
view consists of a list of one or more chunks. A view cannot be 
shared, only replicated. This is a limitation at this stage.


The User Experience.
====================

A user installs, configures and runs a peer application on their 
chosen O/S. This peer software may if the user wishes, donate local
storage to the overall system. The user then creates a storage view
with the application by initially generating or supplying a list of
view id's. Once the view is initialised, the PAM module may be loaded
and pointed at the localhost (127.0.0.1) and the contents of the 
view Mounted, Mapped, initialised for use and access by native
applications. If this is the first time a view is to be used, a 
new view is created, and the identifiers for it's chunks are 
requested.

Local Storage Donation.
=======================

Local storage may be donated either thin provisioned as files in 
the O/S or on selected O/S combinations as a raw block addressable
device. Donating and sharing local storage has the following effects:

   - Increased system-wide data fragment redundancy.
   - Decreased local data access due to effective local 
     cached fragments.

Once storage has been initialised and populated, it can easily 
be moved to a new host. Fragemnts will be discovered automatically in
time as needed, or they will expire and be purged by the agent during
a periodic storage vacuum.

Autonomous Agent Controls.
==========================

Controlling redudancy, servicing requests for data and chunks, 
optimising location and access, migrating and replicating chunks 
is the job of the peer agents. dSys-II is easily classified as a
"Multi-Agent System". Agents used in the system must fulfill the
basic storage request and management functions as required, but 
the logic behind the agents actions and goals is deliberately
left as an experimental platform for further work. A number of 
rudimentary agent examples currently exist in dSys-II.

  - Greedy: Greedy attempts to fill all storage possible with
    as many redundant copies and versions as possible for 
    maximum data redundancy. This encourages a great deal of 
    time spent by the reap and termination process Arnie.
  - Scrooge: Scrooge is the opposite of Greedy where it attempts
    to find the minimum number of fragments and chunks necessary
    to satisfy minimum fragment redundancy and maximise available
    space.
  - Gonzales: Speedy Gonzales is a modification of scrooge where
    it will additionally look at access rates of fragments and 
    attempt to broker storage on the nodes which access them 
    more frequently to minimise network overheads. Speedy is 
    unique in that it will try to "guess" fragment access 
    frequency by the number of times it has requests from another
    external agent. It unlike Greedy and Scrooge will attempt to
    replicate fragments that it does not control and hope that
    the remote agent is sufficiently functional to retreive the 
    pushed local fragment rather than interrogate the network.
  - Arnie: The Terminator is a special agent which attempts to
    reap and purge fragments periodically that have expired 
    beyond their TTL, or have a fragment redundancy greatly
    above the reference count high-water mark. Note: Greedy 
    and Arnie do not get along well.
  - Lazy: Lazy is a fuzz tester that places fragments and 
    chunks "where ever" .. without a care. Lazy does not always
    update reference counters correctly and this can deliberately
    create problems for the other agents to notice, hunt down and
    correct. This is a modernised version of the out-dated "Slob"
    agent. Lazy will occasionally also forget to service remote
    requests for fragments which is great for redundancy and 
    roll-back testing.
  - Slob: Slob is a local-only agent version of lazy. It will 
    often attempt to break reference count integrity and 'forget'
    to send fragments to other hosts. This can cause mayhem if
    fragments are requested next by another remote agent and it 
    is only able to retreive the fragment from Slob.
  - OrderlyDave: Dave does everything by the book. He is slow,
    pain-stakingly-slow. Everything must be done, synchronised,
    and all fragments quiesced before it will return from the 
    IOP. This is a newer version of LocalFreak.
  - CaveDiver: This agent pokes around the agents it can find 
    attempting to spot fragments which have been migrated, and 
    update the chunk data by giving it to Betty. 
    CaveDiver also has hooks into local Arnie instances for 
    spotted problems and Betty. CaveDiver is an updated version
    of the obsolete SaltMiner agent.
  - NurseBetty: Betty attempts to locate and fix missing chunk
    fragment lists. CaveDiver notifies her when something has 
    been found and Betty is also able to tell CaveDiver and the
    older SaltMiner to spelunk for missing fragments.
  - GOM: GOM acknowledges all requests and ignores them completely. 
    This works well in combination with Lazy and Orderly when fuzz
    testing.

General Features.
=================

Redundancy:
  - Fragments are stored on multiple hosts in case a host is 
    unavailable.
  - Fragments may be fetched sequentially and speculatively from 
    multiple agents.
  - Fragments may be migrated to different hosts to satisfy 

Multiple Versions:
  - Multiple versions of fragments exist in the system. This 
    can allow some types of failure or data access to roll 
    back to an earlier version of a chunk.

Highly Resiliant Storage:
  - It is extremely difficult to compromise data integrity.
  - Self healing via SaltMiner and NurseBetty.
  - Roll back to an older version if a catastrophic data error
    occurs.

Storage Performance:
  - Performance of the system is quite high and performance approaching
    91% of wire speeds has been observed in purely decentralised  
    benchmarks, and higher than wire speed where local caching is
    allowed to be used.




