diff --git a/DESIGN.md b/DESIGN.md index 9eec04f..5fb367b 100644 --- a/DESIGN.md +++ b/DESIGN.md @@ -1,6 +1,6 @@ +# Hanoi's Design -# Hanoi's Design: How this LSM-BTree Works - +### Basics If there are N records, there are in log2(N) levels (each being a plain B-tree in a file named "A-*level*.data"). The file `A-0.data` has 1 record, `A-1.data` has 2 records, `A-2.data` has 4 records, and so on: `A-n.data` has 2n records. In "stable state", each level file is either full (there) or empty (not there); so if there are e.g. 20 records stored, then there are only data in filed `A-2.data` (4 records) and `A-4.data` (16 records). @@ -28,7 +28,6 @@ Deletes are the same: they are also done by inserting a tombstone (a special val ## Merge Logic - The really clever thing about this storage mechanism is that merging is guaranteed to be able to "keep up" with insertion. Bitcask for instance has a similar merging phase, but it is separated from insertion. This means that there can suddenly be a lot of catching up to do. The flip side is that you can then decide to do all merging at off-peak hours, but it is yet another thing that need to be configured. With LSM B-Trees; back-pressure is provided by the injection mechanism, which only returns when an injection is complete. Thus, every 2nd insert needs to wait for level #0 to finish the required merging; which - assuming merging has linear I/O complexity - is enough to guarantee that the merge mechanism can keep up at higher-numbered levels. @@ -71,6 +70,4 @@ When X is closed and clean, it is actually intermittently renamed M so that if t ABC files have 2^level KVs in it, regardless of the size of those KVs. XM files have 2^(level+1) approximately ... since tombstone merges might reduce the numbers or repeat PUTs of cause. ### File Descriptors - Hanoi needs a lot of file descriptors, currently 6*⌈log2(N)-TOP_LEVEL⌉, with a nursery of size 2TOP_LEVEL, and N Key/Value pairs in the store. Thus, storing 1.000.000 KV's need 72 file descriptors, storing 1.000.000.000 records needs 132 file descriptors, 1.000.000.000.000 records needs 192. - diff --git a/README.md b/README.md index afb12e3..52885cb 100644 --- a/README.md +++ b/README.md @@ -1,29 +1,42 @@ # Hanoi Key/Value Storage Engine -This Erlang-based storage engine implements a structure somewhat like LSM-trees (Log-Structured Merge Trees, see docs/10.1.1.44.2782.pdf). The notes below describe how this storage engine work; I have not done extensive studies as how it differs from other storage mechanisms, but a brief brows through available online resources on LSM-trees indicates that this storage engine is quite different in several respects. - -The storage engine can function as an alternative backend for Basho's Riak/KV. +This storage engine implements a structure somewhat like LSM-trees +(Log-Structured Merge Trees, see docs/10.1.1.44.2782.pdf). The notes in +DESIGN.md describe how this storage engine work; I have not done extensive +studies as how it differs from other storage mechanisms, but a brief review of +available research on LSM-trees indicates that this storage engine is quite +different in several respects. Here's the bullet list: -- Insert, Delete and Read all have worst case log2(N) complexity. -- The cost of evicting stale key/values is amortized into insertion, so you don't need to schedule merge to happen at off-peak hours. -- Operations-friendly "append-only" storage (allows you to backup live system, and crash-recovery is very fast) -- Supports range queries (and thus eventually Riak 2i.) -- Doesn't need much RAM, but does need a lot of file descriptors -- All around 3000 lines of pure Erlang code +- Insert, Delete and Read all have worst case log2(N) complexity. +- The cost of evicting stale key/values is amortized into insertion + - you don't need a separate eviction thread to keep memory use low + - you don't need to schedule merges to happen at off-peak hours +- Operations-friendly "append-only" storage + - allows you to backup live system + - crash-recovery is very fast and the logic is straight forward +- Supports efficient range queries +- Uses bloom filters to avoid unnecessary lookups on disk +- Efficient resource utilization + - Doesn't store all keys in memory + - Uses a modest number of file descriptors proportional to the number of levels + - IO is generally balanced between random and sequential + - Low CPU overhead +- ~2000 lines of pure Erlang code in src/*.erl -### Deploying the hanoi for testing with Riak/KV +### How to deploy Hanoi as a Riak/KV backend -You can deploy `hanoi` into a Riak devrel cluster using the -`enable-hanoi` script. Clone the `riak` repo, change your working directory -to it, and then execute the `enable-hanoi` script. It adds `hanoi` as a -dependency, runs `make all devrel`, and then modifies the configuration -settings of the resulting dev nodes to use the hanoi storage backend. +This storage engine can function as an alternative backend for Basho's Riak/KV. + +You can deploy `hanoi` into a Riak devrel cluster using the `enable-hanoi` +script. Clone the `riak` repo, change your working directory to it, and then +execute the `enable-hanoi` script. It adds `hanoi` as a dependency, runs `make +all devrel`, and then modifies the configuration settings of the resulting dev +nodes to use the hanoi storage backend. 1. `git clone git://github.com/basho/riak.git` 1. `cd riak/deps` 1. `git clone git://github.com/basho/hanoi.git` 1. `cd ..` -1. `./deps/hanoi/enable-hanoi` # which does `make all devrel` - +1. `./deps/hanoi/enable-hanoi` diff --git a/TODO b/TODO index 9dd8112..1e72907 100644 --- a/TODO +++ b/TODO @@ -1,22 +1,28 @@ -* hanoi (in order of priority) - * [2i] secondary index support +* Phase 1: Minimum viable product (in order of priority) + * lager; check for uses of lager:error/2 + * configurable TOP_LEVEL size + * support for future file format changes + * Define a standard struct which is the metadata added at the end of the + file, e.g. [btree-nodes] [meta-data] [offset of meta-data]. This is written + in hanoi_writer:flush_nodes, and read in hanoi_reader:open2. + * test new snappy compression support + * Riak/KV secondary index (2i) support * atomic multi-commit/recovery - * add checkpoint/1 and sync/1 - flush pending writes to stable storage - (nursery:finish() and finish/flush any merges) - * [config] add config parameters on open - * {cache, bytes(), name} share max(bytes) cache named 'name' via etc - * [stats] statistics - * For each level {#merges, {merge-time-min, max, average}} - * [expiry] support for time based expiry, merge should eliminate expired data + * support for time based expiry, merge should eliminate expired data + * statistics + * for each level {#merges, {merge-time-min, max, average}} * add @doc strings and and -spec's * check to make sure every error returns with a reason {error, Reason} - * lager; check for uses of lager:error/2 - * add version 1, crc to the files - * add compression via snappy (https://github.com/fdmanana/snappy-erlang-nif) - * add encryption - * adaptive nursery sizing + +* Phase 2: Production Ready + * dual-nursery + * cache for read-path + * {cache, bytes(), name} share max(bytes) cache named 'name' via etc + +* Phase 3: Wish List * add truncate/1 - quickly truncates a database to 0 items * count/1 - return number of items currently in tree + * adaptive nursery sizing * backpressure on fold operations - The "sync_fold" creates a snapshot (hard link to btree files), which provides consistent behavior but may use a lot of disk space if there is @@ -24,22 +30,10 @@ - The "async_fold" folds a limited number, and remembers the last key serviced, then picks up from there again. So you could see intermittent puts in a subsequent batch of results. + * add block-level encryption support -PHASE 2: -* hanoi - * Define a standard struct which is the metadata added at the end of the - file, e.g. [btree-nodes] [meta-data] [offset of meta-data]. This is written - in hanoi_writer:flush_nodes, and read in hanoi_reader:open2. - * [feature] compression, encryption on disk - - - -REVIEW LITERATURE AND OTHER SIMILAR IMPLEMENTATAIONS: -* nessdb https://code.google.com/p/nessdb/source/browse/LSM-BTREE?r=3a1df166a19505a2369dd954e8fc6d0a545f3d7b -* http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf page 14+ -* http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf - +## NOTES: 1: make the "first level" have more thatn 2^5 entries (controlled by the constant TOP_LEVEL in hanoi.hrl); this means a new set of files is opened/closed/merged for every 32 insert/updates/deletes. Setting this higher will just make the nursery correspondingly larger, which should be absolutely fine.