This commit is contained in:
Gregory Burd 2012-04-24 10:37:45 -04:00
parent 86516d4b2d
commit e4d8615a99
3 changed files with 54 additions and 50 deletions

View file

@ -1,6 +1,6 @@
# Hanoi's Design
# Hanoi's Design: How this LSM-BTree Works ### Basics
If there are N records, there are in log<sub>2</sub>(N) levels (each being a plain B-tree in a file named "A-*level*.data"). The file `A-0.data` has 1 record, `A-1.data` has 2 records, `A-2.data` has 4 records, and so on: `A-n.data` has 2<sup>n</sup> records. If there are N records, there are in log<sub>2</sub>(N) levels (each being a plain B-tree in a file named "A-*level*.data"). The file `A-0.data` has 1 record, `A-1.data` has 2 records, `A-2.data` has 4 records, and so on: `A-n.data` has 2<sup>n</sup> records.
In "stable state", each level file is either full (there) or empty (not there); so if there are e.g. 20 records stored, then there are only data in filed `A-2.data` (4 records) and `A-4.data` (16 records). In "stable state", each level file is either full (there) or empty (not there); so if there are e.g. 20 records stored, then there are only data in filed `A-2.data` (4 records) and `A-4.data` (16 records).
@ -28,7 +28,6 @@ Deletes are the same: they are also done by inserting a tombstone (a special val
## Merge Logic ## Merge Logic
The really clever thing about this storage mechanism is that merging is guaranteed to be able to "keep up" with insertion. Bitcask for instance has a similar merging phase, but it is separated from insertion. This means that there can suddenly be a lot of catching up to do. The flip side is that you can then decide to do all merging at off-peak hours, but it is yet another thing that need to be configured. The really clever thing about this storage mechanism is that merging is guaranteed to be able to "keep up" with insertion. Bitcask for instance has a similar merging phase, but it is separated from insertion. This means that there can suddenly be a lot of catching up to do. The flip side is that you can then decide to do all merging at off-peak hours, but it is yet another thing that need to be configured.
With LSM B-Trees; back-pressure is provided by the injection mechanism, which only returns when an injection is complete. Thus, every 2nd insert needs to wait for level #0 to finish the required merging; which - assuming merging has linear I/O complexity - is enough to guarantee that the merge mechanism can keep up at higher-numbered levels. With LSM B-Trees; back-pressure is provided by the injection mechanism, which only returns when an injection is complete. Thus, every 2nd insert needs to wait for level #0 to finish the required merging; which - assuming merging has linear I/O complexity - is enough to guarantee that the merge mechanism can keep up at higher-numbered levels.
@ -71,6 +70,4 @@ When X is closed and clean, it is actually intermittently renamed M so that if t
ABC files have 2^level KVs in it, regardless of the size of those KVs. XM files have 2^(level+1) approximately ... since tombstone merges might reduce the numbers or repeat PUTs of cause. ABC files have 2^level KVs in it, regardless of the size of those KVs. XM files have 2^(level+1) approximately ... since tombstone merges might reduce the numbers or repeat PUTs of cause.
### File Descriptors ### File Descriptors
Hanoi needs a lot of file descriptors, currently 6*⌈log<sub>2</sub>(N)-TOP_LEVEL⌉, with a nursery of size 2<sup>TOP_LEVEL</sup>, and N Key/Value pairs in the store. Thus, storing 1.000.000 KV's need 72 file descriptors, storing 1.000.000.000 records needs 132 file descriptors, 1.000.000.000.000 records needs 192. Hanoi needs a lot of file descriptors, currently 6*⌈log<sub>2</sub>(N)-TOP_LEVEL⌉, with a nursery of size 2<sup>TOP_LEVEL</sup>, and N Key/Value pairs in the store. Thus, storing 1.000.000 KV's need 72 file descriptors, storing 1.000.000.000 records needs 132 file descriptors, 1.000.000.000.000 records needs 192.

View file

@ -1,29 +1,42 @@
# Hanoi Key/Value Storage Engine # Hanoi Key/Value Storage Engine
This Erlang-based storage engine implements a structure somewhat like LSM-trees (Log-Structured Merge Trees, see docs/10.1.1.44.2782.pdf). The notes below describe how this storage engine work; I have not done extensive studies as how it differs from other storage mechanisms, but a brief brows through available online resources on LSM-trees indicates that this storage engine is quite different in several respects. This storage engine implements a structure somewhat like LSM-trees
(Log-Structured Merge Trees, see docs/10.1.1.44.2782.pdf). The notes in
The storage engine can function as an alternative backend for Basho's Riak/KV. DESIGN.md describe how this storage engine work; I have not done extensive
studies as how it differs from other storage mechanisms, but a brief review of
available research on LSM-trees indicates that this storage engine is quite
different in several respects.
Here's the bullet list: Here's the bullet list:
- Insert, Delete and Read all have worst case log<sub>2</sub>(N) complexity. - Insert, Delete and Read all have worst case log<sub>2</sub>(N) complexity.
- The cost of evicting stale key/values is amortized into insertion, so you don't need to schedule merge to happen at off-peak hours. - The cost of evicting stale key/values is amortized into insertion
- Operations-friendly "append-only" storage (allows you to backup live system, and crash-recovery is very fast) - you don't need a separate eviction thread to keep memory use low
- Supports range queries (and thus eventually Riak 2i.) - you don't need to schedule merges to happen at off-peak hours
- Doesn't need much RAM, but does need a lot of file descriptors - Operations-friendly "append-only" storage
- All around 3000 lines of pure Erlang code - allows you to backup live system
- crash-recovery is very fast and the logic is straight forward
- Supports efficient range queries
- Uses bloom filters to avoid unnecessary lookups on disk
- Efficient resource utilization
- Doesn't store all keys in memory
- Uses a modest number of file descriptors proportional to the number of levels
- IO is generally balanced between random and sequential
- Low CPU overhead
- ~2000 lines of pure Erlang code in src/*.erl
### Deploying the hanoi for testing with Riak/KV ### How to deploy Hanoi as a Riak/KV backend
You can deploy `hanoi` into a Riak devrel cluster using the This storage engine can function as an alternative backend for Basho's Riak/KV.
`enable-hanoi` script. Clone the `riak` repo, change your working directory
to it, and then execute the `enable-hanoi` script. It adds `hanoi` as a You can deploy `hanoi` into a Riak devrel cluster using the `enable-hanoi`
dependency, runs `make all devrel`, and then modifies the configuration script. Clone the `riak` repo, change your working directory to it, and then
settings of the resulting dev nodes to use the hanoi storage backend. execute the `enable-hanoi` script. It adds `hanoi` as a dependency, runs `make
all devrel`, and then modifies the configuration settings of the resulting dev
nodes to use the hanoi storage backend.
1. `git clone git://github.com/basho/riak.git` 1. `git clone git://github.com/basho/riak.git`
1. `cd riak/deps` 1. `cd riak/deps`
1. `git clone git://github.com/basho/hanoi.git` 1. `git clone git://github.com/basho/hanoi.git`
1. `cd ..` 1. `cd ..`
1. `./deps/hanoi/enable-hanoi` # which does `make all devrel` 1. `./deps/hanoi/enable-hanoi`

50
TODO
View file

@ -1,22 +1,28 @@
* hanoi (in order of priority) * Phase 1: Minimum viable product (in order of priority)
* [2i] secondary index support * lager; check for uses of lager:error/2
* configurable TOP_LEVEL size
* support for future file format changes
* Define a standard struct which is the metadata added at the end of the
file, e.g. [btree-nodes] [meta-data] [offset of meta-data]. This is written
in hanoi_writer:flush_nodes, and read in hanoi_reader:open2.
* test new snappy compression support
* Riak/KV secondary index (2i) support
* atomic multi-commit/recovery * atomic multi-commit/recovery
* add checkpoint/1 and sync/1 - flush pending writes to stable storage * support for time based expiry, merge should eliminate expired data
(nursery:finish() and finish/flush any merges) * statistics
* [config] add config parameters on open * for each level {#merges, {merge-time-min, max, average}}
* {cache, bytes(), name} share max(bytes) cache named 'name' via etc
* [stats] statistics
* For each level {#merges, {merge-time-min, max, average}}
* [expiry] support for time based expiry, merge should eliminate expired data
* add @doc strings and and -spec's * add @doc strings and and -spec's
* check to make sure every error returns with a reason {error, Reason} * check to make sure every error returns with a reason {error, Reason}
* lager; check for uses of lager:error/2
* add version 1, crc to the files * Phase 2: Production Ready
* add compression via snappy (https://github.com/fdmanana/snappy-erlang-nif) * dual-nursery
* add encryption * cache for read-path
* adaptive nursery sizing * {cache, bytes(), name} share max(bytes) cache named 'name' via etc
* Phase 3: Wish List
* add truncate/1 - quickly truncates a database to 0 items * add truncate/1 - quickly truncates a database to 0 items
* count/1 - return number of items currently in tree * count/1 - return number of items currently in tree
* adaptive nursery sizing
* backpressure on fold operations * backpressure on fold operations
- The "sync_fold" creates a snapshot (hard link to btree files), which - The "sync_fold" creates a snapshot (hard link to btree files), which
provides consistent behavior but may use a lot of disk space if there is provides consistent behavior but may use a lot of disk space if there is
@ -24,22 +30,10 @@
- The "async_fold" folds a limited number, and remembers the last key - The "async_fold" folds a limited number, and remembers the last key
serviced, then picks up from there again. So you could see intermittent serviced, then picks up from there again. So you could see intermittent
puts in a subsequent batch of results. puts in a subsequent batch of results.
* add block-level encryption support
PHASE 2: ## NOTES:
* hanoi
* Define a standard struct which is the metadata added at the end of the
file, e.g. [btree-nodes] [meta-data] [offset of meta-data]. This is written
in hanoi_writer:flush_nodes, and read in hanoi_reader:open2.
* [feature] compression, encryption on disk
REVIEW LITERATURE AND OTHER SIMILAR IMPLEMENTATAIONS:
* nessdb https://code.google.com/p/nessdb/source/browse/LSM-BTREE?r=3a1df166a19505a2369dd954e8fc6d0a545f3d7b
* http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf page 14+
* http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf
1: make the "first level" have more thatn 2^5 entries (controlled by the constant TOP_LEVEL in hanoi.hrl); this means a new set of files is opened/closed/merged for every 32 insert/updates/deletes. Setting this higher will just make the nursery correspondingly larger, which should be absolutely fine. 1: make the "first level" have more thatn 2^5 entries (controlled by the constant TOP_LEVEL in hanoi.hrl); this means a new set of files is opened/closed/merged for every 32 insert/updates/deletes. Setting this higher will just make the nursery correspondingly larger, which should be absolutely fine.