Update REAME

2012-01-07 17:54:35 +01:00 · 2012-01-07 17:54:35 +01:00 · eae41cec83
commit eae41cec83
parent 15736dd82a
1 changed files with 29 additions and 17 deletions
--- a/README.md
+++ b/README.md
@ -1,45 +1,57 @@
-# Fractal B-Tree Storage
+> **NOTE** I erroneously called this engine `fractal_btree` when first released, but I have since learned that what we have implemented is indeed much less sophisticated than Tokutek's Fractal Tree&reg;.   Hence, the name change to `lsm_btree`.
-This Erlang-based storage engine provides a scalable alternative to Basho Bitcask and Google's LevelDB with similar properties
+# LSM B-Tree Storage
 This Erlang-based storage engine may eventually provides an alternative to Basho Bitcask and Google's LevelDB.  Of those two, `lsm_btree` is closer to LevelDB in operational characteristics, except it uses fewer file descriptors than LevelDB, is not as performant, but it is implemented in just ~1000 lines of Erlang.  For some the benefit of having a clean and simple Erlang implementation make it worth it.
 Here's the bullet list:
 - Very fast writes and deletes,
- Reasonably fast reads (N records are stored in log<sub>2</sub>(N) B-trees, each with a fan-out of 32),
+- Reasonably fast reads (N records are stored in log<sub>2</sub>(N) B-trees),
- Operations-friendly "append-only" storage (allows you to backup live system)
+- Operations-friendly "append-only" storage (allows you to backup live system, and crash-recovery is very simple)
- The cost of merging (evicting stale key/values) is amortized into insertion, so you don't need to schedule merge to happen at off-peak hours. 
+- The cost of evicting stale key/values is amortized into insertion, so you don't need to schedule merge to happen at off-peak hours. 
- Supports range queries (and thus potentially Riak 2i.)
+- Will support range queries (and thus eventually Riak 2i.)
- Unlike Bitcask and InnoDB, you don't need a boat load of RAM
+- Doesn't need a boat load of RAM
 - All in 1000 lines of pure Erlang code
 Once we're a bit more stable, we'll provide a Riak backend.
-## How It Works
+### LSM B-Trees vs. Fractal Trees
 LSM B-Trees bear some resemblance with so-called "Fractal Trees&reg;", but LSM B-Trees are much simpler.  For instance, our LSM B-Tree does not use any in-place updating (in a Fractal Tree, inner nodes have a buffer space which is updated-in place); also we use bloom filters rather than fractional cascading to speed up multiple B-tree lookups.  From published online documents, it does indeed look like fractal trees can be *much* faster (I have not run any performance tests myself).
-If there are N records, there are in log<sub>2</sub>(N)  levels (each an individual B-tree in a file).  Level #0 has 1 record, level #1 has 2 records, #2 has 4 records, and so on.  I.e. level #n has 2<sup>n</sup> records.
+You can read more about Fractal Trees in this slide deck from [Tokutek](http://www.tokutek.com/2011/11/how-fractal-trees-work-at-mit-today/), a company providing a MySQL backend based on Fractal Trees.  They also own some patents related to fractal trees.  I have not tried their TokuDB, but it looks truly amazing; I recommend that you try it out if you're hitting the limits of your current MySQL setup.
-In "stable state", each level is either full or empty; so if there are e.g. 20 records stored, then levels #5 and #2 are full; the other ones are empty.
+## How a LSM-BTree Works
-You can read more about Fractal Trees at [Tokutek](http://www.tokutek.com/2011/11/how-fractal-trees-work-at-mit-today/), a company providing a MySQL backend based on Fractal Trees.  I have not tried it, but it looks truly amazing.
+If there are N records, there are in log<sub>2</sub>(N)  levels (each being a plain B-tree in a file named "A-*level*.data").  The file `A-0.data` has 1 record, `A-1.data` has 2 records, `A-2.data` has 4 records, and so on: `A-n.data` has 2<sup>n</sup> records.
 In "stable state", each level file is either full (there) or empty (not there); so if there are e.g. 20 records stored, then there are only data in filed `A-2.data` (4 records) and `A-5.data` (16 records).
 ### Lookup
-Lookup is quite simple: starting at level #0, the sought for Key is searched in the B-tree there.  If nothing is found, search continues to the next level.  So if there are *N* levels, then *N* disk-based B-tree lookups are performed.  Each lookup is "guarded" by a bloom filter to improve the likelihood that disk-based searches are only done when likely to succeed.
+Lookup is quite simple: starting at `A-0.data`, the sought for Key is searched in the B-tree there.  If nothing is found, search continues to the next data file.  So if there are *N* levels, then *N* disk-based B-tree lookups are performed.  Each lookup is "guarded" by a bloom filter to improve the likelihood that disk-based searches are only done when likely to succeed.
 ### Insertion
 Insertion works by a mechanism known as B-tree injection.  Insertion always starts by constructing a fresh B-tree with 1 element in it, and "injecting" that B-tree into level #0.  So you always inject a B-tree of the same size as the size of the level you're injecting it into.
- If the level being injected into empty, then the injected B-tree becomes the contents for that level. 
+- If the level being injected into empty (there is no A-*level*.data file), then the injected B-tree becomes the contents for that level (we just rename the file). 
- Otherwise, the contained and the injected B-trees are *merged* to form a new temporary B-tree (of double size), which is then injected into the next level.
+- Otherwise, 
    - The injected tree file is renamed to B-*level*.data;
 	- The files A-*level*.data and B-*level*.data are merged into a new temporary B-tree (of roughly double size), X-*level*.data.
 	- The outcome of the merge is then injected into the next level.
 While merging, lookups at level *n* first consults the B-*n*.data file, then the A-*n*.data file.  At a given level, there can only be one merge operation active.
 ### Overwrite and Delete
 Overwrite is done by simply doing a new insertion.  Since search always starts from the top (level #0 ... level#*n*), newer values will be at a lower level, and thus be found before older values.  When merging, values stored in the injected tree (that come from a lower-numbered level) have priority over the contained tree.
-Deletes are the same: they are also done by inserting a tombstone (a special value outside the domain of values).  When a tombstone is merged at the currently highest numbered level it will be discarded.  So tombstones have to bubble "down" to the highest numbered level before it can be removed.
+Deletes are the same: they are also done by inserting a tombstone (a special value outside the domain of values).  When a tombstone is merged at the currently highest numbered level it will be discarded.  So tombstones have to bubble "down" to the highest numbered level before it can be truly evicted.
 ## Merge Logic
-The really clever thing about this storage engine is that merging is guaranteed to be able to "keep up" with insertion.   Bitcask for instance has a similar merging phase, but it is separated from insertion.  This means that there can suddenly be a lot of catching up to do.  The flip side is that you can then decide to do all merging at off-peak hours, but it is yet another thing that need to be configured.
+The really clever thing about this storage mechanism is that merging is guaranteed to be able to "keep up" with insertion.   Bitcask for instance has a similar merging phase, but it is separated from insertion.  This means that there can suddenly be a lot of catching up to do.  The flip side is that you can then decide to do all merging at off-peak hours, but it is yet another thing that need to be configured.
-With Fractal B-Trees; back-pressure is provided by the injection mechanism, which only returns when an injection is complete.  Thus, every 2nd insert needs to wait for level #0 to finish the required merging; which - assuming merging has linear I/O complexity - is enough to guarantee that the merge mechanism can keep up at higher-numbered levels.  
+With LSM B-Trees; back-pressure is provided by the injection mechanism, which only returns when an injection is complete.  Thus, every 2nd insert needs to wait for level #0 to finish the required merging; which - assuming merging has linear I/O complexity - is enough to guarantee that the merge mechanism can keep up at higher-numbered levels.  
 OK, I've told you a lie.  In practice, it is not practical to create a new file for each insert (injection at level #0), so we allows you to define the "top level" to be a number higher that #0; currently defaulting to #6 (32 records).  That means that you take the amortization "hit" for ever 32 inserts.