Right now, this is controlled by the macro
INC_MERGE_STEP in hanoidb_nursery; eventually
we should turn this into a configuration option.
Making this small, (minimum is 1), hurts average perf
but reduces the 99.9 percentile latency.
Tree nodes now hold entries at the form
{Key, ?TOMBSTONE
| BinValue
| {?TOMBSTONE, TStamp}
| {BinValue, TStamp}}
We use the form without TStamp when expiry_secs
is unset or set to 0 (i.e., values don't expire).
merger/writer: Move KV count into writer, because
now the writer:add determines if a value is expired
and thus wither a value is actually written. Thus,
writer now has a new API function which returns the
KV count written so far.
reader: lookup/fold API hides the TStamp tuples,
so only the next_node API used by the merger
is exposed to these {Key, {_, TStamp}} entries.
nursery: like reader, the TStamp'ed tuples are
not exposed in the client API; expired values
are simply not returned from fold/lookup.
hanoidb: add config option {expiry_secs, N}.
other modules: Make sure that config is passed
all the way down through (sub) processes to be
able to utilize the config option everywhere.
test: update to work with new option.
BREAKING CHANGE! This change provides for future
file format changes, but also breaks backwards
compatibility.
Also describe the file format in design_document
With this change, GETs will flow concurrently
down through the level controllers, replying
directly to the caller via gen_server:reply.
Very actor-like :-)
Current code base silently ignores CRC errors,
meaning that KVs that have errors will just
disappear, or may show up as a previously stored
value for the same key.
With this change, the fold worker does not
link to the receiver; now it simply monitors
the receiving process. If the receiver dies,
the fold worker dies normally.
The individual fold processes running on level
files are linked to the fold worker; so between
fold merge worker and those, normal link/kill
applies.
This makes fold-from-snapshot use the back pressure
model of doing plain_rpc:call to the merge worker
delivering chunks of 100 KVs.
The back pressure is entirely internal to hanoi,
designed to ensure that the process that merges
fold results from the individual levels is not
swamped with fold data.
Folds with a limit < 10 still do "blocking fold"
which is more efficient and uses fewer FDs, but
blocks concurrent put/get operations.
This first step of the fold back pressure impl
changes fold worker so that it does not get
flooded by messages. Now, we take messages
and put them in queues (one per fold source),
so we don't have to do selective receive on
bazillions of messages.
Now merge work computation is close to ideal.
It does not take into account the actual size
of files at each level, but we have not figured
out how to utilize that knowledge.
We were delegating too much work. The original
algorithm description said that for each insert,
"1" unit of merge work has to be done
*at each level* … implying that if nothing needs
doing at a level, that "not done work" does not
add to work done elsewhere. This fix gets us back
to that situation (by always subtracting at least
2^TOP_LEVEL from the presented work amount), while
maintaining the (beneficial) effect of chunking
merge work at at anything but the last level.
Effectively, this reduces the maximum amount of
merge work done, also reducing our worst case
latency.
Now that we understand this, we can refactor the
algorithm to delegate "DoneWork", because then
each level can determine the total work, and see
if any work is left "for me". That's next.
When scanning just one file (because all it's keys
are after the ones in the other file), we also
can need hibernation to save memory. Especially
the bloom filters being built take a lot of mem.