We were delegating too much work. The original
algorithm description said that for each insert,
"1" unit of merge work has to be done
*at each level* … implying that if nothing needs
doing at a level, that "not done work" does not
add to work done elsewhere. This fix gets us back
to that situation (by always subtracting at least
2^TOP_LEVEL from the presented work amount), while
maintaining the (beneficial) effect of chunking
merge work at at anything but the last level.
Effectively, this reduces the maximum amount of
merge work done, also reducing our worst case
latency.
Now that we understand this, we can refactor the
algorithm to delegate "DoneWork", because then
each level can determine the total work, and see
if any work is left "for me". That's next.
When scanning just one file (because all it's keys
are after the ones in the other file), we also
can need hibernation to save memory. Especially
the bloom filters being built take a lot of mem.
These two parameters (defaulting to 512k) control
the amount of erlang file buffer space to allocate
for delayed_write and read_ahead when merging.
This config parameter is *per merge task* of which
there can be many for each open HanoiDB; and again
multiplied by number of active vnodes in Riak.
As such, this can config parameter is significant
for the memory usage of a Riak with Hanoi, but setting
it too low will kill the performance.
Analysis seems to indicate that merge processes
(from high-numbered levels) tend to be activated
quite infrequent. Thus, we term-to-bin/gzip the
merge process state, and invoke explicit gc
before waiting for a {step, …} message again.
Looks like we're generating a lot of garbage
here. Moving this to a separate process lets
us avoid a lot of garbage collection work, since
we don't cache these parsed nodes anyway.
In some cases, inner nodes were not being emitted.
This some times would cause queries (get / range_fold)
to only include results in a right-most branch.
When merge is completed, and inject-to-next-level
is pending, there is still a B file, but no
current merge_pid. In this case, don't try
to do merge work at this level.
This involves some cleanup/reorg of code
in hanoi_util. Streaming trees and nursery
now use the same cry checking code.
Future: Keep the CRC-encoded binary around,
and reuse it when writing trees. This will reduce
cpu costs involved in re-computing those all the
time.
Institutionalize the way hanoi_level handles RPC.
This is embodied in a new module, which should be
pushed to plain_fsm, but we'll keep it here for
now.
Now incremental merge has a new strategy.
In stead of doing the same amount of merge
work at all levels, we now compute the total
merge work load, and do as much as possible
on the first level, subtract work done, and
delegate to the next level, etc.
The effect of this is that we do more IO on
fewer files, improving sequential-ness of
the workload involved in the incremental merge.
This refactoring just adds the stat to the
master gen_server of a Hanoi instance to
know the current number of levels. Until now,
we've only held a reference to the current
top level.
If we're opening a hanoi store configured with
smaller nursery size than the default, then
we need to make sure that we also open the
small levels.
Future feature is to actually squash the
smaller levels.
This improves recovery two-fold:
1. make sure that we actually wait for initial
merge to complete (issue incremental_merge(0))
2. compute minimum required merge work for merge
to establish invariant that there's room
for a new nursery inject any time.
option {compression, none|gzip|snappy}
... except right now using snappy is broken,
it seems that it causes bloom filters to
crash. Needs investigation.
option {block_size, 32768}
... writes data to disk in chunks of ~32k.
When re-opening a Hanoi data store, we need to
reestablish the invariant that there is always
room to inject a data file at the top level.
In a worst case scenario, every level has all of
A, B, and C; and thus needs to merge A+B -> X
fully in order to accommodate what the parent
will inject. 2*BTREE_SIZE(Level) >= sizeof(A+B)
This change makes incremental merge be concurrent
with filling up the nursery. So in stead of waiting
for an incremental merge to complete before returning
from insert, it
- blocks waiting for a possible previous incremental merge to complete
- issues a new incremental merge.
This improves put latencies, but not throughput.
This slows down insert to be log2(N), where N is
the total number of objects in the store. The upside
is that it also removes the terrible worst case
scenarios for insert.