With this change, the fold worker does not
link to the receiver; now it simply monitors
the receiving process. If the receiver dies,
the fold worker dies normally.
The individual fold processes running on level
files are linked to the fold worker; so between
fold merge worker and those, normal link/kill
applies.
This makes fold-from-snapshot use the back pressure
model of doing plain_rpc:call to the merge worker
delivering chunks of 100 KVs.
The back pressure is entirely internal to hanoi,
designed to ensure that the process that merges
fold results from the individual levels is not
swamped with fold data.
Folds with a limit < 10 still do "blocking fold"
which is more efficient and uses fewer FDs, but
blocks concurrent put/get operations.
This first step of the fold back pressure impl
changes fold worker so that it does not get
flooded by messages. Now, we take messages
and put them in queues (one per fold source),
so we don't have to do selective receive on
bazillions of messages.
Now merge work computation is close to ideal.
It does not take into account the actual size
of files at each level, but we have not figured
out how to utilize that knowledge.
We were delegating too much work. The original
algorithm description said that for each insert,
"1" unit of merge work has to be done
*at each level* … implying that if nothing needs
doing at a level, that "not done work" does not
add to work done elsewhere. This fix gets us back
to that situation (by always subtracting at least
2^TOP_LEVEL from the presented work amount), while
maintaining the (beneficial) effect of chunking
merge work at at anything but the last level.
Effectively, this reduces the maximum amount of
merge work done, also reducing our worst case
latency.
Now that we understand this, we can refactor the
algorithm to delegate "DoneWork", because then
each level can determine the total work, and see
if any work is left "for me". That's next.
When scanning just one file (because all it's keys
are after the ones in the other file), we also
can need hibernation to save memory. Especially
the bloom filters being built take a lot of mem.
These two parameters (defaulting to 512k) control
the amount of erlang file buffer space to allocate
for delayed_write and read_ahead when merging.
This config parameter is *per merge task* of which
there can be many for each open HanoiDB; and again
multiplied by number of active vnodes in Riak.
As such, this can config parameter is significant
for the memory usage of a Riak with Hanoi, but setting
it too low will kill the performance.
Analysis seems to indicate that merge processes
(from high-numbered levels) tend to be activated
quite infrequent. Thus, we term-to-bin/gzip the
merge process state, and invoke explicit gc
before waiting for a {step, …} message again.
Looks like we're generating a lot of garbage
here. Moving this to a separate process lets
us avoid a lot of garbage collection work, since
we don't cache these parsed nodes anyway.
In some cases, inner nodes were not being emitted.
This some times would cause queries (get / range_fold)
to only include results in a right-most branch.
When merge is completed, and inject-to-next-level
is pending, there is still a B file, but no
current merge_pid. In this case, don't try
to do merge work at this level.