29868678a4
Small cleanups Small cleanups Refactoring argnames & order for more consistency Add server-side-calculated MD5 checksum + logging file:consult() style checksum management, too slow! 513K csums = 105 seconds, ouch Much faster checksum recording Add checksum_list. Alas, line-by-line I/O is slow, neh? Much faster checksum listing Add file0_verify_checksums.escript and supporting code Adjust escript +A and -smp flags Add file0_compare_filelists.escript First draft of file0_repair_server.escript First draft of file0_repair_server.escript, part 2 WIP of file0_repair_server.escript, part 3 WIP of file0_repair_server.escript, part 4 Basic repair works, it seems, hooray! When checksum file ordering is different, try a cheap(?) 'cmp' on sorted results instead Add README.md Initial import of szone_chash.erl Add file0_cc_make_projection.escript and supporting code Add file0_cc_map_prefix.escript and supporting code Change think-o: hash output is a chain, silly boy Add file0_cc_1file_write_redundant.escript and support Add file0_cc_read_client.escript and supporting code Add examples/servers.map & file0_start_servers.escript WIP: working on file0_cc_migrate_files.escript File migration finished, works, yay! Add basic 'what am I' docs to each script Add file0_server_daemon.escript Minor fixes Fix broken unit test Add basho_bench run() commands for append & read ops with projection Add to examples dir WIP: erasure coding hack, part 1 Fix broken unit test WIP: erasure coding hack, part 2 WIP: erasure coding hack, part 3, EC data write is finished! WIP: erasure coding hack, part 4, EC data read still in progress WIP: erasure coding hack, part 5, EC data read still in progress WIP: erasure coding hack, part 5b, EC data read still in progress WIP: erasure coding hack, EC data read finished! README update, part 1 README update, part 2 Oops, put back the printed ouput for file-write-client and 1file-write-redundant-client README update, part 3 Fix 'user' output bug in list-client Ugly hacks to get output/no-output from write clients Clean up minor output bugs Clean up minor output bugs, part 2 README update, part 4 Clean up minor output bugs, part 3 Clean up minor output bugs, part 5 Clean up minor output bugs, part 6 README update, part 6 README update, part 7 README update, part 7 README update, part 8 Final edits/fixes for demo day Fix another oops in the README/demo day script |
||
---|---|---|
.. | ||
examples | ||
jerasure.Darwin/bin | ||
.gitignore | ||
cc.hrl | ||
file0.config | ||
file0.erl | ||
file0_1file_write_redundant.escript | ||
file0_cc_1file_write_redundant.escript | ||
file0_cc_ec_encode.escript | ||
file0_cc_make_projection.escript | ||
file0_cc_map_prefix.escript | ||
file0_cc_migrate_files.escript | ||
file0_cc_read_client.escript | ||
file0_checksum_list.escript | ||
file0_compare_filelists.escript | ||
file0_list.escript | ||
file0_read_client.escript | ||
file0_repair_server.escript | ||
file0_server.escript | ||
file0_server_daemon.escript | ||
file0_start_servers.escript | ||
file0_test.escript | ||
file0_verify_checksums.escript | ||
file0_write_client.escript | ||
README.md | ||
REPAIR-SORT-JOIN.sh | ||
seq0.erl | ||
szone_chash.erl |
Basic glossary/summary/review for the impatient
- Machi: An immutable, append-only distributed file storage service.
- For more detail, see: https://github.com/basho/internal_wiki/wiki/RFCs#riak-cs-rfcs
- Machi cluster: A small cluster of machines, e.g. 2 or 3, which form a single Machi cluster.
- All cluster data is replicated via Chain Replication.
- In nominal/full-repair state, all servers store exactly the same copies of all files.
- Strong Consistency and CP-style operation: Not here, please look elsewhere. This is an eventual consistency joint.
- Sequencer: Logical component that specifies where any new data will be stored.
- For each chunk of data written to a Machi server, the sequencer is solely responsible for assigning a file name and byte offset within that file.
- http://martinfowler.com/bliki/TwoHardThings.html
- Server: A mostly-dumb file server, with fewer total functions than an NFS version 2 file server.
- Compare with https://tools.ietf.org/html/rfc1094#section-2.2
- Chunk: The basic storage unit of Machi: an ordered sequence of bytes. Each chunk is stored together with its MD5 checksum.
- Think of a single Erlang binary term...
- File: A strictly-ordered collection of chunks.
- Chain Replication: A variation of master/slave replication where writes and reads are strictly ordered in a "chain-like" manner
- Chain Repair: Identify chunks that are missing between servers in a chain and then fix them.
- Cluster of clusters: A desirable rainbow unicorn with purple fur and files automagically distributed across many Machi clusters, scalable to infinite size within a single data center.
- Projection: A data structure which specifies file distribution and file migration across a cluster of clusters.
- Also used by a single Machi cluster to determine the current membership & status of the cluster & its chain.
- Erasure Coding: A family of techniques of encoding data with redundant data. The redundant data is smaller than the original data, yet can be used to reconstruct the original data if some original parts are lost/destroyed.
Present in this POC
-
Basic sequencer and file server
- Sequencer and file server are combined for efficiency
- Basic ops: append chunk to file (with implicit sequencer assignment), read chunk, list files, fetch chunk checksums (by file)
- Chain replication ops: write chunk (location already determined by sequencer), delete file (cluster-of-clusters file migration use only), truncate file (cluster-of-clusters + erasure coding use only)
-
Basic file clients
- Write chunks to a single server (via sequencer) and read from a single server.
- Write and read chunks to multiple servers, using both projection and chain replication
- All chain replication and projection features are implemented on the client side.
-
Basic cluster-of-cluster functions
- Hibari style consistent hashing for file uploads & downloads
- There is no ring.
- Arbitrary weighted distribution of files across multiple clusters
- Append/write/read ops are automatically routed to the correct chain/cluster.
- File migration/rebalancing
- Hibari style consistent hashing for file uploads & downloads
-
Administrative features
- Verify chunk checksums in a file.
- Find and fix missing chunks between any two servers in a single cluster.
- Normal cluster operations are uninterrupted.
- Create new projection definitions
- Demonstrate file -> cluster mapping via projection
- Identify files in current projection that must be moved in the new projection
- ... and move them to the new projection
- Normal cluster operations are uninterrupted.
-
Erasure coding implementation
- Inefficient & just barely enough to work
- No NIF or Erlang port integration
- Reed-Solomon 10,4 encoding only
- 10 independent sub-chunks of original data
- 4 independent sub-chunks of parity data
- Use any 10 sub-chunks to reconstruct all original and parity data
- Encode an entire file and write sub-chunks to a secondary projection of 14 servers
- Download chunks automatically via regular projection (for replicated chunks) or a secondary projection (for EC-coded sub-chunks).
- Inefficient & just barely enough to work
Missing from this POC
- Chunk immutability is not enforced
- Read-repair ... no good reason, I just ran out of 2014 to implement it
- Per-chunk metadata
- Repair of erasure coded missing blocks: all major pieces are there, but I ran out of 2014 here, also.
- Automation: all chain management functions are manual, no health/status monitoring, etc.
Simple examples
Run a server:
./file0_server.escript file0_server 7071 /tmp/path-to-data-dir
Upload chunks of a local file in 1 million byte chunks:
./file0_write_client.escript localhost 7071 1000000 someprefix /path/to/some/local/file
To fetch the chunks, we need a list of filenames, offsets, and chunk
sizes. The easiest way to get such a list is to save the output from
./file0_write_client.escript
when uploading the file in the first
place. For example, save the following command's output into some
file e.g. /tmp/input
. Then:
./file0_read_client.escript localhost 7071 /tmp/input
More details
Server protocol
All negative server responses start with ERROR
unless otherwise noted.
-
No distributed Erlang message passing ... "just ASCII on a socket".
-
A LenHex Prefix
... Append a chunk of data to a file with name prefixPrefix
. File name & offset will be assigned by server-side sequencer.OK OffsetHex Filename
-
R OffsetHex LenHex Filename
... Read a chunk fromOffsetHex
for lengthLenhex
from fileFileName
.OK [chunk follows]
-
L
... List all files stored on the server, one per lineOK [list of files & file sizes, ending with .]
-
C Filename
... Fetch chunk offset & checksumsOK [list of chunk offset, lengths, and checksums, ending with .]
-
QUIT
... Close connectionOK
-
W-repl OffsetHex LenHex Filename
... Write/replicate a chunk of data to a file with anOffsetHex
that has already been assigned by a sequencer.OK
-
TRUNC-hack--- Filename
... Truncate a file (after erasure encoding is successful).OK
Start a server, upload & download chunks to a single server
Mentioned above.
Upload chunks to a cluster (members specified manually)
./file0_1file_write_redundant.escript 1000000 otherprefix /path/to/local/file localhost 7071 localhost 7072
List files (and file sizes) on a server
./file0_list.escript localhost 7071
List all chunk sizes & checksums for a single file
setenv M_FILE otherprefix.KCL2CC9.1
./file0_checksum_list.escript localhost 7071 $M_FILE
Verify checksums in a file
./file0_verify_checksums.escript localhost 7071 $M_FILE
Compare two servers and file all missing files
TODO script is broken, but is just a proof-of-concept for early repair work.
$ ./file0_compare_filelists.escript localhost 7071 localhost 7072
{legend, {file, list_of_servers_without_file}}.
{all, [{"localhost","7071"},{"localhost","7072"}]}.
Repair server A -> server B, replicating all missing data
./file0_repair_server.escript localhost 7071 localhost 7072 verbose check
./file0_repair_server.escript localhost 7071 localhost 7072 verbose repair
./file0_repair_server.escript localhost 7071 localhost 7072 verbose repair
And then repair in the reverse direction:
./file0_repair_server.escript localhost 7072 localhost 7071 verbose check
./file0_repair_server.escript localhost 7072 localhost 7071 verbose repair
./file0_repair_server.escript localhost 7072 localhost 7071 verbose repair
Projections: how to create & use them
- Easiest use: all projection data is stored in a single directory; use path to the directory for all PoC scripts
- Files in the projection directory:
chains.map
- Define all members of all chains at the current time
*.weight
- Define chain capacity weights
*.proj
- Define a projection (based on previous projection file + weight file)
Sample chains.map file
%% Chain names (key): please use binary
%% Chain members (value): two-tuples of binary hostname and integer TCP port
{<<"chain1">>, [{<<"localhost">>, 7071}]}.
{<<"chain10">>, [{<<"localhost">>, 7071}, {<<"localhost">>, 7072}]}.
Sample weight file
%% Please use binaries for chain names
[
{<<"chain1">>, 1000}
].
Sample projection file
%% Created by:
% ./file0_cc_make_projection.escript new examples/weight_map1.weight examples/1.proj
{projection,1,0,[{<<"chain1">>,1.0}],undefined,undefined,undefined,undefined}.
Create the examples/1.proj
file
./file0_cc_make_projection.escript new examples/weight_map_ec1.weight examples/ec1.proj
Demo consistent hashing of file prefixes: what chain do we map to?
./file0_cc_map_prefix.escript examples/1.proj foo-prefix
./file0_cc_map_prefix.escript examples/3.proj foo-prefix
./file0_cc_map_prefix.escript examples/56.proj foo-prefix
Write replica chunks to a chain via projection
./file0_cc_1file_write_redundant.escript 4444 some-prefix /path/to/local/file ./examples/1.proj
Read replica chunks to a chain via projection
./file0_cc_read_client.escript examples/1.proj /tmp/input console
Change of projection: migrate 1 server's files from old to new projection
./file0_cc_migrate_files.escript examples/2.proj localhost 7071 verbose check
./file0_cc_migrate_files.escript examples/2.proj localhost 7071 verbose repair delete-source
Erasure encode a file, then delete the original replicas, then fetch via EC sub-chunks
The current prototype fetches the file directly from the file system, rather than fetching the file over the network via the RegularProjection
projection definition.
WARNING: For this part of demo to work correctly, $M_PATH
must be a
file that exists on localhost 7071
and not on localhost 7072
or
any other server.
setenv M_PATH /tmp/SAM1/seq-tests/data1/otherprefix.77Z34TQ.1
./file0_cc_ec_encode.escript examples/1.proj $M_PATH /tmp/zooenc examples/ec1.proj verbose check nodelete-tmp
less $M_PATH
./file0_cc_ec_encode.escript examples/1.proj $M_PATH /tmp/zooenc examples/ec1.proj verbose repair progress delete-source
less $M_PATH
./file0_read_client.escript localhost 7071 /tmp/input
./file0_read_client.escript localhost 7072 /tmp/input
./file0_cc_read_client.escript examples/1.proj /tmp/input console examples/55.proj