From 70f3c02644e387f71e3a934fb2dba9b6c32b698d Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Fri, 17 Apr 2015 12:16:55 +0900 Subject: [PATCH 01/14] Base high level design doc, prior to splitting Major changes, when compared to the original Basho-internal document: * Start removing strong consistency topics to a separate doc (unfinished) * Remove section on per-file metadata management: it was too speculative IMHO * Remove the following sections (numbering is relative to v3 of internal doc): 7.2.1 scenario 1, 13.3, 14 * Move the "Recommended Reading" section to the end --- doc/src.high-level/.gitignore | 4 + doc/src.high-level/Makefile | 6 + doc/src.high-level/append-flow.eps | 268 +++ doc/src.high-level/append-flow2.eps | 349 ++++ doc/src.high-level/figure6.eps | 557 +++++ doc/src.high-level/high-level-machi.tex | 2527 +++++++++++++++++++++++ doc/src.high-level/read-flow.eps | 145 ++ doc/src.high-level/sigplanconf.cls | 1312 ++++++++++++ 8 files changed, 5168 insertions(+) create mode 100644 doc/src.high-level/.gitignore create mode 100644 doc/src.high-level/Makefile create mode 100644 doc/src.high-level/append-flow.eps create mode 100644 doc/src.high-level/append-flow2.eps create mode 100644 doc/src.high-level/figure6.eps create mode 100644 doc/src.high-level/high-level-machi.tex create mode 100644 doc/src.high-level/read-flow.eps create mode 100644 doc/src.high-level/sigplanconf.cls diff --git a/doc/src.high-level/.gitignore b/doc/src.high-level/.gitignore new file mode 100644 index 0000000..2a517a6 --- /dev/null +++ b/doc/src.high-level/.gitignore @@ -0,0 +1,4 @@ +*.aux +*.dvi +*.log +*.pdf diff --git a/doc/src.high-level/Makefile b/doc/src.high-level/Makefile new file mode 100644 index 0000000..f8216da --- /dev/null +++ b/doc/src.high-level/Makefile @@ -0,0 +1,6 @@ +all: + latex high-level-machi.tex + dvipdfm high-level-machi.dvi + +clean: + rm -f *.aux *.dvi *.log diff --git a/doc/src.high-level/append-flow.eps b/doc/src.high-level/append-flow.eps new file mode 100644 index 0000000..9302919 --- /dev/null +++ b/doc/src.high-level/append-flow.eps @@ -0,0 +1,268 @@ +%!PS-Adobe-3.0 EPSF-2.0 +%%BoundingBox: 0 0 416.500000 280.000000 +%%Creator: mscgen 0.18 +%%EndComments +0.700000 0.700000 scale +0 0 moveto +0 400 lineto +595 400 lineto +595 0 lineto +closepath +clip +%PageTrailer +%Page: 1 1 +/Helvetica findfont +10 scalefont +setfont +/Helvetica findfont +12 scalefont +setfont +0 400 translate +/mtrx matrix def +/ellipse + { /endangle exch def + /startangle exch def + /ydia exch def + /xdia exch def + /y exch def + /x exch def + /savematrix mtrx currentmatrix def + x y translate + xdia 2 div ydia 2 div scale + 1 -1 scale + 0 0 1 startangle endangle arc + savematrix setmatrix +} def +(client) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 42 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +42 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(Projection) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 127 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +127 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(ProjStore_A) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 212 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +212 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(Sequencer_A) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 297 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +297 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(FLU_A) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 382 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +382 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(FLU_B) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 467 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +467 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(FLU_C) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 552 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +552 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +newpath 42 -22 moveto 42 -49 lineto stroke +newpath 127 -22 moveto 127 -49 lineto stroke +newpath 212 -22 moveto 212 -49 lineto stroke +newpath 297 -22 moveto 297 -49 lineto stroke +newpath 382 -22 moveto 382 -49 lineto stroke +newpath 467 -22 moveto 467 -49 lineto stroke +newpath 552 -22 moveto 552 -49 lineto stroke +newpath 42 -35 moveto 127 -35 lineto stroke +newpath 127 -35 moveto 117 -41 lineto stroke +(get current) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 57 -33 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +57 -33 moveto show +newpath 42 -49 moveto 42 -76 lineto stroke +newpath 127 -49 moveto 127 -76 lineto stroke +newpath 212 -49 moveto 212 -76 lineto stroke +newpath 297 -49 moveto 297 -76 lineto stroke +newpath 382 -49 moveto 382 -76 lineto stroke +newpath 467 -49 moveto 467 -76 lineto stroke +newpath 552 -49 moveto 552 -76 lineto stroke +newpath 127 -62 moveto 42 -62 lineto stroke +newpath 42 -62 moveto 52 -68 lineto stroke +(ok, #12...) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 61 -60 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +61 -60 moveto show +newpath 42 -76 moveto 42 -103 lineto stroke +newpath 127 -76 moveto 127 -103 lineto stroke +newpath 212 -76 moveto 212 -103 lineto stroke +newpath 297 -76 moveto 297 -103 lineto stroke +newpath 382 -76 moveto 382 -103 lineto stroke +newpath 467 -76 moveto 467 -103 lineto stroke +newpath 552 -76 moveto 552 -103 lineto stroke +newpath 42 -89 moveto 297 -89 lineto stroke +newpath 297 -89 moveto 287 -95 lineto stroke +(Req. 123 bytes, prefix="foo", epoch=12) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 66 -87 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +66 -87 moveto show +newpath 42 -103 moveto 42 -130 lineto stroke +newpath 127 -103 moveto 127 -130 lineto stroke +newpath 212 -103 moveto 212 -130 lineto stroke +newpath 297 -103 moveto 297 -130 lineto stroke +newpath 382 -103 moveto 382 -130 lineto stroke +newpath 467 -103 moveto 467 -130 lineto stroke +newpath 552 -103 moveto 552 -130 lineto stroke +newpath 297 -116 moveto 42 -116 lineto stroke +newpath 42 -116 moveto 52 -122 lineto stroke +1.000000 0.000000 0.000000 setrgbcolor +(bad_epoch, 13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 131 -114 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +1.000000 0.000000 0.000000 setrgbcolor +131 -114 moveto show +0.000000 0.000000 0.000000 setrgbcolor +newpath 42 -130 moveto 42 -157 lineto stroke +newpath 127 -130 moveto 127 -157 lineto stroke +newpath 212 -130 moveto 212 -157 lineto stroke +newpath 297 -130 moveto 297 -157 lineto stroke +newpath 382 -130 moveto 382 -157 lineto stroke +newpath 467 -130 moveto 467 -157 lineto stroke +newpath 552 -130 moveto 552 -157 lineto stroke +newpath 42 -143 moveto 212 -143 lineto stroke +newpath 212 -143 moveto 202 -149 lineto stroke +(get epoch #13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 89 -141 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +89 -141 moveto show +newpath 42 -157 moveto 42 -184 lineto stroke +newpath 127 -157 moveto 127 -184 lineto stroke +newpath 212 -157 moveto 212 -184 lineto stroke +newpath 297 -157 moveto 297 -184 lineto stroke +newpath 382 -157 moveto 382 -184 lineto stroke +newpath 467 -157 moveto 467 -184 lineto stroke +newpath 552 -157 moveto 552 -184 lineto stroke +newpath 212 -170 moveto 42 -170 lineto stroke +newpath 42 -170 moveto 52 -176 lineto stroke +(ok, #13...) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 103 -168 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +103 -168 moveto show +newpath 42 -184 moveto 42 -211 lineto stroke +newpath 127 -184 moveto 127 -211 lineto stroke +newpath 212 -184 moveto 212 -211 lineto stroke +newpath 297 -184 moveto 297 -211 lineto stroke +newpath 382 -184 moveto 382 -211 lineto stroke +newpath 467 -184 moveto 467 -211 lineto stroke +newpath 552 -184 moveto 552 -211 lineto stroke +newpath 42 -197 moveto 297 -197 lineto stroke +newpath 297 -197 moveto 287 -203 lineto stroke +(Req. 123 bytes, prefix="foo", epoch=13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 66 -195 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +66 -195 moveto show +newpath 42 -211 moveto 42 -238 lineto stroke +newpath 127 -211 moveto 127 -238 lineto stroke +newpath 212 -211 moveto 212 -238 lineto stroke +newpath 297 -211 moveto 297 -238 lineto stroke +newpath 382 -211 moveto 382 -238 lineto stroke +newpath 467 -211 moveto 467 -238 lineto stroke +newpath 552 -211 moveto 552 -238 lineto stroke +newpath 297 -224 moveto 42 -224 lineto stroke +newpath 42 -224 moveto 52 -230 lineto stroke +(ok, "foo.seq_a.009" offset=447) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 89 -222 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +89 -222 moveto show +newpath 42 -238 moveto 42 -265 lineto stroke +newpath 127 -238 moveto 127 -265 lineto stroke +newpath 212 -238 moveto 212 -265 lineto stroke +newpath 297 -238 moveto 297 -265 lineto stroke +newpath 382 -238 moveto 382 -265 lineto stroke +newpath 467 -238 moveto 467 -265 lineto stroke +newpath 552 -238 moveto 552 -265 lineto stroke +newpath 42 -251 moveto 382 -251 lineto stroke +newpath 382 -251 moveto 372 -257 lineto stroke +(write "foo.seq_a.009" offset=447 <<...123...>> epoch=13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 62 -249 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +62 -249 moveto show +newpath 42 -265 moveto 42 -292 lineto stroke +newpath 127 -265 moveto 127 -292 lineto stroke +newpath 212 -265 moveto 212 -292 lineto stroke +newpath 297 -265 moveto 297 -292 lineto stroke +newpath 382 -265 moveto 382 -292 lineto stroke +newpath 467 -265 moveto 467 -292 lineto stroke +newpath 552 -265 moveto 552 -292 lineto stroke +newpath 382 -278 moveto 42 -278 lineto stroke +newpath 42 -278 moveto 52 -284 lineto stroke +(ok) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 206 -276 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +206 -276 moveto show +newpath 42 -292 moveto 42 -319 lineto stroke +newpath 127 -292 moveto 127 -319 lineto stroke +newpath 212 -292 moveto 212 -319 lineto stroke +newpath 297 -292 moveto 297 -319 lineto stroke +newpath 382 -292 moveto 382 -319 lineto stroke +newpath 467 -292 moveto 467 -319 lineto stroke +newpath 552 -292 moveto 552 -319 lineto stroke +newpath 42 -305 moveto 467 -305 lineto stroke +newpath 467 -305 moveto 457 -311 lineto stroke +(write "foo.seq_a.009" offset=447 <<...123...>> epoch=13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 105 -303 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +105 -303 moveto show +newpath 42 -319 moveto 42 -346 lineto stroke +newpath 127 -319 moveto 127 -346 lineto stroke +newpath 212 -319 moveto 212 -346 lineto stroke +newpath 297 -319 moveto 297 -346 lineto stroke +newpath 382 -319 moveto 382 -346 lineto stroke +newpath 467 -319 moveto 467 -346 lineto stroke +newpath 552 -319 moveto 552 -346 lineto stroke +newpath 467 -332 moveto 42 -332 lineto stroke +newpath 42 -332 moveto 52 -338 lineto stroke +(ok) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 249 -330 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +249 -330 moveto show +newpath 42 -346 moveto 42 -373 lineto stroke +newpath 127 -346 moveto 127 -373 lineto stroke +newpath 212 -346 moveto 212 -373 lineto stroke +newpath 297 -346 moveto 297 -373 lineto stroke +newpath 382 -346 moveto 382 -373 lineto stroke +newpath 467 -346 moveto 467 -373 lineto stroke +newpath 552 -346 moveto 552 -373 lineto stroke +newpath 42 -359 moveto 552 -359 lineto stroke +newpath 552 -359 moveto 542 -365 lineto stroke +(write "foo.seq_a.009" offset=447 <<...123...>> epoch=13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 147 -357 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +147 -357 moveto show +newpath 42 -373 moveto 42 -400 lineto stroke +newpath 127 -373 moveto 127 -400 lineto stroke +newpath 212 -373 moveto 212 -400 lineto stroke +newpath 297 -373 moveto 297 -400 lineto stroke +newpath 382 -373 moveto 382 -400 lineto stroke +newpath 467 -373 moveto 467 -400 lineto stroke +newpath 552 -373 moveto 552 -400 lineto stroke +newpath 552 -386 moveto 42 -386 lineto stroke +newpath 42 -386 moveto 52 -392 lineto stroke +(ok) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 291 -384 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +291 -384 moveto show diff --git a/doc/src.high-level/append-flow2.eps b/doc/src.high-level/append-flow2.eps new file mode 100644 index 0000000..ad285a3 --- /dev/null +++ b/doc/src.high-level/append-flow2.eps @@ -0,0 +1,349 @@ +%!PS-Adobe-3.0 EPSF-2.0 +%%BoundingBox: 0 0 416.500000 355.600006 +%%Creator: mscgen 0.18 +%%EndComments +0.700000 0.700000 scale +0 0 moveto +0 508 lineto +595 508 lineto +595 0 lineto +closepath +clip +%PageTrailer +%Page: 1 1 +/Helvetica findfont +10 scalefont +setfont +/Helvetica findfont +12 scalefont +setfont +0 508 translate +/mtrx matrix def +/ellipse + { /endangle exch def + /startangle exch def + /ydia exch def + /xdia exch def + /y exch def + /x exch def + /savematrix mtrx currentmatrix def + x y translate + xdia 2 div ydia 2 div scale + 1 -1 scale + 0 0 1 startangle endangle arc + savematrix setmatrix +} def +(client) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 42 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +42 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(Projection) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 127 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +127 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(ProjStore_A) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 212 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +212 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(Sequencer_A) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 297 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +297 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(FLU_A) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 382 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +382 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(FLU_B) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 467 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +467 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(FLU_C) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 552 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +552 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +newpath 42 -22 moveto 42 -49 lineto stroke +newpath 127 -22 moveto 127 -49 lineto stroke +newpath 212 -22 moveto 212 -49 lineto stroke +newpath 297 -22 moveto 297 -49 lineto stroke +newpath 382 -22 moveto 382 -49 lineto stroke +newpath 467 -22 moveto 467 -49 lineto stroke +newpath 552 -22 moveto 552 -49 lineto stroke +newpath 42 -35 moveto 127 -35 lineto stroke +newpath 127 -35 moveto 117 -41 lineto stroke +(get current) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 57 -33 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +57 -33 moveto show +newpath 42 -49 moveto 42 -76 lineto stroke +newpath 127 -49 moveto 127 -76 lineto stroke +newpath 212 -49 moveto 212 -76 lineto stroke +newpath 297 -49 moveto 297 -76 lineto stroke +newpath 382 -49 moveto 382 -76 lineto stroke +newpath 467 -49 moveto 467 -76 lineto stroke +newpath 552 -49 moveto 552 -76 lineto stroke +newpath 127 -62 moveto 42 -62 lineto stroke +newpath 42 -62 moveto 52 -68 lineto stroke +(ok, #12...) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 61 -60 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +61 -60 moveto show +newpath 42 -76 moveto 42 -103 lineto stroke +newpath 127 -76 moveto 127 -103 lineto stroke +newpath 212 -76 moveto 212 -103 lineto stroke +newpath 297 -76 moveto 297 -103 lineto stroke +newpath 382 -76 moveto 382 -103 lineto stroke +newpath 467 -76 moveto 467 -103 lineto stroke +newpath 552 -76 moveto 552 -103 lineto stroke +newpath 42 -89 moveto 382 -89 lineto stroke +newpath 382 -89 moveto 372 -95 lineto stroke +(write prefix="foo" <<...123...>> epoch=12) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 104 -87 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +104 -87 moveto show +newpath 42 -103 moveto 42 -130 lineto stroke +newpath 127 -103 moveto 127 -130 lineto stroke +newpath 212 -103 moveto 212 -130 lineto stroke +newpath 297 -103 moveto 297 -130 lineto stroke +newpath 382 -103 moveto 382 -130 lineto stroke +newpath 467 -103 moveto 467 -130 lineto stroke +newpath 552 -103 moveto 552 -130 lineto stroke +newpath 382 -116 moveto 42 -116 lineto stroke +newpath 42 -116 moveto 52 -122 lineto stroke +1.000000 0.000000 0.000000 setrgbcolor +(bad_epoch, 13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 173 -114 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +1.000000 0.000000 0.000000 setrgbcolor +173 -114 moveto show +0.000000 0.000000 0.000000 setrgbcolor +newpath 42 -130 moveto 42 -157 lineto stroke +newpath 127 -130 moveto 127 -157 lineto stroke +newpath 212 -130 moveto 212 -157 lineto stroke +newpath 297 -130 moveto 297 -157 lineto stroke +newpath 382 -130 moveto 382 -157 lineto stroke +newpath 467 -130 moveto 467 -157 lineto stroke +newpath 552 -130 moveto 552 -157 lineto stroke +newpath 42 -143 moveto 212 -143 lineto stroke +newpath 212 -143 moveto 202 -149 lineto stroke +(get epoch #13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 89 -141 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +89 -141 moveto show +newpath 42 -157 moveto 42 -184 lineto stroke +newpath 127 -157 moveto 127 -184 lineto stroke +newpath 212 -157 moveto 212 -184 lineto stroke +newpath 297 -157 moveto 297 -184 lineto stroke +newpath 382 -157 moveto 382 -184 lineto stroke +newpath 467 -157 moveto 467 -184 lineto stroke +newpath 552 -157 moveto 552 -184 lineto stroke +newpath 212 -170 moveto 42 -170 lineto stroke +newpath 42 -170 moveto 52 -176 lineto stroke +(ok, #13...) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 103 -168 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +103 -168 moveto show +newpath 42 -184 moveto 42 -211 lineto stroke +newpath 127 -184 moveto 127 -211 lineto stroke +newpath 212 -184 moveto 212 -211 lineto stroke +newpath 297 -184 moveto 297 -211 lineto stroke +newpath 382 -184 moveto 382 -211 lineto stroke +newpath 467 -184 moveto 467 -211 lineto stroke +newpath 552 -184 moveto 552 -211 lineto stroke +newpath 42 -197 moveto 382 -197 lineto stroke +newpath 382 -197 moveto 372 -203 lineto stroke +(write prefix="foo" <<...123...>> epoch=13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 104 -195 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +104 -195 moveto show +newpath 42 -211 moveto 42 -238 lineto stroke +newpath 127 -211 moveto 127 -238 lineto stroke +newpath 212 -211 moveto 212 -238 lineto stroke +newpath 297 -211 moveto 297 -238 lineto stroke +newpath 382 -211 moveto 382 -238 lineto stroke +newpath 467 -211 moveto 467 -238 lineto stroke +newpath 552 -211 moveto 552 -238 lineto stroke +1.000000 1.000000 1.000000 setrgbcolor +newpath 263 -211 moveto 417 -211 lineto 417 -236 lineto 263 -236 lineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +newpath 269 -211 moveto 411 -211 lineto stroke +newpath 269 -236 moveto 411 -236 lineto stroke +newpath 269 -211 moveto 263 -223 lineto stroke +newpath 263 -223 moveto 269 -236 lineto stroke +newpath 411 -211 moveto 417 -223 lineto stroke +newpath 417 -223 moveto 411 -236 lineto stroke +(Co-located on same box) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 275 -227 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +275 -227 moveto show +newpath 42 -238 moveto 42 -265 lineto stroke +newpath 127 -238 moveto 127 -265 lineto stroke +newpath 212 -238 moveto 212 -265 lineto stroke +newpath 297 -238 moveto 297 -265 lineto stroke +newpath 382 -238 moveto 382 -265 lineto stroke +newpath 467 -238 moveto 467 -265 lineto stroke +newpath 552 -238 moveto 552 -265 lineto stroke +newpath 382 -251 moveto 297 -251 lineto stroke +newpath 297 -251 moveto 307 -257 lineto stroke +(Req. 123 bytes, prefix="foo", epoch=13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 236 -249 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +236 -249 moveto show +newpath 42 -265 moveto 42 -292 lineto stroke +newpath 127 -265 moveto 127 -292 lineto stroke +newpath 212 -265 moveto 212 -292 lineto stroke +newpath 297 -265 moveto 297 -292 lineto stroke +newpath 382 -265 moveto 382 -292 lineto stroke +newpath 467 -265 moveto 467 -292 lineto stroke +newpath 552 -265 moveto 552 -292 lineto stroke +newpath 297 -278 moveto 382 -278 lineto stroke +newpath 382 -278 moveto 372 -284 lineto stroke +(ok, "foo.seq_a.009" offset=447) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 259 -276 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +259 -276 moveto show +newpath 42 -292 moveto 42 -319 lineto stroke +newpath 127 -292 moveto 127 -319 lineto stroke +newpath 212 -292 moveto 212 -319 lineto stroke +newpath 297 -292 moveto 297 -319 lineto stroke +newpath 382 -292 moveto 382 -319 lineto stroke +newpath 467 -292 moveto 467 -319 lineto stroke +newpath 552 -292 moveto 552 -319 lineto stroke +(FLU_A writes to local storage @ "foo.seq_a.009" offset=447) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 138 -308 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +138 -308 moveto show +[2] 0 setdash +newpath 21 -305 moveto 136 -305 lineto stroke +[] 0 setdash +[2] 0 setdash +newpath 459 -305 moveto 574 -305 lineto stroke +[] 0 setdash +newpath 42 -319 moveto 42 -346 lineto stroke +newpath 127 -319 moveto 127 -346 lineto stroke +newpath 212 -319 moveto 212 -346 lineto stroke +newpath 297 -319 moveto 297 -346 lineto stroke +newpath 382 -319 moveto 382 -346 lineto stroke +newpath 467 -319 moveto 467 -346 lineto stroke +newpath 552 -319 moveto 552 -346 lineto stroke +newpath 382 -332 moveto 467 -332 lineto stroke +newpath 467 -332 moveto 457 -338 lineto stroke +(write "foo.seq_a.009" offset=447 <<...123...>> epoch=13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 275 -330 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +275 -330 moveto show +newpath 42 -346 moveto 42 -373 lineto stroke +newpath 127 -346 moveto 127 -373 lineto stroke +newpath 212 -346 moveto 212 -373 lineto stroke +newpath 297 -346 moveto 297 -373 lineto stroke +newpath 382 -346 moveto 382 -373 lineto stroke +newpath 467 -346 moveto 467 -373 lineto stroke +newpath 552 -346 moveto 552 -373 lineto stroke +newpath 467 -359 moveto 552 -359 lineto stroke +newpath 552 -359 moveto 542 -365 lineto stroke +(write "foo.seq_a.009" offset=447 <<...123...>> epoch=13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 295 -357 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +295 -357 moveto show +newpath 42 -373 moveto 42 -400 lineto stroke +newpath 127 -373 moveto 127 -400 lineto stroke +newpath 212 -373 moveto 212 -400 lineto stroke +newpath 297 -373 moveto 297 -400 lineto stroke +newpath 382 -373 moveto 382 -400 lineto stroke +newpath 467 -373 moveto 467 -400 lineto stroke +newpath 552 -373 moveto 552 -400 lineto stroke +newpath 552 -386 moveto 42 -386 lineto stroke +newpath 42 -386 moveto 52 -392 lineto stroke +(ok, "foo.seq_a.009" offset=447) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 216 -384 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +216 -384 moveto show +newpath 42 -400 moveto 42 -427 lineto stroke +newpath 127 -400 moveto 127 -427 lineto stroke +newpath 212 -400 moveto 212 -427 lineto stroke +newpath 297 -400 moveto 297 -427 lineto stroke +newpath 382 -400 moveto 382 -427 lineto stroke +newpath 467 -400 moveto 467 -427 lineto stroke +newpath 552 -400 moveto 552 -427 lineto stroke +(The above is "fast path" for FLU->FLU forwarding.) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 164 -416 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +164 -416 moveto show +[2] 0 setdash +newpath 21 -413 moveto 162 -413 lineto stroke +[] 0 setdash +[2] 0 setdash +newpath 432 -413 moveto 574 -413 lineto stroke +[] 0 setdash +newpath 42 -427 moveto 42 -454 lineto stroke +newpath 127 -427 moveto 127 -454 lineto stroke +newpath 212 -427 moveto 212 -454 lineto stroke +newpath 297 -427 moveto 297 -454 lineto stroke +newpath 382 -427 moveto 382 -454 lineto stroke +newpath 467 -427 moveto 467 -454 lineto stroke +newpath 552 -427 moveto 552 -454 lineto stroke +(If, instead, FLU_C has an error...) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 210 -443 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +210 -443 moveto show +[2] 0 setdash +newpath 21 -440 moveto 208 -440 lineto stroke +[] 0 setdash +[2] 0 setdash +newpath 386 -440 moveto 574 -440 lineto stroke +[] 0 setdash +newpath 42 -454 moveto 42 -481 lineto stroke +newpath 127 -454 moveto 127 -481 lineto stroke +newpath 212 -454 moveto 212 -481 lineto stroke +newpath 297 -454 moveto 297 -481 lineto stroke +newpath 382 -454 moveto 382 -481 lineto stroke +newpath 467 -454 moveto 467 -481 lineto stroke +newpath 552 -454 moveto 552 -481 lineto stroke +newpath 552 -467 moveto 42 -467 lineto stroke +newpath 42 -467 moveto 52 -473 lineto stroke +1.000000 0.000000 0.000000 setrgbcolor +(bad_epoch, 15) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 258 -465 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +1.000000 0.000000 0.000000 setrgbcolor +258 -465 moveto show +0.000000 0.000000 0.000000 setrgbcolor +newpath 42 -481 moveto 42 -508 lineto stroke +newpath 127 -481 moveto 127 -508 lineto stroke +newpath 212 -481 moveto 212 -508 lineto stroke +newpath 297 -481 moveto 297 -508 lineto stroke +newpath 382 -481 moveto 382 -508 lineto stroke +newpath 467 -481 moveto 467 -508 lineto stroke +newpath 552 -481 moveto 552 -508 lineto stroke +(Repair is now the client's responsibility \("slow path"\).) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 158 -497 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +158 -497 moveto show +[2] 0 setdash +newpath 21 -494 moveto 156 -494 lineto stroke +[] 0 setdash +[2] 0 setdash +newpath 439 -494 moveto 574 -494 lineto stroke +[] 0 setdash diff --git a/doc/src.high-level/figure6.eps b/doc/src.high-level/figure6.eps new file mode 100644 index 0000000..f445c26 --- /dev/null +++ b/doc/src.high-level/figure6.eps @@ -0,0 +1,557 @@ +%!PS-Adobe-3.0 EPSF-3.0 +%%Title: figure6.fig +%%Creator: fig2dev Version 3.2 Patchlevel 5d +%%CreationDate: Mon Oct 20 21:56:33 2014 +%%For: fritchie@sbb3.local (Scott Lystig Fritchie) +%%BoundingBox: 0 0 633 332 +%Magnification: 1.0000 +%%EndComments +%%BeginProlog +/$F2psDict 200 dict def +$F2psDict begin +$F2psDict /mtrx matrix put +/col-1 {0 setgray} bind def +/col0 {0.000 0.000 0.000 srgb} bind def +/col1 {0.000 0.000 1.000 srgb} bind def +/col2 {0.000 1.000 0.000 srgb} bind def +/col3 {0.000 1.000 1.000 srgb} bind def +/col4 {1.000 0.000 0.000 srgb} bind def +/col5 {1.000 0.000 1.000 srgb} bind def +/col6 {1.000 1.000 0.000 srgb} bind def +/col7 {1.000 1.000 1.000 srgb} bind def +/col8 {0.000 0.000 0.560 srgb} bind def +/col9 {0.000 0.000 0.690 srgb} bind def +/col10 {0.000 0.000 0.820 srgb} bind def +/col11 {0.530 0.810 1.000 srgb} bind def +/col12 {0.000 0.560 0.000 srgb} bind def +/col13 {0.000 0.690 0.000 srgb} bind def +/col14 {0.000 0.820 0.000 srgb} bind def +/col15 {0.000 0.560 0.560 srgb} bind def +/col16 {0.000 0.690 0.690 srgb} bind def +/col17 {0.000 0.820 0.820 srgb} bind def +/col18 {0.560 0.000 0.000 srgb} bind def +/col19 {0.690 0.000 0.000 srgb} bind def +/col20 {0.820 0.000 0.000 srgb} bind def +/col21 {0.560 0.000 0.560 srgb} bind def +/col22 {0.690 0.000 0.690 srgb} bind def +/col23 {0.820 0.000 0.820 srgb} bind def +/col24 {0.500 0.190 0.000 srgb} bind def +/col25 {0.630 0.250 0.000 srgb} bind def +/col26 {0.750 0.380 0.000 srgb} bind def +/col27 {1.000 0.500 0.500 srgb} bind def +/col28 {1.000 0.630 0.630 srgb} bind def +/col29 {1.000 0.750 0.750 srgb} bind def +/col30 {1.000 0.880 0.880 srgb} bind def +/col31 {1.000 0.840 0.000 srgb} bind def + +end + +/cp {closepath} bind def +/ef {eofill} bind def +/gr {grestore} bind def +/gs {gsave} bind def +/sa {save} bind def +/rs {restore} bind def +/l {lineto} bind def +/m {moveto} bind def +/rm {rmoveto} bind def +/n {newpath} bind def +/s {stroke} bind def +/sh {show} bind def +/slc {setlinecap} bind def +/slj {setlinejoin} bind def +/slw {setlinewidth} bind def +/srgb {setrgbcolor} bind def +/rot {rotate} bind def +/sc {scale} bind def +/sd {setdash} bind def +/ff {findfont} bind def +/sf {setfont} bind def +/scf {scalefont} bind def +/sw {stringwidth} bind def +/tr {translate} bind def +/tnt {dup dup currentrgbcolor + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add + 4 -2 roll dup 1 exch sub 3 -1 roll mul add srgb} + bind def +/shd {dup dup currentrgbcolor 4 -2 roll mul 4 -2 roll mul + 4 -2 roll mul srgb} bind def +/reencdict 12 dict def /ReEncode { reencdict begin +/newcodesandnames exch def /newfontname exch def /basefontname exch def +/basefontdict basefontname findfont def /newfont basefontdict maxlength dict def +basefontdict { exch dup /FID ne { dup /Encoding eq +{ exch dup length array copy newfont 3 1 roll put } +{ exch newfont 3 1 roll put } ifelse } { pop pop } ifelse } forall +newfont /FontName newfontname put newcodesandnames aload pop +128 1 255 { newfont /Encoding get exch /.notdef put } for +newcodesandnames length 2 idiv { newfont /Encoding get 3 1 roll put } repeat +newfontname newfont definefont pop end } def +/isovec [ +8#055 /minus 8#200 /grave 8#201 /acute 8#202 /circumflex 8#203 /tilde +8#204 /macron 8#205 /breve 8#206 /dotaccent 8#207 /dieresis +8#210 /ring 8#211 /cedilla 8#212 /hungarumlaut 8#213 /ogonek 8#214 /caron +8#220 /dotlessi 8#230 /oe 8#231 /OE +8#240 /space 8#241 /exclamdown 8#242 /cent 8#243 /sterling +8#244 /currency 8#245 /yen 8#246 /brokenbar 8#247 /section 8#250 /dieresis +8#251 /copyright 8#252 /ordfeminine 8#253 /guillemotleft 8#254 /logicalnot +8#255 /hyphen 8#256 /registered 8#257 /macron 8#260 /degree 8#261 /plusminus +8#262 /twosuperior 8#263 /threesuperior 8#264 /acute 8#265 /mu 8#266 /paragraph +8#267 /periodcentered 8#270 /cedilla 8#271 /onesuperior 8#272 /ordmasculine +8#273 /guillemotright 8#274 /onequarter 8#275 /onehalf +8#276 /threequarters 8#277 /questiondown 8#300 /Agrave 8#301 /Aacute +8#302 /Acircumflex 8#303 /Atilde 8#304 /Adieresis 8#305 /Aring +8#306 /AE 8#307 /Ccedilla 8#310 /Egrave 8#311 /Eacute +8#312 /Ecircumflex 8#313 /Edieresis 8#314 /Igrave 8#315 /Iacute +8#316 /Icircumflex 8#317 /Idieresis 8#320 /Eth 8#321 /Ntilde 8#322 /Ograve +8#323 /Oacute 8#324 /Ocircumflex 8#325 /Otilde 8#326 /Odieresis 8#327 /multiply +8#330 /Oslash 8#331 /Ugrave 8#332 /Uacute 8#333 /Ucircumflex +8#334 /Udieresis 8#335 /Yacute 8#336 /Thorn 8#337 /germandbls 8#340 /agrave +8#341 /aacute 8#342 /acircumflex 8#343 /atilde 8#344 /adieresis 8#345 /aring +8#346 /ae 8#347 /ccedilla 8#350 /egrave 8#351 /eacute +8#352 /ecircumflex 8#353 /edieresis 8#354 /igrave 8#355 /iacute +8#356 /icircumflex 8#357 /idieresis 8#360 /eth 8#361 /ntilde 8#362 /ograve +8#363 /oacute 8#364 /ocircumflex 8#365 /otilde 8#366 /odieresis 8#367 /divide +8#370 /oslash 8#371 /ugrave 8#372 /uacute 8#373 /ucircumflex +8#374 /udieresis 8#375 /yacute 8#376 /thorn 8#377 /ydieresis] def +/Times-Bold /Times-Bold-iso isovec ReEncode +/Times-Roman /Times-Roman-iso isovec ReEncode +/$F2psBegin {$F2psDict begin /$F2psEnteredState save def} def +/$F2psEnd {$F2psEnteredState restore end} def + +/pageheader { +save +newpath 0 332 moveto 0 0 lineto 633 0 lineto 633 332 lineto closepath clip newpath +-331.7 473.8 translate +1 -1 scale +$F2psBegin +10 setmiterlimit +0 slj 0 slc + 0.06000 0.06000 sc +} bind def +/pagefooter { +$F2psEnd +restore +} bind def +%%EndProlog +pageheader +% +% Fig objects follow +% +% +% here starts figure with depth 50 +/Times-Roman-iso ff 166.67 scf sf +5925 7275 m +gs 1 -1 sc (Step 6: Client now knows that projection 12 is invalid. Fetch projection 13, then retry at step #8.) col16 sh gr +/Times-Bold-iso ff 200.00 scf sf +8550 3225 m +gs 1 -1 sc (Get epoch 13) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +5925 6900 m +gs 1 -1 sc (Active=[a,b,c]) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +5925 6675 m +gs 1 -1 sc (Members=[a,b,c]) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +5925 6450 m +gs 1 -1 sc (Epoch=13) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +5925 5580 m +gs 1 -1 sc (Epoch=12) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +5925 5835 m +gs 1 -1 sc (Members=[a,b,c]) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +5925 6090 m +gs 1 -1 sc (Active=[a,b]) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +5925 5175 m +gs 1 -1 sc (Projection \(data structure\)) col0 sh gr +% Polyline +0 slj +0 slc +15.000 slw +n 8400 4950 m 5625 4950 l 5625 7050 l 8400 7050 l + cp gs col0 s gr +/Times-Bold-iso ff 200.00 scf sf +12825 6405 m +gs 1 -1 sc (- write once) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 6660 m +gs 1 -1 sc (- key=integer) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 6915 m +gs 1 -1 sc (- value=projection data structure) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 7170 m +gs 1 -1 sc (k=11, v=...) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 7425 m +gs 1 -1 sc (k=12, v=...) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 7680 m +gs 1 -1 sc (k=13, v=...) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 6150 m +gs 1 -1 sc (FLU projection store \(proc\)) col0 sh gr +% Polyline +n 12750 5925 m 15900 5925 l 15900 7725 l 12750 7725 l + cp gs col0 s gr +% Polyline +gs clippath +14788 5055 m 14940 5055 l 14940 4995 l 14788 4995 l 14788 4995 l 14908 5025 l 14788 5055 l cp +14612 4995 m 14460 4995 l 14460 5055 l 14612 5055 l 14612 5055 l 14492 5025 l 14612 4995 l cp +eoclip +n 14475 5025 m + 14925 5025 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 14612 4995 m 14492 5025 l 14612 5055 l col0 s +% arrowhead +n 14788 5055 m 14908 5025 l 14788 4995 l col0 s +% Polyline +15.000 slw +gs clippath +15688 5055 m 15840 5055 l 15840 4995 l 15688 4995 l 15688 4995 l 15808 5025 l 15688 5055 l cp +15137 4995 m 14985 4995 l 14985 5055 l 15137 5055 l 15137 5055 l 15017 5025 l 15137 4995 l cp +eoclip +n 15000 5025 m + 15825 5025 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 15137 4995 m 15017 5025 l 15137 5055 l col0 s +% arrowhead +n 15688 5055 m 15808 5025 l 15688 4995 l col0 s +% Polyline +15.000 slw +gs clippath +14638 5355 m 14790 5355 l 14790 5295 l 14638 5295 l 14638 5295 l 14758 5325 l 14638 5355 l cp +14612 5295 m 14460 5295 l 14460 5355 l 14612 5355 l 14612 5355 l 14492 5325 l 14612 5295 l cp +eoclip +n 14475 5325 m 14550 5325 l 14625 5325 l 14700 5325 l 14775 5325 l 14700 5325 l + + 14775 5325 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 14612 5295 m 14492 5325 l 14612 5355 l col0 s +% arrowhead +n 14638 5355 m 14758 5325 l 14638 5295 l col0 s +% Polyline +15.000 slw +gs clippath +15163 5355 m 15315 5355 l 15315 5295 l 15163 5295 l 15163 5295 l 15283 5325 l 15163 5355 l cp +15137 5295 m 14985 5295 l 14985 5355 l 15137 5355 l 15137 5355 l 15017 5325 l 15137 5295 l cp +eoclip +n 15000 5325 m 15075 5325 l 15150 5325 l 15225 5325 l + 15300 5325 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 15137 5295 m 15017 5325 l 15137 5355 l col0 s +% arrowhead +n 15163 5355 m 15283 5325 l 15163 5295 l col0 s +% Polyline +15.000 slw +gs clippath +15688 5355 m 15840 5355 l 15840 5295 l 15688 5295 l 15688 5295 l 15808 5325 l 15688 5355 l cp +15587 5295 m 15435 5295 l 15435 5355 l 15587 5355 l 15587 5355 l 15467 5325 l 15587 5295 l cp +eoclip +n 15450 5325 m 15525 5325 l 15600 5325 l 15675 5325 l 15750 5325 l + 15825 5325 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 15587 5295 m 15467 5325 l 15587 5355 l col0 s +% arrowhead +n 15688 5355 m 15808 5325 l 15688 5295 l col0 s +% Polyline + [60] 0 sd +n 14475 5025 m + 15825 5025 l gs col0 s gr [] 0 sd +% Polyline + [60] 0 sd +n 14475 5325 m + 15825 5325 l gs col0 s gr [] 0 sd +% Polyline + [60] 0 sd +n 14475 5550 m + 15825 5550 l gs col0 s gr [] 0 sd +/Times-Bold-iso ff 200.00 scf sf +12825 4575 m +gs 1 -1 sc (epoch=13) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 4830 m +gs 1 -1 sc (files:) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 5085 m +gs 1 -1 sc ( "foo.seq_a.006") col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 5340 m +gs 1 -1 sc ( "foo.seq_b.007") col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 5595 m +gs 1 -1 sc ( "foo.seq_b.008") col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 4275 m +gs 1 -1 sc (FLU \(proc\)) col0 sh gr +% Polyline +15.000 slw +n 12750 4050 m 15975 4050 l 15975 5775 l 12750 5775 l + cp gs col0 s gr +% Polyline +n 12750 2775 m 15150 2775 l 15150 3900 l 12750 3900 l + cp gs col0 s gr +/Times-Bold-iso ff 200.00 scf sf +12825 3000 m +gs 1 -1 sc (Sequencer \(proc\)) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 3300 m +gs 1 -1 sc (epoch=13) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +12825 3555 m +gs 1 -1 sc (map=[{"foo", next_file=8,) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +13500 3750 m +gs 1 -1 sc (next_offset=0}...]) col0 sh gr +% Polyline +n 5700 3975 m 5700 4275 l 8250 4275 l 8250 3075 l 7950 3075 l 7950 3975 l + + 5700 3975 l cp gs col0 s gr +/Times-Bold-iso ff 200.00 scf sf +5775 4200 m +gs 1 -1 sc (server logic) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +5775 3060 m +gs 1 -1 sc (Append <<123 bytes>>) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +5775 3315 m +gs 1 -1 sc (to a file with prefix) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +5775 3570 m +gs 1 -1 sc ("foo".) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +5775 2700 m +gs 1 -1 sc (CLIENT \(proc\)) col0 sh gr +% Polyline +gs clippath +5970 3763 m 5970 3915 l 6030 3915 l 6030 3763 l 6030 3763 l 6000 3883 l 5970 3763 l cp +eoclip +n 6000 3600 m + 6000 3900 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 5970 3763 m 6000 3883 l 6030 3763 l col0 s +% Polyline +15.000 slw +gs clippath +6630 3737 m 6630 3585 l 6570 3585 l 6570 3737 l 6570 3737 l 6600 3617 l 6630 3737 l cp +eoclip +n 6600 3900 m 6600 3825 l 6600 3750 l 6600 3675 l + 6600 3600 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 6630 3737 m 6600 3617 l 6570 3737 l col0 s +/Times-Bold-iso ff 200.00 scf sf +6675 3900 m +gs 1 -1 sc (ok) col0 sh gr +% Polyline +15.000 slw +n 5550 4350 m 8325 4350 l 8325 2475 l 5550 2475 l + cp gs col0 s gr +% Polyline +gs clippath +8143 4500 m 8035 4393 l 7993 4435 l 8100 4543 l 8100 4543 l 8037 4437 l 8143 4500 l cp +eoclip +n 12525 5175 m 8775 5175 l + 8025 4425 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 8143 4500 m 8037 4437 l 8100 4543 l col0 s +/Times-Bold-iso ff 200.00 scf sf +11625 5100 m +gs 1 -1 sc (ok) col0 sh gr +% Polyline +15.000 slw +gs clippath +5970 4663 m 5970 4815 l 6030 4815 l 6030 4663 l 6030 4663 l 6000 4783 l 5970 4663 l cp +eoclip +n 6000 4425 m + 6000 4800 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 5970 4663 m 6000 4783 l 6030 4663 l col0 s +% Polyline +15.000 slw +gs clippath +6630 4562 m 6630 4410 l 6570 4410 l 6570 4562 l 6570 4562 l 6600 4442 l 6630 4562 l cp +eoclip +n 6600 4800 m 6600 4650 l 6600 4575 l 6600 4500 l + 6600 4425 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 6630 4562 m 6600 4442 l 6570 4562 l col0 s +% Polyline +15.000 slw +gs clippath +12388 2730 m 12540 2730 l 12540 2670 l 12388 2670 l 12388 2670 l 12508 2700 l 12388 2730 l cp +eoclip +n 8475 2700 m + 12525 2700 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 12388 2730 m 12508 2700 l 12388 2670 l col0 s +% Polyline +15.000 slw +gs clippath +8612 2970 m 8460 2970 l 8460 3030 l 8612 3030 l 8612 3030 l 8492 3000 l 8612 2970 l cp +eoclip +n 12525 3000 m + 8475 3000 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 8612 2970 m 8492 3000 l 8612 3030 l col0 s +% Polyline +15.000 slw +gs clippath +8612 3645 m 8460 3645 l 8460 3705 l 8612 3705 l 8612 3705 l 8492 3675 l 8612 3645 l cp +eoclip +n 12525 6900 m 12000 6900 l 12000 3675 l + 8475 3675 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 8612 3645 m 8492 3675 l 8612 3705 l col0 s +% Polyline +15.000 slw +gs clippath +12388 3330 m 12540 3330 l 12540 3270 l 12388 3270 l 12388 3270 l 12508 3300 l 12388 3330 l cp +eoclip +n 8475 3975 m 12300 3975 l 12300 3300 l + 12525 3300 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 12388 3330 m 12508 3300 l 12388 3270 l col0 s +% Polyline +15.000 slw +gs clippath +12388 4905 m 12540 4905 l 12540 4845 l 12388 4845 l 12388 4845 l 12508 4875 l 12388 4905 l cp +eoclip +n 8250 4425 m 8700 4875 l + 12525 4875 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 12388 4905 m 12508 4875 l 12388 4845 l col0 s +% Polyline +15.000 slw +n 12675 2400 m 16050 2400 l 16050 7875 l 12675 7875 l + cp gs col0 s gr +% Polyline +gs clippath +8612 4245 m 8460 4245 l 8460 4305 l 8612 4305 l 8612 4305 l 8492 4275 l 8612 4245 l cp +eoclip +n 12525 3600 m 12375 3600 l 12375 4275 l + 8475 4275 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 8612 4245 m 8492 4275 l 8612 4305 l col0 s +/Times-Bold-iso ff 200.00 scf sf +8850 5625 m +gs 1 -1 sc (Write to FLU B -> ok) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +8850 6135 m +gs 1 -1 sc (Write to FLU C -> ok) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +8550 2625 m +gs 1 -1 sc (Request 123 bytes, prefix="foo", epoch=12) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +11100 2925 m +gs 1 -1 sc ({bad_epoch,13}) col4 sh gr +/Times-Bold-iso ff 200.00 scf sf +10875 3600 m +gs 1 -1 sc ({ok, proj=...}) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +8550 4500 m +gs 1 -1 sc (Write <<123 bytes>> to) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +8550 4755 m +gs 1 -1 sc (file="foo.seq_a.008", offset=0) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +13575 2625 m +gs 1 -1 sc (Server A) col0 sh gr +/Times-Bold-iso ff 200.00 scf sf +8550 3900 m +gs 1 -1 sc (Req. 123 bytes, prefix="foo", epoch=13) col0 sh gr +/Times-Roman-iso ff 166.67 scf sf +6075 3825 m +gs 1 -1 sc (1) col16 sh gr +/Times-Roman-iso ff 166.67 scf sf +6075 4650 m +gs 1 -1 sc (2) col16 sh gr +/Times-Roman-iso ff 166.67 scf sf +6675 4650 m +gs 1 -1 sc (3) col16 sh gr +/Times-Roman-iso ff 166.67 scf sf +10950 2850 m +gs 1 -1 sc (5) col16 sh gr +/Times-Roman-iso ff 166.67 scf sf +10725 3525 m +gs 1 -1 sc (7) col16 sh gr +/Times-Bold-iso ff 200.00 scf sf +9375 4200 m +gs 1 -1 sc (file="foo.seq_a.008", offset=0) col0 sh gr +/Times-Roman-iso ff 166.67 scf sf +8400 3225 m +gs 1 -1 sc (6) col16 sh gr +/Times-Roman-iso ff 166.67 scf sf +8400 2625 m +gs 1 -1 sc (4) col16 sh gr +/Times-Roman-iso ff 166.67 scf sf +6675 3750 m +gs 1 -1 sc (16) col16 sh gr +/Times-Roman-iso ff 166.67 scf sf +8400 3900 m +gs 1 -1 sc (8) col16 sh gr +/Times-Roman-iso ff 166.67 scf sf +9225 4200 m +gs 1 -1 sc (9) col16 sh gr +/Times-Roman-iso ff 166.67 scf sf +8400 4500 m +gs 1 -1 sc (10) col16 sh gr +/Times-Roman-iso ff 166.67 scf sf +8475 5625 m +gs 1 -1 sc (12,13) col16 sh gr +/Times-Roman-iso ff 166.67 scf sf +8475 6075 m +gs 1 -1 sc (14,15) col16 sh gr +/Times-Roman-iso ff 166.67 scf sf +11400 5100 m +gs 1 -1 sc (11) col16 sh gr +% Polyline +15.000 slw +gs clippath +12388 6630 m 12540 6630 l 12540 6570 l 12388 6570 l 12388 6570 l 12508 6600 l 12388 6630 l cp +eoclip +n 8475 3300 m 12150 3300 l 12150 6600 l + 12525 6600 l gs col0 s gr gr + +% arrowhead +7.500 slw +n 12388 6630 m 12508 6600 l 12388 6570 l col0 s +% here ends figure; +pagefooter +showpage +%%Trailer +%EOF diff --git a/doc/src.high-level/high-level-machi.tex b/doc/src.high-level/high-level-machi.tex new file mode 100644 index 0000000..77fd142 --- /dev/null +++ b/doc/src.high-level/high-level-machi.tex @@ -0,0 +1,2527 @@ + +%% \documentclass[]{report} +\documentclass[preprint,10pt]{sigplanconf} +% The following \documentclass options may be useful: + +% preprint Remove this option only once the paper is in final form. +% 10pt To set in 10-point type instead of 9-point. +% 11pt To set in 11-point type instead of 9-point. +% authoryear To obtain author/year citation style instead of numeric. + +% \usepackage[a4paper]{geometry} +\usepackage[dvips]{graphicx} % to include images +%\usepackage{pslatex} % to use PostScript fonts + +\begin{document} + +%%\special{papersize=8.5in,11in} +%%\setlength{\pdfpageheight}{\paperheight} +%%\setlength{\pdfpagewidth}{\paperwidth} + +\conferenceinfo{}{} +\copyrightyear{2014} +\copyrightdata{978-1-nnnn-nnnn-n/yy/mm} +\doi{nnnnnnn.nnnnnnn} + +\titlebanner{Draft \#0, April 2014} +\preprintfooter{Draft \#0, April 2014} + +\title{Machi: an immutable file store} +\subtitle{High level design \& strawman implementation suggestions \\ + with focus on eventual consistency/''EC'' mode of operation} + +\authorinfo{Basho Japan KK}{} + +\maketitle + +\section{Origins} +\label{sec:origins} + +This document was first written during the autumn of 2014 for a +Basho-only internal audience. Since its original drafts, Machi has +been designated by Basho as a full open source software project. This +document has been rewritten in 2015 to address an external audience. +Furthermore, many strong consistency design elements have been removed +and will appear later in separate documents. + +\section{Abstract} +\label{sec:abstract} + +Our goal +is creation of a robust \& reliable, distributed, highly +available\footnote{Capable of operating in ``AP mode'' or + ``CP mode'' relative to the CAP Theorem, see + Section~\ref{sub:wedge}.} +large file +store based upon write-once registers, append-only files, +Chain Replication, and +client-server style architecture. All +members of the cluster store all of the files. Distributed load +balancing/sharding of files is {\em outside} of the scope of this system. +However, it is a high priority that this system be able to integrate +easily into systems that do provide distributed load balancing, e.g., +Riak Core. Although strong consistency is a major feature of Chain +Replication, this document will focus mainly on eventual consistency +features --- strong consistency design will be discussed in a separate +document. + +\section{Introduction} +\label{sec:introduction} + +\begin{quotation} +``I must not scope creep. Scope creep is the mind-killer. Scope creep + is the little-death that brings total obliteration. I will face my + scope.'' +\par +\hfill{--- Fred Hebert, {\tt @mononcqc}} +\end{quotation} +\subsection{Name} +\label{sub:name} + +This file store will be called ``Machi''. +``Machi'' is a Japanese word for +``village'' or ``small town''. A village is a rather self-contained +thing, but it is small, not like a city. + +One use case for Machi is for file storage, as-is. However, as Tokyo +City is built with a huge collection of machis, so then this project +is also designed to work well as part of a larger system, such as Riak +Core. Tokyo wasn't built in a day, after all, and definitely wasn't +built out of a single village. + +\subsection{Assumptions} +\label{sub:assumptions} + +Machi is a client-server system. All servers in a Machi cluster store +identical copies/replicas of all files, preferably large files. +\begin{itemize} + \item This puts an effective limit on the size of a Machi cluster. + For example, five servers will replicate all files + for an effective replication $N$ factor of 5. + \item Any mechanism to distribute files across a subset of Machi + servers is outside the scope of Machi and of this design. +\end{itemize} + +``Large file'' is intended to mean hundreds of MBytes or more +per file. The design ``sweet spot'' targets about +1 GByte/file and/or managing up to a few million files in a +single cluster. The maximum size of a single Machi file is +limited by the server's underlying OS and file system; a +practical estimate is 2Tbytes or less but may be larger. + +Machi files are write-once, read-many data structures; the label +``append-only'' is mostly correct. However, to be 100\% truthful +truth, the bytes a Machi file can be written in any order. + +Machi files are always named by the server; Machi clients have no +direct control of the name assigned by a Machi server. Machi servers +specify the file name and byte offset to all client write requests. +(Machi clients may advise servers with a desired file name prefix.) + +Machi is not a Hadoop file system (HDFS) replacement. +%% \begin{itemize} +% \item +There is no mechanism for writing Machi files to a subset of + available storage servers: all servers in a Machi server store + identical copies/replicas of all files. +% \item +However, Machi is intended to play very nicely with a layer above it, + where that layer {\em does} handle file scattering and on-the-fly + file migration across servers and all of the nice things that + HDFS, Riak CS, and similar systems can do. + +Robust and reliable means that Machi will not lose data until a +fundamental assumption has been violated, e.g., all servers have +crashed permanently. Machi's file replicaion algorithms can provide +strong or eventual consistency and is provably correct. Our only +task is to not put bugs into the implementation of the algorithms. Machi's +small pieces and restricted API and semantics will reduce +(we believe) the effort required to test +and verify the implementation. + +Machi should not have ``big'' external runtime dependencies when +practical. For example, the feature set of ZooKeeper makes it a +popular distributed systems coordination service. When possible, +Machi should try to avoid using such a big runtime dependency. For +the purposes of explaining ``big'', the Riak KV service is too big and +thus runs afoul of this requirement. + +Machi clients must assume that any interrupted or incomplete write +operation may be readable at some later time. Read repair or +incomplete writes may happen long after the client has finished or +even crashed. In effect, Machi will provide clients with +``at least once'' behavior for writes. + +\subsection{Defining a Machi file} + +A Machi ``file'' is an undifferentiated, one-dimensional array of +bytes. This definition matches the POSIX definition of a file. +However, the Machi API does not conform to the UNIX/POSIX file +I/O API. + +A list of client operations are shown in +Figure~\ref{fig:example-client-API}. This list may change, but it +shows the basic shape of the service. + +\begin{figure} +\begin{itemize} + \item Append bytes $B$ to a file with name prefix {\tt "foo"}. + \item Read $N$ bytes from offset $O$ from file $F$. + \item List files: name, size, etc. +\end{itemize} +\caption{Full (?) list of file API operations} +\label{fig:example-client-API} +\end{figure} + +The file read \& write granularity of Machi is one byte. (In CORFU +operation mode, perhaps, the granularity would be page size on the +order of 4 KBytes or 16 KBytes.) + +\begin{figure} + \begin{enumerate} + \item Client1: Write 1 byte at offset 0. + \item Client1: Read 1 byte at offset 0. + \item Client2: Write 1 byte at offset 2. + \item Client2: Read 1 byte at offset 2. + \item Client3: (an intermittently slow client) Write 1 byte at offset 1. + \item Client3: Read 1 byte at offset 1. + \end{enumerate} +\caption{Example of temporally out-of-order file append sequence that + is valid within a Machi cluster.} +\label{fig:temporal-out-of-order} +\end{figure} + +\subsubsection{Append-only files} +\label{sub:assume-append-only} + +Machi's file writing semantics are append-only. +Machi's append-only behavior is spatial and is {\em not} +enforced temporally. For example, Figure~\ref{fig:temporal-out-of-order} +shows client operations +upon a single file, in strictly increasing wall clock time ticks. +Figure~\ref{fig:temporal-out-of-order}'s is perfectly valid Machi behavior. + +%% In this example, client 3 was +%% very quick and was actually the second client to request +%% appending to the file and therefore was assigned to write to +%% offset \#1. However, client 3 then became slow and didn't +%% actually write its data to offset 1 until after step 4. + +Any byte in a file may have three states: +\begin{enumerate} + \item unwritten: no value has been assigned to the byte. + \item written: exactly one value has been assigned to the byte. + \item trimmed: only used for garbage collection \& disk space + reclamation purposes +\end{enumerate} + +Transitions between these states are strictly ordered. Valid +orders are: + \begin{itemize} + \item unwritten $\rightarrow$ written + \item unwritten $\rightarrow$ trimmed + \item written $\rightarrow$ trimmed + \end{itemize} +%% The trim operation may be used internally to mark byte ranges +%% which have been marked ``no longer in use'', e.g. with a reference +%% count of zero. Such regions may be garbage collected by Machi +%% at its convenience.\footnote{Advanced feature, implementation TBD.} + +Client append operations are atomic: the transition from +one state to another happens for all bytes, or else no +transition is made for any bytes. + +\subsubsection{Machi servers choose all file names} + +A Machi server always chooses the full file name of file + that will have data appended to it. +A Machi server always chooses the offset within the file + that will have data appended to it. + +All file names chosen by Machi are unique, relative to itself. Any +duplicate file names can cause correctness violations.\footnote{For + participation in a larger system, Machi can construct file names that + are unique within that larger system, e.g. by embedding a unique + Machi cluster name or perhaps a UUID-style + string in the name.} + +\subsubsection{File integrity and bit-rot} +\label{sub:bit-rot} + +Clients may specify a per-write checksum of the data being written, +e.g., SHA1. These checksums will be appended to the file's +metadata. Checksums are first-class metadata and is replicated with +the same consistency and availability guarantees as its corresponding +file data. +Clients may optionally fetch the checksum of the bytes they +read. + +Bit-rot can and will happen. To guard against bit-rot on disk, strong + checksums are used to detect bit-rot at all possible places. +\begin{itemize} + \item Client-calculated checksums of appended data + \item Whole-file checksums, calculated by Machi servers for internal + sanity checking. See \ref{sub:detecting-corrupted} for + commentary on how this may not be feasible. + \item Any other place that makes sense for the paranoid. +\end{itemize} + +Full 100\% protection against arbitrary RAM bit-flips is not a design +goal \ldots but would be cool for as research for the great and +glorious future. Meanwhile, Machi will use as many ``defense in +depth'' techniques as feasible. + +\subsubsection{File metadata} + +Files may have metadata associated with them. +Clients may request appending metadata to a file, for example, + {\tt \{file F, bytes X-Y, property list of 2-tuples\}}. +This metadata receives second-class handling with regard to +consistency and availability, as described below and in contrast to +the per-append checksums described in Section~\ref{sub:bit-rot} + +\begin{itemize} + \item File metadata is strictly append-only. + \item File metadata is always eventually consistent. + \item A complete history of all metadata updates is maintained for + each file. + \item Temporal order of metadata entries is not preserved. + \item Multiple histories for a file may be merged at any time. + \begin{itemize} + \item If a client requires idempotency, then the property list + should contain all information required to identify multiple + copies of the same metadata item. + \item Metadata properties should be considered CRDT-like: the + final metadata list should converge eventually to a single + list of properties. + \end{itemize} +\end{itemize} + +\subsubsection{File replica management via Chain Replication} +\label{sub-chain-replication} + +Machi uses Chain Replication (CR) internally to maintain file +replicas and inter-replica consistency. +A Machi cluster of $F+1$ servers can sustain the failure of up +to $F$ servers without data loss. + +A simple explanation of Chain Replication is that it is a variation of +single-primary/multiple-secondary replication with the following +restrictions: + +\begin{enumerate} +\item All writes are strictly performed by servers that are arranged + in a single order, known as the ``chain order'', beginning at the + chain's head. +\item All strongly consistent reads are performed only by the tail of + the chain, i.e., the last server in the chain order. +\item Inconsistent reads may be performed by any single server in the + chain. +\end{enumerate} + +Machi contains enough Chain Replication implementation to maintain its +chain state, file data integrity, and file metadata eventual +consistency. See also Section~\ref{sub:self-management}. + +The first version of Machi would use a single chain for managing all +files in the cluster. If the system is quiescent, +then all chain members store the same data: all +Machi servers will all store identical files. Later versions of Machi +may play clever games with projection data structures and algorithms +that interpret these projections to implement alternative replication +schemes. However, such clever games are scope creep and are therefore +research topics for the future. + +Machi will probably not\footnote{Final decision TBD} implement chain +replication using CORFU's description of its protocol. CORFU's +authors made an implementation choice to make the FLU servers +(Section~\ref{sub:flu}) as dumb as possible. The CORFU authors were +(in part) experimenting with the FLU server implemented by an FPGA; a +dumb-as-possible server was a feature. + +Machi does not have CORFU's minimalism as a design principle. +Therefore, it's likely that Machi will implement CR using the original +Chain Replication \cite{chain-replication} paper's pattern of message +passing, i.e., with direct server-to-server message +passing.\footnote{Also, the original CR algorithm's requirement for + message passing back up the chain to enforce write consistency is + not required: Machi's combination of client-driven data repair and + write-once registers make inter-server synchronization unnecessary.} +However, the +description of the protocols in this document will use CORFU-style +Chain Replication. The two variations are equivalent from a +correctness point of view --- what matters is the communication +pattern and total number of messages required per operation. +CORFU's +client-driven messaging patterns feel easier to describe and to +align with CORFU- and Tango-related research papers. + +\subsubsection{Data integrity self-management} +\label{sub:self-management} + +Machi servers automatically monitor each others health. Signs +of poor health will automatically reconfigure the Machi cluster +to avoid data loss and to provide maximum availability. +For example, if a server $S$ crashes and later +restarts, Machi will automatically bring the data on $S$ back to full sync. + +Machi will provide an administration API for managing Machi servers, e.g., +cluster membership, file integrity and checksum verification, etc. + +%% Machi's use of Chain Replication internally means that certain +%% combinations of server $S$ fails, $S$ restarts, recovery repair $R_s$ +%% starts to repair $S$'s data, +%% and a separate failure happen before the $R_s$ repair has +%% completed ... can lead to data loss. Such data loss events will +%% be avoided by fail-stop behavior of the entire Machi cluster +%% until external/human intervention can restart nodes that contain +%% at-risk-of-loss data. + +%% All of Machi's participants, client and server alike, fully observe +%% Machi's protocols, write-once enforcement, projection changes (see +%% below), ``wedge'' enforcement (see below), etc. + +\subsection{Out of Machi's scope} + +Anything not mentioned in this paper is outside of Machi's scope. +However, it's worth mentioning (again!) that the following are explicitly +considered out-of-scope for Machi. + + Machi does not distribute/shard files across disjoint sets of servers. + Distribution of files across Machi servers is left for a higher + level of abstraction, e.g. Riak Core. See also + Sections~\ref{sub:name} and \ref{sub:assumptions} and the quote at + the top of Section~\ref{sec:introduction}. + + Later versions of Machi may support erasure + coding directly, or Machi can be used as-is to store files that + client applications that are aware that they are manipulating + erasure coded data. In the latter case, + the client can read a 1 GByte file from a Machi cluster with a chain + length of $N$, erasure encode it in a + 15-choose-any-10 encoding scheme and concatenate them into a 1.5 GByte file, + then store each of the fifteen + 0.1 GByte chunks in a different Machi cluster, each with a chain + length of only $1$. Using separate Machi clusters makes the + burden of physical separation of each coded piece (i.e., ``rack + awareness'') someone/something else's problem. + +Why would would someone wish to run a Machi cluster with only one +server (i.e., chain length of one) rather than using the FLU service +(Section~\ref{sub:flu}) by itself? One answer is that data +migration is much easier with all of Machi than with only the FLU +server. To migrate all files from FLU $F_a$ to FLU $F_b$, the administrator +merely needs to add $F_b$ to the end of $F_a$'s chain. When the data +repair is finished, we know that $F_b$ stores full replicas of all of +$F_a$'s data. The administrator removes $F_a$ from the chain, and the +data migration is finished. + +\section{Architecture: base components and ideas} + +This section presents the major architectural components. They are: + +\begin{itemize} +\item The FLU: the server that stores a single replica of a file. +(Section \ref{sub:flu}) +\item The Sequencer: assigns a unique file name + offset to each file + append request. +(Section \ref{sub:sequencer}) +\item The Projection Store: a write-once key-value blob store, used by + Machi for storing projections. +(Section \ref{sub:proj-store}) +\item The auto-administration monitor: monitors the health of the + chain and calculates new projections when failure is detected. +(Section \ref{sub:auto-admin}) +\end{itemize} + +Also presented here are the major concepts used by Machi components: +\begin{itemize} +\item The Projection: the data structure that describes the current + state of the Machi chain. + and is stored in the write-once Projection Store. +(Section \ref{sub:projection}) +\item The Projection Epoch Number (a.k.a.~The Epoch): Each projection + is numbered with an epoch. +(Also section \ref{sub:projection}) +\item The Bad Epoch Error: a response when a protocol operation uses a + projection epoch number smaller than the current projection epoch. +(Section \ref{sub:bad-epoch}) +\item The Wedge: a response when a protocol operation uses a + projection epoch number larger than the current projection epoch. +(Section \ref{sub:wedge}) +\item AP Mode and CP Mode: the general mode of a Machi cluster may be + in ``AP Mode'' or ``CP Mode'', which are short-hand notations for + Machi clusters with eventual consistency or strong consistency + behavior. Both modes have different availability profiles and + slightly different feature sets. (Section \ref{sub:ap-cp-mode}) +\end{itemize} + +\subsection{The FLU} +\label{sub:flu} + +The basic idea of the FLU is borrowed from CORFU. The base CORFU +data server is called a ``flash unit''. For Machi, the equivalent +server is nicknamed a FLU, a ``FiLe replica Unit''. A FLU is +responsible for maintaining a single replica/copy of each file +(and its associated metadata) stored in a Machi cluster + +The FLU's API is very simple: see Figure~\ref{fig:flu-api} for its +data types and operations. This description is not 100\% complete but +is sufficient for discussion purposes. + +\begin{figure*}[] +\begin{verbatim} +-type m_bytes() :: iolist(). +-type m_csum() :: {none | sha1 | sha1_excl_final_20, binary(20)}. +-type m_epoch() :: {m_epoch_n(), m_csum()}. +-type m_epoch_n() :: non_neg_integer(). +-type m_err_r() :: error_unwritten | error_trimmed. +-type m_err_w() :: error_written | error_trimmed. +-type m_file_info() :: {m_name(), Size::integer(), ...}. +-type m_fill_err() :: error_not_permitted. +-type m_generr() :: error_bad_epoch | error_wedged | + error_bad_checksum | error_unavailable. +-type m_name() :: binary(). +-type m_offset() :: non_neg_integer(). +-type m_rerror() :: m_err_r() m_generr(). +-type m_werror() :: m_generr() | m_err_w(). + +-spec fill(m_name(), m_offset(), integer(), m_epoch()) -> ok | m_fill_err | + m_werror(). +-spec list_files() -> {ok, [m_file_info()]} | m_generr(). +-spec read(m_name(), m_offset(), integer(), m_epoch()) -> {ok, binary()} | m_rerror(). +-spec trim(m_name(), m_offset(), integer(), m_epoch()) -> ok | m_generr(). +-spec write(m_name(), m_offset(), m_bytes(), m_csum(), + m_epoch()) -> ok | m_werror(). + +-spec proj_get_largest_key() -> m_epoch_n() | error_unavailable. +-spec proj_get_largest_keyval() -> {ok, m_epoch_n(), binary()} | +-spec proj_list() -> {ok, [m_epoch_n()]}. +-spec proj_read(m_epoch_n()) -> {ok, binary()} | m_err_r(). +-spec proj_write(m_epoch_n(), m_bytes(), m_csum()) -> ok | m_err_w() | + error_unwritten | error_unavailable. +\end{verbatim} +\caption{FLU data and projection operations as viewed as an API and data types (excluding metadata operations)} +\label{fig:flu-api} +\end{figure*} + +The FLU must enforce the state of each byte of each file. +Transitions between these states are strictly ordered. +See Section~\ref{sub:assume-append-only} for state transitions and +the restrictions related to those transitions. + +The FLU also keeps track of the projection number (number and checksum +both, see also Section~\ref{sub:flu-divergence}) of the last modification to a +file. This projection number is used for quick comparisons during +repair (Section~\ref{sec:repair}) to determine if files are in sync or +not. + +\subsubsection{Divergence from CORFU} +\label{sub:flu-divergence} + +In Machi, the type signature of {\tt + m\_epoch()} includes both the projection epoch number and a checksum +of the projection's contents. This checksum is used in cases where +Machi is configured to run in ``AP mode'', which allows a running Machi +cluster to fragment into multiple running sub-clusters during network +partitions. Each sub-cluster can choose a projection number +$P_{side}$ for its side of the cluster. + +After the partition is +healed, it may be true that epoch numbers assigned to two different +projections $P_{left}$ and $P_{right}$ +are equal. However, their checksum signatures will differ. If a +Machi client or server detects a difference in either the epoch number +or the epoch checksum, it must wedge itself (Section~\ref{sub:wedge}) +until a new projection with a larger epoch number is available. + +\subsection{The Sequencer} +\label{sub:sequencer} + +For every file append request, the Sequencer assigns a unique +{\tt \{file-name,byte-offset\}} location tuple. + +Each FLU server runs a sequencer server. Typically, only the +sequencer of the head of the chain is used by clients. However, for +development and administration ease, each FLU should have a sequencer +running at all times. If a client were to use a sequencer other than +the chain head's sequencer, no harm would be done. + +The sequencer must assign a new file name whenever any of the +following events happen: +\begin{itemize} +\item The current file size is too big, per cluster administration policy. +\item The sequencer or the entire FLU restarts. +\item The FLU receives a projection or client API call + that includes a newer/larger projection epoch + number than its current projection epoch number. +\end{itemize} + +The sequencer assignment given to a Machi client is valid only for the +projection epoch in which it was assigned. Machi FLUs must enforce +this requirement. If a Machi client's write attempt is interrupted in +the middle by a projection change, then the following rules must be +used to continue: + +\begin{itemize} +\item If the client's write has been successful on at least the head + FLU in the chain, then the client may continue to use the old + location. The client is now performing read repair of this location in + the new epoch. (The client may have to add a ``read repair'' option + to its requests to bypass the FLUs usual enforcement of the + location's epoch.) +\item If the client's write to the head FLU has not started yet, or if + it doesn't know the status of the write to the head (e.g., timeout), + then the client must abandon the current location assignment and + request a new assignment from the sequencer. +\end{itemize} + +\subsubsection{Divergence from CORFU} +\label{sub:sequencer-divergence} + +CORFU's sequencer is not +necessary in a CORFU system and is merely a performance optimization. + +In Machi, the sequencer is required because it assigns both a file +byte offset and also a full file name. The client can request a +certain file name prefix, e.g. {\tt "foo"}. The sequencer must make +the file name unique across the entire Machi system. A Machi cluster +has a name that is shared by all servers. The client's prefix +wish is combined with the cluster name, sequencer name, and a +per-sequencer strictly unique ID (such as a counter) to form an opaque +suffix. +For example, +\begin{quote} +{\tt "foo.m=machi4.s=flu-A.n=72006"} +\end{quote} + +One reviewer asked, ``Why not just use UUIDs?'' Any naming system +that generates unique file names is sufficient. + +\subsection{The Projection Store} +\label{sub:proj-store} + +Each FLU maintains a key-value store for the purpose of storing +projections. Reads \& writes to this store are provided by the FLU +administration API. The projection store runs on each server that +provides FLU service, for two reasons of convenience. First, the +projection data structure +need not include extra server names to identify projection +store servers or their locations. +Second, writes to the projection store require +notification to a FLU of the projection update anyway. + +The store's basic operation set is simple: get, put, get largest key +(and optionally its value), and list all keys. +The projection store's data types are: + +\begin{itemize} +\item key = the projection number +\item value = the entire projection data structure, serialized as an + opaque byte blob stored in write-once register. The value is + typically a few KBytes but may be up to 10s of MBytes in size. + (A Machi projection data structure will likely be much less than 10 + KBytes.) +\end{itemize} + +As a write-once register, any attempt to write a key $K$ when the +local store already has a value written for $K$ will always fail +with a {\tt error\_written} error. + +Any write of a key whose value is larger than the FLU's current +projection number will move the FLU to the wedged state +(Section~\ref{sub:wedge}). + +The contents of the projection blob store are maintained by neither +Chain Replication techniques nor any other server-side technique. All +replication and read repair is done only by the projection store +client. Astute readers may theorize that race conditions exist in +such management; see Section~\ref{sec:projections} for details and +restrictions that make it practical. + +\subsection{The auto-administration monitor} +\label{sub:auto-admin} + +NOTE: This needs a better name. + +Each FLU runs an administration agent that is responsible for +monitoring the health of the entire Machi cluster. If a change of +state is noticed (via measurement) or is requested (via the +administration API), zero or more actions may be taken: + +\begin{itemize} +\item Enter wedge state (Section~\ref{sub:wedge}). +\item Calculate a new projection to fit the new environment. +\item Attempt to store the new projection locally and remotely. +\item Read a newer projection from local + remote stores (and possibly + perform read repair). +\item Adopt a new unanimous projection, as read from all + currently available readable blob stores. +\item Exit wedge state. +\end{itemize} + +See also Section~\ref{sec:projections}. + +\subsection{The Projection and the Projection Epoch Number} +\label{sub:projection} + +The projection data +structure defines the current administration \& operational/runtime +configuration of a Machi cluster's single Chain Replication chain. +Each projection is identified by a strictly increasing counter called +the Epoch Projection Number (or more simply ``the epoch''). + +\begin{figure} +\begin{verbatim} +-type m_server_info() :: {Hostname, Port, ...}. + +-record(projection, { + epoch_number :: m_epoch_n(), + epoch_csum :: m_csum(), + prev_epoch_num :: m_epoch_n(), + prev_epoch_csum :: m_csum(), + creation_time :: now(), + author_server :: m_server(), + all_members :: [m_server()], + active_repaired :: [m_server()], + active_all :: [m_server()], + dbg_annotations :: proplist() + }). +\end{verbatim} +\caption{Sketch of the projection data structure} +\label{fig:projection} +\end{figure} + +Projections are calculated by each FLU using input from local +measurement data, calculations by the FLU's auto-administration +monitor (see below), and input from the administration API. +Each time that the configuration changes (automatically or by +administrator's request), a new epoch number is assigned +to the entire configuration data structure and is distributed to +all FLUs via the FLU's administration API. Each FLU maintains the +current projection epoch number as part of its soft state. + +Pseudo-code for the projection's definition is shown in +Figure~\ref{fig:projection}. +See also Section~\ref{sub:flu-divergence} for discussion of the +projection epoch checksum. + +\subsection{The Bad Epoch Error} +\label{sub:bad-epoch} + +Most Machi protocol actions are tagged with the actor's best knowledge +of the current epoch. However, Machi does not have a single/master +coordinator for making configuration changes. Instead, change is +performed in a fully asynchronous manner. During a cluster +configuration change, some servers will use the old projection number, +$P_p$, whereas others know of a newer projection, $P_{p+x}$ where $x>0$. + +When a protocol operation with $P_p$ arrives at an actor who knows +$P_{p+x}$, the response must be {\tt error\_bad\_epoch}. This is a signal +that the actor using $P_p$ is indeed out-of-date and that a newer +projection must be found and used. + +\subsection{The Wedge} +\label{sub:wedge} + +If a FLU server is using a projection $P_p$ and receives a protocol +message that mentions a newer projection $P_{p+x}$ that is larger than its +current projection value, then it must enter ``wedge'' state and stop +processing all new requests. The server remains in wedge state until +a new projection (with a larger/higher epoch number) is discovered and +appropriately acted upon. +In the Windows Azure storage system \cite{was}, this state is called +the ``sealed'' state. + +\subsection{``AP Mode'' and ``CP Mode''} +\label{sub:ap-cp-mode} + +Machi's first use cases require only eventual consistency semantics +and behavior, a.k.a.~``AP mode''. However, with only small +modifications, Machi can operate in a strongly consistent manner, +a.k.a.~``CP mode''. + +The auto-administration service (Section \ref{sub:auto-admin}) is +sufficient for an ``AP Mode'' Machi service. In AP Mode, all mutations +to any file on any side of a network partition are guaranteed to use +unique locations (file names and/or byte offsets). When network +partitions are healed, all files can be merged together +(while considering the file format detail discussed in +the footnote of Section~\ref{ssec:just-rsync-it}) in any order +without conflict. + +``CP mode'' will be extensively covered in other documents. In summary, +to support ``CP mode'', we believe that the auto-administra\-tion +service proposed here can guarantee strong consistency +at all times. + +\section{Sketches of single operations} +\label{sec:sketches} + +\subsection{Single operation: append a single sequence of bytes to a file} +\label{sec:sketch-append} + +To write/append atomically a single sequence/hunk of bytes to a file, +here's the sequence of steps required. + +\begin{enumerate} + +\item The client chooses a file name prefix. This prefix gives the +sequencer implicit advice of where the client wants data to be +placed. For example, if two different append requests are for file +prefixes $Pref1$ and $Pref2$ where $Pref1 \ne Pref2$, then the two byte +sequences will definitely be written to different files. If +$Pref1 = Pref2$, +then the sequencer may choose the same file for both (but no +guarantee of how ``close together'' the two requests might be). + +\item (cacheable) Find the list of Machi member servers. This step is +only needed at client initialization time or when all Machi members +are down/unavailable. This step is out of scope of Machi, i.e., found +via another source: local configuration file, DNS, LDAP, Riak KV, ZooKeeper, +carrier pigeon, etc. + +\item (cacheable) Find the current projection number and projection data +structure by fetching it from one of the Machi FLU server's +projection store service. This info +may be cached and reused for as long as Machi server requests do not +result in {\tt error\_bad\_epoch}. + +\item Client sends a sequencer op to the sequencer process on the head of +the Machi chain (as defined by the projection data structure): +{\tt \{sequence\_req, Filename\_Prefix, Number\_of\_Bytes\}}. The reply +includes {\tt \{Full\_Filename, Offset\}}. + +\item The client sends a write request to the head of the Machi chain: +{\tt \{write\_req, Full\_Filename, Offset, Bytes, Options\}}. The +client-calculated checksum is a recommended option. + +\item If the head's reply is {\tt ok}, then repeat for all remaining chain +members in strict chain order. + +\item If all chain members' replies are {\tt ok}, then the append was +successful. The client now knows the full Machi file name and byte +offset, so that future attempts to read the data can do so by file +name and offset. + +\item Upon any non-{\tt ok} reply from a FLU server, {\em the client must +consider the entire append operation a failure}. If the client +wishes, it may retry the append operation using a new location +assignment from the sequencer or, if permitted by Machi restrictions, +perform read repair on the original location. If this read repair is +fully successful, then the client may consider the append operation +successful. + +\item If a FLU server $FLU$ is unavailable, notify another up/available +chain member that $FLU$ appears unavailable. This info may be used by +the auto-administration monitor to change projections. If the client +wishes, it may retry the append op or perhaps wait until a new projection is +available. + +\item If any FLU server reports {\tt error\_written}, then either of two +things has happened: +\begin{itemize} + \item The appending client $C_w$ was too slow when attempting to write + to the head of the chain. + Another client, $C_r$, attempted a read, noticed that the tail's value was + unwritten and noticed that the head's value was also unwritten. + Then $C_r$ initiated a ``fill'' operation to write junk into + this offset of + the file. The fill operation succeeded, and now the slow + appending client $C_w$ discovers that it was too slow via the + {\tt error\_written} response. + \item The appending client $C_w$ was too slow after at least one + successful write. + Client $C_r$ attempted a read, noticed the partial write, and + then engaged in read repair. Client $C_w$ should also check all + replicas to verify that the repaired data matches its write + attempt -- in all cases, the values written by $C_w$ and $C_r$ are + identical. +\end{itemize} + +\end{enumerate} + +%% NOTE: append-whiteboard.eps was created by 'jpeg2ps'. +\begin{figure*}[htp] +\resizebox{\textwidth}{!}{ + \includegraphics[width=\textwidth]{figure6} + %% \includegraphics[width=\textwidth]{append-whiteboard} + } +\caption{Flow diagram: append 123 bytes onto a file with prefix {\tt "foo"}.} +\label{fig:append-flow} +\end{figure*} + +See Figure~\ref{fig:append-flow} for a diagram showing an example +append; the same example is also shown in +Figure~\ref{fig:append-flowMSC} using MSC style (message sequence chart). +In +this case, the first FLU contacted has a newer projection epoch, +$P_{13}$, than the $P_{12}$ epoch that the client first attempts to use. + +\subsection{TODO: Single operation: reading a chunk of bytes from a file} +\label{sec:sketch-read} + +\section{Projections: calculation, then storage, then (perhaps) use} +\label{sec:projections} + +Machi uses a ``projection'' to determine how its Chain Replication replicas +should operate; see Section~\ref{sub-chain-replication} and +\cite{corfu1}. At runtime, a cluster must be able to respond both to +administrative changes (e.g., substituting a failed server box with +replacement hardware) as well as local network conditions (e.g., is +there a network partition?). The concept of a projection is borrowed +from CORFU but has a longer history, e.g., the Hibari key-value store +\cite{cr-theory-and-practice} and goes back in research for decades, +e.g., Porcupine \cite{porcupine}. + +\subsection{Phases of projection change} + +Machi's use of projections is in four discrete phases and are +discussed below: network monitoring, +projection calculation, projection storage, and +adoption of new projections. + +\subsubsection{Network monitoring} +\label{sub:network-monitoring} + +Monitoring of local network conditions can be implemented in many +ways. None are mandatory, as far as this RFC is concerned. +Easy-to-maintain code should be the primary driver for any +implementation. Early versions of Machi may use some/all of the +following techniques: + +\begin{itemize} +\item Internal ``no op'' FLU-level protocol request \& response. +\item Use of distributed Erlang {\tt net\_ticktime} node monitoring +\item Explicit connections of remote {\tt epmd} services, e.g., to +tell the difference between a dead Erlang VM and a dead +machine/hardware node. +\item Network tests via ICMP {\tt ECHO\_REQUEST}, a.k.a. {\tt ping(8)} +\end{itemize} + +Output of the monitor should declare the up/down (or +available/unavailable) status of each server in the projection. Such +Boolean status does not eliminate ``fuzzy logic'' or probabilistic +methods for determining status. Instead, hard Boolean up/down status +decisions are required by the projection calculation phase +(Section~\ref{subsub:projection-calculation}). + +\subsubsection{Projection data structure calculation} +\label{subsub:projection-calculation} + +Each Machi server will have an independent agent/process that is +responsible for calculating new projections. A new projection may be +required whenever an administrative change is requested or in response +to network conditions (e.g., network partitions). + +Projection calculation will be a pure computation, based on input of: + +\begin{enumerate} +\item The current projection epoch's data structure +\item Administrative request (if any) +\item Status of each server, as determined by network monitoring +(Section~\ref{sub:network-monitoring}). +\end{enumerate} + +All decisions about {\em when} to calculate a projection must be made +using additional runtime information. Administrative change requests +probably should happen immediately. Change based on network status +changes may require retry logic and delay/sleep time intervals. + +Some of the items in Figure~\ref{fig:projection}'s sketch include: + +\begin{itemize} +\item {\tt prev\_epoch\_num} and {\tt prev\_epoch\_csum} The previous + projection number and checksum, respectively. +\item {\tt creation\_time} Wall-clock time, useful for humans and + general debugging effort. +\item {\tt author\_server} Name of the server that calculated the projection. +\item {\tt all\_members} All servers in the chain, regardless of current + operation status. If all operating conditions are perfect, the + chain should operate in the order specified here. + (See also the limitations in Section~\ref{sub:repair-chain-re-ordering}.) +\item {\tt active\_repaired} All active chain members that we know are + fully repaired/in-sync with each other and therefore the Update + Propagation Invariant (Section~\ref{sub:cr-proof}) is always true. + See also Section~\ref{sec:repair}. +\item {\tt active\_all} All active chain members, including those that + are under active repair procedures. +\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to + add any hints for why the projection change was made, delay/retry + information, etc. +\end{itemize} + +\subsection{Projection storage: writing} +\label{sub:proj-storage-writing} + +All projection data structures are stored in the write-once Projection +Store (Section~\ref{sub:proj-store}) that is run by each FLU +(Section~\ref{sub:flu}). + +Writing the projection follows the two-step sequence below. +In cases of writing +failure at any stage, the process is aborted. The most common case is +{\tt error\_written}, which signifies that another actor in the system has +already calculated another (perhaps different) projection using the +same projection epoch number and that +read repair is necessary. Note that {\tt error\_written} may also +indicate that another actor has performed read repair on the exact +projection value that the local actor is trying to write! + +\begin{enumerate} +\item Write $P_{new}$ to the local projection store. This will trigger + ``wedge'' status in the local FLU, which will then cascade to other + projection-related behavior within the FLU. +\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. + Some members may be unavailable, but that is OK. +\end{enumerate} + +(Recall: Other parts of the system are responsible for reading new +projections from other actors in the system and for deciding to try to +create a new projection locally.) + +\subsection{Projection storage: reading} +\label{sub:proj-storage-reading} + +Reading data from the projection store is similar in principle to +reading from a Chain Replication-managed FLU system. However, the +projection store does not require the strict replica ordering that +Chain Replication does. For any projection store key $K_n$, the +participating servers may have different values for $K_n$. As a +write-once store, it is impossible to mutate a replica of $K_n$. If +replicas of $K_n$ differ, then other parts of the system (projection +calculation and storage) are responsible for reconciling the +differences by writing a later key, +$K_{n+x}$ when $x>0$, with a new projection. + +Projection store reads are ``best effort''. The projection used is chosen from +all replica servers that are available at the time of the read. The +minimum number of replicas is only one: the local projection store +should always be available, even if no other remote replica projection +stores are available. + +For any key $K$, different projection stores $S_a$ and $S_b$ may store +nothing (i.e., {\tt error\_unwritten} when queried) or store different +values, $P_a \ne P_b$, despite having the same projection epoch +number. The following ranking rules are used to +determine the ``best value'' of a projection, where highest rank of +{\em any single projection} is considered the ``best value'': + +\begin{enumerate} +\item An unwritten value is ranked at a value of $-1$. +\item A value whose {\tt author\_server} is at the $I^{th}$ position + in the {\tt all\_members} list has a rank of $I$. +\item A value whose {\tt dbg\_annotations} and/or other fields have + additional information may increase/decrease its rank, e.g., + increase the rank by $10.25$. +\end{enumerate} + +Rank rules \#2 and \#3 are intended to avoid worst-case ``thrashing'' +of different projection proposals. + +The concept of ``read repair'' of an unwritten key is the same as +Chain Replication's. If a read attempt for a key $K$ at some server +$S$ results in {\tt error\_unwritten}, then all of the other stores in +the {\tt \#projection.all\_members} list are consulted. If there is a +unanimous value $V_{u}$ elsewhere, then $V_{u}$ is use to repair all +unwritten replicas. If the value of $K$ is not unanimous, then the +``best value'' $V_{best}$ is used for the repair. If all respond with +{\tt error\_unwritten}, repair is not required. + +\subsection{Adoption of new projections} + +The projection store's ``best value'' for the largest written epoch +number at the time of the read is projection used by the FLU. +If the read attempt for projection $P_p$ +also yields other non-best values, then the +projection calculation subsystem is notified. This notification +may/may not trigger a calculation of a new projection $P_{p+1}$ which +may eventually be stored and so +resolve $P_p$'s replicas' ambiguity. + +\subsubsection{Alternative implementations: Hibari's ``Admin Server'' + and Elastic Chain Replication} + +See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's +chain management agent, the ``Admin Server''. In brief: + +\begin{itemize} +\item The Admin Server is intentionally a single point of failure in + the same way that the instance of Stanchion in a Riak CS cluster + is an intentional single + point of failure. In both cases, strict + serialization of state changes is more important than 100\% + availability. + +\item For higher availability, the Hibari Admin Server is usually + configured in an active/standby manner. Status monitoring and + application failover logic is provided by the built-in capabilities + of the Erlang/OTP application controller. + +\end{itemize} + +Elastic chain replication is a technique described in +\cite{elastic-chain-replication}. It describes using multiple chains +to monitor each other, as arranged in a ring where a chain at position +$x$ is responsible for chain configuration and management of the chain +at position $x+1$. This technique is likely the fall-back to be used +in case the chain management method described in this RFC proves +infeasible. + +\subsection{Likely problems and possible solutions} +\label{sub:likely-problems} + +There are some unanswered questions about Machi's proposed chain +management technique. The problems that we guess are likely/possible +include: + +\begin{itemize} + +\item Thrashing or oscillating between a pair (or more) of + projections. It's hoped that the ``best projection'' ranking system + will be sufficient to prevent endless thrashing of projections, but + it isn't yet clear that it will be. + +\item Partial (and/or one-way) network splits which cause partially + connected graphs of inter-node connectivity. Groups of nodes that + are completely isolated aren't a problem. However, partially + connected groups of nodes is an unknown. Intuition says that + communication (via the projection store) with ``bridge nodes'' in a + partially-connected network ought to settle eventually on a + projection with high rank, e.g., the projection on an island + subcluster of nodes with the largest author node name. Some corner + case(s) may exist where this intuition is not correct. + +\item CP Mode management via the method proposed in + Section~\ref{sec:split-brain-management} may not be sufficient in + all cases. + +\end{itemize} + +\section{Chain Replication repair: how to fix servers after they crash +and return to service} +\label{sec:repair} + +%% Section~\ref{sec:safety-of-transitions} mentions that there are some +%% not-obvious ways that a Machi cluster could inadvertently lose data. +%% It is possible to avoid data loss in all cases, short of all servers +%% being destroyed by a fire. +The theory of why it's possible to avoid +data loss with chain replication is summarized in this section, +followed by a discussion of Machi-specific details that must be +included in any production-quality implementation. + +{\bf NOTE:} Beginning with Section~\ref{sub:repair-entire-files}, the +techniques presented here are novel and not described (to the best of +our knowledge) in other papers or public open source software. +Reviewers should give this new stuff +{\em an extremely careful reading}. All novelty in this section and +also in the projection management techniques of +Section~\ref{sec:projections} must be the first things to be +thoroughly vetted with tools such as Concuerror, QuickCheck, TLA+, +etc. + +\subsection{Chain Replication: proof of correctness} +\label{sub:cr-proof} + +\begin{quote} +``You want the truth? You can't handle the truth!'' +\par +\hfill{ --- Colonel Jessep, ``A Few Good Men'', 2002} +\end{quote} + +See Section~3 of \cite{chain-replication} for a proof of the +correctness of Chain Replication. A short summary is provide here. +Readers interested in good karma should read the entire paper. + +The three basic rules of Chain Replication and its strong +consistency guarantee: + +\begin{enumerate} + +\item All replica servers are arranged in an ordered list $C$. + +\item All mutations of a datum are performed upon each replica of $C$ + strictly in the order which they appear in $C$. A mutation is considered + completely successful if the writes by all replicas are successful. + +\item The head of the chain makes the determination of the order of + all mutations to all members of the chain. If the head determines + that some mutation $M_i$ happened before another mutation $M_j$, + then mutation $M_i$ happens before $M_j$ on all other members of + the chain.\footnote{While necesary for general Chain Replication, + Machi does not need this property. Instead, the property is + provided by Machi's sequencer and the write-once register of each + byte in each file.} + +\item All read-only operations are performed by the ``tail'' replica, + i.e., the last replica in $C$. + +\end{enumerate} + +The basis of the proof lies in a simple logical trick, which is to +consider the history of all operations made to any server in the chain +as a literal list of unique symbols, one for each mutation. + +Each replica of a datum will have a mutation history list. We will +call this history list $H$. For the $i^{th}$ replica in the chain list +$C$, we call $H_i$ the mutation history list for the $i^{th}$ replica. + +Before the $i^{th}$ replica in the chain list begins service, its mutation +history $H_i$ is empty, $[]$. After this replica runs in a Chain +Replication system for a while, its mutation history list grows to +look something like +$[M_0, M_1, M_2, ..., M_{m-1}]$ where $m$ is the total number of +mutations of the datum that this server has processed successfully. + +Let's assume for a moment that all mutation operations have stopped. +If the order of the chain was constant, and if all mutations are +applied to each replica in the chain's order, then all replicas of a +datum will have the exact same mutation history: $H_i = H_J$ for any +two replicas $i$ and $j$ in the chain +(i.e., $\forall i,j \in C, H_i = H_J$). That's a lovely property, +but it is much more interesting to assume that the service is +not stopped. Let's look next at a running system. + +\begin{figure*} +\centering +\begin{tabular}{ccc} +{\bf {{On left side of $C$}}} & & {\bf On right side of $C$} \\ +\hline +\multicolumn{3}{l}{Looking at replica order in chain $C$:} \\ +$i$ & $<$ & $j$ \\ + +\multicolumn{3}{l}{For example:} \\ + +0 & $<$ & 2 \\ +\hline +\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\ +length($H_i$) & $\geq$ & length($H_j$) \\ +\multicolumn{3}{l}{For example, a quiescent chain:} \\ +48 & $\geq$ & 48 \\ +\multicolumn{3}{l}{For example, a chain being mutated:} \\ +55 & $\geq$ & 48 \\ +\multicolumn{3}{l}{Example ordered mutation sets:} \\ +$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ +\multicolumn{3}{c}{\bf Therefore the right side is always an ordered + subset} \\ +\multicolumn{3}{c}{\bf of the left side. Furthermore, the ordered + sets on both} \\ +\multicolumn{3}{c}{\bf sides have the exact same order of those elements they have in common.} \\ +\multicolumn{3}{c}{The notation used by the Chain Replication paper is +shown below:} \\ +$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\succeq$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ + +\end{tabular} +\caption{A demonstration of Chain Replication protocol history ``Update Propagation Invariant''.} +\label{tab:chain-order} +\end{figure*} + +If the entire chain $C$ is processing any number of concurrent +mutations, then we can still understand $C$'s behavior. +Figure~\ref{tab:chain-order} shows us two replicas in chain $C$: +replica $R_i$ that's on the left/earlier side of the replica chain $C$ +than some other replica $R_j$. We know that $i$'s position index in +the chain is smaller than $j$'s position index, so therefore $i < j$. +The restrictions of Chain Replication make it true that length($H_i$) +$\ge$ length($H_j$) because it's also that $H_i \supset H_j$, i.e, +$H_i$ on the left is always is a superset of $H_j$ on the right. + +When considering $H_i$ and $H_j$ as strictly ordered lists, we have +$H_i \succeq H_j$, where the right side is always an exact prefix of the left +side's list. This prefixing propery is exactly what strong +consistency requires. If a value is read from the tail of the chain, +then no other chain member can have a prior/older value because their +respective mutations histories cannot be shorter than the tail +member's history. + +\paragraph{``Update Propagation Invariant''} +is the original chain replication paper's name for the +$H_i \succeq H_j$ +property. This paper will use the same name. + +\subsection{When to trigger read repair of single values} + +Assume now that some client $X$ wishes to fetch a datum that's managed +by Chain Replication. Client $X$ must discover the chain's +configuration for that datum, then send its read request to the tail +replica of the chain, $R_{tail}$. + +In CORFU and in Machi, the store is a set of write-once registers. +Therefore, the only possible responses that client $X$ might get from a +query to the chain's $R_{tail}$ are: + +\begin{enumerate} +\item {\tt error\_unwritten} +\item {\tt \{ok, <<...data bytes...>>\}} +\item {\tt error\_trimmed} (in environments where space + reclamation/garbage collection is permitted) +\end{enumerate} + +Let's explore each of these responses in the following subsections. + +\subsubsection{Tail replica replies {\tt error\_unwritten}} + +There are only a few reasons why this value is possible. All are +discussed here. + +\paragraph{Scenario: A client $X_w$ has received a sequencer's + assignment for this + location, but the client has crashed somewhere in the middle of + writing the value to the chain.} + +The correct action to take here depends on the value of the $R_{head}$ +replica's value. If $R_{head}$'s value is unwritten, then the writing +client $X_w$ crashed before writing to $R_{head}$. The reading client +$X_r$ must ``fill'' the page with junk bytes (see +Section~\ref{sub:fill-single}) or else do nothing. + +If $R_{head}$'s value is indeed written, then the reading client $X_r$ +must finish a ``read repair'' operation before the client may proceed. +See Section~\ref{sub:read-repair-single} for details. + +\paragraph{Scenario: A client has received a sequencer's assignment for this + location, but the client has become extremely slow (or is + experiencing a network partition, or any other reason) and has not + yet updated $R_{tail}$ $\ldots$ but that client {\em will eventually + finish its work} and will eventually update $R_{tail}$.} + +It should come as little surprise that reading client $C_r$ +cannot know whether the writing client $C_w$ +has really crashed or if $C_w$ is merely very slow. +It is therefore very nice that +the action that $C_r$ must take in either case is the same --- see the +scenario \#2 for details. + +\subsubsection{Tail replica replies {\tt \{ok, <<...>>\}}} + +There is no need to perform single item read repair in this case. +The Update Propagation Invariant guarantees that this value is the one +strictly consistent value for this register. + +\subsubsection{Tail replica replies {\tt error\_trimmed}} + +There is no need to perform single item read repair in this case. + +{\bf NOTE:} It isn't yet clear how much support early versions of +Machi will need for GC/space reclamation via trimming. + +\subsection{How to read repair a single value} +\label{sub:read-repair-single} + +If a value at $R_{tail}$ is unwritten, then the answer to ``what value +should I use to repair the chain's value?'' is simple: the value at the +head $R_{head}$ is the value $V_{head}$ that must be used. The client +then writes $V_{head}$ to all other members of the chain $C$, in +order. + +The client may not proceed with its upper-level logic until the read +repair operation is successful. If the read repair operation is not +successful, then the client must react in the same manner as if the +original read attempt of $R_{tail}$'s value had failed. + +\subsection{How to ``fill'' a single value} +\label{sub:fill-single} + +A Machi FLU +implementation may (or may not) maintain enough metadata to be able to +unambiguously inform clients that a written value is the result of a +``fill'' operation. It is not yet clear if that information is value +enough for FLUs to maintain. + +A ``fill'' operation is simply writing a value of junk. The value of +the junk does not matter, as long as any client reading the value does +not mistake the junk for an application's legitimate data. For +example, the Erlang notation of {\tt <<0,0,0,\ldots>>} + +CORFU requires a fill operation to be able to meet its promise of +low-latency operation, in case of failure. Its use can be illustrated +in this sequence of events: + +\begin{enumerate} +\item Client $X$ obtains a position from the sequencer at offset $O$ + for a new log write of value $V_X$. +%% \item Client $Z$ obtains a position for a new log write from the +%% sequences at offset $O+1$. +\item Client $X$ pauses. The reason does not matter: a crash, a + network partition, garbage collection pause, gone scuba diving, etc. +\item Client $Y$ is reading the log forward and finds the entry at + offset $O$ is unwritten. A CORFU log is very strictly ordered, so + client $Y$ is blocked and cannot read any further in the log until + the status of offset $O$ has been unambiguously determined. +\item Client $Y$ attempts a fill operation on offset $O$ at the head + of the chain with value $V_{fill}$. + If this succeeds, then $Y$ and all other clients know + that a partial write is in progress, and the value is + fill bytes. If this fails because of {\tt error\_written}, then + client $Y$ knows that client $X$ isn't truly dead and that it has + lost a race with $X$: the head's value at offset $O$ is $V_x$. +\item Client $Y$ writes to the remaining members of the chain, + using the value at the chain's head, $V_x$ or $V_{fill}$. +\item Client $Y$ (and all other CORFU clients) now unambiguously know + the state of offset $O$: it is either a fully-written junk page + written by $Y$ or it is a fully-written page $V_x$ written by $X$. +\item If client $X$ has not crashed but is merely slow with any write + attempt to any chain member, $X$ may encounter {\tt error\_written} + responses. However, all values stored by that chain member must be + either $V_x$ or $V_{fill}$, and all chain members will agree on + which value it is. +\end{enumerate} + +A fill operation in Machi is {\em prohibited} at any time that split +brain runtime support is enabled (i.e., in AP mode). + +CORFU does not need such a restriction on ``fill'': CORFU always replaces +all of the repair destination's data, server $R_a$ in the figure, with +the repair source $R_a$'s data. (See also +Section~\ref{sub:repair-divergence}.) Machi must be able +to perform data repair of many 10s of TBytes of data very quickly; +CORFU's brute-force solution is not sufficient for Machi. Until a +work-around is found for Machi, fill operations will simply be +prohibited if split brain operation is enabled. + +\subsection{Repair of entire files} +\label{sub:repair-entire-files} + +There are some situations where repair of entire files is necessary. + +\begin{itemize} +\item To repair FLUs added to a chain in a projection change, + specifically adding a new FLU to the chain. This case covers both + adding a new, data-less FLU and re-adding a previous, data-full FLU + back to the chain. +\item To avoid data loss when changing the order of the chain's servers. +\end{itemize} + +Both situations can set the stage for data loss in the future. +If a violation of the Update Propagation Invariant (see end of +Section~\ref{sub:cr-proof}) is permitted, then the strong consistency +guarantee of Chain Replication is violated. Because Machi uses +write-once registers, the number of possible strong consistency +violations is small: any client that witnesses a written $\rightarrow$ +unwritten transition is a violation of strong consistency. But +avoiding even this one bad scenario is a bit tricky. + +As explained in Section~\ref{sub:data-loss1}, data +unavailability/loss when all chain servers fail is unavoidable. We +wish to avoid data loss whenever a chain has at least one surviving +server. Another method to avoid data loss is to preserve the Update +Propagation Invariant at all times. + +\subsubsection{Just ``rsync'' it!} +\label{ssec:just-rsync-it} + +A simpler replication method might be perhaps 90\% sufficient. +That method could loosely be described as ``just {\tt rsync} +out of all files to all servers in an infinite loop.''\footnote{The + file format suggested in + Section~\ref{sub:on-disk-data-format} does not permit {\tt rsync} + as-is to be sufficient. A variation of {\tt rsync} would need to be + aware of the data/metadata split within each file and only replicate + the data section \ldots and the metadata would still need to be + managed outside of {\tt rsync}.} + +However, such an informal method +cannot tell you exactly when you are in danger of data loss and when +data loss has actually happened. If we maintain the Update +Propagation Invariant, then we know exactly when data loss is immanent +or has happened. + +Furthermore, we hope to use Machi for multiple use cases, including +ones that require strong consistency. +For uses such as CORFU, strong consistency is a non-negotiable +requirement. Therefore, we will use the Update Propagation Invariant +as the foundation for Machi's data loss prevention techniques. + +\subsubsection{Divergence from CORFU: repair} +\label{sub:repair-divergence} + +The original repair design for CORFU is simple and effective, +mostly. See Figure~\ref{fig:corfu-style-repair} for a full +description of the algorithm +Figure~\ref{fig:corfu-repair-sc-violation} for an example of a strong +consistency violation that can follow. (NOTE: This is a variation of +the data loss scenario that is described in +Figure~\ref{fig:data-loss2}.) + +\begin{figure} +\begin{enumerate} +\item Destroy all data on the repair destination FLU. +\item Add the repair destination FLU to the tail of the chain in a new + projection $P_{p+1}$. +\item Change projection from $P_p$ to $P_{p+1}$. +\item Let single item read repair fix all of the problems. +\end{enumerate} +\caption{Simplest CORFU-style repair algorithm.} +\label{fig:corfu-style-repair} +\end{figure} + +\begin{figure} +\begin{enumerate} +\item Write value $V$ to offset $O$ in the log with chain $[F_a]$. + This write is considered successful. +\item Change projection to configure chain as $[F_a,F_b]$. Prior to + the change, all values on FLU $F_b$ are unwritten. +\item FLU server $F_a$ crashes. The new projection defines the chain + as $[F_b]$. +\item A client attempts to read offset $O$ and finds an unwritten + value. This is a strong consistency violation. +%% \item The same client decides to fill $O$ with the junk value +%% $V_{junk}$. Now value $V$ is lost. +\end{enumerate} +\caption{An example scenario where the CORFU simplest repair algorithm + can lead to a violation of strong consistency.} +\label{fig:corfu-repair-sc-violation} +\end{figure} + +A variation of the repair +algorithm is presented in section~2.5 of a later CORFU paper \cite{corfu2}. +However, the re-use a failed +server is not discussed there, either: the example of a failed server +$F_6$ uses a new server, $F_8$ to replace $F_6$. Furthermore, the +repair process is described as: + +\begin{quote} +``Once $F_6$ is completely rebuilt on $F_8$ (by copying entries from + $F_7$), the system moves to projection (C), where $F_8$ is now used + to service all reads in the range $[40K,80K)$.'' +\end{quote} + +The phrase ``by copying entries'' does not give enough +detail to avoid the same data race as described in +Figure~\ref{fig:corfu-repair-sc-violation}. We believe that if +``copying entries'' means copying only written pages, then CORFU +remains vulnerable. If ``copying entries'' also means ``fill any +unwritten pages prior to copying them'', then perhaps the +vulnerability is eliminated.\footnote{SLF's note: Probably? This is my + gut feeling right now. However, given that I've just convinced + myself 100\% that fill during any possibility of split brain is {\em + not safe} in Machi, I'm not 100\% certain anymore than this ``easy'' + fix for CORFU is correct.}. + +\subsubsection{Whole-file repair as FLUs are (re-)added to a chain} +\label{sub:repair-add-to-chain} + +Machi's repair process must preserve the Update Propagation +Invariant. To avoid data races with data copying from +``U.P.~Invariant preserving'' servers (i.e. fully repaired with +respect to the Update Propagation Invariant) +to servers of unreliable/unknown state, a +projection like the one shown in +Figure~\ref{fig:repair-chain-of-chains} is used. In addition, the +operations rules for data writes and reads must be observed in a +projection of this type. + +\begin{figure*} +\centering +$ +[\overbrace{\underbrace{H_1}_\textbf{Head of Heads}, M_{11}, + \underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)} +\mid +\overbrace{H_2, M_{21}, + \underbrace{T_2}_\textbf{Tail \#2}}^\textbf{Chain \#2 (repairing)} +\mid \ldots \mid +\overbrace{H_n, M_{n1}, + \underbrace{T_n}_\textbf{Tail \#n \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#n (repairing)} +] +$ +\caption{Representation of a ``chain of chains'': a chain prefix of + Update Propagation Invariant preserving FLUs (``Chain \#1'') + with FLUs from $n-1$ other chains under repair.} +\label{fig:repair-chain-of-chains} +\end{figure*} + +\begin{itemize} + +\item The system maintains the distinction between ``U.P.~preserving'' + and ``repairing'' FLUs at all times. This allows the system to + track exactly which servers are known to preserve the Update + Propagation Invariant and which servers may/may not. + +\item All ``repairing'' FLUs must be added only at the end of the + chain-of-chains. + +\item All write operations must flow successfully through the + chain-of-chains from beginning to end, i.e., from the ``head of + heads'' to the ``tail of tails''. This rule also includes any + repair operations. + +\item In AP Mode, all read operations are attempted from the list of +$[T_1,\-T_2,\-\ldots,\-T_n]$, where these FLUs are the tails of each of the +chains involved in repair. +In CP mode, all read operations are attempted only from $T_1$. +The first reply of {\tt \{ok, <<...>>\}} is a correct answer; +the rest of the FLU list can be ignored and the result returned to the +client. If all FLUs in the list have an unwritten value, then the +client can return {\tt error\_unwritten}. + +\end{itemize} + +While the normal single-write and single-read operations are performed +by the cluster, a file synchronization process is initiated. The +sequence of steps differs depending on the AP or CP mode of the system. + +\paragraph{In cases where the cluster is operating in CP Mode:} + +CORFU's repair method of ``just copy it all'' (from source FLU to repairing +FLU) is correct, {\em except} for the small problem pointed out in +Section~\ref{sub:repair-divergence}. The problem for Machi is one of +time \& space. Machi wishes to avoid transferring data that is +already correct on the repairing nodes. If a Machi node is storing +20TBytes of data, we really do not wish to use 20TBytes of bandwidth +to repair only 1 GByte of truly-out-of-sync data. + +However, it is {\em vitally important} that all repairing FLU data be +clobbered/overwritten with exactly the same data as the Update +Propagation Invariant preserving chain. If this rule is not strictly +enforced, then fill operations can corrupt Machi file data. The +algorithm proposed is: + +\begin{enumerate} + +\item Change the projection to a ``chain of chains'' configuration + such as depicted in Figure~\ref{fig:repair-chain-of-chains}. + +\item For all files on all FLUs in all chains, extract the lists of + written/unwritten byte ranges and their corresponding file data + checksums. (The checksum metadata is not strictly required for + recovery in AP Mode.) + Send these lists to the tail of tails + $T_{tails}$, which will collate all of the lists into a list of + tuples such as {\tt \{FName, $O_{start}, O_{end}$, CSum, FLU\_List\}} + where {\tt FLU\_List} is the list of all FLUs in the entire chain of + chains where the bytes at the location {\tt \{FName, $O_{start}, + O_{end}$\}} are known to be written (as of the current repair period). + +\item For chain \#1 members, i.e., the + leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains}, + repair files byte ranges for any chain \#1 members that are not members + of the {\tt FLU\_List} set. This will repair any partial + writes to chain \#1 that were unsuccessful (e.g., client crashed). + (Note however that this step only repairs FLUs in chain \#1.) + +\item For all file byte ranges in all files on all FLUs in all + repairing chains where Tail \#1's value is unwritten, force all + repairing FLUs to also be unwritten. + +\item For file byte ranges in all files on all FLUs in all repairing + chains where Tail \#1's value is written, send repair file byte data + \& metadata to any repairing FLU if the value repairing FLU's + value is unwritten or the checksum is not exactly equal to Tail \#1's + checksum. + +\end{enumerate} + +\begin{figure} +\centering +$ +[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, + H_2, M_{21}, T_2, + \ldots + H_n, M_{n1}, + \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} +] +$ +\caption{Representation of Figure~\ref{fig:repair-chain-of-chains} + after all repairs have finished successfully and a new projection has + been calculated.} +\label{fig:repair-chain-of-chains-finished} +\end{figure} + +When the repair is known to have copied all missing data successfully, +then the chain can change state via a new projection that includes the +repaired FLU(s) at the end of the U.P.~Invariant preserving chain \#1 +in the same order in which they appeared in the chain-of-chains during +repair. See Figure~\ref{fig:repair-chain-of-chains-finished}. + +The repair can be coordinated and/or performed by the $T_{tails}$ FLU +or any other FLU or cluster member that has spare capacity. + +There is no serious race condition here between the enumeration steps +and the repair steps. Why? Because the change in projection at +step \#1 will force any new data writes to adapt to a new projection. +Consider the mutations that either happen before or after a projection +change: + + +\begin{itemize} + +\item For all mutations $M_1$ prior to the projection change, the + enumeration steps \#3 \& \#4 and \#5 will always encounter mutation + $M_1$. Any repair must write through the entire chain-of-chains and + thus will preserve the Update Propagation Invariant when repair is + finished. + +\item For all mutations $M_2$ starting during or after the projection + change has finished, a new mutation $M_2$ may or may not be included in the + enumeration steps \#3 \& \#4 and \#5. + However, in the new projection, $M_2$ must be + written to all chain of chains members, and such + in-order writes will also preserve the Update + Propagation Invariant and therefore is also be safe. + +\end{itemize} + +%% Then the only remaining safety problem (as far as I can see) is +%% avoiding this race: + +%% \begin{enumerate} +%% \item Enumerate byte ranges $[B_0,B_1,\ldots]$ in file $F$ that must +%% be copied to the repair target, based on checksum differences for +%% those byte ranges. +%% \item A real-time concurrent write for byte range $B_x$ arrives at the +%% U.P.~Invariant preserving chain for file $F$ but was not a member of +%% step \#1's list of byte ranges. +%% \item Step \#2's update is propagated down the chain of chains. +%% \item Step \#1's clobber updates are propagated down the chain of +%% chains. +%% \item The value for $B_x$ is lost on the repair targets. +%% \end{enumerate} + +\paragraph{In cases the cluster is operating in AP Mode:} + +\begin{enumerate} +\item Follow the first two steps of the ``CP Mode'' + sequence (above). +\item Follow step \#3 of the ``strongly consistent mode'' sequence + (above), but in place of repairing only FLUs in Chain \#1, AP mode + will repair the byte range of any FLU that is not a member of the + {\tt FLU\_List} set. +\item End of procedure. +\end{enumerate} + +The end result is a huge ``merge'' where any +{\tt \{FName, $O_{start}, O_{end}$\}} range of bytes that is written +on FLU $F_w$ but missing/unwritten from FLU $F_m$ is written down the full chain +of chains, skipping any FLUs where the data is known to be written. +Such writes will also preserve Update Propagation Invariant when +repair is finished. + +\subsubsection{Whole-file repair when changing FLU ordering within a chain} +\label{sub:repair-chain-re-ordering} + +Changing FLU order within a chain is an operations optimization only. +It may be that the administrator wishes the order of a chain to remain +as originally configured during steady-state operation, e.g., +$[F_a,F_b,F_c]$. As FLUs are stopped \& restarted, the chain may +become re-ordered in a seemingly-arbitrary manner. + +It is certainly possible to re-order the chain, in a kludgy manner. +For example, if the desired order is $[F_a,F_b,F_c]$ but the current +operating order is $[F_c,F_b,F_a]$, then remove $F_b$ from the chain, +then add $F_b$ to the end of the chain. Then repeat the same +procedure for $F_c$. The end result will be the desired order. + +From an operations perspective, re-ordering of the chain +using this kludgy manner has a +negative effect on availability: the chain is temporarily reduced from +operating with $N$ replicas down to $N-1$. This reduced replication +factor will not remain for long, at most a few minutes at a time, but +even a small amount of time may be unacceptable in some environments. + +Reordering is possible with the introduction of a ``temporary head'' +of the chain. This temporary FLU does not need to be a full replica +of the entire chain --- it merely needs to store replicas of mutations +that are made during the chain reordering process. This method will +not be described here. However, {\em if reviewers believe that it should +be included}, please let the authors know. + +\paragraph{In both Machi operating modes:} +After initial implementation, it may be that the repair procedure is a +bit too slow. In order to accelerate repair decisions, it would be +helpful have a quicker method to calculate which files have exactly +the same contents. In traditional systems, this is done with a single +file checksum; see also Section~\ref{sub:detecting-corrupted}. +Machi's files can be written out-of-order from a file offset point of +view, which violates the order which the traditional method for +calculating a full-file hash. If we recall +Figure~\ref{fig:temporal-out-of-order}, the traditional method cannot +continue calculating the file checksum at offset 2 until the byte at +file offset 1 is written. + +It may be advantageous for each FLU to maintain for each file a +checksum of a canonical representation of the +{\tt \{$O_{start},O_{end},$ CSum\}} tuples that the FLU must already +maintain. Then for any two FLUs that claim to store a file $F$, if +both FLUs have the same hash of $F$'s written map + checksums, then +the copies of $F$ on both FLUs are the same. + +\section{``Split brain'' management in CP Mode} +\label{sec:split-brain-management} + +Split brain management is a thorny problem. The method presented here +is one based on pragmatics. If it doesn't work, there isn't a serious +worry, because Machi's first serious use case all require only AP Mode. +If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'', +then perhaps that's +fine enough. Meanwhile, let's explore how a +completely self-contained, no-external-dependencies +CP Mode Machi might work. + +Wikipedia's description of the quorum consensus solution\footnote{See + {\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice +and short: + +\begin{quotation} +A typical approach, as described by Coulouris et al.,[4] is to use a +quorum-consensus approach. This allows the sub-partition with a +majority of the votes to remain available, while the remaining +sub-partitions should fall down to an auto-fencing mode. +\end{quotation} + +This is the same basic technique that +both Riak Ensemble and ZooKeeper use. Machi's +extensive use of write-registers are a big advantage when implementing +this technique. Also very useful is the Machi ``wedge'' mechanism, +which can automatically implement the ``auto-fencing'' that the +technique requires. All Machi servers that can communicate with only +a minority of other servers will automatically ``wedge'' themselves +and refuse all requests for service until communication with the +majority can be re-established. + +\subsection{The quorum: witness servers vs. full servers} + +In any quorum-consensus system, at least $2f+1$ participants are +required to survive $f$ participant failures. Machi can implement a +technique of ``witness servers'' servers to bring the total cost +somewhere in the middle, between $2f+1$ and $f+1$, depending on your +point of view. + +A ``witness server'' is one that participates in the network protocol +but does not store or manage all of the state that a ``full server'' +does. A ``full server'' is a Machi server as +described by this RFC document. A ``witness server'' is a server that +only participates in the projection store and projection epoch +transition protocol and a small subset of the file access API. +A witness server doesn't actually store any +Machi files. A witness server is almost stateless, when compared to a +full Machi server. + +A mixed cluster of witness and full servers must still contain at +least $2f+1$ participants. However, only $f+1$ of them are full +participants, and the remaining $f$ participants are witnesses. In +such a cluster, any majority quorum must have at least one full server +participant. + +Witness FLUs are always placed at the front of the chain. As stated +above, there may be at most $f$ witness FLUs. A functioning quorum +majority +must have at least $f+1$ FLUs that can communicate and therefore +calculate and store a new unanimous projection. Therefore, any FLU at +the tail of a functioning quorum majority chain must be full FLU. Full FLUs +actually store Machi files, so they have no problem answering {\tt + read\_req} API requests.\footnote{We hope that it is now clear that + a witness FLU cannot answer any Machi file read API request.} + +Any FLU that can only communicate with a minority of other FLUs will +find that none can calculate a new projection that includes a +majority of FLUs. Any such FLU, when in CP mode, would then move to +wedge state and remain wedged until the network partition heals enough +to communicate with the majority side. This is a nice property: we +automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side + is wedged and therefore refuses to serve because it is, so to speak, + ``on the wrong side of the fence.''} + +There is one case where ``fencing'' may not happen: if both the client +and the tail FLU are on the same minority side of a network partition. +Assume the client and FLU $F_z$ are on the "wrong side" of a network +split; both are using projection epoch $P_1$. The tail of the +chain is $F_z$. + +Also assume that the "right side" has reconfigured and is using +projection epoch $P_2$. The right side has mutated key $K$. Meanwhile, +nobody on the "right side" has noticed anything wrong and is happy to +continue using projection $P_1$. + +\begin{itemize} +\item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via + $F_z$. $F_z$ does not detect an epoch problem and thus returns an + answer. Given our assumptions, this value is stale. For some + client use cases, this kind of staleness may be OK in trade for + fewer network messages per read \ldots so Machi may + have a configurable option to permit it. +\item {\bf Option b}: The wrong side client must confirm that $P_1$ is + in use by a full majority of chain members, including $F_z$. +\end{itemize} + +Attempts using Option b will fail for one of two reasons. First, if +the client can talk to a FLU that is using $P_2$, the client's +operation must be retried using $P_2$. Second, the client will time +out talking to enough FLUs so that it fails to get a quorum's worth of +$P_1$ answers. In either case, Option B will always fail a client +read and thus cannot return a stale value of $K$. + +\subsection{Witness FLU data and protocol changes} + +Some small changes to the projection's data structure +(Figure~\ref{fig:projection}) are required. The projection itself +needs new annotation to indicate the operating mode, AP mode or CP +mode. The state type notifies the auto-administration service how to +react in network partitions and how to calculate new, safe projection +transitions and which file repair mode to use +(Section~\ref{sub:repair-entire-files}). +Also, we need to label member FLU servers as full- or +witness-type servers. + +Write API requests are processed by witness servers in {\em almost but + not quite} no-op fashion. The only requirement of a witness server +is to return correct interpretations of local projection epoch +numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error +codes. In fact, a new API call is sufficient for querying witness +servers: {\tt \{check\_epoch, m\_epoch()\}}. +Any client write operation sends the {\tt + check\_\-epoch} API command to witness FLUs and sends the usual {\tt + write\_\-req} command to full FLUs. + +\section{On-disk storage and file corruption detection} +\label{sec:on-disk} + +An individual FLU has a couple of goals: store file data and metadata +as efficiently as possible, and make it easy to detect and fix file +corruption. + +FLUs have a lot of flexibility to implement their on-disk data formats in +whatever manner allow them to be safe and fast. Any format that +allows safe management of file names, per-file data chunks, and +per-data-chunk metadata is sufficient. + +\subsection{First draft/strawman proposal for on-disk data format} +\label{sub:on-disk-data-format} + +{\bf NOTE:} The suggestions in this section are ``strawman quality'' +only. + +\begin{figure*} +\begin{verbatim} +|<--- Data section --->|<---- Metadata section (starts at fixed offset) ----> + |<- trailer --> +V1,C1 | V2,C2 | ||| C1t,O1a,O1z,C1 | C2t,O2a,O2z,C2 | Summ | SummBytes |eof + |<- trailer --> +V1,C1 | V2,C2 | V3,C3 ||| C1t,O1a,O1z,C1 | C2t,O2a,O2z,C2 | C3t,O3a,O3z,C3 | Summ | SummBytes |eof +\end{verbatim} +\caption{File format draft \#1, a snapshot at two different times.} +\label{fig:file-format-d1} +\end{figure*} + +See Figure~\ref{fig:file-format-d1} for an example file layout. +Prominent features are: + +\begin{itemize} +\item The data section is a fixed size, e.g. 1 GByte, so the metadata + section is known to start at a particular offset. + The sequencers on all FLUs must also be aware of of this file size + limit. +\item Data section $V_n,C_n$ tuples: client-written data plus the 20 + byte SHA1 hash of that data, concatenated. The client must be aware + that the hash is the final 20 bytes of the value that it reads + \ldots but this feels like a small price to pay to have the checksum + co-located exactly adjacent to the data that it protects. + The client may elect not to store the checksum explicitly in the + file body, knowing that there is likely a performance penalty when + it wishes to fetch the checksum via the file metadata API. +\item Metadata section $C_{nt},O_{na},O_{nz},C_n$ tuples: + The chunk's + checksum type (e.g. SHA1 for all but the final + 20 bytes),\footnote{Other types may include: no checksum, checksum + of the entire value, and checksums using other hash algorithms.} + the starting + offset (``a''), ending offset (``z'') of a chunk, and the + chunk's SHA1 checksum (which is intentionally duplicated in this + example in both sections). The approximate size is + $4 + 4 + 1 + 20 = 25$ bytes per metadata entry. +\item Metadata section {\tt Summ}: a compact summary of the + unwritten/written status of all bytes in the file, e.g., using byte + range encoding for contiguous regions of writes. +\item Metadata section {\tt SummBytes}: the number of bytes backward + to look for the start of the {\tt Summ} summary. +\item {\tt eof} The end of file. +\end{itemize} + +When a chunk write is requested by a client, the FLU must verify that +the byte range has entirely ``unwritten'' status. If that information +is not cached by the FLU somehow, it can be easily read by reading the +trailer, which is always positioned at the end of the file. + +If the FLU is queried for checksum information and/or chunk boundary +information, and that info is not cached, then the FLU can simply read +all data beyond the start of the metadata section. For a 1 GByte file +written in 1 MByte chunks, the metadata section +would be approximately 25 KBytes. For 4 KByte pages (CORFU style), the +metadata section would be approximately 6.4 MBytes. + +Each time that a new chunk(s) is written within the data section, no +matter its offset, the old {\tt Summ} and {\tt SummBytes} trailer is +overwritten by the offset$+$checksum metadata for the new chunk(s) +followed by the new trailer. Overwriting the trailer is justified in +that if corruption happens in the metadata section, the +system's worst-case reaction would be as if +the corruption had happened in the data section: the file +is invalid, and Machi will repair the file from another replica. +A more likely scenario is that some early part of the file is correct, +and only a part of the end of the file requires repair from another +replica. + +\subsection{If the client does not provide a checksum?} + +If the client doesn't provide a checksum, then it's almost certainly a +good idea to have the FLU calculate the checksum before writing. The +$C_t$ value should be a type that indicates that the checksum was not +calculated by the client. In all other fields, the metadata section +data would be identical. + +\subsection{Detecting corrupted files (``checksum scrub'')} +\label{sub:detecting-corrupted} + +This task is a bit more difficult than with a typical append-only, +file-written-in-order file. In most append-only situations, the file +is really written in a strict order, both temporally and spatially, +from offset 0 to the (eventual) +end-of-file. The order in which the bytes were written is the same +order as the bytes are fed into a checksum or +hashing function, such as SHA1. + +However, a Machi file is not written strictly in order from offset 0 +to some larger offset. Machi's append-only file guarantee is +{\em guaranteed in space, i.e., the offset within the file} and is +definitely {\em not guaranteed in time}. + +The file format proposed in Figure~\ref{fig:file-format-d1} +contains the checksum of each client write, using the checksum value +that the client or the FLU provides. A FLU could then: + +\begin{enumerate} +\item Read the metadata section to discover all written chunks and + their checksums. +\item For each written chunk, read the chunk and calculate the + checksum (with the same algorithm specified by the metadata). +\item For any checksum mismatch, ask the FLU to trigger a repair from + another FLU in the chain. +\end{enumerate} + +The corruption detection should run at a lower priority than normal +FLU activities. FLUs should implement a basic rate limiting +mechanism. + +FLUs should also be able to schedule their checksum scrubbing activity +periodically and limit their activity to certain times, per a +only-as-complex-as-it-needs-to-be administrative policy. + +\section{The safety of projection epoch transitions} +\label{sec:safety-of-transitions} + +Machi uses the projection epoch transition algorithm and +implementation from CORFU, which is believed to be safe. However, +CORFU assumes a single, external, strongly consistent projection +store. Further, CORFU assumes that new projections are calculated by +an oracle that the rest of the CORFU system agrees is the sole agent +for creating new projections. Such an assumption is impractical for +Machi's intended purpose. + +Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle +coordinator), but we wish to keep Machi free of big external +dependencies. We would also like to see Machi be able to +operate in an ``AP mode'', which means providing service even +if all network communication to an oracle is broken. + +The model of projection calculation and storage described in +Section~\ref{sec:projections} allows for each server to operate +independently, if necessary. This autonomy allows the server in AP +mode to +always accept new writes: new writes are written to unique file names +and unique file offsets using a chain consisting of only a single FLU, +if necessary. How is this possible? Let's look at a scenario in +Section~\ref{sub:split-brain-scenario}. + +\subsection{A split brain scenario} +\label{sub:split-brain-scenario} + +\begin{enumerate} + +\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a, + F_b, F_c]$ using projection epoch $P_p$. + +\item Assume data $D_0$ is written at offset $O_0$ in Machi file + $F_0$. + +\item Then a network partition happens. Servers $F_a$ and $F_b$ are + on one side of the split, and server $F_c$ is on the other side of + the split. We'll call them the ``left side'' and ``right side'', + respectively. + +\item On the left side, $F_b$ calculates a new projection and writes + it unanimously (to two projection stores) as epoch $P_B+1$. The + subscript $_B$ denotes a + version of projection epoch $P_{p+1}$ that was created by server $F_B$ + and has a unique checksum (used to detect differences after the + network partition heals). + +\item In parallel, on the right side, $F_c$ calculates a new + projection and writes it unanimously (to a single projection store) + as epoch $P_c+1$. + +\item In parallel, a client on the left side writes data $D_1$ + at offset $O_1$ in Machi file $F_1$, and also + a client on the right side writes data $D_2$ + at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$ + because each sequencer is forced to choose disjoint filenames from + any prior epoch whenever a new projection is available. + +\end{enumerate} + +Now, what happens when various clients attempt to read data values +$D_0$, $D_1$, and $D_2$? + +\begin{itemize} +\item All clients can read $D_0$. +\item Clients on the left side can read $D_1$. +\item Attempts by clients on the right side to read $D_1$ will get + {\tt error\_unavailable}. +\item Clients on the right side can read $D_2$. +\item Attempts by clients on the left side to read $D_2$ will get + {\tt error\_unavailable}. +\end{itemize} + +The {\tt error\_unavailable} result is not an error in the CAP Theorem +sense: it is a valid and affirmative response. In both cases, the +system on the client's side definitely knows that the cluster is +partitioned. If Machi were not a write-once store, perhaps there +might be an old/stale value to read on the local side of the network +partition \ldots but the system also knows definitely that no +old/stale value exists. Therefore Machi remains available in the +CAP Theorem sense both for writes and reads. + +We know that all files $F_0$, +$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous +to set union) onto each server in $[F_a, F_b, F_c]$ safely +when the network partition is healed. However, +unlike pure theoretical set union, Machi's data merge \& repair +operations must operate within some constraints that are designed to +prevent data loss. + +\subsection{Aside: defining data availability and data loss} +\label{sub:define-availability} + +Let's take a moment to be clear about definitions: + +\begin{itemize} +\item ``data is available at time $T$'' means that data is available + for reading at $T$: the Machi cluster knows for certain that the + requested data is not been written or it is written and has a single + value. +\item ``data is unavailable at time $T$'' means that data is + unavailable for reading at $T$ due to temporary circumstances, + e.g. network partition. If a read request is issued at some time + after $T$, the data will be available. +\item ``data is lost at time $T$'' means that data is permanently + unavailable at $T$ and also all times after $T$. +\end{itemize} + +Chain Replication is a fantastic technique for managing the +consistency of data across a number of whole replicas. There are, +however, cases where CR can indeed lose data. + +\subsection{Data loss scenario \#1: too few servers} +\label{sub:data-loss1} + +If the chain is $N$ servers long, and if all $N$ servers fail, then +of course data is unavailable. However, if all $N$ fail +permanently, then data is lost. + +If the administrator had intended to avoid data loss after $N$ +failures, then the administrator would have provisioned a Machi +cluster with at least $N+1$ servers. + +\subsection{Data Loss scenario \#2: bogus configuration change sequence} +\label{sub:data-loss2} + +Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place. + +\begin{figure} +\begin{enumerate} +%% NOTE: the following list 9 items long. We use that fact later, see +%% string YYY9 in a comment further below. If the length of this list +%% changes, then the counter reset below needs adjustment. +\item Projection $P_p$ says that chain membership is $[F_a]$. +\item A write of data $D$ to file $F$ at offset $O$ is successful. +\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via + an administration API request. +\item Machi will trigger repair operations, copying any missing data + files from FLU $F_a$ to FLU $F_b$. For the purpose of this + example, the sync operation for file $F$'s data and metadata has + not yet started. +\item FLU $F_a$ crashes. +\item The auto-administration monitor on $F_b$ notices $F_a$'s crash, + decides to create a new projection $P_{p+2}$ where chain membership is + $[F_b]$ + successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged. +\item FLU $F_a$ is down, therefore the + value of $P_{p+2}$ is unanimous for all currently available FLUs + (namely $[F_b]$). +\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous + projection. It unwedges itself and continues operation using $P_{p+2}$. +\item Data $D$ is definitely unavailable for now, perhaps lost forever? +\end{enumerate} +\caption{Data unavailability scenario with danger of permanent data loss} +\label{fig:data-loss2} +\end{figure} + +At this point, the data $D$ is not available on $F_b$. However, if +we assume that $F_a$ eventually returns to service, and Machi +correctly acts to repair all data within its chain, then $D$ +all of its contents will be available eventually. + +However, if server $F_a$ never returns to service, then $D$ is lost. The +Machi administration API must always warn the user that data loss is +possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must +warn the administrator in multiple ways that fewer than the full {\tt + length(all\_members)} number of replicas are in full sync. + +A careful reader should note that $D$ is also lost if step \#5 were +instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.'' +For any possible step following \#5, $D$ is lost. This is data loss +for the same reason that the scenario of Section~\ref{sub:data-loss1} +happens: the administrator has not provisioned a sufficient number of +replicas. + +Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This +time, we add a final step at the end of the sequence: + +\begin{enumerate} +\setcounter{enumi}{9} % YYY9 +\item The administration API is used to change the chain +configuration to {\tt all\_members=$[F_b]$}. +\end{enumerate} + +Step \#10 causes data loss. Specifically, the only copy of file +$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now +permanently inaccessible. + +The auto-administration monitor {\em must} keep track of all +repair operations and their status. If such information is tracked by +all FLUs, then the data loss by bogus administrator action can be +prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow +F_b$` repair has not yet finished and therefore it is unsafe to remove +$F_a$ from the cluster. + +\subsection{Data Loss scenario \#3: chain replication repair done badly} +\label{sub:data-loss3} + +It's quite possible to lose data through careless/buggy Chain +Replication chain configuration changes. For example, in the split +brain scenario of Section~\ref{sub:split-brain-scenario}, we have two +pieces of data written to different ``sides'' of the split brain, +$D_0$ and $D_1$. If the chain is naively reconfigured after the network +partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$\footnote{Where $\emptyset$ + denotes the unwritten value.} then $D_1$ +is in danger of being lost. Why? +The Update Propagation Invariant is violated. +Any Chain Replication read will be +directed to the tail, $F_c$. The value exists there, so there is no +need to do any further work; the unwritten values at $F_a$ and $F_b$ +will not be repaired. If the $F_c$ server fails sometime +later, then $D_1$ will be lost. Section~\ref{sec:repair} discusses +how data loss can be avoided after servers are added (or re-added) to +an active chain configuration. + +\subsection{Summary} + +We believe that maintaining the Update Propagation Invariant is a +hassle anda pain, but that hassle and pain are well worth the +sacrifices required to maintain the invariant at all times. It avoids +data loss in all cases where the U.P.~Invariant preserving chain +contains at least one FLU. + +\section{Load balancing read vs. write ops} +\label{sec:load-balancing} + +Consistent reads in Chain Replication require reading only from the +tail of the chain. This requirement can cause workload imbalances for +any chain longer than length one under high read-only workloads. For +example, for chain $[F_a, F_b, F_c]$ and a 100\% read-only workload, +FLUs $F_a$ and $F_b$ will be completely idle, and FLU $F_c$ must +handle all of the workload. + +CORFU suggests a strategy of rotating the chain every so often, e.g., +rotating the chain members every 10K or 20K pages or so. In this +manner, then, the head and tail roles would rotate in a deterministic +way and balance the workload evenly.\footnote{If we ignore cases of + small numbers of extremely ``hot''/frequently-accessed pages.} + +The same scheme could be applied pretty easily to the Machi projection +data structure. For example, using a rotation ``stripe'' of 1 MByte, then +any write where the offset $O \textit{ div } 1024^2 = 0$ would use chain +variation $[F_a, F_b, F_c]$, and $O \textit{ div } 1024^2 = 1$, would use chain +variation $[F_b, F_c, F_a]$, and so on. Some use cases, if the first +1 MByte of a file were always ``hot'', then this simple scheme would be +insufficient. + +Other more complicated striping solutions can be applied.\footnote{It + may not be worth discussing any of them here, but SLF has several + ideas of how to do it.} All have the problem of ``tearing'' a byte +range write into two pieces, if that byte range falls on either size +of a stripe boundary, e.g., $\{1024^2 - 1, 1024^2 + 1\}$. It feels +like the cost of a few torn writes (relative to the entire file size) +should be fairly low? And in cases like CORFU where the stripe size +is an exact multiple of the page size, then torn writes cannot happen +\ldots and it is likely that the CORFU use case is the one most likely +to requite this kind of load balancing. + +\section{Integration strategy with Riak Core and other distributed systems} +\label{sec:integration} + +We assume that any technique is able to perform extremely basic +parsing of the file names that Machi sequencers create. The example +shown in Section~\ref{sub:sequencer-divergence} depicts a client write +specifying the file prefix {\tt "foo"}; Machi assigns that write to a +file name such as: +\begin{quote} +{\tt "foo.m=machi4.s=flu-A.n=72006"} +\end{quote} + +Given a Machi file name, the client-specified prefix will always be +easily parseable, e.g., all characters to the left of the first +dot/period character. However, anything following the separator +character should strictly be considered opaque. + +\subsection{Machi and the Riak Core ring} +\label{sub:integration-riak-core} + +\paragraph{Simplest scheme:} +Get rid of the power-of-2 partition number restriction of the Riak +Core ring data structure. Have exactly one partition per Machi +cluster, where the ring data includes each Machi cluster name. We +{\em don't bother} using successive partitions on the ring for +deciding the membership of any of the Machi clusters: that is a Riak KV +style pattern that is not applicable here. + +Also, it would be handy to remove the current Core assumption of equal +partition sizes. + +Parse the Machi file name $F$ (per above) to find the original +file prefix $F_{prefix}$ given to Machi at write time. +Hash the empty bucket {\tt <<>>} and key $F_{prefix}$ to +calculate the preflist. Take only the head of +the preflist, which names the Machi cluster $M$ that stores $F$. Ask +one of $M$'s nodes for the current projection (if not alrady cached). +Then fetch the desired byte range(s) from $F$. + +To add/remove Machi clusters, use ring resizing. + +\subsection{Machi and Random Slicing} +\label{sub:integration-random-slicing} + +\paragraph{Simplest scheme:} +Instead of using the machinery of Riak Core to hash a Machi file name +$F$ to some Machi cluster $M$, let's suggest Random Slicing +\cite{random-slicing}. It appears that \cite{random-slicing} was +co-invented at about the same time that Hibari +\cite{cr-theory-and-practice} implemented it. + +The data structure to describe a Random Slicing scheme is pretty +small, about 100 KBytes in a conveninet but space-inefficient +representation in Erlang. A pure function with domain of Machi file +name plus Random Slicing map and range of all available Machi clusters +is straightforward. + +Parse the Machi file name $F$ (per above) to find the original +file prefix $F_{prefix}$ given to Machi at write time. +To move/relocate files from one Machi server to another, two different +Random Slicing maps, $RSM_{old}$ and $RSM_{new}$. For each Machi file +in all Machi clusters, if +%% Break the math mode below to make line breaks easier..... +$MAP(F_{prefix},$ $RSM_{old})$ $=$ $MAP(F_{prefix},$ $RSM_{new})$, +then the file does not need to move. + +A file migration process iterates over all files where the value of +$MAP(F, RSM_{new})$ differs. All Machi files are immutable, which +makes the coordination effort much easier than many other distributed +systems. For file lookup, try using the $RSM_{new}$ first. If the +file doesn't exist there, use $RSM_{old})$. An honest race may +then force a second attempt with $RSM_{new}$ again. + +Multiple migrations can be concurrent, at the expense of additional +latency. The generalization of the move/relocate algorithm above is: + +\begin{enumerate} +\item For each $RSM_j$ mapping for the ``new'' location map list, + query the Machi cluster $MAP(F_{prefix}, RSM_j)$ and take the + first {\tt \{ok,\ldots\}} response. +\item For each $RSM_i$ mapping for the ``old'' location map list, + query the Machi cluster $MAP(F_{prefix}, RSM_i)$ and take the + first {\tt \{ok,\ldots\}} response. +\item To deal with races when moving files and then removing them from + the ``old'' locations, perform step \#1 again to look in the new + location(s). +\item If the data is not found at this stage, then the data does not exist. +\end{enumerate} + +\section{Recommended reading \& related work} + +A big reason for the large size of this document is that it includes a +lot of background information. +Basho people tend to be busy, and sitting down to +read 4--6 research papers to get familiar with a topic \ldots doesn't +happen very quickly. We recommend you read the papers mentioned in +this section and in the ``References'' at the end, but if our job is +done well enough, it isn't necessary. + +Familiarity with the CAP Theorem, the concepts \& semantics \& +trade-offs of eventual consistency and strong consistency in the +context of asynchronous distributed systems, network partitions and +failure detection in asynchronous distributed systems, and ``split +brain'' syndrome are all assumed.\footnote{Heh, let's see how well +{\em the authors} actually know those things\ldots.} + +The replication protocol for Machi is based almost entirely on the CORFU +ordered log protocol \cite{corfu1}. If the reader is familiar with +the content of this paper, understanding the implementation details of +Machi will be easy. The longer paper \cite{corfu2} goes into much +more detail -- developers are strongly recommended to read this paper +also. + +CORFU is, in turn, a very close cousin of the Paxos distributed +consensus protocol \cite{paxos-made-simple}. Understanding Paxos is +not required for understanding Machi, but reading about it can certainly +increase your good karma. + +CORFU also uses the Chain Replication algorithm +\cite{chain-replication}. This paper is recommended for Machi +developers who need to understand the guarantees and restrictions of +the protocol. For other readers, it is recommended for good karma. + + Machi's function +roughly corresponds to the Windows Azure Storage (WAS) paper \cite{was} +``stream layer'' as described in section~4. +The main features from that section that WAS does support are file +distribution/sharding across multiple servers and erasure coding; both +are explicitly outside of Machi's scope. + +The Kafka paper \cite{kafka} is highly recommended reading for why +you'd want to have an ordered log service and how you'd build one +(though this particular paper is too short to describe how it's +actually done). +Machi feels like a better foundation to build a +distributed immutable file store than Kafka's internals, but +that's debate for another forum. The blog posting by Kreps +\cite{the-log-what} is long but does a good job of explaining +the why and how of using a strongly ordered distributed log to build +complicated-seeming distributed systems in an easy way. + +The Hibari paper \cite{cr-theory-and-practice} describes some of the +implementation details of chain replication that are not explored in +detail in the CR paper. It is also recommended for Machi developers, +especially sections 2 and 12. + +\bibliographystyle{abbrvnat} +\begin{thebibliography}{} +\softraggedright + +\bibitem{elastic-chain-replication} +Abu-Libdeh, Hussam et al. +Leveraging Sharding in the Design of Scalable Replication Protocols. +Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC'13), 2013. +{\tt http://www.ymsir.com/papers/sharding-socc.pdf} + +\bibitem{corfu1} +Balakrishnan, Mahesh et al. +CORFU: A Shared Log Design for Flash Clusters. +Proceedings of the 9th USENIX Conference on Networked Systems Design +and Implementation (NSDI'12), 2012. +{\tt http://research.microsoft.com/pubs/157204/ corfumain-final.pdf} + +\bibitem{corfu2} +Balakrishnan, Mahesh et al. +CORFU: A Distributed Shared Log +ACM Transactions on Computer Systems, Vol. 31, No. 4, Article 10, December 2013. +{\tt http://www.snookles.com/scottmp/corfu/ corfu.a10-balakrishnan.pdf} + +\bibitem{was} +Calder, Brad et al. +Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency +Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11), 2011. +{\tt http://sigops.org/sosp/sosp11/current/ 2011-Cascais/printable/11-calder.pdf} + +\bibitem{cr-theory-and-practice} +Fritchie, Scott Lystig. +Chain Replication in Theory and in Practice. +Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang'10), 2010. +{\tt http://www.snookles.com/scott/publications/ erlang2010-slf.pdf} + +\bibitem{the-log-what} +Kreps, Jay. +The Log: What every software engineer should know about real-time data's unifying abstraction +{\tt http://engineering.linkedin.com/distributed- + systems/log-what-every-software-engineer-should- + know-about-real-time-datas-unifying} + +\bibitem{kafka} +Kreps, Jay et al. +Kafka: a distributed messaging system for log processing. +NetDB’11. +{\tt http://research.microsoft.com/en-us/UM/people/ + srikanth/netdb11/netdb11papers/netdb11-final12.pdf} + +\bibitem{paxos-made-simple} +Lamport, Leslie. +Paxos Made Simple. +In SIGACT News \#4, Dec, 2001. +{\tt http://research.microsoft.com/users/ lamport/pubs/paxos-simple.pdf} + +\bibitem{random-slicing} +Miranda, Alberto et al. +Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems. +ACM Transactions on Storage, Vol. 10, No. 3, Article 9, July 2014. +{\tt http://www.snookles.com/scottmp/corfu/random- slicing.a9-miranda.pdf} + +\bibitem{porcupine} +Saito, Yasushi et al. +Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service. +7th ACM Symposium on Operating System Principles (SOSP’99). +{\tt http://homes.cs.washington.edu/\%7Elevy/ porcupine.pdf} + +\bibitem{chain-replication} +van Renesse, Robbert et al. +Chain Replication for Supporting High Throughput and Availability. +Proceedings of the 6th Conference on Symposium on Operating Systems +Design \& Implementation (OSDI'04) - Volume 6, 2004. +{\tt http://www.cs.cornell.edu/home/rvr/papers/ osdi04.pdf} + +\end{thebibliography} + +%% \pagebreak + +%% \section{Appendix: MSC diagrams} +%% \label{sec:appendix-msc} + +\begin{figure*}[tp] +\resizebox{\textwidth}{!}{ + \includegraphics{append-flow} + } +\caption{MSC diagram: append 123 bytes onto a file with prefix {\tt + "foo"}. In error-free cases and with a correct cached projection, the + number of network messages is $2 + 2N$ where $N$ is chain length.} +\label{fig:append-flowMSC} +\end{figure*} + +\begin{figure*}[tp] +\resizebox{\textwidth}{!}{ + \includegraphics{read-flow} + } +\caption{MSC diagram: read 123 bytes from a file} +\label{fig:read-flowMSC} +\end{figure*} + +\begin{figure*}[tp] +\resizebox{\textwidth}{!}{ + \includegraphics{append-flow2} + } +\caption{MSC diagram: append 123 bytes onto a file with prefix {\tt + "foo"}, using FLU$\rightarrow$FLU direct communication in original + Chain Replication's messaging pattern. In error-free cases and with + a correct cached projection, the number of network messages is $N+1$ + where $N$ is chain length.} +\label{fig:append-flow2MSC} +\end{figure*} + + +\end{document} diff --git a/doc/src.high-level/read-flow.eps b/doc/src.high-level/read-flow.eps new file mode 100644 index 0000000..1a0edf4 --- /dev/null +++ b/doc/src.high-level/read-flow.eps @@ -0,0 +1,145 @@ +%!PS-Adobe-3.0 EPSF-2.0 +%%BoundingBox: 0 0 420.000000 166.599991 +%%Creator: mscgen 0.18 +%%EndComments +0.700000 0.700000 scale +0 0 moveto +0 238 lineto +600 238 lineto +600 0 lineto +closepath +clip +%PageTrailer +%Page: 1 1 +/Helvetica findfont +10 scalefont +setfont +/Helvetica findfont +12 scalefont +setfont +0 238 translate +/mtrx matrix def +/ellipse + { /endangle exch def + /startangle exch def + /ydia exch def + /xdia exch def + /y exch def + /x exch def + /savematrix mtrx currentmatrix def + x y translate + xdia 2 div ydia 2 div scale + 1 -1 scale + 0 0 1 startangle endangle arc + savematrix setmatrix +} def +(client) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 75 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +75 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(Projection) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 225 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +225 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(ProjStore_C) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 375 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +375 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +(FLU_C) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup dup newpath 525 -17 moveto 2 div neg 0 rmoveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +525 -15 moveto dup stringwidth pop 2 div neg 0 rmoveto show +newpath 75 -22 moveto 75 -49 lineto stroke +newpath 225 -22 moveto 225 -49 lineto stroke +newpath 375 -22 moveto 375 -49 lineto stroke +newpath 525 -22 moveto 525 -49 lineto stroke +newpath 75 -35 moveto 225 -35 lineto stroke +newpath 225 -35 moveto 215 -41 lineto stroke +(get current) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 122 -33 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +122 -33 moveto show +newpath 75 -49 moveto 75 -76 lineto stroke +newpath 225 -49 moveto 225 -76 lineto stroke +newpath 375 -49 moveto 375 -76 lineto stroke +newpath 525 -49 moveto 525 -76 lineto stroke +newpath 225 -62 moveto 75 -62 lineto stroke +newpath 75 -62 moveto 85 -68 lineto stroke +(ok, #12...) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 126 -60 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +126 -60 moveto show +newpath 75 -76 moveto 75 -103 lineto stroke +newpath 225 -76 moveto 225 -103 lineto stroke +newpath 375 -76 moveto 375 -103 lineto stroke +newpath 525 -76 moveto 525 -103 lineto stroke +newpath 75 -89 moveto 525 -89 lineto stroke +newpath 525 -89 moveto 515 -95 lineto stroke +(read "foo.seq_a.009" offset=447 bytes=123 epoch=12) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 157 -87 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +157 -87 moveto show +newpath 75 -103 moveto 75 -130 lineto stroke +newpath 225 -103 moveto 225 -130 lineto stroke +newpath 375 -103 moveto 375 -130 lineto stroke +newpath 525 -103 moveto 525 -130 lineto stroke +newpath 525 -116 moveto 75 -116 lineto stroke +newpath 75 -116 moveto 85 -122 lineto stroke +1.000000 0.000000 0.000000 setrgbcolor +(bad_epoch, 13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 261 -114 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +1.000000 0.000000 0.000000 setrgbcolor +261 -114 moveto show +0.000000 0.000000 0.000000 setrgbcolor +newpath 75 -130 moveto 75 -157 lineto stroke +newpath 225 -130 moveto 225 -157 lineto stroke +newpath 375 -130 moveto 375 -157 lineto stroke +newpath 525 -130 moveto 525 -157 lineto stroke +newpath 75 -143 moveto 375 -143 lineto stroke +newpath 375 -143 moveto 365 -149 lineto stroke +(get epoch #13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 187 -141 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +187 -141 moveto show +newpath 75 -157 moveto 75 -184 lineto stroke +newpath 225 -157 moveto 225 -184 lineto stroke +newpath 375 -157 moveto 375 -184 lineto stroke +newpath 525 -157 moveto 525 -184 lineto stroke +newpath 375 -170 moveto 75 -170 lineto stroke +newpath 75 -170 moveto 85 -176 lineto stroke +(ok, #13...) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 201 -168 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +201 -168 moveto show +newpath 75 -184 moveto 75 -211 lineto stroke +newpath 225 -184 moveto 225 -211 lineto stroke +newpath 375 -184 moveto 375 -211 lineto stroke +newpath 525 -184 moveto 525 -211 lineto stroke +newpath 75 -197 moveto 525 -197 lineto stroke +newpath 525 -197 moveto 515 -203 lineto stroke +(read "foo.seq_a.009" offset=447 bytes=123 epoch=13) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 157 -195 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +157 -195 moveto show +newpath 75 -211 moveto 75 -238 lineto stroke +newpath 225 -211 moveto 225 -238 lineto stroke +newpath 375 -211 moveto 375 -238 lineto stroke +newpath 525 -211 moveto 525 -238 lineto stroke +newpath 525 -224 moveto 75 -224 lineto stroke +newpath 75 -224 moveto 85 -230 lineto stroke +(ok, <<...123...>>) dup stringwidth +1.000000 1.000000 1.000000 setrgbcolor +pop dup newpath 257 -222 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +0.000000 0.000000 0.000000 setrgbcolor +257 -222 moveto show diff --git a/doc/src.high-level/sigplanconf.cls b/doc/src.high-level/sigplanconf.cls new file mode 100644 index 0000000..cbe4031 --- /dev/null +++ b/doc/src.high-level/sigplanconf.cls @@ -0,0 +1,1312 @@ +%----------------------------------------------------------------------------- +% +% LaTeX Class/Style File +% +% Name: sigplanconf.cls +% +% Purpose: A LaTeX 2e class file for SIGPLAN conference proceedings. +% This class file supercedes acm_proc_article-sp, +% sig-alternate, and sigplan-proc. +% +% Author: Paul C. Anagnostopoulos +% Windfall Software +% 978 371-2316 +% paul [atsign] windfall.com +% +% Created: 12 September 2004 +% +% Revisions: See end of file. +% +% This work is licensed under the Creative Commons Attribution License. +% To view a copy of this license, visit +% http://creativecommons.org/licenses/by/3.0/ +% or send a letter to Creative Commons, 171 2nd Street, Suite 300, +% San Francisco, California, 94105, U.S.A. +% +%----------------------------------------------------------------------------- + + +\NeedsTeXFormat{LaTeX2e}[1995/12/01] +\ProvidesClass{sigplanconf}[2013/07/02 v2.8 ACM SIGPLAN Proceedings] + +% The following few pages contain LaTeX programming extensions adapted +% from the ZzTeX macro package. + +% Token Hackery +% ----- ------- + + +\def \@expandaftertwice {\expandafter\expandafter\expandafter} +\def \@expandafterthrice {\expandafter\expandafter\expandafter\expandafter + \expandafter\expandafter\expandafter} + +% This macro discards the next token. + +\def \@discardtok #1{}% token + +% This macro removes the `pt' following a dimension. + +{\catcode `\p = 12 \catcode `\t = 12 + +\gdef \@remover #1pt{#1} + +} % \catcode + +% This macro extracts the contents of a macro and returns it as plain text. +% Usage: \expandafter\@defof \meaning\macro\@mark + +\def \@defof #1:->#2\@mark{#2} + +% Control Sequence Names +% ------- -------- ----- + + +\def \@name #1{% {\tokens} + \csname \expandafter\@discardtok \string#1\endcsname} + +\def \@withname #1#2{% {\command}{\tokens} + \expandafter#1\csname \expandafter\@discardtok \string#2\endcsname} + +% Flags (Booleans) +% ----- ---------- + +% The boolean literals \@true and \@false are appropriate for use with +% the \if command, which tests the codes of the next two characters. + +\def \@true {TT} +\def \@false {FL} + +\def \@setflag #1=#2{\edef #1{#2}}% \flag = boolean + +% IF and Predicates +% -- --- ---------- + +% A "predicate" is a macro that returns \@true or \@false as its value. +% Such values are suitable for use with the \if conditional. For example: +% +% \if \@oddp{\x} \else \fi + +% A predicate can be used with \@setflag as follows: +% +% \@setflag \flag = {} + +% Here are the predicates for TeX's repertoire of conditional +% commands. These might be more appropriately interspersed with +% other definitions in this module, but what the heck. +% Some additional "obvious" predicates are defined. + +\def \@eqlp #1#2{\ifnum #1 = #2\@true \else \@false \fi} +\def \@neqlp #1#2{\ifnum #1 = #2\@false \else \@true \fi} +\def \@lssp #1#2{\ifnum #1 < #2\@true \else \@false \fi} +\def \@gtrp #1#2{\ifnum #1 > #2\@true \else \@false \fi} +\def \@zerop #1{\ifnum #1 = 0\@true \else \@false \fi} +\def \@onep #1{\ifnum #1 = 1\@true \else \@false \fi} +\def \@posp #1{\ifnum #1 > 0\@true \else \@false \fi} +\def \@negp #1{\ifnum #1 < 0\@true \else \@false \fi} +\def \@oddp #1{\ifodd #1\@true \else \@false \fi} +\def \@evenp #1{\ifodd #1\@false \else \@true \fi} +\def \@rangep #1#2#3{\if \@orp{\@lssp{#1}{#2}}{\@gtrp{#1}{#3}}\@false \else + \@true \fi} +\def \@tensp #1{\@rangep{#1}{10}{19}} + +\def \@dimeqlp #1#2{\ifdim #1 = #2\@true \else \@false \fi} +\def \@dimneqlp #1#2{\ifdim #1 = #2\@false \else \@true \fi} +\def \@dimlssp #1#2{\ifdim #1 < #2\@true \else \@false \fi} +\def \@dimgtrp #1#2{\ifdim #1 > #2\@true \else \@false \fi} +\def \@dimzerop #1{\ifdim #1 = 0pt\@true \else \@false \fi} +\def \@dimposp #1{\ifdim #1 > 0pt\@true \else \@false \fi} +\def \@dimnegp #1{\ifdim #1 < 0pt\@true \else \@false \fi} + +\def \@vmodep {\ifvmode \@true \else \@false \fi} +\def \@hmodep {\ifhmode \@true \else \@false \fi} +\def \@mathmodep {\ifmmode \@true \else \@false \fi} +\def \@textmodep {\ifmmode \@false \else \@true \fi} +\def \@innermodep {\ifinner \@true \else \@false \fi} + +\long\def \@codeeqlp #1#2{\if #1#2\@true \else \@false \fi} + +\long\def \@cateqlp #1#2{\ifcat #1#2\@true \else \@false \fi} + +\long\def \@tokeqlp #1#2{\ifx #1#2\@true \else \@false \fi} +\long\def \@xtokeqlp #1#2{\expandafter\ifx #1#2\@true \else \@false \fi} + +\long\def \@definedp #1{% + \expandafter\ifx \csname \expandafter\@discardtok \string#1\endcsname + \relax \@false \else \@true \fi} + +\long\def \@undefinedp #1{% + \expandafter\ifx \csname \expandafter\@discardtok \string#1\endcsname + \relax \@true \else \@false \fi} + +\def \@emptydefp #1{\ifx #1\@empty \@true \else \@false \fi}% {\name} + +\let \@emptylistp = \@emptydefp + +\long\def \@emptyargp #1{% {#n} + \@empargp #1\@empargq\@mark} +\long\def \@empargp #1#2\@mark{% + \ifx #1\@empargq \@true \else \@false \fi} +\def \@empargq {\@empargq} + +\def \@emptytoksp #1{% {\tokenreg} + \expandafter\@emptoksp \the#1\@mark} + +\long\def \@emptoksp #1\@mark{\@emptyargp{#1}} + +\def \@voidboxp #1{\ifvoid #1\@true \else \@false \fi} +\def \@hboxp #1{\ifhbox #1\@true \else \@false \fi} +\def \@vboxp #1{\ifvbox #1\@true \else \@false \fi} + +\def \@eofp #1{\ifeof #1\@true \else \@false \fi} + + +% Flags can also be used as predicates, as in: +% +% \if \flaga \else \fi + + +% Now here we have predicates for the common logical operators. + +\def \@notp #1{\if #1\@false \else \@true \fi} + +\def \@andp #1#2{\if #1% + \if #2\@true \else \@false \fi + \else + \@false + \fi} + +\def \@orp #1#2{\if #1% + \@true + \else + \if #2\@true \else \@false \fi + \fi} + +\def \@xorp #1#2{\if #1% + \if #2\@false \else \@true \fi + \else + \if #2\@true \else \@false \fi + \fi} + +% Arithmetic +% ---------- + +\def \@increment #1{\advance #1 by 1\relax}% {\count} + +\def \@decrement #1{\advance #1 by -1\relax}% {\count} + +% Options +% ------- + + +\@setflag \@authoryear = \@false +\@setflag \@blockstyle = \@false +\@setflag \@copyrightwanted = \@true +\@setflag \@explicitsize = \@false +\@setflag \@mathtime = \@false +\@setflag \@natbib = \@true +\@setflag \@ninepoint = \@true +\newcount{\@numheaddepth} \@numheaddepth = 3 +\@setflag \@onecolumn = \@false +\@setflag \@preprint = \@false +\@setflag \@reprint = \@false +\@setflag \@tenpoint = \@false +\@setflag \@times = \@false + +% Note that all the dangerous article class options are trapped. + +\DeclareOption{9pt}{\@setflag \@ninepoint = \@true + \@setflag \@explicitsize = \@true} + +\DeclareOption{10pt}{\PassOptionsToClass{10pt}{article}% + \@setflag \@ninepoint = \@false + \@setflag \@tenpoint = \@true + \@setflag \@explicitsize = \@true} + +\DeclareOption{11pt}{\PassOptionsToClass{11pt}{article}% + \@setflag \@ninepoint = \@false + \@setflag \@explicitsize = \@true} + +\DeclareOption{12pt}{\@unsupportedoption{12pt}} + +\DeclareOption{a4paper}{\@unsupportedoption{a4paper}} + +\DeclareOption{a5paper}{\@unsupportedoption{a5paper}} + +\DeclareOption{authoryear}{\@setflag \@authoryear = \@true} + +\DeclareOption{b5paper}{\@unsupportedoption{b5paper}} + +\DeclareOption{blockstyle}{\@setflag \@blockstyle = \@true} + +\DeclareOption{cm}{\@setflag \@times = \@false} + +\DeclareOption{computermodern}{\@setflag \@times = \@false} + +\DeclareOption{executivepaper}{\@unsupportedoption{executivepaper}} + +\DeclareOption{indentedstyle}{\@setflag \@blockstyle = \@false} + +\DeclareOption{landscape}{\@unsupportedoption{landscape}} + +\DeclareOption{legalpaper}{\@unsupportedoption{legalpaper}} + +\DeclareOption{letterpaper}{\@unsupportedoption{letterpaper}} + +\DeclareOption{mathtime}{\@setflag \@mathtime = \@true} + +\DeclareOption{natbib}{\@setflag \@natbib = \@true} + +\DeclareOption{nonatbib}{\@setflag \@natbib = \@false} + +\DeclareOption{nocopyrightspace}{\@setflag \@copyrightwanted = \@false} + +\DeclareOption{notitlepage}{\@unsupportedoption{notitlepage}} + +\DeclareOption{numberedpars}{\@numheaddepth = 4} + +\DeclareOption{numbers}{\@setflag \@authoryear = \@false} + +%%%\DeclareOption{onecolumn}{\@setflag \@onecolumn = \@true} + +\DeclareOption{preprint}{\@setflag \@preprint = \@true} + +\DeclareOption{reprint}{\@setflag \@reprint = \@true} + +\DeclareOption{times}{\@setflag \@times = \@true} + +\DeclareOption{titlepage}{\@unsupportedoption{titlepage}} + +\DeclareOption{twocolumn}{\@setflag \@onecolumn = \@false} + +\DeclareOption*{\PassOptionsToClass{\CurrentOption}{article}} + +\ExecuteOptions{9pt,indentedstyle,times} +\@setflag \@explicitsize = \@false +\ProcessOptions + +\if \@onecolumn + \if \@notp{\@explicitsize}% + \@setflag \@ninepoint = \@false + \PassOptionsToClass{11pt}{article}% + \fi + \PassOptionsToClass{twoside,onecolumn}{article} +\else + \PassOptionsToClass{twoside,twocolumn}{article} +\fi +\LoadClass{article} + +\def \@unsupportedoption #1{% + \ClassError{proc}{The standard '#1' option is not supported.}} + +% This can be used with the 'reprint' option to get the final folios. + +\def \setpagenumber #1{% + \setcounter{page}{#1}} + +\AtEndDocument{\label{sigplanconf@finalpage}} + +% Utilities +% --------- + + +\newcommand{\setvspace}[2]{% + #1 = #2 + \advance #1 by -1\parskip} + +% Document Parameters +% -------- ---------- + + +% Page: + +\setlength{\hoffset}{-1in} +\setlength{\voffset}{-1in} + +\setlength{\topmargin}{1in} +\setlength{\headheight}{0pt} +\setlength{\headsep}{0pt} + +\if \@onecolumn + \setlength{\evensidemargin}{.75in} + \setlength{\oddsidemargin}{.75in} +\else + \setlength{\evensidemargin}{.75in} + \setlength{\oddsidemargin}{.75in} +\fi + +% Text area: + +\newdimen{\standardtextwidth} +\setlength{\standardtextwidth}{42pc} + +\if \@onecolumn + \setlength{\textwidth}{40.5pc} +\else + \setlength{\textwidth}{\standardtextwidth} +\fi + +\setlength{\topskip}{8pt} +\setlength{\columnsep}{2pc} +\setlength{\textheight}{54.5pc} + +% Running foot: + +\setlength{\footskip}{30pt} + +% Paragraphs: + +\if \@blockstyle + \setlength{\parskip}{5pt plus .1pt minus .5pt} + \setlength{\parindent}{0pt} +\else + \setlength{\parskip}{0pt} + \setlength{\parindent}{12pt} +\fi + +\setlength{\lineskip}{.5pt} +\setlength{\lineskiplimit}{\lineskip} + +\frenchspacing +\pretolerance = 400 +\tolerance = \pretolerance +\setlength{\emergencystretch}{5pt} +\clubpenalty = 10000 +\widowpenalty = 10000 +\setlength{\hfuzz}{.5pt} + +% Standard vertical spaces: + +\newskip{\standardvspace} +\setvspace{\standardvspace}{5pt plus 1pt minus .5pt} + +% Margin paragraphs: + +\setlength{\marginparwidth}{36pt} +\setlength{\marginparsep}{2pt} +\setlength{\marginparpush}{8pt} + + +\setlength{\skip\footins}{8pt plus 3pt minus 1pt} +\setlength{\footnotesep}{9pt} + +\renewcommand{\footnoterule}{% + \hrule width .5\columnwidth height .33pt depth 0pt} + +\renewcommand{\@makefntext}[1]{% + \noindent \@makefnmark \hspace{1pt}#1} + +% Floats: + +\setcounter{topnumber}{4} +\setcounter{bottomnumber}{1} +\setcounter{totalnumber}{4} + +\renewcommand{\fps@figure}{tp} +\renewcommand{\fps@table}{tp} +\renewcommand{\topfraction}{0.90} +\renewcommand{\bottomfraction}{0.30} +\renewcommand{\textfraction}{0.10} +\renewcommand{\floatpagefraction}{0.75} + +\setcounter{dbltopnumber}{4} + +\renewcommand{\dbltopfraction}{\topfraction} +\renewcommand{\dblfloatpagefraction}{\floatpagefraction} + +\setlength{\floatsep}{18pt plus 4pt minus 2pt} +\setlength{\textfloatsep}{18pt plus 4pt minus 3pt} +\setlength{\intextsep}{10pt plus 4pt minus 3pt} + +\setlength{\dblfloatsep}{18pt plus 4pt minus 2pt} +\setlength{\dbltextfloatsep}{20pt plus 4pt minus 3pt} + +% Miscellaneous: + +\errorcontextlines = 5 + +% Fonts +% ----- + + +\if \@times + \renewcommand{\rmdefault}{ptm}% + \if \@mathtime + \usepackage[mtbold,noTS1]{mathtime}% + \else +%%% \usepackage{mathptm}% + \fi +\else + \relax +\fi + +\if \@ninepoint + +\renewcommand{\normalsize}{% + \@setfontsize{\normalsize}{9pt}{10pt}% + \setlength{\abovedisplayskip}{5pt plus 1pt minus .5pt}% + \setlength{\belowdisplayskip}{\abovedisplayskip}% + \setlength{\abovedisplayshortskip}{3pt plus 1pt minus 2pt}% + \setlength{\belowdisplayshortskip}{\abovedisplayshortskip}} + +\renewcommand{\tiny}{\@setfontsize{\tiny}{5pt}{6pt}} + +\renewcommand{\scriptsize}{\@setfontsize{\scriptsize}{7pt}{8pt}} + +\renewcommand{\small}{% + \@setfontsize{\small}{8pt}{9pt}% + \setlength{\abovedisplayskip}{4pt plus 1pt minus 1pt}% + \setlength{\belowdisplayskip}{\abovedisplayskip}% + \setlength{\abovedisplayshortskip}{2pt plus 1pt}% + \setlength{\belowdisplayshortskip}{\abovedisplayshortskip}} + +\renewcommand{\footnotesize}{% + \@setfontsize{\footnotesize}{8pt}{9pt}% + \setlength{\abovedisplayskip}{4pt plus 1pt minus .5pt}% + \setlength{\belowdisplayskip}{\abovedisplayskip}% + \setlength{\abovedisplayshortskip}{2pt plus 1pt}% + \setlength{\belowdisplayshortskip}{\abovedisplayshortskip}} + +\renewcommand{\large}{\@setfontsize{\large}{11pt}{13pt}} + +\renewcommand{\Large}{\@setfontsize{\Large}{14pt}{18pt}} + +\renewcommand{\LARGE}{\@setfontsize{\LARGE}{18pt}{20pt}} + +\renewcommand{\huge}{\@setfontsize{\huge}{20pt}{25pt}} + +\renewcommand{\Huge}{\@setfontsize{\Huge}{25pt}{30pt}} + +\else\if \@tenpoint + +\relax + +\else + +\relax + +\fi\fi + +% Abstract +% -------- + + +\renewenvironment{abstract}{% + \section*{Abstract}% + \normalsize}{% + } + +% Bibliography +% ------------ + + +\renewenvironment{thebibliography}[1] + {\section*{\refname + \@mkboth{\MakeUppercase\refname}{\MakeUppercase\refname}}% + \list{\@biblabel{\@arabic\c@enumiv}}% + {\settowidth\labelwidth{\@biblabel{#1}}% + \leftmargin\labelwidth + \advance\leftmargin\labelsep + \@openbib@code + \usecounter{enumiv}% + \let\p@enumiv\@empty + \renewcommand\theenumiv{\@arabic\c@enumiv}}% + \bibfont + \clubpenalty4000 + \@clubpenalty \clubpenalty + \widowpenalty4000% + \sfcode`\.\@m} + {\def\@noitemerr + {\@latex@warning{Empty `thebibliography' environment}}% + \endlist} + +\if \@natbib + +\if \@authoryear + \typeout{Using natbib package with 'authoryear' citation style.} + \usepackage[authoryear,square]{natbib} + \bibpunct{(}{)}{;}{a}{}{,} % Change fences to parentheses; + % citation separator to semicolon; + % eliminate comma between author and year. + \let \cite = \citep +\else + \typeout{Using natbib package with 'numbers' citation style.} + \usepackage[numbers,sort&compress,square]{natbib} +\fi +\setlength{\bibsep}{3pt plus .5pt minus .25pt} + +\fi + +\def \bibfont {\small} + +% Categories +% ---------- + + +\@setflag \@firstcategory = \@true + +\newcommand{\category}[3]{% + \if \@firstcategory + \paragraph*{Categories and Subject Descriptors}% + \@setflag \@firstcategory = \@false + \else + \unskip ;\hspace{.75em}% + \fi + \@ifnextchar [{\@category{#1}{#2}{#3}}{\@category{#1}{#2}{#3}[}} + +\def \@category #1#2#3[#4]{% + {\let \and = \relax + #1 [\textit{#2}]% + \if \@emptyargp{#4}% + \if \@notp{\@emptyargp{#3}}: #3\fi + \else + :\space + \if \@notp{\@emptyargp{#3}}#3---\fi + \textrm{#4}% + \fi}} + +% Copyright Notice +% --------- ------ + + +\def \ftype@copyrightbox {8} +\def \@toappear {} +\def \@permission {} +\def \@reprintprice {} + +\def \@copyrightspace {% + \@float{copyrightbox}[b]% + \vbox to 0.0001in{% + \vfill + \parbox[b]{0pc}{% + \scriptsize + \if \@preprint + [%Copyright notice will appear here + %once 'preprint' option is removed. +]\par + \else + \@toappear + \fi + \if \@reprint + \noindent Reprinted from \@conferencename, + \@proceedings, + \@conferenceinfo, + pp.~\number\thepage--\pageref{sigplanconf@finalpage}.\par + \fi}}% + \end@float} + +\newcommand{\reprintprice}[1]{% + \gdef \@reprintprice {#1}} + +\reprintprice{\$15.00} + +\long\def \toappear #1{% + \def \@toappear {#1}} + +\toappear{% + \noindent \@permission \par + \vspace{2pt} + \noindent \textsl{\@conferencename}, \quad \@conferenceinfo. \par + \noindent Copyright \copyright\ \@copyrightyear\ ACM \@copyrightdata + \dots \@reprintprice.\par + \noindent http://dx.doi.org/10.1145/\@doi } + +\newcommand{\permission}[1]{% + \gdef \@permission {#1}} + +\permission{% + Permission to make digital or hard copies of all or part of this work for + personal or classroom use is granted without fee provided that copies are + not made or distributed for profit or commercial advantage and that copies + bear this notice and the full citation on the first page. Copyrights for + components of this work owned by others than ACM must be honored. + Abstracting with credit is permitted. To copy otherwise, or republish, to + post on servers or to redistribute to lists, requires prior specific + permission and/or a fee. Request permissions from permissions@acm.org.} + +% These are two new rights management and bibstrip text blocks. + +\newcommand{\exclusivelicense}{% + \permission{% + Permission to make digital or hard copies of all or part of this work for + personal or classroom use is granted without fee provided that copies are + not made or distributed for profit or commercial advantage and that copies + bear this notice and the full citation on the first page. Copyrights for + components of this work owned by others than the author(s) must be honored. + Abstracting with credit is permitted. To copy otherwise, or republish, to + post on servers or to redistribute to lists, requires prior specific + permission and/or a fee. Request permissions from permissions@acm.org.} + \toappear{% + \noindent \@permission \par + \vspace{2pt} + \noindent \textsl{\@conferencename}, \quad \@conferenceinfo. \par + \noindent Copyright is held by the owner/author(s). Publication rights licensed to ACM. \par + \noindent ACM \@copyrightdata \dots \@reprintprice.\par + \noindent http://dx.doi.org/10.1145/\@doi}} + +\newcommand{\permissiontopublish}{% + \permission{% + Permission to make digital or hard copies of part or all of this work for + personal or classroom use is granted without fee provided that copies are + not made or distributed for profit or commercial advantage and that copies + bear this notice and the full citation on the first page. Copyrights for + third-party components of this work must be honored. + For all other uses, contact the owner/author(s).}% + \toappear{% + \noindent \@permission \par + \vspace{2pt} + \noindent \textsl{\@conferencename}, \quad \@conferenceinfo. \par + \noindent Copyright is held by the owner/author(s). \par + \noindent ACM \@copyrightdata.\par + \noindent http://dx.doi.org/10.1145/\@doi}} + +% The following permission notices are +% for the traditional copyright transfer agreement option. + +% Exclusive license and permission-to-publish +% give more complicated permission notices. +% These are not covered here. + +\newcommand{\ACMCanadapermission}{% + \permission{% + ACM acknowledges that this contribution was authored or + co-authored by an affiliate of the Canadian National + Government. As such, the Crown in Right of Canada retains an equal + interest in the copyright. Reprint requests should be forwarded to + ACM.}} + +\newcommand{\ACMUSpermission}{% + \permission{% + ACM acknowledges that this contribution was authored or + co-authored by a contractor or affiliate of the United States + Government. As such, the United States Government retains a + nonexclusive, royalty-free right to publish or reproduce this + article, or to allow others to do so, for Government purposes + only.}} + +\newcommand{\USpublicpermission}{% + \permission{% + This paper is authored by an employee(s) of the United States + Government and is in the public domain. Non-exclusive copying or + redistribution is allowed, provided that the article citation is + given and the authors and the agency are clearly identified as its + source.}% + \toappear{% + \noindent \@permission \par + \vspace{2pt} + \noindent \textsl{\@conferencename}, \quad \@conferenceinfo. \par + \noindent ACM \@copyrightdata.\par + \noindent http://dx.doi.org/10.1145/\@doi}} + +\newcommand{\authorversion}[4]{% + \permission{% + Copyright \copyright\ ACM, #1. This is the author's version of the work. + It is posted here by permission of ACM for your personal use. + Not for redistribution. The definitive version was published in + #2, #3, http://dx.doi.org/10.1145/#4.}} + +% Enunciations +% ------------ + + +\def \@begintheorem #1#2{% {name}{number} + \trivlist + \item[\hskip \labelsep \textsc{#1 #2.}]% + \itshape\selectfont + \ignorespaces} + +\def \@opargbegintheorem #1#2#3{% {name}{number}{title} + \trivlist + \item[% + \hskip\labelsep \textsc{#1\ #2}% + \if \@notp{\@emptyargp{#3}}\nut (#3).\fi]% + \itshape\selectfont + \ignorespaces} + +% Figures +% ------- + + +\@setflag \@caprule = \@true + +\long\def \@makecaption #1#2{% + \addvspace{4pt} + \if \@caprule + \hrule width \hsize height .33pt + \vspace{4pt} + \fi + \setbox \@tempboxa = \hbox{\@setfigurenumber{#1.}\nut #2}% + \if \@dimgtrp{\wd\@tempboxa}{\hsize}% + \noindent \@setfigurenumber{#1.}\nut #2\par + \else + \centerline{\box\@tempboxa}% + \fi} + +\newcommand{\nocaptionrule}{% + \@setflag \@caprule = \@false} + +\def \@setfigurenumber #1{% + {\rmfamily \bfseries \selectfont #1}} + +% Hierarchy +% --------- + + +\setcounter{secnumdepth}{\@numheaddepth} + +\newskip{\@sectionaboveskip} +\setvspace{\@sectionaboveskip}{10pt plus 3pt minus 2pt} + +\newskip{\@sectionbelowskip} +\if \@blockstyle + \setlength{\@sectionbelowskip}{0.1pt}% +\else + \setlength{\@sectionbelowskip}{4pt}% +\fi + +\renewcommand{\section}{% + \@startsection + {section}% + {1}% + {0pt}% + {-\@sectionaboveskip}% + {\@sectionbelowskip}% + {\large \bfseries \raggedright}} + +\newskip{\@subsectionaboveskip} +\setvspace{\@subsectionaboveskip}{8pt plus 2pt minus 2pt} + +\newskip{\@subsectionbelowskip} +\if \@blockstyle + \setlength{\@subsectionbelowskip}{0.1pt}% +\else + \setlength{\@subsectionbelowskip}{4pt}% +\fi + +\renewcommand{\subsection}{% + \@startsection% + {subsection}% + {2}% + {0pt}% + {-\@subsectionaboveskip}% + {\@subsectionbelowskip}% + {\normalsize \bfseries \raggedright}} + +\renewcommand{\subsubsection}{% + \@startsection% + {subsubsection}% + {3}% + {0pt}% + {-\@subsectionaboveskip} + {\@subsectionbelowskip}% + {\normalsize \bfseries \raggedright}} + +\newskip{\@paragraphaboveskip} +\setvspace{\@paragraphaboveskip}{6pt plus 2pt minus 2pt} + +\renewcommand{\paragraph}{% + \@startsection% + {paragraph}% + {4}% + {0pt}% + {\@paragraphaboveskip} + {-1em}% + {\normalsize \bfseries \if \@times \itshape \fi}} + +\renewcommand{\subparagraph}{% + \@startsection% + {subparagraph}% + {4}% + {0pt}% + {\@paragraphaboveskip} + {-1em}% + {\normalsize \itshape}} + +% Standard headings: + +\newcommand{\acks}{\section*{Acknowledgments}} + +\newcommand{\keywords}{\paragraph*{Keywords}} + +\newcommand{\terms}{\paragraph*{General Terms}} + +% Identification +% -------------- + + +\def \@conferencename {} +\def \@conferenceinfo {} +\def \@copyrightyear {} +\def \@copyrightdata {[to be supplied]} +\def \@proceedings {[Unknown Proceedings]} + + +\newcommand{\conferenceinfo}[2]{% + \gdef \@conferencename {#1}% + \gdef \@conferenceinfo {#2}} + +\newcommand{\copyrightyear}[1]{% + \gdef \@copyrightyear {#1}} + +\let \CopyrightYear = \copyrightyear + +\newcommand{\copyrightdata}[1]{% + \gdef \@copyrightdata {#1}} + +\let \crdata = \copyrightdata + +\newcommand{\doi}[1]{% + \gdef \@doi {#1}} + +\newcommand{\proceedings}[1]{% + \gdef \@proceedings {#1}} + +% Lists +% ----- + + +\setlength{\leftmargini}{13pt} +\setlength\leftmarginii{13pt} +\setlength\leftmarginiii{13pt} +\setlength\leftmarginiv{13pt} +\setlength{\labelsep}{3.5pt} + +\setlength{\topsep}{\standardvspace} +\if \@blockstyle + \setlength{\itemsep}{1pt} + \setlength{\parsep}{3pt} +\else + \setlength{\itemsep}{1pt} + \setlength{\parsep}{3pt} +\fi + +\renewcommand{\labelitemi}{{\small \centeroncapheight{\textbullet}}} +\renewcommand{\labelitemii}{\centeroncapheight{\rule{2.5pt}{2.5pt}}} +\renewcommand{\labelitemiii}{$-$} +\renewcommand{\labelitemiv}{{\Large \textperiodcentered}} + +\renewcommand{\@listi}{% + \leftmargin = \leftmargini + \listparindent = 0pt} +%%% \itemsep = 1pt +%%% \parsep = 3pt} +%%% \listparindent = \parindent} + +\let \@listI = \@listi + +\renewcommand{\@listii}{% + \leftmargin = \leftmarginii + \topsep = 1pt + \labelwidth = \leftmarginii + \advance \labelwidth by -\labelsep + \listparindent = \parindent} + +\renewcommand{\@listiii}{% + \leftmargin = \leftmarginiii + \labelwidth = \leftmarginiii + \advance \labelwidth by -\labelsep + \listparindent = \parindent} + +\renewcommand{\@listiv}{% + \leftmargin = \leftmarginiv + \labelwidth = \leftmarginiv + \advance \labelwidth by -\labelsep + \listparindent = \parindent} + +% Mathematics +% ----------- + + +\def \theequation {\arabic{equation}} + +% Miscellaneous +% ------------- + + +\newcommand{\balancecolumns}{% + \vfill\eject + \global\@colht = \textheight + \global\ht\@cclv = \textheight} + +\newcommand{\nut}{\hspace{.5em}} + +\newcommand{\softraggedright}{% + \let \\ = \@centercr + \leftskip = 0pt + \rightskip = 0pt plus 10pt} + +% Program Code +% ------- ---- + + +\newcommand{\mono}[1]{% + {\@tempdima = \fontdimen2\font + \texttt{\spaceskip = 1.1\@tempdima #1}}} + +% Running Heads and Feet +% ------- ----- --- ---- + + +\def \@preprintfooter {} + +\newcommand{\preprintfooter}[1]{% + \gdef \@preprintfooter {#1}} + +\if \@preprint + +\def \ps@plain {% + \let \@mkboth = \@gobbletwo + \let \@evenhead = \@empty + \def \@evenfoot {\scriptsize + \rlap{\textit{\@preprintfooter}}\hfil + \thepage \hfil + \llap{\textit{\@formatyear}}}% + \let \@oddhead = \@empty + \let \@oddfoot = \@evenfoot} + +\else\if \@reprint + +\def \ps@plain {% + \let \@mkboth = \@gobbletwo + \let \@evenhead = \@empty + \def \@evenfoot {\scriptsize \hfil \thepage \hfil}% + \let \@oddhead = \@empty + \let \@oddfoot = \@evenfoot} + +\else + +\let \ps@plain = \ps@empty +\let \ps@headings = \ps@empty +\let \ps@myheadings = \ps@empty + +\fi\fi + +\def \@formatyear {% + \number\year/\number\month/\number\day} + +% Special Characters +% ------- ---------- + + +\DeclareRobustCommand{\euro}{% + \protect{\rlap{=}}{\sf \kern .1em C}} + +% Title Page +% ----- ---- + + +\@setflag \@addauthorsdone = \@false + +\def \@titletext {\@latex@error{No title was provided}{}} +\def \@subtitletext {} + +\newcount{\@authorcount} + +\newcount{\@titlenotecount} +\newtoks{\@titlenotetext} + +\def \@titlebanner {} + +\renewcommand{\title}[1]{% + \gdef \@titletext {#1}} + +\newcommand{\subtitle}[1]{% + \gdef \@subtitletext {#1}} + +\newcommand{\authorinfo}[3]{% {names}{affiliation}{email/URL} + \global\@increment \@authorcount + \@withname\gdef {\@authorname\romannumeral\@authorcount}{#1}% + \@withname\gdef {\@authoraffil\romannumeral\@authorcount}{#2}% + \@withname\gdef {\@authoremail\romannumeral\@authorcount}{#3}} + +\renewcommand{\author}[1]{% + \@latex@error{The \string\author\space command is obsolete; + use \string\authorinfo}{}} + +\newcommand{\titlebanner}[1]{% + \gdef \@titlebanner {#1}} + +\renewcommand{\maketitle}{% + \pagestyle{plain}% + \if \@onecolumn + {\hsize = \standardtextwidth + \@maketitle}% + \else + \twocolumn[\@maketitle]% + \fi + \@placetitlenotes + \if \@copyrightwanted \@copyrightspace \fi} + +\def \@maketitle {% + \begin{center} + \@settitlebanner + \let \thanks = \titlenote + {\leftskip = 0pt plus 0.25\linewidth + \rightskip = 0pt plus 0.25 \linewidth + \parfillskip = 0pt + \spaceskip = .7em + \noindent \LARGE \bfseries \@titletext \par} + \vskip 6pt + \noindent \Large \@subtitletext \par + \vskip 12pt + \ifcase \@authorcount + \@latex@error{No authors were specified for this paper}{}\or + \@titleauthors{i}{}{}\or + \@titleauthors{i}{ii}{}\or + \@titleauthors{i}{ii}{iii}\or + \@titleauthors{i}{ii}{iii}\@titleauthors{iv}{}{}\or + \@titleauthors{i}{ii}{iii}\@titleauthors{iv}{v}{}\or + \@titleauthors{i}{ii}{iii}\@titleauthors{iv}{v}{vi}\or + \@titleauthors{i}{ii}{iii}\@titleauthors{iv}{v}{vi}% + \@titleauthors{vii}{}{}\or + \@titleauthors{i}{ii}{iii}\@titleauthors{iv}{v}{vi}% + \@titleauthors{vii}{viii}{}\or + \@titleauthors{i}{ii}{iii}\@titleauthors{iv}{v}{vi}% + \@titleauthors{vii}{viii}{ix}\or + \@titleauthors{i}{ii}{iii}\@titleauthors{iv}{v}{vi}% + \@titleauthors{vii}{viii}{ix}\@titleauthors{x}{}{}\or + \@titleauthors{i}{ii}{iii}\@titleauthors{iv}{v}{vi}% + \@titleauthors{vii}{viii}{ix}\@titleauthors{x}{xi}{}\or + \@titleauthors{i}{ii}{iii}\@titleauthors{iv}{v}{vi}% + \@titleauthors{vii}{viii}{ix}\@titleauthors{x}{xi}{xii}% + \else + \@latex@error{Cannot handle more than 12 authors}{}% + \fi + \vspace{1.75pc} + \end{center}} + +\def \@settitlebanner {% + \if \@andp{\@preprint}{\@notp{\@emptydefp{\@titlebanner}}}% + \vbox to 0pt{% + \vskip -32pt + \noindent \textbf{\@titlebanner}\par + \vss}% + \nointerlineskip + \fi} + +\def \@titleauthors #1#2#3{% + \if \@andp{\@emptyargp{#2}}{\@emptyargp{#3}}% + \noindent \@setauthor{40pc}{#1}{\@false}\par + \else\if \@emptyargp{#3}% + \noindent \@setauthor{17pc}{#1}{\@false}\hspace{3pc}% + \@setauthor{17pc}{#2}{\@false}\par + \else + \noindent \@setauthor{12.5pc}{#1}{\@false}\hspace{2pc}% + \@setauthor{12.5pc}{#2}{\@false}\hspace{2pc}% + \@setauthor{12.5pc}{#3}{\@true}\par + \relax + \fi\fi + \vspace{20pt}} + +\def \@setauthor #1#2#3{% {width}{text}{unused} + \vtop{% + \def \and {% + \hspace{16pt}} + \hsize = #1 + \normalfont + \centering + \large \@name{\@authorname#2}\par + \vspace{5pt} + \normalsize \@name{\@authoraffil#2}\par + \vspace{2pt} + \textsf{\@name{\@authoremail#2}}\par}} + +\def \@maybetitlenote #1{% + \if \@andp{#1}{\@gtrp{\@authorcount}{3}}% + \titlenote{See page~\pageref{@addauthors} for additional authors.}% + \fi} + +\newtoks{\@fnmark} + +\newcommand{\titlenote}[1]{% + \global\@increment \@titlenotecount + \ifcase \@titlenotecount \relax \or + \@fnmark = {\ast}\or + \@fnmark = {\dagger}\or + \@fnmark = {\ddagger}\or + \@fnmark = {\S}\or + \@fnmark = {\P}\or + \@fnmark = {\ast\ast}% + \fi + \,$^{\the\@fnmark}$% + \edef \reserved@a {\noexpand\@appendtotext{% + \noexpand\@titlefootnote{\the\@fnmark}}}% + \reserved@a{#1}} + +\def \@appendtotext #1#2{% + \global\@titlenotetext = \expandafter{\the\@titlenotetext #1{#2}}} + +\newcount{\@authori} + +\iffalse +\def \additionalauthors {% + \if \@gtrp{\@authorcount}{3}% + \section{Additional Authors}% + \label{@addauthors}% + \noindent + \@authori = 4 + {\let \\ = ,% + \loop + \textbf{\@name{\@authorname\romannumeral\@authori}}, + \@name{\@authoraffil\romannumeral\@authori}, + email: \@name{\@authoremail\romannumeral\@authori}.% + \@increment \@authori + \if \@notp{\@gtrp{\@authori}{\@authorcount}} \repeat}% + \par + \fi + \global\@setflag \@addauthorsdone = \@true} +\fi + +\let \addauthorsection = \additionalauthors + +\def \@placetitlenotes { + \the\@titlenotetext} + +% Utilities +% --------- + + +\newcommand{\centeroncapheight}[1]{% + {\setbox\@tempboxa = \hbox{#1}% + \@measurecapheight{\@tempdima}% % Calculate ht(CAP) - ht(text) + \advance \@tempdima by -\ht\@tempboxa % ------------------ + \divide \@tempdima by 2 % 2 + \raise \@tempdima \box\@tempboxa}} + +\newbox{\@measbox} + +\def \@measurecapheight #1{% {\dimen} + \setbox\@measbox = \hbox{ABCDEFGHIJKLMNOPQRSTUVWXYZ}% + #1 = \ht\@measbox} + +\long\def \@titlefootnote #1#2{% + \insert\footins{% + \reset@font\footnotesize + \interlinepenalty\interfootnotelinepenalty + \splittopskip\footnotesep + \splitmaxdepth \dp\strutbox \floatingpenalty \@MM + \hsize\columnwidth \@parboxrestore +%%% \protected@edef\@currentlabel{% +%%% \csname p@footnote\endcsname\@thefnmark}% + \color@begingroup + \def \@makefnmark {$^{#1}$}% + \@makefntext{% + \rule\z@\footnotesep\ignorespaces#2\@finalstrut\strutbox}% + \color@endgroup}} + +% LaTeX Modifications +% ----- ------------- + +\def \@seccntformat #1{% + \@name{\the#1}% + \@expandaftertwice\@seccntformata \csname the#1\endcsname.\@mark + \quad} + +\def \@seccntformata #1.#2\@mark{% + \if \@emptyargp{#2}.\fi} + +% Revision History +% -------- ------- + + +% Date Person Ver. Change +% ---- ------ ---- ------ + +% 2004.09.12 PCA 0.1--4 Preliminary development. + +% 2004.11.18 PCA 0.5 Start beta testing. + +% 2004.11.19 PCA 0.6 Obsolete \author and replace with +% \authorinfo. +% Add 'nocopyrightspace' option. +% Compress article opener spacing. +% Add 'mathtime' option. +% Increase text height by 6 points. + +% 2004.11.28 PCA 0.7 Add 'cm/computermodern' options. +% Change default to Times text. + +% 2004.12.14 PCA 0.8 Remove use of mathptm.sty; it cannot +% coexist with latexsym or amssymb. + +% 2005.01.20 PCA 0.9 Rename class file to sigplanconf.cls. + +% 2005.03.05 PCA 0.91 Change default copyright data. + +% 2005.03.06 PCA 0.92 Add at-signs to some macro names. + +% 2005.03.07 PCA 0.93 The 'onecolumn' option defaults to '11pt', +% and it uses the full type width. + +% 2005.03.15 PCA 0.94 Add at-signs to more macro names. +% Allow margin paragraphs during review. + +% 2005.03.22 PCA 0.95 Implement \euro. +% Remove proof and newdef environments. + +% 2005.05.06 PCA 1.0 Eliminate 'onecolumn' option. +% Change footer to small italic and eliminate +% left portion if no \preprintfooter. +% Eliminate copyright notice if preprint. +% Clean up and shrink copyright box. + +% 2005.05.30 PCA 1.1 Add alternate permission statements. + +% 2005.06.29 PCA 1.1 Publish final first edition of guide. + +% 2005.07.14 PCA 1.2 Add \subparagraph. +% Use block paragraphs in lists, and adjust +% spacing between items and paragraphs. + +% 2006.06.22 PCA 1.3 Add 'reprint' option and associated +% commands. + +% 2006.08.24 PCA 1.4 Fix bug in \maketitle case command. + +% 2007.03.13 PCA 1.5 The title banner only displays with the +% 'preprint' option. + +% 2007.06.06 PCA 1.6 Use \bibfont in \thebibliography. +% Add 'natbib' option to load and configure +% the natbib package. + +% 2007.11.20 PCA 1.7 Balance line lengths in centered article +% title (thanks to Norman Ramsey). + +% 2009.01.26 PCA 1.8 Change natbib \bibpunct values. + +% 2009.03.24 PCA 1.9 Change natbib to use the 'numbers' option. +% Change templates to use 'natbib' option. + +% 2009.09.01 PCA 2.0 Add \reprintprice command (suggested by +% Stephen Chong). + +% 2009.09.08 PCA 2.1 Make 'natbib' the default; add 'nonatbib'. +% SB Add 'authoryear' and 'numbers' (default) to +% control citation style when using natbib. +% Add \bibpunct to change punctuation for +% 'authoryear' style. + +% 2009.09.21 PCA 2.2 Add \softraggedright to the thebibliography +% environment. Also add to template so it will +% happen with natbib. + +% 2009.09.30 PCA 2.3 Remove \softraggedright from thebibliography. +% Just include in the template. + +% 2010.05.24 PCA 2.4 Obfuscate class author's email address. + +% 2011.11.08 PCA 2.5 Add copyright notice to this file. +% Remove 'sort' option from natbib when using +% 'authoryear' style. +% Add the \authorversion command. + +% 2013.02.22 PCA 2.6 Change natbib fences to parentheses when +% using 'authoryear' style. + +% 2013.05.17 PCA 2.7 Change standard and author copyright text. + +% 2013.07.02 TU 2.8 More changes to permission/copyright notes. +% Replaced ambiguous \authorpermission with +% \exclusivelicense and \permissiontopublish + + From 62d3dadf9841be015e545c1282fe88e7eecc4d37 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Fri, 17 Apr 2015 16:02:39 +0900 Subject: [PATCH 02/14] Doc split to high-level-chain-mgr.tex finished All of the major surgery required to move Chain Manager design & discussion details out of the high-level-machi.tex document are complete. I've done only a very small amount of work on the original high-level-machi.tex to fix document flow problems. There's probably a good way to have LaTeX automatically manage the mutual references between the now-split documents, but I didn't know about, sorry. --- doc/src.high-level/Makefile | 2 + doc/src.high-level/high-level-chain-mgr.tex | 1186 +++++++++++++++++ doc/src.high-level/high-level-machi.tex | 1269 ++----------------- 3 files changed, 1304 insertions(+), 1153 deletions(-) create mode 100644 doc/src.high-level/high-level-chain-mgr.tex diff --git a/doc/src.high-level/Makefile b/doc/src.high-level/Makefile index f8216da..3c99a71 100644 --- a/doc/src.high-level/Makefile +++ b/doc/src.high-level/Makefile @@ -1,6 +1,8 @@ all: latex high-level-machi.tex dvipdfm high-level-machi.dvi + latex high-level-chain-mgr.tex + dvipdfm high-level-chain-mgr.dvi clean: rm -f *.aux *.dvi *.log diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex new file mode 100644 index 0000000..0683374 --- /dev/null +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -0,0 +1,1186 @@ + +%% \documentclass[]{report} +\documentclass[preprint,10pt]{sigplanconf} +% The following \documentclass options may be useful: + +% preprint Remove this option only once the paper is in final form. +% 10pt To set in 10-point type instead of 9-point. +% 11pt To set in 11-point type instead of 9-point. +% authoryear To obtain author/year citation style instead of numeric. + +% \usepackage[a4paper]{geometry} +\usepackage[dvips]{graphicx} % to include images +%\usepackage{pslatex} % to use PostScript fonts + +\begin{document} + +%%\special{papersize=8.5in,11in} +%%\setlength{\pdfpageheight}{\paperheight} +%%\setlength{\pdfpagewidth}{\paperwidth} + +\conferenceinfo{}{} +\copyrightyear{2014} +\copyrightdata{978-1-nnnn-nnnn-n/yy/mm} +\doi{nnnnnnn.nnnnnnn} + +\titlebanner{Draft \#0, April 2014} +\preprintfooter{Draft \#0, April 2014} + +\title{Machi Chain Replication: management theory and design} +\subtitle{} + +\authorinfo{Basho Japan KK}{} + +\maketitle + +\section{Origins} +\label{sec:origins} + +This document was first written during the autumn of 2014 for a +Basho-only internal audience. Since its original drafts, Machi has +been designated by Basho as a full open source software project. This +document has been rewritten in 2015 to address an external audience. +For an overview of the design of the larger Machi system, please see +\cite{machi-design}. + +\section{Abstract} +\label{sec:abstract} + +TODO + +\section{Introduction} +\label{sec:introduction} + +TODO + +\section{Projections: calculation, then storage, then (perhaps) use} +\label{sec:projections} + +Machi uses a ``projection'' to determine how its Chain Replication replicas +should operate; see \cite{machi-design} and +\cite{corfu1}. At runtime, a cluster must be able to respond both to +administrative changes (e.g., substituting a failed server box with +replacement hardware) as well as local network conditions (e.g., is +there a network partition?). The concept of a projection is borrowed +from CORFU but has a longer history, e.g., the Hibari key-value store +\cite{cr-theory-and-practice} and goes back in research for decades, +e.g., Porcupine \cite{porcupine}. + +\subsection{Phases of projection change} + +Machi's use of projections is in four discrete phases and are +discussed below: network monitoring, +projection calculation, projection storage, and +adoption of new projections. + +\subsubsection{Network monitoring} +\label{sub:network-monitoring} + +Monitoring of local network conditions can be implemented in many +ways. None are mandatory, as far as this RFC is concerned. +Easy-to-maintain code should be the primary driver for any +implementation. Early versions of Machi may use some/all of the +following techniques: + +\begin{itemize} +\item Internal ``no op'' FLU-level protocol request \& response. +\item Use of distributed Erlang {\tt net\_ticktime} node monitoring +\item Explicit connections of remote {\tt epmd} services, e.g., to +tell the difference between a dead Erlang VM and a dead +machine/hardware node. +\item Network tests via ICMP {\tt ECHO\_REQUEST}, a.k.a. {\tt ping(8)} +\end{itemize} + +Output of the monitor should declare the up/down (or +available/unavailable) status of each server in the projection. Such +Boolean status does not eliminate ``fuzzy logic'' or probabilistic +methods for determining status. Instead, hard Boolean up/down status +decisions are required by the projection calculation phase +(Section~\ref{subsub:projection-calculation}). + +\subsubsection{Projection data structure calculation} +\label{subsub:projection-calculation} + +Each Machi server will have an independent agent/process that is +responsible for calculating new projections. A new projection may be +required whenever an administrative change is requested or in response +to network conditions (e.g., network partitions). + +Projection calculation will be a pure computation, based on input of: + +\begin{enumerate} +\item The current projection epoch's data structure +\item Administrative request (if any) +\item Status of each server, as determined by network monitoring +(Section~\ref{sub:network-monitoring}). +\end{enumerate} + +All decisions about {\em when} to calculate a projection must be made +using additional runtime information. Administrative change requests +probably should happen immediately. Change based on network status +changes may require retry logic and delay/sleep time intervals. + +\subsection{Projection storage: writing} +\label{sub:proj-storage-writing} + +All projection data structures are stored in the write-once Projection +Store that is run by each FLU. (See also \cite{machi-design}.) + +Writing the projection follows the two-step sequence below. +In cases of writing +failure at any stage, the process is aborted. The most common case is +{\tt error\_written}, which signifies that another actor in the system has +already calculated another (perhaps different) projection using the +same projection epoch number and that +read repair is necessary. Note that {\tt error\_written} may also +indicate that another actor has performed read repair on the exact +projection value that the local actor is trying to write! + +\begin{enumerate} +\item Write $P_{new}$ to the local projection store. This will trigger + ``wedge'' status in the local FLU, which will then cascade to other + projection-related behavior within the FLU. +\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. + Some members may be unavailable, but that is OK. +\end{enumerate} + +(Recall: Other parts of the system are responsible for reading new +projections from other actors in the system and for deciding to try to +create a new projection locally.) + +\subsection{Projection storage: reading} +\label{sub:proj-storage-reading} + +Reading data from the projection store is similar in principle to +reading from a Chain Replication-managed FLU system. However, the +projection store does not require the strict replica ordering that +Chain Replication does. For any projection store key $K_n$, the +participating servers may have different values for $K_n$. As a +write-once store, it is impossible to mutate a replica of $K_n$. If +replicas of $K_n$ differ, then other parts of the system (projection +calculation and storage) are responsible for reconciling the +differences by writing a later key, +$K_{n+x}$ when $x>0$, with a new projection. + +Projection store reads are ``best effort''. The projection used is chosen from +all replica servers that are available at the time of the read. The +minimum number of replicas is only one: the local projection store +should always be available, even if no other remote replica projection +stores are available. + +For any key $K$, different projection stores $S_a$ and $S_b$ may store +nothing (i.e., {\tt error\_unwritten} when queried) or store different +values, $P_a \ne P_b$, despite having the same projection epoch +number. The following ranking rules are used to +determine the ``best value'' of a projection, where highest rank of +{\em any single projection} is considered the ``best value'': + +\begin{enumerate} +\item An unwritten value is ranked at a value of $-1$. +\item A value whose {\tt author\_server} is at the $I^{th}$ position + in the {\tt all\_members} list has a rank of $I$. +\item A value whose {\tt dbg\_annotations} and/or other fields have + additional information may increase/decrease its rank, e.g., + increase the rank by $10.25$. +\end{enumerate} + +Rank rules \#2 and \#3 are intended to avoid worst-case ``thrashing'' +of different projection proposals. + +The concept of ``read repair'' of an unwritten key is the same as +Chain Replication's. If a read attempt for a key $K$ at some server +$S$ results in {\tt error\_unwritten}, then all of the other stores in +the {\tt \#projection.all\_members} list are consulted. If there is a +unanimous value $V_{u}$ elsewhere, then $V_{u}$ is use to repair all +unwritten replicas. If the value of $K$ is not unanimous, then the +``best value'' $V_{best}$ is used for the repair. If all respond with +{\tt error\_unwritten}, repair is not required. + +\subsection{Adoption of new projections} + +The projection store's ``best value'' for the largest written epoch +number at the time of the read is projection used by the FLU. +If the read attempt for projection $P_p$ +also yields other non-best values, then the +projection calculation subsystem is notified. This notification +may/may not trigger a calculation of a new projection $P_{p+1}$ which +may eventually be stored and so +resolve $P_p$'s replicas' ambiguity. + +\subsubsection{Alternative implementations: Hibari's ``Admin Server'' + and Elastic Chain Replication} + +See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's +chain management agent, the ``Admin Server''. In brief: + +\begin{itemize} +\item The Admin Server is intentionally a single point of failure in + the same way that the instance of Stanchion in a Riak CS cluster + is an intentional single + point of failure. In both cases, strict + serialization of state changes is more important than 100\% + availability. + +\item For higher availability, the Hibari Admin Server is usually + configured in an active/standby manner. Status monitoring and + application failover logic is provided by the built-in capabilities + of the Erlang/OTP application controller. + +\end{itemize} + +Elastic chain replication is a technique described in +\cite{elastic-chain-replication}. It describes using multiple chains +to monitor each other, as arranged in a ring where a chain at position +$x$ is responsible for chain configuration and management of the chain +at position $x+1$. This technique is likely the fall-back to be used +in case the chain management method described in this RFC proves +infeasible. + +\subsection{Likely problems and possible solutions} +\label{sub:likely-problems} + +There are some unanswered questions about Machi's proposed chain +management technique. The problems that we guess are likely/possible +include: + +\begin{itemize} + +\item Thrashing or oscillating between a pair (or more) of + projections. It's hoped that the ``best projection'' ranking system + will be sufficient to prevent endless thrashing of projections, but + it isn't yet clear that it will be. + +\item Partial (and/or one-way) network splits which cause partially + connected graphs of inter-node connectivity. Groups of nodes that + are completely isolated aren't a problem. However, partially + connected groups of nodes is an unknown. Intuition says that + communication (via the projection store) with ``bridge nodes'' in a + partially-connected network ought to settle eventually on a + projection with high rank, e.g., the projection on an island + subcluster of nodes with the largest author node name. Some corner + case(s) may exist where this intuition is not correct. + +\item CP Mode management via the method proposed in + Section~\ref{sec:split-brain-management} may not be sufficient in + all cases. + +\end{itemize} + +\section{Chain Replication: proof of correctness} +\label{sub:cr-proof} + +See Section~3 of \cite{chain-replication} for a proof of the +correctness of Chain Replication. A short summary is provide here. +Readers interested in good karma should read the entire paper. + +The three basic rules of Chain Replication and its strong +consistency guarantee: + +\begin{enumerate} + +\item All replica servers are arranged in an ordered list $C$. + +\item All mutations of a datum are performed upon each replica of $C$ + strictly in the order which they appear in $C$. A mutation is considered + completely successful if the writes by all replicas are successful. + +\item The head of the chain makes the determination of the order of + all mutations to all members of the chain. If the head determines + that some mutation $M_i$ happened before another mutation $M_j$, + then mutation $M_i$ happens before $M_j$ on all other members of + the chain.\footnote{While necesary for general Chain Replication, + Machi does not need this property. Instead, the property is + provided by Machi's sequencer and the write-once register of each + byte in each file.} + +\item All read-only operations are performed by the ``tail'' replica, + i.e., the last replica in $C$. + +\end{enumerate} + +The basis of the proof lies in a simple logical trick, which is to +consider the history of all operations made to any server in the chain +as a literal list of unique symbols, one for each mutation. + +Each replica of a datum will have a mutation history list. We will +call this history list $H$. For the $i^{th}$ replica in the chain list +$C$, we call $H_i$ the mutation history list for the $i^{th}$ replica. + +Before the $i^{th}$ replica in the chain list begins service, its mutation +history $H_i$ is empty, $[]$. After this replica runs in a Chain +Replication system for a while, its mutation history list grows to +look something like +$[M_0, M_1, M_2, ..., M_{m-1}]$ where $m$ is the total number of +mutations of the datum that this server has processed successfully. + +Let's assume for a moment that all mutation operations have stopped. +If the order of the chain was constant, and if all mutations are +applied to each replica in the chain's order, then all replicas of a +datum will have the exact same mutation history: $H_i = H_J$ for any +two replicas $i$ and $j$ in the chain +(i.e., $\forall i,j \in C, H_i = H_J$). That's a lovely property, +but it is much more interesting to assume that the service is +not stopped. Let's look next at a running system. + +\begin{figure*} +\centering +\begin{tabular}{ccc} +{\bf {{On left side of $C$}}} & & {\bf On right side of $C$} \\ +\hline +\multicolumn{3}{l}{Looking at replica order in chain $C$:} \\ +$i$ & $<$ & $j$ \\ + +\multicolumn{3}{l}{For example:} \\ + +0 & $<$ & 2 \\ +\hline +\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\ +length($H_i$) & $\geq$ & length($H_j$) \\ +\multicolumn{3}{l}{For example, a quiescent chain:} \\ +48 & $\geq$ & 48 \\ +\multicolumn{3}{l}{For example, a chain being mutated:} \\ +55 & $\geq$ & 48 \\ +\multicolumn{3}{l}{Example ordered mutation sets:} \\ +$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ +\multicolumn{3}{c}{\bf Therefore the right side is always an ordered + subset} \\ +\multicolumn{3}{c}{\bf of the left side. Furthermore, the ordered + sets on both} \\ +\multicolumn{3}{c}{\bf sides have the exact same order of those elements they have in common.} \\ +\multicolumn{3}{c}{The notation used by the Chain Replication paper is +shown below:} \\ +$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\succeq$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ + +\end{tabular} +\caption{A demonstration of Chain Replication protocol history ``Update Propagation Invariant''.} +\label{tab:chain-order} +\end{figure*} + +If the entire chain $C$ is processing any number of concurrent +mutations, then we can still understand $C$'s behavior. +Figure~\ref{tab:chain-order} shows us two replicas in chain $C$: +replica $R_i$ that's on the left/earlier side of the replica chain $C$ +than some other replica $R_j$. We know that $i$'s position index in +the chain is smaller than $j$'s position index, so therefore $i < j$. +The restrictions of Chain Replication make it true that length($H_i$) +$\ge$ length($H_j$) because it's also that $H_i \supset H_j$, i.e, +$H_i$ on the left is always is a superset of $H_j$ on the right. + +When considering $H_i$ and $H_j$ as strictly ordered lists, we have +$H_i \succeq H_j$, where the right side is always an exact prefix of the left +side's list. This prefixing propery is exactly what strong +consistency requires. If a value is read from the tail of the chain, +then no other chain member can have a prior/older value because their +respective mutations histories cannot be shorter than the tail +member's history. + +\paragraph{``Update Propagation Invariant''} +is the original chain replication paper's name for the +$H_i \succeq H_j$ +property. This paper will use the same name. + +\section{Repair of entire files} +\label{sec:repair-entire-files} + +There are some situations where repair of entire files is necessary. + +\begin{itemize} +\item To repair FLUs added to a chain in a projection change, + specifically adding a new FLU to the chain. This case covers both + adding a new, data-less FLU and re-adding a previous, data-full FLU + back to the chain. +\item To avoid data loss when changing the order of the chain's servers. +\end{itemize} + +Both situations can set the stage for data loss in the future. +If a violation of the Update Propagation Invariant (see end of +Section~\ref{sub:cr-proof}) is permitted, then the strong consistency +guarantee of Chain Replication is violated. Because Machi uses +write-once registers, the number of possible strong consistency +violations is small: any client that witnesses a written $\rightarrow$ +unwritten transition is a violation of strong consistency. But +avoiding even this one bad scenario is a bit tricky. + +As explained in Section~\ref{sub:data-loss1}, data +unavailability/loss when all chain servers fail is unavoidable. We +wish to avoid data loss whenever a chain has at least one surviving +server. Another method to avoid data loss is to preserve the Update +Propagation Invariant at all times. + +\subsubsection{Just ``rsync'' it!} +\label{ssec:just-rsync-it} + +A simple repair method might be perhaps 90\% sufficient. +That method could loosely be described as ``just {\tt rsync} +all files to all servers in an infinite loop.''\footnote{The + file format suggested in + \cite{machi-design} does not permit {\tt rsync} + as-is to be sufficient. A variation of {\tt rsync} would need to be + aware of the data/metadata split within each file and only replicate + the data section \ldots and the metadata would still need to be + managed outside of {\tt rsync}.} + +However, such an informal method +cannot tell you exactly when you are in danger of data loss and when +data loss has actually happened. If we maintain the Update +Propagation Invariant, then we know exactly when data loss is immanent +or has happened. + +Furthermore, we hope to use Machi for multiple use cases, including +ones that require strong consistency. +For uses such as CORFU, strong consistency is a non-negotiable +requirement. Therefore, we will use the Update Propagation Invariant +as the foundation for Machi's data loss prevention techniques. + +\subsubsection{Divergence from CORFU: repair} +\label{sub:repair-divergence} + +The original repair design for CORFU is simple and effective, +mostly. See Figure~\ref{fig:corfu-style-repair} for a full +description of the algorithm +Figure~\ref{fig:corfu-repair-sc-violation} for an example of a strong +consistency violation that can follow. (NOTE: This is a variation of +the data loss scenario that is described in +Figure~\ref{fig:data-loss2}.) + +\begin{figure} +\begin{enumerate} +\item Destroy all data on the repair destination FLU. +\item Add the repair destination FLU to the tail of the chain in a new + projection $P_{p+1}$. +\item Change projection from $P_p$ to $P_{p+1}$. +\item Let single item read repair fix all of the problems. +\end{enumerate} +\caption{Simplest CORFU-style repair algorithm.} +\label{fig:corfu-style-repair} +\end{figure} + +\begin{figure} +\begin{enumerate} +\item Write value $V$ to offset $O$ in the log with chain $[F_a]$. + This write is considered successful. +\item Change projection to configure chain as $[F_a,F_b]$. Prior to + the change, all values on FLU $F_b$ are unwritten. +\item FLU server $F_a$ crashes. The new projection defines the chain + as $[F_b]$. +\item A client attempts to read offset $O$ and finds an unwritten + value. This is a strong consistency violation. +%% \item The same client decides to fill $O$ with the junk value +%% $V_{junk}$. Now value $V$ is lost. +\end{enumerate} +\caption{An example scenario where the CORFU simplest repair algorithm + can lead to a violation of strong consistency.} +\label{fig:corfu-repair-sc-violation} +\end{figure} + +A variation of the repair +algorithm is presented in section~2.5 of a later CORFU paper \cite{corfu2}. +However, the re-use a failed +server is not discussed there, either: the example of a failed server +$F_6$ uses a new server, $F_8$ to replace $F_6$. Furthermore, the +repair process is described as: + +\begin{quote} +``Once $F_6$ is completely rebuilt on $F_8$ (by copying entries from + $F_7$), the system moves to projection (C), where $F_8$ is now used + to service all reads in the range $[40K,80K)$.'' +\end{quote} + +The phrase ``by copying entries'' does not give enough +detail to avoid the same data race as described in +Figure~\ref{fig:corfu-repair-sc-violation}. We believe that if +``copying entries'' means copying only written pages, then CORFU +remains vulnerable. If ``copying entries'' also means ``fill any +unwritten pages prior to copying them'', then perhaps the +vulnerability is eliminated.\footnote{SLF's note: Probably? This is my + gut feeling right now. However, given that I've just convinced + myself 100\% that fill during any possibility of split brain is {\em + not safe} in Machi, I'm not 100\% certain anymore than this ``easy'' + fix for CORFU is correct.}. + +\subsubsection{Whole-file repair as FLUs are (re-)added to a chain} +\label{sub:repair-add-to-chain} + +Machi's repair process must preserve the Update Propagation +Invariant. To avoid data races with data copying from +``U.P.~Invariant preserving'' servers (i.e. fully repaired with +respect to the Update Propagation Invariant) +to servers of unreliable/unknown state, a +projection like the one shown in +Figure~\ref{fig:repair-chain-of-chains} is used. In addition, the +operations rules for data writes and reads must be observed in a +projection of this type. + +\begin{figure*} +\centering +$ +[\overbrace{\underbrace{H_1}_\textbf{Head of Heads}, M_{11}, + \underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)} +\mid +\overbrace{H_2, M_{21}, + \underbrace{T_2}_\textbf{Tail \#2}}^\textbf{Chain \#2 (repairing)} +\mid \ldots \mid +\overbrace{H_n, M_{n1}, + \underbrace{T_n}_\textbf{Tail \#n \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#n (repairing)} +] +$ +\caption{Representation of a ``chain of chains'': a chain prefix of + Update Propagation Invariant preserving FLUs (``Chain \#1'') + with FLUs from $n-1$ other chains under repair.} +\label{fig:repair-chain-of-chains} +\end{figure*} + +\begin{itemize} + +\item The system maintains the distinction between ``U.P.~preserving'' + and ``repairing'' FLUs at all times. This allows the system to + track exactly which servers are known to preserve the Update + Propagation Invariant and which servers may/may not. + +\item All ``repairing'' FLUs must be added only at the end of the + chain-of-chains. + +\item All write operations must flow successfully through the + chain-of-chains from beginning to end, i.e., from the ``head of + heads'' to the ``tail of tails''. This rule also includes any + repair operations. + +\item In AP Mode, all read operations are attempted from the list of +$[T_1,\-T_2,\-\ldots,\-T_n]$, where these FLUs are the tails of each of the +chains involved in repair. +In CP mode, all read operations are attempted only from $T_1$. +The first reply of {\tt \{ok, <<...>>\}} is a correct answer; +the rest of the FLU list can be ignored and the result returned to the +client. If all FLUs in the list have an unwritten value, then the +client can return {\tt error\_unwritten}. + +\end{itemize} + +While the normal single-write and single-read operations are performed +by the cluster, a file synchronization process is initiated. The +sequence of steps differs depending on the AP or CP mode of the system. + +\paragraph{In cases where the cluster is operating in CP Mode:} + +CORFU's repair method of ``just copy it all'' (from source FLU to repairing +FLU) is correct, {\em except} for the small problem pointed out in +Section~\ref{sub:repair-divergence}. The problem for Machi is one of +time \& space. Machi wishes to avoid transferring data that is +already correct on the repairing nodes. If a Machi node is storing +20TBytes of data, we really do not wish to use 20TBytes of bandwidth +to repair only 1 GByte of truly-out-of-sync data. + +However, it is {\em vitally important} that all repairing FLU data be +clobbered/overwritten with exactly the same data as the Update +Propagation Invariant preserving chain. If this rule is not strictly +enforced, then fill operations can corrupt Machi file data. The +algorithm proposed is: + +\begin{enumerate} + +\item Change the projection to a ``chain of chains'' configuration + such as depicted in Figure~\ref{fig:repair-chain-of-chains}. + +\item For all files on all FLUs in all chains, extract the lists of + written/unwritten byte ranges and their corresponding file data + checksums. (The checksum metadata is not strictly required for + recovery in AP Mode.) + Send these lists to the tail of tails + $T_{tails}$, which will collate all of the lists into a list of + tuples such as {\tt \{FName, $O_{start}, O_{end}$, CSum, FLU\_List\}} + where {\tt FLU\_List} is the list of all FLUs in the entire chain of + chains where the bytes at the location {\tt \{FName, $O_{start}, + O_{end}$\}} are known to be written (as of the current repair period). + +\item For chain \#1 members, i.e., the + leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains}, + repair files byte ranges for any chain \#1 members that are not members + of the {\tt FLU\_List} set. This will repair any partial + writes to chain \#1 that were unsuccessful (e.g., client crashed). + (Note however that this step only repairs FLUs in chain \#1.) + +\item For all file byte ranges in all files on all FLUs in all + repairing chains where Tail \#1's value is unwritten, force all + repairing FLUs to also be unwritten. + +\item For file byte ranges in all files on all FLUs in all repairing + chains where Tail \#1's value is written, send repair file byte data + \& metadata to any repairing FLU if the value repairing FLU's + value is unwritten or the checksum is not exactly equal to Tail \#1's + checksum. + +\end{enumerate} + +\begin{figure} +\centering +$ +[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, + H_2, M_{21}, T_2, + \ldots + H_n, M_{n1}, + \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} +] +$ +\caption{Representation of Figure~\ref{fig:repair-chain-of-chains} + after all repairs have finished successfully and a new projection has + been calculated.} +\label{fig:repair-chain-of-chains-finished} +\end{figure} + +When the repair is known to have copied all missing data successfully, +then the chain can change state via a new projection that includes the +repaired FLU(s) at the end of the U.P.~Invariant preserving chain \#1 +in the same order in which they appeared in the chain-of-chains during +repair. See Figure~\ref{fig:repair-chain-of-chains-finished}. + +The repair can be coordinated and/or performed by the $T_{tails}$ FLU +or any other FLU or cluster member that has spare capacity. + +There is no serious race condition here between the enumeration steps +and the repair steps. Why? Because the change in projection at +step \#1 will force any new data writes to adapt to a new projection. +Consider the mutations that either happen before or after a projection +change: + + +\begin{itemize} + +\item For all mutations $M_1$ prior to the projection change, the + enumeration steps \#3 \& \#4 and \#5 will always encounter mutation + $M_1$. Any repair must write through the entire chain-of-chains and + thus will preserve the Update Propagation Invariant when repair is + finished. + +\item For all mutations $M_2$ starting during or after the projection + change has finished, a new mutation $M_2$ may or may not be included in the + enumeration steps \#3 \& \#4 and \#5. + However, in the new projection, $M_2$ must be + written to all chain of chains members, and such + in-order writes will also preserve the Update + Propagation Invariant and therefore is also be safe. + +\end{itemize} + +%% Then the only remaining safety problem (as far as I can see) is +%% avoiding this race: + +%% \begin{enumerate} +%% \item Enumerate byte ranges $[B_0,B_1,\ldots]$ in file $F$ that must +%% be copied to the repair target, based on checksum differences for +%% those byte ranges. +%% \item A real-time concurrent write for byte range $B_x$ arrives at the +%% U.P.~Invariant preserving chain for file $F$ but was not a member of +%% step \#1's list of byte ranges. +%% \item Step \#2's update is propagated down the chain of chains. +%% \item Step \#1's clobber updates are propagated down the chain of +%% chains. +%% \item The value for $B_x$ is lost on the repair targets. +%% \end{enumerate} + +\paragraph{In cases the cluster is operating in AP Mode:} + +\begin{enumerate} +\item Follow the first two steps of the ``CP Mode'' + sequence (above). +\item Follow step \#3 of the ``strongly consistent mode'' sequence + (above), but in place of repairing only FLUs in Chain \#1, AP mode + will repair the byte range of any FLU that is not a member of the + {\tt FLU\_List} set. +\item End of procedure. +\end{enumerate} + +The end result is a huge ``merge'' where any +{\tt \{FName, $O_{start}, O_{end}$\}} range of bytes that is written +on FLU $F_w$ but missing/unwritten from FLU $F_m$ is written down the full chain +of chains, skipping any FLUs where the data is known to be written. +Such writes will also preserve Update Propagation Invariant when +repair is finished. + +\subsubsection{Whole-file repair when changing FLU ordering within a chain} +\label{sub:repair-chain-re-ordering} + +Changing FLU order within a chain is an operations optimization only. +It may be that the administrator wishes the order of a chain to remain +as originally configured during steady-state operation, e.g., +$[F_a,F_b,F_c]$. As FLUs are stopped \& restarted, the chain may +become re-ordered in a seemingly-arbitrary manner. + +It is certainly possible to re-order the chain, in a kludgy manner. +For example, if the desired order is $[F_a,F_b,F_c]$ but the current +operating order is $[F_c,F_b,F_a]$, then remove $F_b$ from the chain, +then add $F_b$ to the end of the chain. Then repeat the same +procedure for $F_c$. The end result will be the desired order. + +From an operations perspective, re-ordering of the chain +using this kludgy manner has a +negative effect on availability: the chain is temporarily reduced from +operating with $N$ replicas down to $N-1$. This reduced replication +factor will not remain for long, at most a few minutes at a time, but +even a small amount of time may be unacceptable in some environments. + +Reordering is possible with the introduction of a ``temporary head'' +of the chain. This temporary FLU does not need to be a full replica +of the entire chain --- it merely needs to store replicas of mutations +that are made during the chain reordering process. This method will +not be described here. However, {\em if reviewers believe that it should +be included}, please let the authors know. + +\paragraph{In both Machi operating modes:} +After initial implementation, it may be that the repair procedure is a +bit too slow. In order to accelerate repair decisions, it would be +helpful have a quicker method to calculate which files have exactly +the same contents. In traditional systems, this is done with a single +file checksum; see also the ``checksum scrub'' subsection in +\cite{machi-design}. +Machi's files can be written out-of-order from a file offset point of +view, which violates the order which the traditional method for +calculating a full-file hash. If we recall out-of-temporal-order +example in the ``Append-only files'' section of \cite{machi-design}, +the traditional method cannot +continue calculating the file checksum at offset 2 until the byte at +file offset 1 is written. + +It may be advantageous for each FLU to maintain for each file a +checksum of a canonical representation of the +{\tt \{$O_{start},O_{end},$ CSum\}} tuples that the FLU must already +maintain. Then for any two FLUs that claim to store a file $F$, if +both FLUs have the same hash of $F$'s written map + checksums, then +the copies of $F$ on both FLUs are the same. + +\section{``Split brain'' management in CP Mode} +\label{sec:split-brain-management} + +Split brain management is a thorny problem. The method presented here +is one based on pragmatics. If it doesn't work, there isn't a serious +worry, because Machi's first serious use case all require only AP Mode. +If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'', +then perhaps that's +fine enough. Meanwhile, let's explore how a +completely self-contained, no-external-dependencies +CP Mode Machi might work. + +Wikipedia's description of the quorum consensus solution\footnote{See + {\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice +and short: + +\begin{quotation} +A typical approach, as described by Coulouris et al.,[4] is to use a +quorum-consensus approach. This allows the sub-partition with a +majority of the votes to remain available, while the remaining +sub-partitions should fall down to an auto-fencing mode. +\end{quotation} + +This is the same basic technique that +both Riak Ensemble and ZooKeeper use. Machi's +extensive use of write-registers are a big advantage when implementing +this technique. Also very useful is the Machi ``wedge'' mechanism, +which can automatically implement the ``auto-fencing'' that the +technique requires. All Machi servers that can communicate with only +a minority of other servers will automatically ``wedge'' themselves +and refuse all requests for service until communication with the +majority can be re-established. + +\subsection{The quorum: witness servers vs. full servers} + +In any quorum-consensus system, at least $2f+1$ participants are +required to survive $f$ participant failures. Machi can implement a +technique of ``witness servers'' servers to bring the total cost +somewhere in the middle, between $2f+1$ and $f+1$, depending on your +point of view. + +A ``witness server'' is one that participates in the network protocol +but does not store or manage all of the state that a ``full server'' +does. A ``full server'' is a Machi server as +described by this RFC document. A ``witness server'' is a server that +only participates in the projection store and projection epoch +transition protocol and a small subset of the file access API. +A witness server doesn't actually store any +Machi files. A witness server is almost stateless, when compared to a +full Machi server. + +A mixed cluster of witness and full servers must still contain at +least $2f+1$ participants. However, only $f+1$ of them are full +participants, and the remaining $f$ participants are witnesses. In +such a cluster, any majority quorum must have at least one full server +participant. + +Witness FLUs are always placed at the front of the chain. As stated +above, there may be at most $f$ witness FLUs. A functioning quorum +majority +must have at least $f+1$ FLUs that can communicate and therefore +calculate and store a new unanimous projection. Therefore, any FLU at +the tail of a functioning quorum majority chain must be full FLU. Full FLUs +actually store Machi files, so they have no problem answering {\tt + read\_req} API requests.\footnote{We hope that it is now clear that + a witness FLU cannot answer any Machi file read API request.} + +Any FLU that can only communicate with a minority of other FLUs will +find that none can calculate a new projection that includes a +majority of FLUs. Any such FLU, when in CP mode, would then move to +wedge state and remain wedged until the network partition heals enough +to communicate with the majority side. This is a nice property: we +automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side + is wedged and therefore refuses to serve because it is, so to speak, + ``on the wrong side of the fence.''} + +There is one case where ``fencing'' may not happen: if both the client +and the tail FLU are on the same minority side of a network partition. +Assume the client and FLU $F_z$ are on the "wrong side" of a network +split; both are using projection epoch $P_1$. The tail of the +chain is $F_z$. + +Also assume that the "right side" has reconfigured and is using +projection epoch $P_2$. The right side has mutated key $K$. Meanwhile, +nobody on the "right side" has noticed anything wrong and is happy to +continue using projection $P_1$. + +\begin{itemize} +\item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via + $F_z$. $F_z$ does not detect an epoch problem and thus returns an + answer. Given our assumptions, this value is stale. For some + client use cases, this kind of staleness may be OK in trade for + fewer network messages per read \ldots so Machi may + have a configurable option to permit it. +\item {\bf Option b}: The wrong side client must confirm that $P_1$ is + in use by a full majority of chain members, including $F_z$. +\end{itemize} + +Attempts using Option b will fail for one of two reasons. First, if +the client can talk to a FLU that is using $P_2$, the client's +operation must be retried using $P_2$. Second, the client will time +out talking to enough FLUs so that it fails to get a quorum's worth of +$P_1$ answers. In either case, Option B will always fail a client +read and thus cannot return a stale value of $K$. + +\subsection{Witness FLU data and protocol changes} + +Some small changes to the projection's data structure +are required (relative to the initial spec described in +\cite{machi-design}). The projection itself +needs new annotation to indicate the operating mode, AP mode or CP +mode. The state type notifies the chain manager how to +react in network partitions and how to calculate new, safe projection +transitions and which file repair mode to use +(Section~\ref{sec:repair-entire-files}). +Also, we need to label member FLU servers as full- or +witness-type servers. + +Write API requests are processed by witness servers in {\em almost but + not quite} no-op fashion. The only requirement of a witness server +is to return correct interpretations of local projection epoch +numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error +codes. In fact, a new API call is sufficient for querying witness +servers: {\tt \{check\_epoch, m\_epoch()\}}. +Any client write operation sends the {\tt + check\_\-epoch} API command to witness FLUs and sends the usual {\tt + write\_\-req} command to full FLUs. + +\section{The safety of projection epoch transitions} +\label{sec:safety-of-transitions} + +Machi uses the projection epoch transition algorithm and +implementation from CORFU, which is believed to be safe. However, +CORFU assumes a single, external, strongly consistent projection +store. Further, CORFU assumes that new projections are calculated by +an oracle that the rest of the CORFU system agrees is the sole agent +for creating new projections. Such an assumption is impractical for +Machi's intended purpose. + +Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle +coordinator), but we wish to keep Machi free of big external +dependencies. We would also like to see Machi be able to +operate in an ``AP mode'', which means providing service even +if all network communication to an oracle is broken. + +The model of projection calculation and storage described in +Section~\ref{sec:projections} allows for each server to operate +independently, if necessary. This autonomy allows the server in AP +mode to +always accept new writes: new writes are written to unique file names +and unique file offsets using a chain consisting of only a single FLU, +if necessary. How is this possible? Let's look at a scenario in +Section~\ref{sub:split-brain-scenario}. + +\subsection{A split brain scenario} +\label{sub:split-brain-scenario} + +\begin{enumerate} + +\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a, + F_b, F_c]$ using projection epoch $P_p$. + +\item Assume data $D_0$ is written at offset $O_0$ in Machi file + $F_0$. + +\item Then a network partition happens. Servers $F_a$ and $F_b$ are + on one side of the split, and server $F_c$ is on the other side of + the split. We'll call them the ``left side'' and ``right side'', + respectively. + +\item On the left side, $F_b$ calculates a new projection and writes + it unanimously (to two projection stores) as epoch $P_B+1$. The + subscript $_B$ denotes a + version of projection epoch $P_{p+1}$ that was created by server $F_B$ + and has a unique checksum (used to detect differences after the + network partition heals). + +\item In parallel, on the right side, $F_c$ calculates a new + projection and writes it unanimously (to a single projection store) + as epoch $P_c+1$. + +\item In parallel, a client on the left side writes data $D_1$ + at offset $O_1$ in Machi file $F_1$, and also + a client on the right side writes data $D_2$ + at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$ + because each sequencer is forced to choose disjoint filenames from + any prior epoch whenever a new projection is available. + +\end{enumerate} + +Now, what happens when various clients attempt to read data values +$D_0$, $D_1$, and $D_2$? + +\begin{itemize} +\item All clients can read $D_0$. +\item Clients on the left side can read $D_1$. +\item Attempts by clients on the right side to read $D_1$ will get + {\tt error\_unavailable}. +\item Clients on the right side can read $D_2$. +\item Attempts by clients on the left side to read $D_2$ will get + {\tt error\_unavailable}. +\end{itemize} + +The {\tt error\_unavailable} result is not an error in the CAP Theorem +sense: it is a valid and affirmative response. In both cases, the +system on the client's side definitely knows that the cluster is +partitioned. If Machi were not a write-once store, perhaps there +might be an old/stale value to read on the local side of the network +partition \ldots but the system also knows definitely that no +old/stale value exists. Therefore Machi remains available in the +CAP Theorem sense both for writes and reads. + +We know that all files $F_0$, +$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous +to set union) onto each server in $[F_a, F_b, F_c]$ safely +when the network partition is healed. However, +unlike pure theoretical set union, Machi's data merge \& repair +operations must operate within some constraints that are designed to +prevent data loss. + +\subsection{Aside: defining data availability and data loss} +\label{sub:define-availability} + +Let's take a moment to be clear about definitions: + +\begin{itemize} +\item ``data is available at time $T$'' means that data is available + for reading at $T$: the Machi cluster knows for certain that the + requested data is not been written or it is written and has a single + value. +\item ``data is unavailable at time $T$'' means that data is + unavailable for reading at $T$ due to temporary circumstances, + e.g. network partition. If a read request is issued at some time + after $T$, the data will be available. +\item ``data is lost at time $T$'' means that data is permanently + unavailable at $T$ and also all times after $T$. +\end{itemize} + +Chain Replication is a fantastic technique for managing the +consistency of data across a number of whole replicas. There are, +however, cases where CR can indeed lose data. + +\subsection{Data loss scenario \#1: too few servers} +\label{sub:data-loss1} + +If the chain is $N$ servers long, and if all $N$ servers fail, then +of course data is unavailable. However, if all $N$ fail +permanently, then data is lost. + +If the administrator had intended to avoid data loss after $N$ +failures, then the administrator would have provisioned a Machi +cluster with at least $N+1$ servers. + +\subsection{Data Loss scenario \#2: bogus configuration change sequence} +\label{sub:data-loss2} + +Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place. + +\begin{figure} +\begin{enumerate} +%% NOTE: the following list 9 items long. We use that fact later, see +%% string YYY9 in a comment further below. If the length of this list +%% changes, then the counter reset below needs adjustment. +\item Projection $P_p$ says that chain membership is $[F_a]$. +\item A write of data $D$ to file $F$ at offset $O$ is successful. +\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via + an administration API request. +\item Machi will trigger repair operations, copying any missing data + files from FLU $F_a$ to FLU $F_b$. For the purpose of this + example, the sync operation for file $F$'s data and metadata has + not yet started. +\item FLU $F_a$ crashes. +\item The chain manager on $F_b$ notices $F_a$'s crash, + decides to create a new projection $P_{p+2}$ where chain membership is + $[F_b]$ + successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged. +\item FLU $F_a$ is down, therefore the + value of $P_{p+2}$ is unanimous for all currently available FLUs + (namely $[F_b]$). +\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous + projection. It unwedges itself and continues operation using $P_{p+2}$. +\item Data $D$ is definitely unavailable for now, perhaps lost forever? +\end{enumerate} +\caption{Data unavailability scenario with danger of permanent data loss} +\label{fig:data-loss2} +\end{figure} + +At this point, the data $D$ is not available on $F_b$. However, if +we assume that $F_a$ eventually returns to service, and Machi +correctly acts to repair all data within its chain, then $D$ +all of its contents will be available eventually. + +However, if server $F_a$ never returns to service, then $D$ is lost. The +Machi administration API must always warn the user that data loss is +possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must +warn the administrator in multiple ways that fewer than the full {\tt + length(all\_members)} number of replicas are in full sync. + +A careful reader should note that $D$ is also lost if step \#5 were +instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.'' +For any possible step following \#5, $D$ is lost. This is data loss +for the same reason that the scenario of Section~\ref{sub:data-loss1} +happens: the administrator has not provisioned a sufficient number of +replicas. + +Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This +time, we add a final step at the end of the sequence: + +\begin{enumerate} +\setcounter{enumi}{9} % YYY9 +\item The administration API is used to change the chain +configuration to {\tt all\_members=$[F_b]$}. +\end{enumerate} + +Step \#10 causes data loss. Specifically, the only copy of file +$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now +permanently inaccessible. + +The chain manager {\em must} keep track of all +repair operations and their status. If such information is tracked by +all FLUs, then the data loss by bogus administrator action can be +prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow +F_b$` repair has not yet finished and therefore it is unsafe to remove +$F_a$ from the cluster. + +\subsection{Data Loss scenario \#3: chain replication repair done badly} +\label{sub:data-loss3} + +It's quite possible to lose data through careless/buggy Chain +Replication chain configuration changes. For example, in the split +brain scenario of Section~\ref{sub:split-brain-scenario}, we have two +pieces of data written to different ``sides'' of the split brain, +$D_0$ and $D_1$. If the chain is naively reconfigured after the network +partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$\footnote{Where $\emptyset$ + denotes the unwritten value.} then $D_1$ +is in danger of being lost. Why? +The Update Propagation Invariant is violated. +Any Chain Replication read will be +directed to the tail, $F_c$. The value exists there, so there is no +need to do any further work; the unwritten values at $F_a$ and $F_b$ +will not be repaired. If the $F_c$ server fails sometime +later, then $D_1$ will be lost. The ``Chain Replication Repair'' +section of \cite{machi-design} discusses +how data loss can be avoided after servers are added (or re-added) to +an active chain configuration. + +\subsection{Summary} + +We believe that maintaining the Update Propagation Invariant is a +hassle anda pain, but that hassle and pain are well worth the +sacrifices required to maintain the invariant at all times. It avoids +data loss in all cases where the U.P.~Invariant preserving chain +contains at least one FLU. + +\bibliographystyle{abbrvnat} +\begin{thebibliography}{} +\softraggedright + +\bibitem{elastic-chain-replication} +Abu-Libdeh, Hussam et al. +Leveraging Sharding in the Design of Scalable Replication Protocols. +Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC'13), 2013. +{\tt http://www.ymsir.com/papers/sharding-socc.pdf} + +\bibitem{corfu1} +Balakrishnan, Mahesh et al. +CORFU: A Shared Log Design for Flash Clusters. +Proceedings of the 9th USENIX Conference on Networked Systems Design +and Implementation (NSDI'12), 2012. +{\tt http://research.microsoft.com/pubs/157204/ corfumain-final.pdf} + +\bibitem{corfu2} +Balakrishnan, Mahesh et al. +CORFU: A Distributed Shared Log +ACM Transactions on Computer Systems, Vol. 31, No. 4, Article 10, December 2013. +{\tt http://www.snookles.com/scottmp/corfu/ corfu.a10-balakrishnan.pdf} + +\bibitem{machi-design} +Basho Japan KK. +Machi: an immutable file store +{\tt https://github.com/basho/machi/tree/ master/doc/high-level-machi.pdf} + +\bibitem{was} +Calder, Brad et al. +Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency +Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11), 2011. +{\tt http://sigops.org/sosp/sosp11/current/ 2011-Cascais/printable/11-calder.pdf} + +\bibitem{cr-theory-and-practice} +Fritchie, Scott Lystig. +Chain Replication in Theory and in Practice. +Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang'10), 2010. +{\tt http://www.snookles.com/scott/publications/ erlang2010-slf.pdf} + +\bibitem{the-log-what} +Kreps, Jay. +The Log: What every software engineer should know about real-time data's unifying abstraction +{\tt http://engineering.linkedin.com/distributed- + systems/log-what-every-software-engineer-should- + know-about-real-time-datas-unifying} + +\bibitem{kafka} +Kreps, Jay et al. +Kafka: a distributed messaging system for log processing. +NetDB’11. +{\tt http://research.microsoft.com/en-us/UM/people/ + srikanth/netdb11/netdb11papers/netdb11-final12.pdf} + +\bibitem{paxos-made-simple} +Lamport, Leslie. +Paxos Made Simple. +In SIGACT News \#4, Dec, 2001. +{\tt http://research.microsoft.com/users/ lamport/pubs/paxos-simple.pdf} + +\bibitem{random-slicing} +Miranda, Alberto et al. +Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems. +ACM Transactions on Storage, Vol. 10, No. 3, Article 9, July 2014. +{\tt http://www.snookles.com/scottmp/corfu/random- slicing.a9-miranda.pdf} + +\bibitem{porcupine} +Saito, Yasushi et al. +Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service. +7th ACM Symposium on Operating System Principles (SOSP’99). +{\tt http://homes.cs.washington.edu/\%7Elevy/ porcupine.pdf} + +\bibitem{chain-replication} +van Renesse, Robbert et al. +Chain Replication for Supporting High Throughput and Availability. +Proceedings of the 6th Conference on Symposium on Operating Systems +Design \& Implementation (OSDI'04) - Volume 6, 2004. +{\tt http://www.cs.cornell.edu/home/rvr/papers/ osdi04.pdf} + +\end{thebibliography} + + +\end{document} diff --git a/doc/src.high-level/high-level-machi.tex b/doc/src.high-level/high-level-machi.tex index 77fd142..587ed2e 100644 --- a/doc/src.high-level/high-level-machi.tex +++ b/doc/src.high-level/high-level-machi.tex @@ -41,8 +41,9 @@ This document was first written during the autumn of 2014 for a Basho-only internal audience. Since its original drafts, Machi has been designated by Basho as a full open source software project. This document has been rewritten in 2015 to address an external audience. -Furthermore, many strong consistency design elements have been removed -and will appear later in separate documents. +Furthermore, discussion of the ``chain manager'' service and of strong +consistency have been moved to a separate document, please see +\cite{machi-chain-manager-design}. \section{Abstract} \label{sec:abstract} @@ -298,7 +299,7 @@ the per-append checksums described in Section~\ref{sub:bit-rot} \end{itemize} \subsubsection{File replica management via Chain Replication} -\label{sub-chain-replication} +\label{sub:chain-replication} Machi uses Chain Replication (CR) internally to maintain file replicas and inter-replica consistency. @@ -429,9 +430,9 @@ This section presents the major architectural components. They are: \item The Projection Store: a write-once key-value blob store, used by Machi for storing projections. (Section \ref{sub:proj-store}) -\item The auto-administration monitor: monitors the health of the +\item The chain manager: monitors the health of the chain and calculates new projections when failure is detected. -(Section \ref{sub:auto-admin}) +(Section \ref{sub:chain-manager}) \end{itemize} Also presented here are the major concepts used by Machi components: @@ -486,7 +487,7 @@ is sufficient for discussion purposes. -type m_rerror() :: m_err_r() m_generr(). -type m_werror() :: m_generr() | m_err_w(). --spec fill(m_name(), m_offset(), integer(), m_epoch()) -> ok | m_fill_err | +-spec fill(m_name(), m_offset(), integer(), m_epoch()) -> ok | m_fill_err() | m_werror(). -spec list_files() -> {ok, [m_file_info()]} | m_generr(). -spec read(m_name(), m_offset(), integer(), m_epoch()) -> {ok, binary()} | m_rerror(). @@ -639,10 +640,8 @@ client. Astute readers may theorize that race conditions exist in such management; see Section~\ref{sec:projections} for details and restrictions that make it practical. -\subsection{The auto-administration monitor} -\label{sub:auto-admin} - -NOTE: This needs a better name. +\subsection{The chain manager} +\label{sub:chain-manager} Each FLU runs an administration agent that is responsible for monitoring the health of the entire Machi cluster. If a change of @@ -660,7 +659,8 @@ administration API), zero or more actions may be taken: \item Exit wedge state. \end{itemize} -See also Section~\ref{sec:projections}. +See also Section~\ref{sec:projections} and also the Chain Manager +design document \cite{machi-chain-manager-design}. \subsection{The Projection and the Projection Epoch Number} \label{sub:projection} @@ -673,18 +673,16 @@ the Epoch Projection Number (or more simply ``the epoch''). \begin{figure} \begin{verbatim} --type m_server_info() :: {Hostname, Port, ...}. - +-type m_server_info() :: {Hostname, Port,...}. -record(projection, { epoch_number :: m_epoch_n(), epoch_csum :: m_csum(), - prev_epoch_num :: m_epoch_n(), - prev_epoch_csum :: m_csum(), creation_time :: now(), author_server :: m_server(), all_members :: [m_server()], - active_repaired :: [m_server()], + active_upi :: [m_server()], active_all :: [m_server()], + down_members :: [m_server()], dbg_annotations :: proplist() }). \end{verbatim} @@ -693,8 +691,8 @@ the Epoch Projection Number (or more simply ``the epoch''). \end{figure} Projections are calculated by each FLU using input from local -measurement data, calculations by the FLU's auto-administration -monitor (see below), and input from the administration API. +measurement data, calculations by the FLU's chain manager +(see below), and input from the administration API. Each time that the configuration changes (automatically or by administrator's request), a new epoch number is assigned to the entire configuration data structure and is distributed to @@ -702,9 +700,29 @@ all FLUs via the FLU's administration API. Each FLU maintains the current projection epoch number as part of its soft state. Pseudo-code for the projection's definition is shown in -Figure~\ref{fig:projection}. -See also Section~\ref{sub:flu-divergence} for discussion of the -projection epoch checksum. +Figure~\ref{fig:projection}. To summarize the major components: + +\begin{itemize} +\item {\tt creation\_time} Wall-clock time, useful for humans and + general debugging effort. +\item {\tt author\_server} Name of the server that calculated the projection. +\item {\tt all\_members} All servers in the chain, regardless of current + operation status. If all operating conditions are perfect, the + chain should operate in the order specified here. + (See also the limitations in \cite{machi-chain-manager-design}, + ``Whole-file repair when changing FLU ordering within a chain''.) +\item {\tt active\_upi} All active chain members that we know are + fully repaired/in-sync with each other and therefore the Update + Propagation Invariant \cite{machi-chain-manager-design} is always true. + See also Section~\ref{sec:repair}. +\item {\tt active\_all} All active chain members, including those that + are under active repair procedures. +\item {\tt down\_members} All members that the {\tt author\_server} + believes are currently down or partitioned. +\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to + add any hints for why the projection change was made, delay/retry + information, etc. +\end{itemize} \subsection{The Bad Epoch Error} \label{sub:bad-epoch} @@ -741,7 +759,7 @@ and behavior, a.k.a.~``AP mode''. However, with only small modifications, Machi can operate in a strongly consistent manner, a.k.a.~``CP mode''. -The auto-administration service (Section \ref{sub:auto-admin}) is +The chain manager service (Section \ref{sub:chain-manager}) is sufficient for an ``AP Mode'' Machi service. In AP Mode, all mutations to any file on any side of a network partition are guaranteed to use unique locations (file names and/or byte offsets). When network @@ -751,7 +769,7 @@ the footnote of Section~\ref{ssec:just-rsync-it}) in any order without conflict. ``CP mode'' will be extensively covered in other documents. In summary, -to support ``CP mode'', we believe that the auto-administra\-tion +to support ``CP mode'', we believe that the chain manager service proposed here can guarantee strong consistency at all times. @@ -761,8 +779,24 @@ at all times. \subsection{Single operation: append a single sequence of bytes to a file} \label{sec:sketch-append} +%% NOTE: append-whiteboard.eps was created by 'jpeg2ps'. +\begin{figure*}[htp] +\resizebox{\textwidth}{!}{ + \includegraphics[width=\textwidth]{figure6} + %% \includegraphics[width=\textwidth]{append-whiteboard} + } +\caption{Flow diagram: append 123 bytes onto a file with prefix {\tt "foo"}.} +\label{fig:append-flow} +\end{figure*} + To write/append atomically a single sequence/hunk of bytes to a file, here's the sequence of steps required. +See Figure~\ref{fig:append-flow} for a diagram showing an example +append; the same example is also shown in +Figure~\ref{fig:append-flowMSC} using MSC style (message sequence chart). +In +this case, the first FLU contacted has a newer projection epoch, +$P_{13}$, than the $P_{12}$ epoch that the client first attempts to use. \begin{enumerate} @@ -814,7 +848,7 @@ successful. \item If a FLU server $FLU$ is unavailable, notify another up/available chain member that $FLU$ appears unavailable. This info may be used by -the auto-administration monitor to change projections. If the client +the chain manager service to change projections. If the client wishes, it may retry the append op or perhaps wait until a new projection is available. @@ -841,23 +875,6 @@ things has happened: \end{enumerate} -%% NOTE: append-whiteboard.eps was created by 'jpeg2ps'. -\begin{figure*}[htp] -\resizebox{\textwidth}{!}{ - \includegraphics[width=\textwidth]{figure6} - %% \includegraphics[width=\textwidth]{append-whiteboard} - } -\caption{Flow diagram: append 123 bytes onto a file with prefix {\tt "foo"}.} -\label{fig:append-flow} -\end{figure*} - -See Figure~\ref{fig:append-flow} for a diagram showing an example -append; the same example is also shown in -Figure~\ref{fig:append-flowMSC} using MSC style (message sequence chart). -In -this case, the first FLU contacted has a newer projection epoch, -$P_{13}$, than the $P_{12}$ epoch that the client first attempts to use. - \subsection{TODO: Single operation: reading a chunk of bytes from a file} \label{sec:sketch-read} @@ -865,7 +882,7 @@ $P_{13}$, than the $P_{12}$ epoch that the client first attempts to use. \label{sec:projections} Machi uses a ``projection'' to determine how its Chain Replication replicas -should operate; see Section~\ref{sub-chain-replication} and +should operate; see Section~\ref{sub:chain-replication} and \cite{corfu1}. At runtime, a cluster must be able to respond both to administrative changes (e.g., substituting a failed server box with replacement hardware) as well as local network conditions (e.g., is @@ -874,229 +891,8 @@ from CORFU but has a longer history, e.g., the Hibari key-value store \cite{cr-theory-and-practice} and goes back in research for decades, e.g., Porcupine \cite{porcupine}. -\subsection{Phases of projection change} - -Machi's use of projections is in four discrete phases and are -discussed below: network monitoring, -projection calculation, projection storage, and -adoption of new projections. - -\subsubsection{Network monitoring} -\label{sub:network-monitoring} - -Monitoring of local network conditions can be implemented in many -ways. None are mandatory, as far as this RFC is concerned. -Easy-to-maintain code should be the primary driver for any -implementation. Early versions of Machi may use some/all of the -following techniques: - -\begin{itemize} -\item Internal ``no op'' FLU-level protocol request \& response. -\item Use of distributed Erlang {\tt net\_ticktime} node monitoring -\item Explicit connections of remote {\tt epmd} services, e.g., to -tell the difference between a dead Erlang VM and a dead -machine/hardware node. -\item Network tests via ICMP {\tt ECHO\_REQUEST}, a.k.a. {\tt ping(8)} -\end{itemize} - -Output of the monitor should declare the up/down (or -available/unavailable) status of each server in the projection. Such -Boolean status does not eliminate ``fuzzy logic'' or probabilistic -methods for determining status. Instead, hard Boolean up/down status -decisions are required by the projection calculation phase -(Section~\ref{subsub:projection-calculation}). - -\subsubsection{Projection data structure calculation} -\label{subsub:projection-calculation} - -Each Machi server will have an independent agent/process that is -responsible for calculating new projections. A new projection may be -required whenever an administrative change is requested or in response -to network conditions (e.g., network partitions). - -Projection calculation will be a pure computation, based on input of: - -\begin{enumerate} -\item The current projection epoch's data structure -\item Administrative request (if any) -\item Status of each server, as determined by network monitoring -(Section~\ref{sub:network-monitoring}). -\end{enumerate} - -All decisions about {\em when} to calculate a projection must be made -using additional runtime information. Administrative change requests -probably should happen immediately. Change based on network status -changes may require retry logic and delay/sleep time intervals. - -Some of the items in Figure~\ref{fig:projection}'s sketch include: - -\begin{itemize} -\item {\tt prev\_epoch\_num} and {\tt prev\_epoch\_csum} The previous - projection number and checksum, respectively. -\item {\tt creation\_time} Wall-clock time, useful for humans and - general debugging effort. -\item {\tt author\_server} Name of the server that calculated the projection. -\item {\tt all\_members} All servers in the chain, regardless of current - operation status. If all operating conditions are perfect, the - chain should operate in the order specified here. - (See also the limitations in Section~\ref{sub:repair-chain-re-ordering}.) -\item {\tt active\_repaired} All active chain members that we know are - fully repaired/in-sync with each other and therefore the Update - Propagation Invariant (Section~\ref{sub:cr-proof}) is always true. - See also Section~\ref{sec:repair}. -\item {\tt active\_all} All active chain members, including those that - are under active repair procedures. -\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to - add any hints for why the projection change was made, delay/retry - information, etc. -\end{itemize} - -\subsection{Projection storage: writing} -\label{sub:proj-storage-writing} - -All projection data structures are stored in the write-once Projection -Store (Section~\ref{sub:proj-store}) that is run by each FLU -(Section~\ref{sub:flu}). - -Writing the projection follows the two-step sequence below. -In cases of writing -failure at any stage, the process is aborted. The most common case is -{\tt error\_written}, which signifies that another actor in the system has -already calculated another (perhaps different) projection using the -same projection epoch number and that -read repair is necessary. Note that {\tt error\_written} may also -indicate that another actor has performed read repair on the exact -projection value that the local actor is trying to write! - -\begin{enumerate} -\item Write $P_{new}$ to the local projection store. This will trigger - ``wedge'' status in the local FLU, which will then cascade to other - projection-related behavior within the FLU. -\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. - Some members may be unavailable, but that is OK. -\end{enumerate} - -(Recall: Other parts of the system are responsible for reading new -projections from other actors in the system and for deciding to try to -create a new projection locally.) - -\subsection{Projection storage: reading} -\label{sub:proj-storage-reading} - -Reading data from the projection store is similar in principle to -reading from a Chain Replication-managed FLU system. However, the -projection store does not require the strict replica ordering that -Chain Replication does. For any projection store key $K_n$, the -participating servers may have different values for $K_n$. As a -write-once store, it is impossible to mutate a replica of $K_n$. If -replicas of $K_n$ differ, then other parts of the system (projection -calculation and storage) are responsible for reconciling the -differences by writing a later key, -$K_{n+x}$ when $x>0$, with a new projection. - -Projection store reads are ``best effort''. The projection used is chosen from -all replica servers that are available at the time of the read. The -minimum number of replicas is only one: the local projection store -should always be available, even if no other remote replica projection -stores are available. - -For any key $K$, different projection stores $S_a$ and $S_b$ may store -nothing (i.e., {\tt error\_unwritten} when queried) or store different -values, $P_a \ne P_b$, despite having the same projection epoch -number. The following ranking rules are used to -determine the ``best value'' of a projection, where highest rank of -{\em any single projection} is considered the ``best value'': - -\begin{enumerate} -\item An unwritten value is ranked at a value of $-1$. -\item A value whose {\tt author\_server} is at the $I^{th}$ position - in the {\tt all\_members} list has a rank of $I$. -\item A value whose {\tt dbg\_annotations} and/or other fields have - additional information may increase/decrease its rank, e.g., - increase the rank by $10.25$. -\end{enumerate} - -Rank rules \#2 and \#3 are intended to avoid worst-case ``thrashing'' -of different projection proposals. - -The concept of ``read repair'' of an unwritten key is the same as -Chain Replication's. If a read attempt for a key $K$ at some server -$S$ results in {\tt error\_unwritten}, then all of the other stores in -the {\tt \#projection.all\_members} list are consulted. If there is a -unanimous value $V_{u}$ elsewhere, then $V_{u}$ is use to repair all -unwritten replicas. If the value of $K$ is not unanimous, then the -``best value'' $V_{best}$ is used for the repair. If all respond with -{\tt error\_unwritten}, repair is not required. - -\subsection{Adoption of new projections} - -The projection store's ``best value'' for the largest written epoch -number at the time of the read is projection used by the FLU. -If the read attempt for projection $P_p$ -also yields other non-best values, then the -projection calculation subsystem is notified. This notification -may/may not trigger a calculation of a new projection $P_{p+1}$ which -may eventually be stored and so -resolve $P_p$'s replicas' ambiguity. - -\subsubsection{Alternative implementations: Hibari's ``Admin Server'' - and Elastic Chain Replication} - -See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's -chain management agent, the ``Admin Server''. In brief: - -\begin{itemize} -\item The Admin Server is intentionally a single point of failure in - the same way that the instance of Stanchion in a Riak CS cluster - is an intentional single - point of failure. In both cases, strict - serialization of state changes is more important than 100\% - availability. - -\item For higher availability, the Hibari Admin Server is usually - configured in an active/standby manner. Status monitoring and - application failover logic is provided by the built-in capabilities - of the Erlang/OTP application controller. - -\end{itemize} - -Elastic chain replication is a technique described in -\cite{elastic-chain-replication}. It describes using multiple chains -to monitor each other, as arranged in a ring where a chain at position -$x$ is responsible for chain configuration and management of the chain -at position $x+1$. This technique is likely the fall-back to be used -in case the chain management method described in this RFC proves -infeasible. - -\subsection{Likely problems and possible solutions} -\label{sub:likely-problems} - -There are some unanswered questions about Machi's proposed chain -management technique. The problems that we guess are likely/possible -include: - -\begin{itemize} - -\item Thrashing or oscillating between a pair (or more) of - projections. It's hoped that the ``best projection'' ranking system - will be sufficient to prevent endless thrashing of projections, but - it isn't yet clear that it will be. - -\item Partial (and/or one-way) network splits which cause partially - connected graphs of inter-node connectivity. Groups of nodes that - are completely isolated aren't a problem. However, partially - connected groups of nodes is an unknown. Intuition says that - communication (via the projection store) with ``bridge nodes'' in a - partially-connected network ought to settle eventually on a - projection with high rank, e.g., the projection on an island - subcluster of nodes with the largest author node name. Some corner - case(s) may exist where this intuition is not correct. - -\item CP Mode management via the method proposed in - Section~\ref{sec:split-brain-management} may not be sufficient in - all cases. - -\end{itemize} +See \cite{machi-chain-manager-design} for the design and discussion of +all aspects of projection management and storage. \section{Chain Replication repair: how to fix servers after they crash and return to service} @@ -1111,135 +907,6 @@ data loss with chain replication is summarized in this section, followed by a discussion of Machi-specific details that must be included in any production-quality implementation. -{\bf NOTE:} Beginning with Section~\ref{sub:repair-entire-files}, the -techniques presented here are novel and not described (to the best of -our knowledge) in other papers or public open source software. -Reviewers should give this new stuff -{\em an extremely careful reading}. All novelty in this section and -also in the projection management techniques of -Section~\ref{sec:projections} must be the first things to be -thoroughly vetted with tools such as Concuerror, QuickCheck, TLA+, -etc. - -\subsection{Chain Replication: proof of correctness} -\label{sub:cr-proof} - -\begin{quote} -``You want the truth? You can't handle the truth!'' -\par -\hfill{ --- Colonel Jessep, ``A Few Good Men'', 2002} -\end{quote} - -See Section~3 of \cite{chain-replication} for a proof of the -correctness of Chain Replication. A short summary is provide here. -Readers interested in good karma should read the entire paper. - -The three basic rules of Chain Replication and its strong -consistency guarantee: - -\begin{enumerate} - -\item All replica servers are arranged in an ordered list $C$. - -\item All mutations of a datum are performed upon each replica of $C$ - strictly in the order which they appear in $C$. A mutation is considered - completely successful if the writes by all replicas are successful. - -\item The head of the chain makes the determination of the order of - all mutations to all members of the chain. If the head determines - that some mutation $M_i$ happened before another mutation $M_j$, - then mutation $M_i$ happens before $M_j$ on all other members of - the chain.\footnote{While necesary for general Chain Replication, - Machi does not need this property. Instead, the property is - provided by Machi's sequencer and the write-once register of each - byte in each file.} - -\item All read-only operations are performed by the ``tail'' replica, - i.e., the last replica in $C$. - -\end{enumerate} - -The basis of the proof lies in a simple logical trick, which is to -consider the history of all operations made to any server in the chain -as a literal list of unique symbols, one for each mutation. - -Each replica of a datum will have a mutation history list. We will -call this history list $H$. For the $i^{th}$ replica in the chain list -$C$, we call $H_i$ the mutation history list for the $i^{th}$ replica. - -Before the $i^{th}$ replica in the chain list begins service, its mutation -history $H_i$ is empty, $[]$. After this replica runs in a Chain -Replication system for a while, its mutation history list grows to -look something like -$[M_0, M_1, M_2, ..., M_{m-1}]$ where $m$ is the total number of -mutations of the datum that this server has processed successfully. - -Let's assume for a moment that all mutation operations have stopped. -If the order of the chain was constant, and if all mutations are -applied to each replica in the chain's order, then all replicas of a -datum will have the exact same mutation history: $H_i = H_J$ for any -two replicas $i$ and $j$ in the chain -(i.e., $\forall i,j \in C, H_i = H_J$). That's a lovely property, -but it is much more interesting to assume that the service is -not stopped. Let's look next at a running system. - -\begin{figure*} -\centering -\begin{tabular}{ccc} -{\bf {{On left side of $C$}}} & & {\bf On right side of $C$} \\ -\hline -\multicolumn{3}{l}{Looking at replica order in chain $C$:} \\ -$i$ & $<$ & $j$ \\ - -\multicolumn{3}{l}{For example:} \\ - -0 & $<$ & 2 \\ -\hline -\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\ -length($H_i$) & $\geq$ & length($H_j$) \\ -\multicolumn{3}{l}{For example, a quiescent chain:} \\ -48 & $\geq$ & 48 \\ -\multicolumn{3}{l}{For example, a chain being mutated:} \\ -55 & $\geq$ & 48 \\ -\multicolumn{3}{l}{Example ordered mutation sets:} \\ -$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ -\multicolumn{3}{c}{\bf Therefore the right side is always an ordered - subset} \\ -\multicolumn{3}{c}{\bf of the left side. Furthermore, the ordered - sets on both} \\ -\multicolumn{3}{c}{\bf sides have the exact same order of those elements they have in common.} \\ -\multicolumn{3}{c}{The notation used by the Chain Replication paper is -shown below:} \\ -$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\succeq$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ - -\end{tabular} -\caption{A demonstration of Chain Replication protocol history ``Update Propagation Invariant''.} -\label{tab:chain-order} -\end{figure*} - -If the entire chain $C$ is processing any number of concurrent -mutations, then we can still understand $C$'s behavior. -Figure~\ref{tab:chain-order} shows us two replicas in chain $C$: -replica $R_i$ that's on the left/earlier side of the replica chain $C$ -than some other replica $R_j$. We know that $i$'s position index in -the chain is smaller than $j$'s position index, so therefore $i < j$. -The restrictions of Chain Replication make it true that length($H_i$) -$\ge$ length($H_j$) because it's also that $H_i \supset H_j$, i.e, -$H_i$ on the left is always is a superset of $H_j$ on the right. - -When considering $H_i$ and $H_j$ as strictly ordered lists, we have -$H_i \succeq H_j$, where the right side is always an exact prefix of the left -side's list. This prefixing propery is exactly what strong -consistency requires. If a value is read from the tail of the chain, -then no other chain member can have a prior/older value because their -respective mutations histories cannot be shorter than the tail -member's history. - -\paragraph{``Update Propagation Invariant''} -is the original chain replication paper's name for the -$H_i \succeq H_j$ -property. This paper will use the same name. - \subsection{When to trigger read repair of single values} Assume now that some client $X$ wishes to fetch a datum that's managed @@ -1265,7 +932,25 @@ Let's explore each of these responses in the following subsections. There are only a few reasons why this value is possible. All are discussed here. -\paragraph{Scenario: A client $X_w$ has received a sequencer's +\paragraph{Scenario 1: The block truly hasn't been written yet} + +A read from any other server in the chain will also yield {\tt + error\_unwritten}. + +\paragraph{Scenario 2: The block has not yet finished being written} + +A read from any other server in the chain may yield {\tt + error\_unwritten} or may find written data. (In this scenario, the +head server has written data; we don't know the state of the middle +and tail server(s).) The client ought to perform read repair of this +data. (See also, scenario \#4 below.) + +During read repair, the client's writes operations may race with the +original writer's operations. However, both the original writer and +the repairing client are always writing the same data. Therefore, +data corruption by conflicting client writes is not possible. + +\paragraph{Scenario 3: A client $X_w$ has received a sequencer's assignment for this location, but the client has crashed somewhere in the middle of writing the value to the chain.} @@ -1273,14 +958,13 @@ discussed here. The correct action to take here depends on the value of the $R_{head}$ replica's value. If $R_{head}$'s value is unwritten, then the writing client $X_w$ crashed before writing to $R_{head}$. The reading client -$X_r$ must ``fill'' the page with junk bytes (see -Section~\ref{sub:fill-single}) or else do nothing. +$X_r$ must ``fill'' the page with junk bytes or else do nothing. If $R_{head}$'s value is indeed written, then the reading client $X_r$ must finish a ``read repair'' operation before the client may proceed. See Section~\ref{sub:read-repair-single} for details. -\paragraph{Scenario: A client has received a sequencer's assignment for this +\paragraph{Scenario 4: A client has received a sequencer's assignment for this location, but the client has become extremely slow (or is experiencing a network partition, or any other reason) and has not yet updated $R_{tail}$ $\ldots$ but that client {\em will eventually @@ -1320,66 +1004,6 @@ repair operation is successful. If the read repair operation is not successful, then the client must react in the same manner as if the original read attempt of $R_{tail}$'s value had failed. -\subsection{How to ``fill'' a single value} -\label{sub:fill-single} - -A Machi FLU -implementation may (or may not) maintain enough metadata to be able to -unambiguously inform clients that a written value is the result of a -``fill'' operation. It is not yet clear if that information is value -enough for FLUs to maintain. - -A ``fill'' operation is simply writing a value of junk. The value of -the junk does not matter, as long as any client reading the value does -not mistake the junk for an application's legitimate data. For -example, the Erlang notation of {\tt <<0,0,0,\ldots>>} - -CORFU requires a fill operation to be able to meet its promise of -low-latency operation, in case of failure. Its use can be illustrated -in this sequence of events: - -\begin{enumerate} -\item Client $X$ obtains a position from the sequencer at offset $O$ - for a new log write of value $V_X$. -%% \item Client $Z$ obtains a position for a new log write from the -%% sequences at offset $O+1$. -\item Client $X$ pauses. The reason does not matter: a crash, a - network partition, garbage collection pause, gone scuba diving, etc. -\item Client $Y$ is reading the log forward and finds the entry at - offset $O$ is unwritten. A CORFU log is very strictly ordered, so - client $Y$ is blocked and cannot read any further in the log until - the status of offset $O$ has been unambiguously determined. -\item Client $Y$ attempts a fill operation on offset $O$ at the head - of the chain with value $V_{fill}$. - If this succeeds, then $Y$ and all other clients know - that a partial write is in progress, and the value is - fill bytes. If this fails because of {\tt error\_written}, then - client $Y$ knows that client $X$ isn't truly dead and that it has - lost a race with $X$: the head's value at offset $O$ is $V_x$. -\item Client $Y$ writes to the remaining members of the chain, - using the value at the chain's head, $V_x$ or $V_{fill}$. -\item Client $Y$ (and all other CORFU clients) now unambiguously know - the state of offset $O$: it is either a fully-written junk page - written by $Y$ or it is a fully-written page $V_x$ written by $X$. -\item If client $X$ has not crashed but is merely slow with any write - attempt to any chain member, $X$ may encounter {\tt error\_written} - responses. However, all values stored by that chain member must be - either $V_x$ or $V_{fill}$, and all chain members will agree on - which value it is. -\end{enumerate} - -A fill operation in Machi is {\em prohibited} at any time that split -brain runtime support is enabled (i.e., in AP mode). - -CORFU does not need such a restriction on ``fill'': CORFU always replaces -all of the repair destination's data, server $R_a$ in the figure, with -the repair source $R_a$'s data. (See also -Section~\ref{sub:repair-divergence}.) Machi must be able -to perform data repair of many 10s of TBytes of data very quickly; -CORFU's brute-force solution is not sufficient for Machi. Until a -work-around is found for Machi, fill operations will simply be -prohibited if split brain operation is enabled. - \subsection{Repair of entire files} \label{sub:repair-entire-files} @@ -1393,485 +1017,44 @@ There are some situations where repair of entire files is necessary. \item To avoid data loss when changing the order of the chain's servers. \end{itemize} -Both situations can set the stage for data loss in the future. -If a violation of the Update Propagation Invariant (see end of -Section~\ref{sub:cr-proof}) is permitted, then the strong consistency -guarantee of Chain Replication is violated. Because Machi uses -write-once registers, the number of possible strong consistency -violations is small: any client that witnesses a written $\rightarrow$ -unwritten transition is a violation of strong consistency. But -avoiding even this one bad scenario is a bit tricky. +The full file repair discussion in \cite{machi-chain-manager-design} +argues for correctness in both eventually consistent and strongly +consistent environments. Discussion in this section will be limited +to eventually consistent environments (``AP mode'') . -As explained in Section~\ref{sub:data-loss1}, data -unavailability/loss when all chain servers fail is unavoidable. We -wish to avoid data loss whenever a chain has at least one surviving -server. Another method to avoid data loss is to preserve the Update -Propagation Invariant at all times. - -\subsubsection{Just ``rsync'' it!} +\subsubsection{``Just `rsync' it!''} \label{ssec:just-rsync-it} -A simpler replication method might be perhaps 90\% sufficient. -That method could loosely be described as ``just {\tt rsync} -out of all files to all servers in an infinite loop.''\footnote{The - file format suggested in - Section~\ref{sub:on-disk-data-format} does not permit {\tt rsync} - as-is to be sufficient. A variation of {\tt rsync} would need to be - aware of the data/metadata split within each file and only replicate - the data section \ldots and the metadata would still need to be - managed outside of {\tt rsync}.} - -However, such an informal method -cannot tell you exactly when you are in danger of data loss and when -data loss has actually happened. If we maintain the Update -Propagation Invariant, then we know exactly when data loss is immanent -or has happened. - -Furthermore, we hope to use Machi for multiple use cases, including -ones that require strong consistency. -For uses such as CORFU, strong consistency is a non-negotiable -requirement. Therefore, we will use the Update Propagation Invariant -as the foundation for Machi's data loss prevention techniques. - -\subsubsection{Divergence from CORFU: repair} -\label{sub:repair-divergence} - -The original repair design for CORFU is simple and effective, -mostly. See Figure~\ref{fig:corfu-style-repair} for a full -description of the algorithm -Figure~\ref{fig:corfu-repair-sc-violation} for an example of a strong -consistency violation that can follow. (NOTE: This is a variation of -the data loss scenario that is described in -Figure~\ref{fig:data-loss2}.) - -\begin{figure} -\begin{enumerate} -\item Destroy all data on the repair destination FLU. -\item Add the repair destination FLU to the tail of the chain in a new - projection $P_{p+1}$. -\item Change projection from $P_p$ to $P_{p+1}$. -\item Let single item read repair fix all of the problems. -\end{enumerate} -\caption{Simplest CORFU-style repair algorithm.} -\label{fig:corfu-style-repair} -\end{figure} - -\begin{figure} -\begin{enumerate} -\item Write value $V$ to offset $O$ in the log with chain $[F_a]$. - This write is considered successful. -\item Change projection to configure chain as $[F_a,F_b]$. Prior to - the change, all values on FLU $F_b$ are unwritten. -\item FLU server $F_a$ crashes. The new projection defines the chain - as $[F_b]$. -\item A client attempts to read offset $O$ and finds an unwritten - value. This is a strong consistency violation. -%% \item The same client decides to fill $O$ with the junk value -%% $V_{junk}$. Now value $V$ is lost. -\end{enumerate} -\caption{An example scenario where the CORFU simplest repair algorithm - can lead to a violation of strong consistency.} -\label{fig:corfu-repair-sc-violation} -\end{figure} - -A variation of the repair -algorithm is presented in section~2.5 of a later CORFU paper \cite{corfu2}. -However, the re-use a failed -server is not discussed there, either: the example of a failed server -$F_6$ uses a new server, $F_8$ to replace $F_6$. Furthermore, the -repair process is described as: - -\begin{quote} -``Once $F_6$ is completely rebuilt on $F_8$ (by copying entries from - $F_7$), the system moves to projection (C), where $F_8$ is now used - to service all reads in the range $[40K,80K)$.'' -\end{quote} - -The phrase ``by copying entries'' does not give enough -detail to avoid the same data race as described in -Figure~\ref{fig:corfu-repair-sc-violation}. We believe that if -``copying entries'' means copying only written pages, then CORFU -remains vulnerable. If ``copying entries'' also means ``fill any -unwritten pages prior to copying them'', then perhaps the -vulnerability is eliminated.\footnote{SLF's note: Probably? This is my - gut feeling right now. However, given that I've just convinced - myself 100\% that fill during any possibility of split brain is {\em - not safe} in Machi, I'm not 100\% certain anymore than this ``easy'' - fix for CORFU is correct.}. - -\subsubsection{Whole-file repair as FLUs are (re-)added to a chain} -\label{sub:repair-add-to-chain} - -Machi's repair process must preserve the Update Propagation -Invariant. To avoid data races with data copying from -``U.P.~Invariant preserving'' servers (i.e. fully repaired with -respect to the Update Propagation Invariant) -to servers of unreliable/unknown state, a -projection like the one shown in -Figure~\ref{fig:repair-chain-of-chains} is used. In addition, the -operations rules for data writes and reads must be observed in a -projection of this type. - -\begin{figure*} -\centering -$ -[\overbrace{\underbrace{H_1}_\textbf{Head of Heads}, M_{11}, - \underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)} -\mid -\overbrace{H_2, M_{21}, - \underbrace{T_2}_\textbf{Tail \#2}}^\textbf{Chain \#2 (repairing)} -\mid \ldots \mid -\overbrace{H_n, M_{n1}, - \underbrace{T_n}_\textbf{Tail \#n \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#n (repairing)} -] -$ -\caption{Representation of a ``chain of chains'': a chain prefix of - Update Propagation Invariant preserving FLUs (``Chain \#1'') - with FLUs from $n-1$ other chains under repair.} -\label{fig:repair-chain-of-chains} -\end{figure*} +The ``just {\tt rsync} it!'' method could loosely be described as, +``run {\tt rsync} on all files to all servers.'' This simple repair +method is nearly sufficient enough for Machi's eventual consistency +mode of operation. There's only one small problem that {\tt rsync} +cannot handle by itself: handling late writes to a file. It is +possible that the same file could contain the following pattern of +written and unwritten data: \begin{itemize} - -\item The system maintains the distinction between ``U.P.~preserving'' - and ``repairing'' FLUs at all times. This allows the system to - track exactly which servers are known to preserve the Update - Propagation Invariant and which servers may/may not. - -\item All ``repairing'' FLUs must be added only at the end of the - chain-of-chains. - -\item All write operations must flow successfully through the - chain-of-chains from beginning to end, i.e., from the ``head of - heads'' to the ``tail of tails''. This rule also includes any - repair operations. - -\item In AP Mode, all read operations are attempted from the list of -$[T_1,\-T_2,\-\ldots,\-T_n]$, where these FLUs are the tails of each of the -chains involved in repair. -In CP mode, all read operations are attempted only from $T_1$. -The first reply of {\tt \{ok, <<...>>\}} is a correct answer; -the rest of the FLU list can be ignored and the result returned to the -client. If all FLUs in the list have an unwritten value, then the -client can return {\tt error\_unwritten}. - +\item Server $A$: $x$ bytes written, $y$ bytes unwritten +\item Server $B$: $x$ bytes unwritten, $y$ bytes written \end{itemize} -While the normal single-write and single-read operations are performed -by the cluster, a file synchronization process is initiated. The -sequence of steps differs depending on the AP or CP mode of the system. +If {\tt rsync} is uses as-is to replicate this file, then one of the +two written sections will overwritten by NUL bytes. Obviously, we +don't want this kind of data loss. However, we already have a +requirement that Machi file servers must enforce write-once behavior +on all file byte ranges. The same data used to maintain written and +unwritten state can be used to merge file state so that both the $x$ +and $y$ byte ranges will be correct after repair. -\paragraph{In cases where the cluster is operating in CP Mode:} +\subsubsection{The larger problem with ``Just `rsync' it!''} -CORFU's repair method of ``just copy it all'' (from source FLU to repairing -FLU) is correct, {\em except} for the small problem pointed out in -Section~\ref{sub:repair-divergence}. The problem for Machi is one of -time \& space. Machi wishes to avoid transferring data that is -already correct on the repairing nodes. If a Machi node is storing -20TBytes of data, we really do not wish to use 20TBytes of bandwidth -to repair only 1 GByte of truly-out-of-sync data. - -However, it is {\em vitally important} that all repairing FLU data be -clobbered/overwritten with exactly the same data as the Update -Propagation Invariant preserving chain. If this rule is not strictly -enforced, then fill operations can corrupt Machi file data. The -algorithm proposed is: - -\begin{enumerate} - -\item Change the projection to a ``chain of chains'' configuration - such as depicted in Figure~\ref{fig:repair-chain-of-chains}. - -\item For all files on all FLUs in all chains, extract the lists of - written/unwritten byte ranges and their corresponding file data - checksums. (The checksum metadata is not strictly required for - recovery in AP Mode.) - Send these lists to the tail of tails - $T_{tails}$, which will collate all of the lists into a list of - tuples such as {\tt \{FName, $O_{start}, O_{end}$, CSum, FLU\_List\}} - where {\tt FLU\_List} is the list of all FLUs in the entire chain of - chains where the bytes at the location {\tt \{FName, $O_{start}, - O_{end}$\}} are known to be written (as of the current repair period). - -\item For chain \#1 members, i.e., the - leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains}, - repair files byte ranges for any chain \#1 members that are not members - of the {\tt FLU\_List} set. This will repair any partial - writes to chain \#1 that were unsuccessful (e.g., client crashed). - (Note however that this step only repairs FLUs in chain \#1.) - -\item For all file byte ranges in all files on all FLUs in all - repairing chains where Tail \#1's value is unwritten, force all - repairing FLUs to also be unwritten. - -\item For file byte ranges in all files on all FLUs in all repairing - chains where Tail \#1's value is written, send repair file byte data - \& metadata to any repairing FLU if the value repairing FLU's - value is unwritten or the checksum is not exactly equal to Tail \#1's - checksum. - -\end{enumerate} - -\begin{figure} -\centering -$ -[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, - H_2, M_{21}, T_2, - \ldots - H_n, M_{n1}, - \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} -] -$ -\caption{Representation of Figure~\ref{fig:repair-chain-of-chains} - after all repairs have finished successfully and a new projection has - been calculated.} -\label{fig:repair-chain-of-chains-finished} -\end{figure} - -When the repair is known to have copied all missing data successfully, -then the chain can change state via a new projection that includes the -repaired FLU(s) at the end of the U.P.~Invariant preserving chain \#1 -in the same order in which they appeared in the chain-of-chains during -repair. See Figure~\ref{fig:repair-chain-of-chains-finished}. - -The repair can be coordinated and/or performed by the $T_{tails}$ FLU -or any other FLU or cluster member that has spare capacity. - -There is no serious race condition here between the enumeration steps -and the repair steps. Why? Because the change in projection at -step \#1 will force any new data writes to adapt to a new projection. -Consider the mutations that either happen before or after a projection -change: - - -\begin{itemize} - -\item For all mutations $M_1$ prior to the projection change, the - enumeration steps \#3 \& \#4 and \#5 will always encounter mutation - $M_1$. Any repair must write through the entire chain-of-chains and - thus will preserve the Update Propagation Invariant when repair is - finished. - -\item For all mutations $M_2$ starting during or after the projection - change has finished, a new mutation $M_2$ may or may not be included in the - enumeration steps \#3 \& \#4 and \#5. - However, in the new projection, $M_2$ must be - written to all chain of chains members, and such - in-order writes will also preserve the Update - Propagation Invariant and therefore is also be safe. - -\end{itemize} - -%% Then the only remaining safety problem (as far as I can see) is -%% avoiding this race: - -%% \begin{enumerate} -%% \item Enumerate byte ranges $[B_0,B_1,\ldots]$ in file $F$ that must -%% be copied to the repair target, based on checksum differences for -%% those byte ranges. -%% \item A real-time concurrent write for byte range $B_x$ arrives at the -%% U.P.~Invariant preserving chain for file $F$ but was not a member of -%% step \#1's list of byte ranges. -%% \item Step \#2's update is propagated down the chain of chains. -%% \item Step \#1's clobber updates are propagated down the chain of -%% chains. -%% \item The value for $B_x$ is lost on the repair targets. -%% \end{enumerate} - -\paragraph{In cases the cluster is operating in AP Mode:} - -\begin{enumerate} -\item Follow the first two steps of the ``CP Mode'' - sequence (above). -\item Follow step \#3 of the ``strongly consistent mode'' sequence - (above), but in place of repairing only FLUs in Chain \#1, AP mode - will repair the byte range of any FLU that is not a member of the - {\tt FLU\_List} set. -\item End of procedure. -\end{enumerate} - -The end result is a huge ``merge'' where any -{\tt \{FName, $O_{start}, O_{end}$\}} range of bytes that is written -on FLU $F_w$ but missing/unwritten from FLU $F_m$ is written down the full chain -of chains, skipping any FLUs where the data is known to be written. -Such writes will also preserve Update Propagation Invariant when -repair is finished. - -\subsubsection{Whole-file repair when changing FLU ordering within a chain} -\label{sub:repair-chain-re-ordering} - -Changing FLU order within a chain is an operations optimization only. -It may be that the administrator wishes the order of a chain to remain -as originally configured during steady-state operation, e.g., -$[F_a,F_b,F_c]$. As FLUs are stopped \& restarted, the chain may -become re-ordered in a seemingly-arbitrary manner. - -It is certainly possible to re-order the chain, in a kludgy manner. -For example, if the desired order is $[F_a,F_b,F_c]$ but the current -operating order is $[F_c,F_b,F_a]$, then remove $F_b$ from the chain, -then add $F_b$ to the end of the chain. Then repeat the same -procedure for $F_c$. The end result will be the desired order. - -From an operations perspective, re-ordering of the chain -using this kludgy manner has a -negative effect on availability: the chain is temporarily reduced from -operating with $N$ replicas down to $N-1$. This reduced replication -factor will not remain for long, at most a few minutes at a time, but -even a small amount of time may be unacceptable in some environments. - -Reordering is possible with the introduction of a ``temporary head'' -of the chain. This temporary FLU does not need to be a full replica -of the entire chain --- it merely needs to store replicas of mutations -that are made during the chain reordering process. This method will -not be described here. However, {\em if reviewers believe that it should -be included}, please let the authors know. - -\paragraph{In both Machi operating modes:} -After initial implementation, it may be that the repair procedure is a -bit too slow. In order to accelerate repair decisions, it would be -helpful have a quicker method to calculate which files have exactly -the same contents. In traditional systems, this is done with a single -file checksum; see also Section~\ref{sub:detecting-corrupted}. -Machi's files can be written out-of-order from a file offset point of -view, which violates the order which the traditional method for -calculating a full-file hash. If we recall -Figure~\ref{fig:temporal-out-of-order}, the traditional method cannot -continue calculating the file checksum at offset 2 until the byte at -file offset 1 is written. - -It may be advantageous for each FLU to maintain for each file a -checksum of a canonical representation of the -{\tt \{$O_{start},O_{end},$ CSum\}} tuples that the FLU must already -maintain. Then for any two FLUs that claim to store a file $F$, if -both FLUs have the same hash of $F$'s written map + checksums, then -the copies of $F$ on both FLUs are the same. - -\section{``Split brain'' management in CP Mode} -\label{sec:split-brain-management} - -Split brain management is a thorny problem. The method presented here -is one based on pragmatics. If it doesn't work, there isn't a serious -worry, because Machi's first serious use case all require only AP Mode. -If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'', -then perhaps that's -fine enough. Meanwhile, let's explore how a -completely self-contained, no-external-dependencies -CP Mode Machi might work. - -Wikipedia's description of the quorum consensus solution\footnote{See - {\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice -and short: - -\begin{quotation} -A typical approach, as described by Coulouris et al.,[4] is to use a -quorum-consensus approach. This allows the sub-partition with a -majority of the votes to remain available, while the remaining -sub-partitions should fall down to an auto-fencing mode. -\end{quotation} - -This is the same basic technique that -both Riak Ensemble and ZooKeeper use. Machi's -extensive use of write-registers are a big advantage when implementing -this technique. Also very useful is the Machi ``wedge'' mechanism, -which can automatically implement the ``auto-fencing'' that the -technique requires. All Machi servers that can communicate with only -a minority of other servers will automatically ``wedge'' themselves -and refuse all requests for service until communication with the -majority can be re-established. - -\subsection{The quorum: witness servers vs. full servers} - -In any quorum-consensus system, at least $2f+1$ participants are -required to survive $f$ participant failures. Machi can implement a -technique of ``witness servers'' servers to bring the total cost -somewhere in the middle, between $2f+1$ and $f+1$, depending on your -point of view. - -A ``witness server'' is one that participates in the network protocol -but does not store or manage all of the state that a ``full server'' -does. A ``full server'' is a Machi server as -described by this RFC document. A ``witness server'' is a server that -only participates in the projection store and projection epoch -transition protocol and a small subset of the file access API. -A witness server doesn't actually store any -Machi files. A witness server is almost stateless, when compared to a -full Machi server. - -A mixed cluster of witness and full servers must still contain at -least $2f+1$ participants. However, only $f+1$ of them are full -participants, and the remaining $f$ participants are witnesses. In -such a cluster, any majority quorum must have at least one full server -participant. - -Witness FLUs are always placed at the front of the chain. As stated -above, there may be at most $f$ witness FLUs. A functioning quorum -majority -must have at least $f+1$ FLUs that can communicate and therefore -calculate and store a new unanimous projection. Therefore, any FLU at -the tail of a functioning quorum majority chain must be full FLU. Full FLUs -actually store Machi files, so they have no problem answering {\tt - read\_req} API requests.\footnote{We hope that it is now clear that - a witness FLU cannot answer any Machi file read API request.} - -Any FLU that can only communicate with a minority of other FLUs will -find that none can calculate a new projection that includes a -majority of FLUs. Any such FLU, when in CP mode, would then move to -wedge state and remain wedged until the network partition heals enough -to communicate with the majority side. This is a nice property: we -automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side - is wedged and therefore refuses to serve because it is, so to speak, - ``on the wrong side of the fence.''} - -There is one case where ``fencing'' may not happen: if both the client -and the tail FLU are on the same minority side of a network partition. -Assume the client and FLU $F_z$ are on the "wrong side" of a network -split; both are using projection epoch $P_1$. The tail of the -chain is $F_z$. - -Also assume that the "right side" has reconfigured and is using -projection epoch $P_2$. The right side has mutated key $K$. Meanwhile, -nobody on the "right side" has noticed anything wrong and is happy to -continue using projection $P_1$. - -\begin{itemize} -\item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via - $F_z$. $F_z$ does not detect an epoch problem and thus returns an - answer. Given our assumptions, this value is stale. For some - client use cases, this kind of staleness may be OK in trade for - fewer network messages per read \ldots so Machi may - have a configurable option to permit it. -\item {\bf Option b}: The wrong side client must confirm that $P_1$ is - in use by a full majority of chain members, including $F_z$. -\end{itemize} - -Attempts using Option b will fail for one of two reasons. First, if -the client can talk to a FLU that is using $P_2$, the client's -operation must be retried using $P_2$. Second, the client will time -out talking to enough FLUs so that it fails to get a quorum's worth of -$P_1$ answers. In either case, Option B will always fail a client -read and thus cannot return a stale value of $K$. - -\subsection{Witness FLU data and protocol changes} - -Some small changes to the projection's data structure -(Figure~\ref{fig:projection}) are required. The projection itself -needs new annotation to indicate the operating mode, AP mode or CP -mode. The state type notifies the auto-administration service how to -react in network partitions and how to calculate new, safe projection -transitions and which file repair mode to use -(Section~\ref{sub:repair-entire-files}). -Also, we need to label member FLU servers as full- or -witness-type servers. - -Write API requests are processed by witness servers in {\em almost but - not quite} no-op fashion. The only requirement of a witness server -is to return correct interpretations of local projection epoch -numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error -codes. In fact, a new API call is sufficient for querying witness -servers: {\tt \{check\_epoch, m\_epoch()\}}. -Any client write operation sends the {\tt - check\_\-epoch} API command to witness FLUs and sends the usual {\tt - write\_\-req} command to full FLUs. +Assume for a moment that the {\tt rsync} utility could indeed preserve +Machi written chunk boundaries as described above. A larger +administration problem still remains: this informal method cannot tell +you exactly when you are in danger of data loss or when data loss has +actually happened. If we maintain the Update Propagation Invariant +(as argued in \cite{machi-chain-manager-design}, +then we always know exactly when data loss is immanent or has happened. \section{On-disk storage and file corruption detection} \label{sec:on-disk} @@ -2006,231 +1189,6 @@ FLUs should also be able to schedule their checksum scrubbing activity periodically and limit their activity to certain times, per a only-as-complex-as-it-needs-to-be administrative policy. -\section{The safety of projection epoch transitions} -\label{sec:safety-of-transitions} - -Machi uses the projection epoch transition algorithm and -implementation from CORFU, which is believed to be safe. However, -CORFU assumes a single, external, strongly consistent projection -store. Further, CORFU assumes that new projections are calculated by -an oracle that the rest of the CORFU system agrees is the sole agent -for creating new projections. Such an assumption is impractical for -Machi's intended purpose. - -Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle -coordinator), but we wish to keep Machi free of big external -dependencies. We would also like to see Machi be able to -operate in an ``AP mode'', which means providing service even -if all network communication to an oracle is broken. - -The model of projection calculation and storage described in -Section~\ref{sec:projections} allows for each server to operate -independently, if necessary. This autonomy allows the server in AP -mode to -always accept new writes: new writes are written to unique file names -and unique file offsets using a chain consisting of only a single FLU, -if necessary. How is this possible? Let's look at a scenario in -Section~\ref{sub:split-brain-scenario}. - -\subsection{A split brain scenario} -\label{sub:split-brain-scenario} - -\begin{enumerate} - -\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a, - F_b, F_c]$ using projection epoch $P_p$. - -\item Assume data $D_0$ is written at offset $O_0$ in Machi file - $F_0$. - -\item Then a network partition happens. Servers $F_a$ and $F_b$ are - on one side of the split, and server $F_c$ is on the other side of - the split. We'll call them the ``left side'' and ``right side'', - respectively. - -\item On the left side, $F_b$ calculates a new projection and writes - it unanimously (to two projection stores) as epoch $P_B+1$. The - subscript $_B$ denotes a - version of projection epoch $P_{p+1}$ that was created by server $F_B$ - and has a unique checksum (used to detect differences after the - network partition heals). - -\item In parallel, on the right side, $F_c$ calculates a new - projection and writes it unanimously (to a single projection store) - as epoch $P_c+1$. - -\item In parallel, a client on the left side writes data $D_1$ - at offset $O_1$ in Machi file $F_1$, and also - a client on the right side writes data $D_2$ - at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$ - because each sequencer is forced to choose disjoint filenames from - any prior epoch whenever a new projection is available. - -\end{enumerate} - -Now, what happens when various clients attempt to read data values -$D_0$, $D_1$, and $D_2$? - -\begin{itemize} -\item All clients can read $D_0$. -\item Clients on the left side can read $D_1$. -\item Attempts by clients on the right side to read $D_1$ will get - {\tt error\_unavailable}. -\item Clients on the right side can read $D_2$. -\item Attempts by clients on the left side to read $D_2$ will get - {\tt error\_unavailable}. -\end{itemize} - -The {\tt error\_unavailable} result is not an error in the CAP Theorem -sense: it is a valid and affirmative response. In both cases, the -system on the client's side definitely knows that the cluster is -partitioned. If Machi were not a write-once store, perhaps there -might be an old/stale value to read on the local side of the network -partition \ldots but the system also knows definitely that no -old/stale value exists. Therefore Machi remains available in the -CAP Theorem sense both for writes and reads. - -We know that all files $F_0$, -$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous -to set union) onto each server in $[F_a, F_b, F_c]$ safely -when the network partition is healed. However, -unlike pure theoretical set union, Machi's data merge \& repair -operations must operate within some constraints that are designed to -prevent data loss. - -\subsection{Aside: defining data availability and data loss} -\label{sub:define-availability} - -Let's take a moment to be clear about definitions: - -\begin{itemize} -\item ``data is available at time $T$'' means that data is available - for reading at $T$: the Machi cluster knows for certain that the - requested data is not been written or it is written and has a single - value. -\item ``data is unavailable at time $T$'' means that data is - unavailable for reading at $T$ due to temporary circumstances, - e.g. network partition. If a read request is issued at some time - after $T$, the data will be available. -\item ``data is lost at time $T$'' means that data is permanently - unavailable at $T$ and also all times after $T$. -\end{itemize} - -Chain Replication is a fantastic technique for managing the -consistency of data across a number of whole replicas. There are, -however, cases where CR can indeed lose data. - -\subsection{Data loss scenario \#1: too few servers} -\label{sub:data-loss1} - -If the chain is $N$ servers long, and if all $N$ servers fail, then -of course data is unavailable. However, if all $N$ fail -permanently, then data is lost. - -If the administrator had intended to avoid data loss after $N$ -failures, then the administrator would have provisioned a Machi -cluster with at least $N+1$ servers. - -\subsection{Data Loss scenario \#2: bogus configuration change sequence} -\label{sub:data-loss2} - -Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place. - -\begin{figure} -\begin{enumerate} -%% NOTE: the following list 9 items long. We use that fact later, see -%% string YYY9 in a comment further below. If the length of this list -%% changes, then the counter reset below needs adjustment. -\item Projection $P_p$ says that chain membership is $[F_a]$. -\item A write of data $D$ to file $F$ at offset $O$ is successful. -\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via - an administration API request. -\item Machi will trigger repair operations, copying any missing data - files from FLU $F_a$ to FLU $F_b$. For the purpose of this - example, the sync operation for file $F$'s data and metadata has - not yet started. -\item FLU $F_a$ crashes. -\item The auto-administration monitor on $F_b$ notices $F_a$'s crash, - decides to create a new projection $P_{p+2}$ where chain membership is - $[F_b]$ - successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged. -\item FLU $F_a$ is down, therefore the - value of $P_{p+2}$ is unanimous for all currently available FLUs - (namely $[F_b]$). -\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous - projection. It unwedges itself and continues operation using $P_{p+2}$. -\item Data $D$ is definitely unavailable for now, perhaps lost forever? -\end{enumerate} -\caption{Data unavailability scenario with danger of permanent data loss} -\label{fig:data-loss2} -\end{figure} - -At this point, the data $D$ is not available on $F_b$. However, if -we assume that $F_a$ eventually returns to service, and Machi -correctly acts to repair all data within its chain, then $D$ -all of its contents will be available eventually. - -However, if server $F_a$ never returns to service, then $D$ is lost. The -Machi administration API must always warn the user that data loss is -possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must -warn the administrator in multiple ways that fewer than the full {\tt - length(all\_members)} number of replicas are in full sync. - -A careful reader should note that $D$ is also lost if step \#5 were -instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.'' -For any possible step following \#5, $D$ is lost. This is data loss -for the same reason that the scenario of Section~\ref{sub:data-loss1} -happens: the administrator has not provisioned a sufficient number of -replicas. - -Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This -time, we add a final step at the end of the sequence: - -\begin{enumerate} -\setcounter{enumi}{9} % YYY9 -\item The administration API is used to change the chain -configuration to {\tt all\_members=$[F_b]$}. -\end{enumerate} - -Step \#10 causes data loss. Specifically, the only copy of file -$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now -permanently inaccessible. - -The auto-administration monitor {\em must} keep track of all -repair operations and their status. If such information is tracked by -all FLUs, then the data loss by bogus administrator action can be -prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow -F_b$` repair has not yet finished and therefore it is unsafe to remove -$F_a$ from the cluster. - -\subsection{Data Loss scenario \#3: chain replication repair done badly} -\label{sub:data-loss3} - -It's quite possible to lose data through careless/buggy Chain -Replication chain configuration changes. For example, in the split -brain scenario of Section~\ref{sub:split-brain-scenario}, we have two -pieces of data written to different ``sides'' of the split brain, -$D_0$ and $D_1$. If the chain is naively reconfigured after the network -partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$\footnote{Where $\emptyset$ - denotes the unwritten value.} then $D_1$ -is in danger of being lost. Why? -The Update Propagation Invariant is violated. -Any Chain Replication read will be -directed to the tail, $F_c$. The value exists there, so there is no -need to do any further work; the unwritten values at $F_a$ and $F_b$ -will not be repaired. If the $F_c$ server fails sometime -later, then $D_1$ will be lost. Section~\ref{sec:repair} discusses -how data loss can be avoided after servers are added (or re-added) to -an active chain configuration. - -\subsection{Summary} - -We believe that maintaining the Update Propagation Invariant is a -hassle anda pain, but that hassle and pain are well worth the -sacrifices required to maintain the invariant at all times. It avoids -data loss in all cases where the U.P.~Invariant preserving chain -contains at least one FLU. - \section{Load balancing read vs. write ops} \label{sec:load-balancing} @@ -2429,6 +1387,11 @@ Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12), 2012. {\tt http://research.microsoft.com/pubs/157204/ corfumain-final.pdf} +\bibitem{machi-chain-manager-design} +Basho Japan KK. +Machi Chain Replication: management theory and design +{\tt https://github.com/basho/machi/tree/ master/doc/high-level-chain-mgr.pdf} + \bibitem{corfu2} Balakrishnan, Mahesh et al. CORFU: A Distributed Shared Log From 60dfff0c86538b7ec8c08333094397ae675f7d1b Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 10:36:54 +0900 Subject: [PATCH 03/14] Type up Friday's edits --- doc/src.high-level/append-flow.eps | 18 +- doc/src.high-level/append-flow2.eps | 58 ++--- doc/src.high-level/high-level-machi.tex | 330 +++++++++++++----------- 3 files changed, 219 insertions(+), 187 deletions(-) diff --git a/doc/src.high-level/append-flow.eps b/doc/src.high-level/append-flow.eps index 9302919..9df810e 100644 --- a/doc/src.high-level/append-flow.eps +++ b/doc/src.high-level/append-flow.eps @@ -191,11 +191,11 @@ newpath 467 -238 moveto 467 -265 lineto stroke newpath 552 -238 moveto 552 -265 lineto stroke newpath 42 -251 moveto 382 -251 lineto stroke newpath 382 -251 moveto 372 -257 lineto stroke -(write "foo.seq_a.009" offset=447 <<...123...>> epoch=13) dup stringwidth +(write "foo.seq_a.009" offset=447 <<123 bytes...>> epoch=13) dup stringwidth 1.000000 1.000000 1.000000 setrgbcolor -pop dup newpath 62 -249 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +pop dup newpath 51 -249 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill 0.000000 0.000000 0.000000 setrgbcolor -62 -249 moveto show +51 -249 moveto show newpath 42 -265 moveto 42 -292 lineto stroke newpath 127 -265 moveto 127 -292 lineto stroke newpath 212 -265 moveto 212 -292 lineto stroke @@ -219,11 +219,11 @@ newpath 467 -292 moveto 467 -319 lineto stroke newpath 552 -292 moveto 552 -319 lineto stroke newpath 42 -305 moveto 467 -305 lineto stroke newpath 467 -305 moveto 457 -311 lineto stroke -(write "foo.seq_a.009" offset=447 <<...123...>> epoch=13) dup stringwidth +(write "foo.seq_a.009" offset=447 <<123 bytes...>> epoch=13) dup stringwidth 1.000000 1.000000 1.000000 setrgbcolor -pop dup newpath 105 -303 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +pop dup newpath 94 -303 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill 0.000000 0.000000 0.000000 setrgbcolor -105 -303 moveto show +94 -303 moveto show newpath 42 -319 moveto 42 -346 lineto stroke newpath 127 -319 moveto 127 -346 lineto stroke newpath 212 -319 moveto 212 -346 lineto stroke @@ -247,11 +247,11 @@ newpath 467 -346 moveto 467 -373 lineto stroke newpath 552 -346 moveto 552 -373 lineto stroke newpath 42 -359 moveto 552 -359 lineto stroke newpath 552 -359 moveto 542 -365 lineto stroke -(write "foo.seq_a.009" offset=447 <<...123...>> epoch=13) dup stringwidth +(write "foo.seq_a.009" offset=447 <<123 bytes...>> epoch=13) dup stringwidth 1.000000 1.000000 1.000000 setrgbcolor -pop dup newpath 147 -357 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +pop dup newpath 136 -357 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill 0.000000 0.000000 0.000000 setrgbcolor -147 -357 moveto show +136 -357 moveto show newpath 42 -373 moveto 42 -400 lineto stroke newpath 127 -373 moveto 127 -400 lineto stroke newpath 212 -373 moveto 212 -400 lineto stroke diff --git a/doc/src.high-level/append-flow2.eps b/doc/src.high-level/append-flow2.eps index ad285a3..0acff62 100644 --- a/doc/src.high-level/append-flow2.eps +++ b/doc/src.high-level/append-flow2.eps @@ -105,11 +105,11 @@ newpath 467 -76 moveto 467 -103 lineto stroke newpath 552 -76 moveto 552 -103 lineto stroke newpath 42 -89 moveto 382 -89 lineto stroke newpath 382 -89 moveto 372 -95 lineto stroke -(write prefix="foo" <<...123...>> epoch=12) dup stringwidth +(append prefix="foo" <<123 bytes...>> epoch=12) dup stringwidth 1.000000 1.000000 1.000000 setrgbcolor -pop dup newpath 104 -87 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +pop dup newpath 85 -87 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill 0.000000 0.000000 0.000000 setrgbcolor -104 -87 moveto show +85 -87 moveto show newpath 42 -103 moveto 42 -130 lineto stroke newpath 127 -103 moveto 127 -130 lineto stroke newpath 212 -103 moveto 212 -130 lineto stroke @@ -163,11 +163,11 @@ newpath 467 -184 moveto 467 -211 lineto stroke newpath 552 -184 moveto 552 -211 lineto stroke newpath 42 -197 moveto 382 -197 lineto stroke newpath 382 -197 moveto 372 -203 lineto stroke -(write prefix="foo" <<...123...>> epoch=13) dup stringwidth +(append prefix="foo" <<123 bytes...>> epoch=13) dup stringwidth 1.000000 1.000000 1.000000 setrgbcolor -pop dup newpath 104 -195 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +pop dup newpath 85 -195 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill 0.000000 0.000000 0.000000 setrgbcolor -104 -195 moveto show +85 -195 moveto show newpath 42 -211 moveto 42 -238 lineto stroke newpath 127 -211 moveto 127 -238 lineto stroke newpath 212 -211 moveto 212 -238 lineto stroke @@ -224,17 +224,13 @@ newpath 297 -292 moveto 297 -319 lineto stroke newpath 382 -292 moveto 382 -319 lineto stroke newpath 467 -292 moveto 467 -319 lineto stroke newpath 552 -292 moveto 552 -319 lineto stroke -(FLU_A writes to local storage @ "foo.seq_a.009" offset=447) dup stringwidth +newpath 382 -305 85 13 270 90 ellipse stroke +newpath 382 -311 moveto 392 -317 lineto stroke +(write "foo.seq_a.009" offset=447 <<123 bytes...>> epoch=13) dup stringwidth 1.000000 1.000000 1.000000 setrgbcolor -pop dup newpath 138 -308 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +pop dup newpath 58 -303 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill 0.000000 0.000000 0.000000 setrgbcolor -138 -308 moveto show -[2] 0 setdash -newpath 21 -305 moveto 136 -305 lineto stroke -[] 0 setdash -[2] 0 setdash -newpath 459 -305 moveto 574 -305 lineto stroke -[] 0 setdash +58 -303 moveto show newpath 42 -319 moveto 42 -346 lineto stroke newpath 127 -319 moveto 127 -346 lineto stroke newpath 212 -319 moveto 212 -346 lineto stroke @@ -244,11 +240,11 @@ newpath 467 -319 moveto 467 -346 lineto stroke newpath 552 -319 moveto 552 -346 lineto stroke newpath 382 -332 moveto 467 -332 lineto stroke newpath 467 -332 moveto 457 -338 lineto stroke -(write "foo.seq_a.009" offset=447 <<...123...>> epoch=13) dup stringwidth +(write "foo.seq_a.009" offset=447 <<123 bytes...>> epoch=13) dup stringwidth 1.000000 1.000000 1.000000 setrgbcolor -pop dup newpath 275 -330 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +pop dup newpath 264 -330 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill 0.000000 0.000000 0.000000 setrgbcolor -275 -330 moveto show +264 -330 moveto show newpath 42 -346 moveto 42 -373 lineto stroke newpath 127 -346 moveto 127 -373 lineto stroke newpath 212 -346 moveto 212 -373 lineto stroke @@ -258,11 +254,11 @@ newpath 467 -346 moveto 467 -373 lineto stroke newpath 552 -346 moveto 552 -373 lineto stroke newpath 467 -359 moveto 552 -359 lineto stroke newpath 552 -359 moveto 542 -365 lineto stroke -(write "foo.seq_a.009" offset=447 <<...123...>> epoch=13) dup stringwidth +(write "foo.seq_a.009" offset=447 <<123 bytes...>> epoch=13) dup stringwidth 1.000000 1.000000 1.000000 setrgbcolor -pop dup newpath 295 -357 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +pop dup newpath 273 -357 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill 0.000000 0.000000 0.000000 setrgbcolor -295 -357 moveto show +273 -357 moveto show newpath 42 -373 moveto 42 -400 lineto stroke newpath 127 -373 moveto 127 -400 lineto stroke newpath 212 -373 moveto 212 -400 lineto stroke @@ -302,16 +298,16 @@ newpath 297 -427 moveto 297 -454 lineto stroke newpath 382 -427 moveto 382 -454 lineto stroke newpath 467 -427 moveto 467 -454 lineto stroke newpath 552 -427 moveto 552 -454 lineto stroke -(If, instead, FLU_C has an error...) dup stringwidth +(If, in an alternate scenario, FLU_C has an error...) dup stringwidth 1.000000 1.000000 1.000000 setrgbcolor -pop dup newpath 210 -443 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +pop dup newpath 167 -443 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill 0.000000 0.000000 0.000000 setrgbcolor -210 -443 moveto show +167 -443 moveto show [2] 0 setdash -newpath 21 -440 moveto 208 -440 lineto stroke +newpath 21 -440 moveto 165 -440 lineto stroke [] 0 setdash [2] 0 setdash -newpath 386 -440 moveto 574 -440 lineto stroke +newpath 429 -440 moveto 574 -440 lineto stroke [] 0 setdash newpath 42 -454 moveto 42 -481 lineto stroke newpath 127 -454 moveto 127 -481 lineto stroke @@ -336,14 +332,14 @@ newpath 297 -481 moveto 297 -508 lineto stroke newpath 382 -481 moveto 382 -508 lineto stroke newpath 467 -481 moveto 467 -508 lineto stroke newpath 552 -481 moveto 552 -508 lineto stroke -(Repair is now the client's responsibility \("slow path"\).) dup stringwidth +(... then repair becomes the client's responsibility \("slow path"\).) dup stringwidth 1.000000 1.000000 1.000000 setrgbcolor -pop dup newpath 158 -497 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill +pop dup newpath 133 -497 moveto 0 rlineto 0 11 rlineto neg 0 rlineto closepath fill 0.000000 0.000000 0.000000 setrgbcolor -158 -497 moveto show +133 -497 moveto show [2] 0 setdash -newpath 21 -494 moveto 156 -494 lineto stroke +newpath 21 -494 moveto 131 -494 lineto stroke [] 0 setdash [2] 0 setdash -newpath 439 -494 moveto 574 -494 lineto stroke +newpath 464 -494 moveto 574 -494 lineto stroke [] 0 setdash diff --git a/doc/src.high-level/high-level-machi.tex b/doc/src.high-level/high-level-machi.tex index 587ed2e..6f2a37f 100644 --- a/doc/src.high-level/high-level-machi.tex +++ b/doc/src.high-level/high-level-machi.tex @@ -23,8 +23,8 @@ \copyrightdata{978-1-nnnn-nnnn-n/yy/mm} \doi{nnnnnnn.nnnnnnn} -\titlebanner{Draft \#0, April 2014} -\preprintfooter{Draft \#0, April 2014} +\titlebanner{Draft \#1, April 2014} +\preprintfooter{Draft \#1, April 2014} \title{Machi: an immutable file store} \subtitle{High level design \& strawman implementation suggestions \\ @@ -76,10 +76,9 @@ document. \par \hfill{--- Fred Hebert, {\tt @mononcqc}} \end{quotation} -\subsection{Name} +\subsection{Origin of the name ``Machi''} \label{sub:name} -This file store will be called ``Machi''. ``Machi'' is a Japanese word for ``village'' or ``small town''. A village is a rather self-contained thing, but it is small, not like a city. @@ -95,15 +94,15 @@ built out of a single village. Machi is a client-server system. All servers in a Machi cluster store identical copies/replicas of all files, preferably large files. -\begin{itemize} - \item This puts an effective limit on the size of a Machi cluster. - For example, five servers will replicate all files - for an effective replication $N$ factor of 5. - \item Any mechanism to distribute files across a subset of Machi - servers is outside the scope of Machi and of this design. -\end{itemize} +This puts an effective limit on the size of a Machi cluster. +For example, five servers will replicate all files +for an effective replication $N$ factor of 5. -``Large file'' is intended to mean hundreds of MBytes or more +Any mechanism to distribute files across a subset of Machi +servers is outside the scope of Machi and of this design. + +Machi's design assumes that it stores mostly large files. +``Large file'' means hundreds of MBytes or more per file. The design ``sweet spot'' targets about 1 GByte/file and/or managing up to a few million files in a single cluster. The maximum size of a single Machi file is @@ -112,26 +111,15 @@ practical estimate is 2Tbytes or less but may be larger. Machi files are write-once, read-many data structures; the label ``append-only'' is mostly correct. However, to be 100\% truthful -truth, the bytes a Machi file can be written in any order. +truth, the bytes a Machi file can be written temporally in any order. Machi files are always named by the server; Machi clients have no direct control of the name assigned by a Machi server. Machi servers -specify the file name and byte offset to all client write requests. +determine the file name and byte offset to all client write requests. (Machi clients may advise servers with a desired file name prefix.) -Machi is not a Hadoop file system (HDFS) replacement. -%% \begin{itemize} -% \item -There is no mechanism for writing Machi files to a subset of - available storage servers: all servers in a Machi server store - identical copies/replicas of all files. -% \item -However, Machi is intended to play very nicely with a layer above it, - where that layer {\em does} handle file scattering and on-the-fly - file migration across servers and all of the nice things that - HDFS, Riak CS, and similar systems can do. - -Robust and reliable means that Machi will not lose data until a +Machi shall be a +robust and reliable system. Machi will not lose data until a fundamental assumption has been violated, e.g., all servers have crashed permanently. Machi's file replicaion algorithms can provide strong or eventual consistency and is provably correct. Our only @@ -153,6 +141,18 @@ incomplete writes may happen long after the client has finished or even crashed. In effect, Machi will provide clients with ``at least once'' behavior for writes. +Machi is not a Hadoop file system (HDFS) replacement. +%% \begin{itemize} +% \item +There is no mechanism for writing Machi files to a subset of + available storage servers: all servers in a Machi server store + identical copies/replicas of all files. +% \item +However, Machi is intended to play very nicely with a layer above it, + where that layer {\em does} handle file scattering and on-the-fly + file migration across servers and all of the nice things that + HDFS, Riak CS, and similar systems can do. + \subsection{Defining a Machi file} A Machi ``file'' is an undifferentiated, one-dimensional array of @@ -167,10 +167,11 @@ shows the basic shape of the service. \begin{figure} \begin{itemize} \item Append bytes $B$ to a file with name prefix {\tt "foo"}. - \item Read $N$ bytes from offset $O$ from file $F$. + \item Write bytes $B$ to offset $O$ of file $F$. + \item Read $N$ bytes from offset $O$ of file $F$. \item List files: name, size, etc. \end{itemize} -\caption{Full (?) list of file API operations} +\caption{Nearly complete list of file API operations} \label{fig:example-client-API} \end{figure} @@ -180,12 +181,12 @@ order of 4 KBytes or 16 KBytes.) \begin{figure} \begin{enumerate} - \item Client1: Write 1 byte at offset 0. - \item Client1: Read 1 byte at offset 0. - \item Client2: Write 1 byte at offset 2. - \item Client2: Read 1 byte at offset 2. - \item Client3: (an intermittently slow client) Write 1 byte at offset 1. - \item Client3: Read 1 byte at offset 1. + \item Client1: Write 1 byte at offset 0 of file $F$. +% \item Client1: Read 1 byte at offset 0 of file $F$. + \item Client2: Write 1 byte at offset 2 of file $F$. +% \item Client2: Read 1 byte at offset 2 of file $F$. + \item Client3: (an intermittently slow client) Write 1 byte at offset 1 of file $F$. +% \item Client3: Read 1 byte at offset 1 of file $F$. \end{enumerate} \caption{Example of temporally out-of-order file append sequence that is valid within a Machi cluster.} @@ -262,7 +263,7 @@ Bit-rot can and will happen. To guard against bit-rot on disk, strong \begin{itemize} \item Client-calculated checksums of appended data \item Whole-file checksums, calculated by Machi servers for internal - sanity checking. See \ref{sub:detecting-corrupted} for + sanity checking. See Section~\ref{sub:detecting-corrupted} for commentary on how this may not be feasible. \item Any other place that makes sense for the paranoid. \end{itemize} @@ -284,10 +285,8 @@ the per-append checksums described in Section~\ref{sub:bit-rot} \begin{itemize} \item File metadata is strictly append-only. \item File metadata is always eventually consistent. - \item A complete history of all metadata updates is maintained for - each file. \item Temporal order of metadata entries is not preserved. - \item Multiple histories for a file may be merged at any time. + \item Multiple metadata stores for a file may be merged at any time. \begin{itemize} \item If a client requires idempotency, then the property list should contain all information required to identify multiple @@ -298,6 +297,9 @@ the per-append checksums described in Section~\ref{sub:bit-rot} \end{itemize} \end{itemize} +{\bf NOTE:} It isn't yet clear how much support early versions of +Machi will need for file metadata features. + \subsubsection{File replica management via Chain Replication} \label{sub:chain-replication} @@ -313,7 +315,7 @@ restrictions: \begin{enumerate} \item All writes are strictly performed by servers that are arranged in a single order, known as the ``chain order'', beginning at the - chain's head. + chain's head and ending at the chain's tail. \item All strongly consistent reads are performed only by the tail of the chain, i.e., the last server in the chain order. \item Inconsistent reads may be performed by any single server in the @@ -321,10 +323,10 @@ restrictions: \end{enumerate} Machi contains enough Chain Replication implementation to maintain its -chain state, file data integrity, and file metadata eventual +chain state, strict file data integrity, and file metadata eventual consistency. See also Section~\ref{sub:self-management}. -The first version of Machi would use a single chain for managing all +The first version of Machi will use a single chain for managing all files in the cluster. If the system is quiescent, then all chain members store the same data: all Machi servers will all store identical files. Later versions of Machi @@ -365,6 +367,8 @@ of poor health will automatically reconfigure the Machi cluster to avoid data loss and to provide maximum availability. For example, if a server $S$ crashes and later restarts, Machi will automatically bring the data on $S$ back to full sync. +This service will be provided by the ``chain manager'', which is +described in \cite{machi-chain-manager-design}. Machi will provide an administration API for managing Machi servers, e.g., cluster membership, file integrity and checksum verification, etc. @@ -407,16 +411,6 @@ considered out-of-scope for Machi. burden of physical separation of each coded piece (i.e., ``rack awareness'') someone/something else's problem. -Why would would someone wish to run a Machi cluster with only one -server (i.e., chain length of one) rather than using the FLU service -(Section~\ref{sub:flu}) by itself? One answer is that data -migration is much easier with all of Machi than with only the FLU -server. To migrate all files from FLU $F_a$ to FLU $F_b$, the administrator -merely needs to add $F_b$ to the end of $F_a$'s chain. When the data -repair is finished, we know that $F_b$ stores full replicas of all of -$F_a$'s data. The administrator removes $F_a$ from the chain, and the -data migration is finished. - \section{Architecture: base components and ideas} This section presents the major architectural components. They are: @@ -427,19 +421,19 @@ This section presents the major architectural components. They are: \item The Sequencer: assigns a unique file name + offset to each file append request. (Section \ref{sub:sequencer}) -\item The Projection Store: a write-once key-value blob store, used by - Machi for storing projections. -(Section \ref{sub:proj-store}) \item The chain manager: monitors the health of the chain and calculates new projections when failure is detected. (Section \ref{sub:chain-manager}) +\item The Projection Store: a write-once key-value blob store, used by + Machi's chain manager for storing projections. +(Section \ref{sub:proj-store}) \end{itemize} Also presented here are the major concepts used by Machi components: \begin{itemize} \item The Projection: the data structure that describes the current state of the Machi chain. - and is stored in the write-once Projection Store. + Projections are stored in the write-once Projection Store. (Section \ref{sub:projection}) \item The Projection Epoch Number (a.k.a.~The Epoch): Each projection is numbered with an epoch. @@ -464,7 +458,7 @@ The basic idea of the FLU is borrowed from CORFU. The base CORFU data server is called a ``flash unit''. For Machi, the equivalent server is nicknamed a FLU, a ``FiLe replica Unit''. A FLU is responsible for maintaining a single replica/copy of each file -(and its associated metadata) stored in a Machi cluster +(and its associated metadata) stored in a Machi cluster. The FLU's API is very simple: see Figure~\ref{fig:flu-api} for its data types and operations. This description is not 100\% complete but @@ -484,9 +478,12 @@ is sufficient for discussion purposes. error_bad_checksum | error_unavailable. -type m_name() :: binary(). -type m_offset() :: non_neg_integer(). +-type m_prefix() :: binary(). -type m_rerror() :: m_err_r() m_generr(). -type m_werror() :: m_generr() | m_err_w(). +-spec append(m_prefix(), m_bytes(), m_epoch()) -> {ok, m_name(), m_offset()} | + m_werror(). -spec fill(m_name(), m_offset(), integer(), m_epoch()) -> ok | m_fill_err() | m_werror(). -spec list_files() -> {ok, [m_file_info()]} | m_generr(). @@ -511,7 +508,7 @@ Transitions between these states are strictly ordered. See Section~\ref{sub:assume-append-only} for state transitions and the restrictions related to those transitions. -The FLU also keeps track of the projection number (number and checksum +The FLU also keeps track of the projection epoch number (number and checksum both, see also Section~\ref{sub:flu-divergence}) of the last modification to a file. This projection number is used for quick comparisons during repair (Section~\ref{sec:repair}) to determine if files are in sync or @@ -525,7 +522,7 @@ In Machi, the type signature of {\tt of the projection's contents. This checksum is used in cases where Machi is configured to run in ``AP mode'', which allows a running Machi cluster to fragment into multiple running sub-clusters during network -partitions. Each sub-cluster can choose a projection number +partitions. Each sub-cluster can choose an epoch projection number $P_{side}$ for its side of the cluster. After the partition is @@ -568,7 +565,7 @@ used to continue: \item If the client's write has been successful on at least the head FLU in the chain, then the client may continue to use the old location. The client is now performing read repair of this location in - the new epoch. (The client may have to add a ``read repair'' option + the new epoch. (The client may be required to add a ``read repair'' option to its requests to bypass the FLUs usual enforcement of the location's epoch.) \item If the client's write to the head FLU has not started yet, or if @@ -577,6 +574,13 @@ used to continue: request a new assignment from the sequencer. \end{itemize} +If the client eventually wishes to write a contiguous chunk of $Y$ +bytes, but only $X$ bytes ($X < Y$) are available right now, the +client may make a sequencer request for the larger $Y$ byte range +immediately. The client then uses this file~+~byte range assignment +to write the $X$ bytes now and all of the remaining $Y-X$ bytes at +some later time. + \subsubsection{Divergence from CORFU} \label{sub:sequencer-divergence} @@ -602,15 +606,19 @@ that generates unique file names is sufficient. \subsection{The Projection Store} \label{sub:proj-store} -Each FLU maintains a key-value store for the purpose of storing +Each FLU maintains a key-value store of write-once registers +for the purpose of storing projections. Reads \& writes to this store are provided by the FLU administration API. The projection store runs on each server that -provides FLU service, for two reasons of convenience. First, the +provides FLU service, for several reasons. First, the projection data structure need not include extra server names to identify projection store servers or their locations. Second, writes to the projection store require notification to a FLU of the projection update anyway. +Third, certain kinds of writes to the projection store indicate +changes in cluster status which require prompt changes of state inside +of the FLU (e.g., entering wedge state). The store's basic operation set is simple: get, put, get largest key (and optionally its value), and list all keys. @@ -627,7 +635,7 @@ The projection store's data types are: As a write-once register, any attempt to write a key $K$ when the local store already has a value written for $K$ will always fail -with a {\tt error\_written} error. +with a {\tt error\_written} status. Any write of a key whose value is larger than the FLU's current projection number will move the FLU to the wedged state @@ -636,17 +644,21 @@ projection number will move the FLU to the wedged state The contents of the projection blob store are maintained by neither Chain Replication techniques nor any other server-side technique. All replication and read repair is done only by the projection store -client. Astute readers may theorize that race conditions exist in +clients. Astute readers may theorize that race conditions exist in such management; see Section~\ref{sec:projections} for details and restrictions that make it practical. \subsection{The chain manager} \label{sub:chain-manager} -Each FLU runs an administration agent that is responsible for -monitoring the health of the entire Machi cluster. If a change of -state is noticed (via measurement) or is requested (via the -administration API), zero or more actions may be taken: +Each FLU runs an administration agent, the chain manager, that is +responsible for monitoring the health of the entire Machi cluster. +Each chain manager instance is fully autonomous and communicates with +other chain managers indirectly via writes and reads to its peers' +projection stores. + +If a change of state is noticed (via measurement) or is requested (via +the administration API), one or more actions may be taken: \begin{itemize} \item Enter wedge state (Section~\ref{sub:wedge}). @@ -703,6 +715,8 @@ Pseudo-code for the projection's definition is shown in Figure~\ref{fig:projection}. To summarize the major components: \begin{itemize} +\item {\tt epoch\_number} and {\tt epoch\_csum} The epoch number and + projection checksum are unique identifiers for this projection. \item {\tt creation\_time} Wall-clock time, useful for humans and general debugging effort. \item {\tt author\_server} Name of the server that calculated the projection. @@ -730,13 +744,14 @@ Figure~\ref{fig:projection}. To summarize the major components: Most Machi protocol actions are tagged with the actor's best knowledge of the current epoch. However, Machi does not have a single/master coordinator for making configuration changes. Instead, change is -performed in a fully asynchronous manner. During a cluster +performed in a fully asynchronous manner by +each local chain manager. During a cluster configuration change, some servers will use the old projection number, $P_p$, whereas others know of a newer projection, $P_{p+x}$ where $x>0$. -When a protocol operation with $P_p$ arrives at an actor who knows -$P_{p+x}$, the response must be {\tt error\_bad\_epoch}. This is a signal -that the actor using $P_p$ is indeed out-of-date and that a newer +When a protocol operation with $P_{p-x}$ arrives at an actor who knows +$P_p$, the response must be {\tt error\_bad\_epoch}. This is a signal +that the actor using $P_{p-x}$ is indeed out-of-date and that a newer projection must be found and used. \subsection{The Wedge} @@ -744,12 +759,12 @@ projection must be found and used. If a FLU server is using a projection $P_p$ and receives a protocol message that mentions a newer projection $P_{p+x}$ that is larger than its -current projection value, then it must enter ``wedge'' state and stop +current projection value, then it enters ``wedge'' state and stops processing all new requests. The server remains in wedge state until a new projection (with a larger/higher epoch number) is discovered and appropriately acted upon. -In the Windows Azure storage system \cite{was}, this state is called -the ``sealed'' state. +(In the Windows Azure storage system \cite{was}, this state is called +the ``sealed'' state.) \subsection{``AP Mode'' and ``CP Mode''} \label{sub:ap-cp-mode} @@ -764,14 +779,14 @@ sufficient for an ``AP Mode'' Machi service. In AP Mode, all mutations to any file on any side of a network partition are guaranteed to use unique locations (file names and/or byte offsets). When network partitions are healed, all files can be merged together -(while considering the file format detail discussed in -the footnote of Section~\ref{ssec:just-rsync-it}) in any order +(while considering the details discussed in +Section~\ref{ssec:just-rsync-it}) in any order without conflict. -``CP mode'' will be extensively covered in other documents. In summary, -to support ``CP mode'', we believe that the chain manager -service proposed here can guarantee strong consistency -at all times. +``CP mode'' will be extensively covered in~\cite{machi-chain-manager-design}. +In summary, to support ``CP mode'', we believe that the chain manager +service proposed by~\cite{machi-chain-manager-design} can guarantee +strong consistency at all times. \section{Sketches of single operations} \label{sec:sketches} @@ -791,8 +806,8 @@ at all times. To write/append atomically a single sequence/hunk of bytes to a file, here's the sequence of steps required. -See Figure~\ref{fig:append-flow} for a diagram showing an example -append; the same example is also shown in +See Figure~\ref{fig:append-flow} for a diagram that illustrates this +example; the same example is also shown in Figure~\ref{fig:append-flowMSC} using MSC style (message sequence chart). In this case, the first FLU contacted has a newer projection epoch, @@ -807,21 +822,26 @@ prefixes $Pref1$ and $Pref2$ where $Pref1 \ne Pref2$, then the two byte sequences will definitely be written to different files. If $Pref1 = Pref2$, then the sequencer may choose the same file for both (but no -guarantee of how ``close together'' the two requests might be). +guarantee of how ``close together'' the two requests might be time-wise). \item (cacheable) Find the list of Machi member servers. This step is only needed at client initialization time or when all Machi members are down/unavailable. This step is out of scope of Machi, i.e., found via another source: local configuration file, DNS, LDAP, Riak KV, ZooKeeper, -carrier pigeon, etc. +carrier pigeon, papyrus, etc. \item (cacheable) Find the current projection number and projection data structure by fetching it from one of the Machi FLU server's projection store service. This info -may be cached and reused for as long as Machi server requests do not +may be cached and reused for as long as Machi API operations do not result in {\tt error\_bad\_epoch}. -\item Client sends a sequencer op to the sequencer process on the head of +\item Client sends a sequencer op\footnote{The {\tt append()} API + operation is performed by the server as if it were two different API +operations in sequence: {\tt sequence()} and {\tt write()}. The {\tt + append()} operation is provided as an optimization to reduce latency +by reducing messages sent \& received by a client.} +to the sequencer process on the head of the Machi chain (as defined by the projection data structure): {\tt \{sequence\_req, Filename\_Prefix, Number\_of\_Bytes\}}. The reply includes {\tt \{Full\_Filename, Offset\}}. @@ -838,15 +858,18 @@ successful. The client now knows the full Machi file name and byte offset, so that future attempts to read the data can do so by file name and offset. -\item Upon any non-{\tt ok} reply from a FLU server, {\em the client must -consider the entire append operation a failure}. If the client +\item Upon any non-{\tt ok} reply from a FLU server, the client must +either perform read repair or else consider the entire append +operation a failure. +If the client wishes, it may retry the append operation using a new location assignment from the sequencer or, if permitted by Machi restrictions, perform read repair on the original location. If this read repair is fully successful, then the client may consider the append operation successful. -\item If a FLU server $FLU$ is unavailable, notify another up/available +\item (optional) +If a FLU server $FLU$ is unavailable, notify another up/available chain member that $FLU$ appears unavailable. This info may be used by the chain manager service to change projections. If the client wishes, it may retry the append op or perhaps wait until a new projection is @@ -855,15 +878,6 @@ available. \item If any FLU server reports {\tt error\_written}, then either of two things has happened: \begin{itemize} - \item The appending client $C_w$ was too slow when attempting to write - to the head of the chain. - Another client, $C_r$, attempted a read, noticed that the tail's value was - unwritten and noticed that the head's value was also unwritten. - Then $C_r$ initiated a ``fill'' operation to write junk into - this offset of - the file. The fill operation succeeded, and now the slow - appending client $C_w$ discovers that it was too slow via the - {\tt error\_written} response. \item The appending client $C_w$ was too slow after at least one successful write. Client $C_r$ attempted a read, noticed the partial write, and @@ -871,14 +885,21 @@ things has happened: replicas to verify that the repaired data matches its write attempt -- in all cases, the values written by $C_w$ and $C_r$ are identical. + \item The appending client $C_w$ was too slow when attempting to write + to the head of the chain. + Another client, $C_r$, attempted a read. + $C_r$ observes that the tail's value was + unwritten and observes that the head's value was also unwritten. + Then $C_r$ initiated a ``fill'' operation to write junk into + this offset of + the file. The fill operation succeeded, and now the slow + appending client $C_w$ discovers that it was too slow via the + {\tt error\_written} response. \end{itemize} \end{enumerate} -\subsection{TODO: Single operation: reading a chunk of bytes from a file} -\label{sec:sketch-read} - -\section{Projections: calculation, then storage, then (perhaps) use} +\section{Projections: calculation, storage, then use} \label{sec:projections} Machi uses a ``projection'' to determine how its Chain Replication replicas @@ -909,7 +930,7 @@ included in any production-quality implementation. \subsection{When to trigger read repair of single values} -Assume now that some client $X$ wishes to fetch a datum that's managed +Assume that some client $X$ wishes to fetch a datum that's managed by Chain Replication. Client $X$ must discover the chain's configuration for that datum, then send its read request to the tail replica of the chain, $R_{tail}$. @@ -941,14 +962,14 @@ A read from any other server in the chain will also yield {\tt A read from any other server in the chain may yield {\tt error\_unwritten} or may find written data. (In this scenario, the -head server has written data; we don't know the state of the middle +head server has written data, but we don't know the state of the middle and tail server(s).) The client ought to perform read repair of this data. (See also, scenario \#4 below.) During read repair, the client's writes operations may race with the original writer's operations. However, both the original writer and the repairing client are always writing the same data. Therefore, -data corruption by conflicting client writes is not possible. +data corruption by concurrent client writes is not possible. \paragraph{Scenario 3: A client $X_w$ has received a sequencer's assignment for this @@ -1031,19 +1052,19 @@ method is nearly sufficient enough for Machi's eventual consistency mode of operation. There's only one small problem that {\tt rsync} cannot handle by itself: handling late writes to a file. It is possible that the same file could contain the following pattern of -written and unwritten data: +written and unwritten data on two different replicas $A$ and $B$: \begin{itemize} \item Server $A$: $x$ bytes written, $y$ bytes unwritten \item Server $B$: $x$ bytes unwritten, $y$ bytes written \end{itemize} -If {\tt rsync} is uses as-is to replicate this file, then one of the -two written sections will overwritten by NUL bytes. Obviously, we +If {\tt rsync} is used as-is to replicate this file, then one of the +two written sections will lost, i.e., overwritten by NUL bytes. Obviously, we don't want this kind of data loss. However, we already have a requirement that Machi file servers must enforce write-once behavior -on all file byte ranges. The same data used to maintain written and -unwritten state can be used to merge file state so that both the $x$ +on all file byte ranges. The same metadata used to maintain written and +unwritten state can be used to merge file state safely so that both the $x$ and $y$ byte ranges will be correct after repair. \subsubsection{The larger problem with ``Just `rsync' it!''} @@ -1053,8 +1074,9 @@ Machi written chunk boundaries as described above. A larger administration problem still remains: this informal method cannot tell you exactly when you are in danger of data loss or when data loss has actually happened. If we maintain the Update Propagation Invariant -(as argued in \cite{machi-chain-manager-design}, -then we always know exactly when data loss is immanent or has happened. +(as argued in \cite{machi-chain-manager-design}), +then we always know exactly when data loss is immanent or has +probably happened. \section{On-disk storage and file corruption detection} \label{sec:on-disk} @@ -1064,9 +1086,13 @@ as efficiently as possible, and make it easy to detect and fix file corruption. FLUs have a lot of flexibility to implement their on-disk data formats in -whatever manner allow them to be safe and fast. Any format that +whatever manner allow them to be safe and fast. Any scheme that allows safe management of file names, per-file data chunks, and per-data-chunk metadata is sufficient. +\footnote{The proof-of-concept implementation at GitHub in the {\tt + prototype/demo-day} directory uses two files in the local file + system per Machi file: one for Machi file data and one for + checksum metadata.} \subsection{First draft/strawman proposal for on-disk data format} \label{sub:on-disk-data-format} @@ -1199,34 +1225,27 @@ example, for chain $[F_a, F_b, F_c]$ and a 100\% read-only workload, FLUs $F_a$ and $F_b$ will be completely idle, and FLU $F_c$ must handle all of the workload. -CORFU suggests a strategy of rotating the chain every so often, e.g., -rotating the chain members every 10K or 20K pages or so. In this -manner, then, the head and tail roles would rotate in a deterministic -way and balance the workload evenly.\footnote{If we ignore cases of - small numbers of extremely ``hot''/frequently-accessed pages.} - -The same scheme could be applied pretty easily to the Machi projection -data structure. For example, using a rotation ``stripe'' of 1 MByte, then -any write where the offset $O \textit{ div } 1024^2 = 0$ would use chain -variation $[F_a, F_b, F_c]$, and $O \textit{ div } 1024^2 = 1$, would use chain -variation $[F_b, F_c, F_a]$, and so on. Some use cases, if the first -1 MByte of a file were always ``hot'', then this simple scheme would be -insufficient. - -Other more complicated striping solutions can be applied.\footnote{It - may not be worth discussing any of them here, but SLF has several - ideas of how to do it.} All have the problem of ``tearing'' a byte -range write into two pieces, if that byte range falls on either size -of a stripe boundary, e.g., $\{1024^2 - 1, 1024^2 + 1\}$. It feels -like the cost of a few torn writes (relative to the entire file size) -should be fairly low? And in cases like CORFU where the stripe size -is an exact multiple of the page size, then torn writes cannot happen -\ldots and it is likely that the CORFU use case is the one most likely -to requite this kind of load balancing. +Because all bytes of a Machi file is immutable, the extra +synchronization between servers as suggested by \cite{cr-craq} are not +needed. +Machi's use of write-once registers makes any server choice correct. +The implementation is +therefore free to make any load balancing choice for read operations, +as long as the read repair protocol is honored. \section{Integration strategy with Riak Core and other distributed systems} \label{sec:integration} +We have repeatedly stated that load balancing/sharding files across +multiple Machi clusters is out of scope of this document. This +section ignores that warning and explores a couple of extremely simple +methods to implement a cluster-of-Machi-clusters. Note that the +method sketched in Section~\ref{sub:integration-random-slicing} has +been implemented in the Machi proof-of-concept implementation at +GitHub in the {\tt prototype/demo-day} directory. + +\subsection{Assumptions} + We assume that any technique is able to perform extremely basic parsing of the file names that Machi sequencers create. The example shown in Section~\ref{sub:sequencer-divergence} depicts a client write @@ -1276,8 +1295,9 @@ co-invented at about the same time that Hibari \cite{cr-theory-and-practice} implemented it. The data structure to describe a Random Slicing scheme is pretty -small, about 100 KBytes in a conveninet but space-inefficient -representation in Erlang. A pure function with domain of Machi file +small, about 100 KBytes in a convenient but space-inefficient +representation in Erlang for a few hundred chains. +A pure function implementation with domain of Machi file name plus Random Slicing map and range of all available Machi clusters is straightforward. @@ -1303,24 +1323,33 @@ latency. The generalization of the move/relocate algorithm above is: \begin{enumerate} \item For each $RSM_j$ mapping for the ``new'' location map list, query the Machi cluster $MAP(F_{prefix}, RSM_j)$ and take the - first {\tt \{ok,\ldots\}} response. + first {\tt \{ok,\ldots\}} response. If no results are found, then \ldots \item For each $RSM_i$ mapping for the ``old'' location map list, query the Machi cluster $MAP(F_{prefix}, RSM_i)$ and take the - first {\tt \{ok,\ldots\}} response. + first {\tt \{ok,\ldots\}} response. If no results are found, then \ldots \item To deal with races when moving files and then removing them from the ``old'' locations, perform step \#1 again to look in the new location(s). \item If the data is not found at this stage, then the data does not exist. \end{enumerate} +\subsubsection{Problems with the ``simplest scheme''} + +The major drawback to the ``simplest schemes'' sketched above is a +problem of uneven file distributions across the cluster-of-clusters. +The risk of this imbalance is directly proportional to the risk of +clients that make poor prefix choices. The worst case is if all +clients always request the same prefix. Research for effective, +well-balancing file prefix choices is an area for future work. + \section{Recommended reading \& related work} A big reason for the large size of this document is that it includes a lot of background information. -Basho people tend to be busy, and sitting down to +People tend to be busy, and sitting down to read 4--6 research papers to get familiar with a topic \ldots doesn't happen very quickly. We recommend you read the papers mentioned in -this section and in the ``References'' at the end, but if our job is +this section and in the ``References'' section, but if our job is done well enough, it isn't necessary. Familiarity with the CAP Theorem, the concepts \& semantics \& @@ -1334,7 +1363,7 @@ The replication protocol for Machi is based almost entirely on the CORFU ordered log protocol \cite{corfu1}. If the reader is familiar with the content of this paper, understanding the implementation details of Machi will be easy. The longer paper \cite{corfu2} goes into much -more detail -- developers are strongly recommended to read this paper +more detail --- Machi developers are strongly recommended to read this paper also. CORFU is, in turn, a very close cousin of the Paxos distributed @@ -1442,6 +1471,12 @@ Manageability, availability and performance in Porcupine: a highly scalable, clu 7th ACM Symposium on Operating System Principles (SOSP’99). {\tt http://homes.cs.washington.edu/\%7Elevy/ porcupine.pdf} +\bibitem{cr-craq} +Jeff Terrace and Michael J.~Freedman +Object Storage on CRAQ. +In Usenix ATC 2009. +{\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf} + \bibitem{chain-replication} van Renesse, Robbert et al. Chain Replication for Supporting High Throughput and Availability. @@ -1479,8 +1514,9 @@ Design \& Implementation (OSDI'04) - Volume 6, 2004. \includegraphics{append-flow2} } \caption{MSC diagram: append 123 bytes onto a file with prefix {\tt - "foo"}, using FLU$\rightarrow$FLU direct communication in original - Chain Replication's messaging pattern. In error-free cases and with + "foo"}, using the {\tt append()} API function and also + using FLU$\rightarrow$FLU direct communication (i.e., the original + Chain Replication's messaging pattern). In error-free cases and with a correct cached projection, the number of network messages is $N+1$ where $N$ is chain length.} \label{fig:append-flow2MSC} From 3a0fbb7e7c72fbf7f4478e0c4ba2fe395769ebc3 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 12:54:05 +0900 Subject: [PATCH 04/14] Add the 1st draft of high-level-machi.pdf --- doc/README.md | 22 +++++++++++++++++++++- doc/high-level-machi.pdf | Bin 0 -> 110887 bytes doc/src.high-level/high-level-machi.tex | 3 +-- 3 files changed, 22 insertions(+), 3 deletions(-) create mode 100644 doc/high-level-machi.pdf diff --git a/doc/README.md b/doc/README.md index 74849f7..c4d1c8c 100644 --- a/doc/README.md +++ b/doc/README.md @@ -6,7 +6,27 @@ Erlang documentation, please use this link: ## Documents in this directory -* __chain-self-management-sketch.org__ is an introduction to the +### chain-self-management-sketch.org + +__chain-self-management-sketch.org__ is an introduction to the self-management algorithm proposed for Machi. This algorithm is (hoped to be) sufficient for managing the Chain Replication state of a Machi cluster. + +### high-level-machi.pdf + +__high-level-machi.pdf__ is an overview of the high level design for +Machi. Its abstract: + +> Our goal is a robust & reliable, distributed, highly available large +> file store based upon write-once registers, append-only files, Chain +> Replication, and client-server style architecture. All members of +> the cluster store all of the files. Distributed load +> balancing/sharding of files is outside of the scope of this system. +> However, it is a high priority that this system be able to integrate +> easily into systems that do provide distributed load balancing, +> e.g., Riak Core. Although strong consistency is a major feature of +> Chain Replication, this document will focus mainly on eventual +> consistency features --- strong consistency design will be discussed +> in a separate document. + diff --git a/doc/high-level-machi.pdf b/doc/high-level-machi.pdf new file mode 100644 index 0000000000000000000000000000000000000000..c357c2d359b50c115f88cf184d040b5e81415c7e GIT binary patch literal 110887 zcma&NL#!|iux)v4+qP}nwr$(CZQHhO+qUiVo%g?_2e;EbxU(A7pi)U??b@4EK}3v} zk&YFL^yK081B#P?fxzC#3W|q^Ud+1egc>%(qcX<&*LWX)s=uZ^5xtWWa~|aKn(M-O_tZfKH$D$NgTp%et=4=l zF?4t}zmqVF_O$tYLocOZgj$>ULAa*%*><99bdA_t>zqf^(aJjuE`*}8}yKo)$ z=JL_0I`(tkzCV{gLWm-Jz2p9wby-7~++aTXvQ*jFoQGW+Gq&$S_Va)RgJ?+ja9@z*nxo@9oFZMnG?(|L1s_b|4ed{ZL+Oh6~X{3}ueI0QQt#vdHE z{D<-Q5`t`kLR%#8e4k~U>hc)la-Mgx-e!Hl78FuOU5=(tzLE_yDZbwbd@O3wN_m zGrGXP*=swHf{Jb?Ot~1?iU|$qw2-Y>gk4boSM~H4XHt0a!(AN<}`&s{Neo_>@mr12npSXupnzx`nLxiZ@+(x z)9%1D#4kC4h)G?5kd(uE|UYB!NSwA&c; z<37+vhV#36fb?p^oLnOcN%Muo0VjAXIO=dD$`J+6LC=~X+; z^nJ)N6noYxV9+461)Z` zyT&5>m=1858gd93)7R{K3(Fl$Utd}!)5Geram0yDw@+03_mjf=?&UqSBdXq5={(fQ za$u#fsE0JU5X4Fb@#(f|o>h9yo~_=%c~_qHu*FjF&9PUtKQb7N2Yjp>@J?`M(GlQxQj|m1F8k zv1FA&h$-fn?-x`b%9;Ks7p}`q$7BzxtNB@hokN7>Ol; zg|L|RbG+V5Wf*Vjk|!2{fThjP7l9hzVms=zfI&U$or(ilBtCl9Goz>L2S}oXQa2E%>0kqSE}cI`K$y} zGw(Uhdek6@&RE=cKc^-~S!qqnwGjjpB!ROfa|{wT43RG_me80Ead22QPnjQ*mfJ0J zK26omlEIv%LFatI3%Xew)F$j@lt{H#4TQ}%lZXu<5ivcH z90FPD^_0k^zHVPkZlxg%fu+W@{7asy%q-wL#5N=f#%qv!A2xsYa|G}drk5xi(rOV} z0oGVREfW9}-h@?-b~0f=l3@|h?o3nNvXhB<6Z$d!M!w5+Cj?36Fpa0gAv)ML(uhlB{F>ejc=2pP*!7_Fox4TO*~OKsMmIrbXBUF zhlQaW^19*BNcX{%xgh5QW#2E2$TmQFQ&A z9llzgBE{C@9pbHJi^T5|Y^a)K{-O(|jIL0-SLc`dqtr;OmTW)T{79XNEjBL&3avmv zNyN5Y4BM32K4Ibou$|#u;DmXMD2vX-az;P}X)kV~^3dK^X_ea=a z1zi_Vd4Ra~%^H%%qZ66C1JSn?t3V7;RB!w-{ z_tzW?CVhB&?ju3YMg>B+o;NfV`Oc zqZ-MEBi$@6JcSGFAI6Um!9h#iKoCu&sSRTSQ@Flw_|<$JXeyp{N0elUrSj%{q9pVH|JHR32QD) zEjp;G<7*z*wKs}?(ro5GJ#VTug*cAwy={Ji>@eYCfq3@acF9FJ54sA|HTM#rjo<3xcDf7&dBYs>a&o(1>*mdDmZvdI^V$*PBDcPn3_qp9S_r;rMqi(4e4>hoT zawb2pq9wm3~nSl-i3bR<(k|WITSLb z#)EjGRhL`DAevrfpDhK3BRr&H46i7a4?qD0tfgoeVr#^gUU*|kxS*Lg@zn?UQ<+2h z;~F-x%lIVeI;qMC8w%;ilG%vAwFbInWPY10h;Gu9SOa|cVRnrQlTwoaV2E$fONroO z%?En|!Esr97#?|iIN^vF;k!KT`Pg~$a{HNyroAA2ROusBftf9MOj(b`l2U7Q#Qr|~ z<<1I6>gqmq3rb|&-Kz;xZ|5m6*>eopd%id=mOW|AK!+pOyIPv%$c@0;Rdv`0FL>k& z0Gm>+4#Z@3jYHvf?$c@aXsk#jZ`$0c8hI^>GP&DU=S!`}?^CBc$Q%q{ zwtu!O*;Px(1maZG(k!LTmfYnB7`(2kA_cu^rB;oKlZ7b0gOy_i2YVIH;JDfZZgf*J zo*0($u|^Meq@X>;ZXMV-ePdoJu|8ko8Cw2Jra$j^2;4O37#vm)^&2zvr3pbnbiOx(CLWXomiepK1c%lCbvVl$HJ9*1iM_zydOlr`7W^X;5Q9<}TbSFKNSf%MK_ZSe z?T&%e($gqm(=jh%+=12T-?w4$V#(FJgg{LyfZEAJaI*D)@HB5mLu_ve`rfE7_Zb

Qv z`0eeOAaj{P&1|fQ4c<{jNQ@i3wP6KyF(yqyL)4lX*&7 z^McZt1>JrShV2&tXSuBzZdQ?}AG@KKj&=HDN1LPNfPLF&JG4;u?Hc_&Ie$zcoa2XA z&L11y`ze2X)*l|h#17Mvdr03ODjmPE?)igq&esLcv#t002>%C+Ay|S0Wol>g|3LG9 z)&Iq8M%MogZ?iBmu>Viqp4HM!{*Sl)=Iae|qJ-6G#W8jsvf=C{C2cmBsNATgfFa=0 zDFOjP0bBQezFt^4AnM|xlB6C3dBM*vE|0JCG<)>VV;fO$)5pR%Zn7Uy2He$KJRPKKT{3A>Tcd-RzC%=;%W!! z`X>txJrwjoGh@DH>n!^+JBRV`~MqM#lSq8JA+1-#yDiuEFed1s@tg5qt_!dHW+V>0d5sW7fI!k z4MyGcwsQB<{d&h5b6Nl4-6=rc>rn+@Wn?9mF%wN;iLrWYeAxsVOM6!f?iMbbS7?Z9C50szQdb! zE(jImAE0w9%tKD)!^%% zA*C+b%jq`id-|;r`z&jZZq0>*&u`~aNX1Q0*uDR>qH+hb9c{#G6LYxn@)@clTNrJAmus^& z`|Vnb&hZQQWsWa%E(tuZQ4#CCd>NfcQ&(?t?b6d9_=%%ls`eRU*M3u;$2tC!k$#k# zIyv-tUzE@x?i{c=ng`YCZTB1(GT?xvPcKi-1keXVRV%sEjbA@bO|3eY&Yp8%NHs4m zck^jfnJcV;`WjJl7kYX|8Jah;$-gJL5?W}m>>=9lpS)}KrB=}K6nm=6*y%z4ZJd(c zDefaq6}QvDhqhkE@X6NNUyyOGGt+4o;=dLXbExEibhFM_wBV7vxP;59O8ZXhzT(Qp zrtR#goY3|Gtd5iW{lu8FW=PJfL{3rS=_*~POBS;x!J6)ZI_WtR(W zfjDt;FLgE4cdoWsuenJ0(=`r3M|1u_7kgiR2Oy2i9(O2$`~{RKT$XPi&}Lr>%iD{G z$brNNPvk}HjYG67Fq$39mKy9FLRiU^!XKV}w%NpLFL;oaUhqafhy3>G^Z7dc;@u%M zjwLGk89=vUGpF(HzRhg3P*NDnJ%SO>&p(4GU$}C|1G6pgwPQXh9y6MWu0$05`Ce{J zl5>9VEsh-KM4%>2Obx}O&8cwssVPYoj8iz6Nm>*kc!FR{{(`(rv`3MI(t^*`nn->c zYj(QUX{vJFbE1G{Lgio!$RP!H)5PKl5JrL^W6K+26ABfrIudQTzZMvh5a1~3`OQL% z9n@*NZm8Zv_RL(XTJa@to`J& zBU!S+`;S%)RJv&;R3Es~Kz@R0wqv+>;lF|mfrDFytG3JINXHms%7ww21d7^7I{Ww` z8{pZhCkJubNz^pwDD{HnWqQHC8p8VrdD&z6xKrAU2{!^hLV>{8&qcO180nD-2B7nR zLF3>Ro1uo#S~OJxV@#oWr)ykX=+mKFZlkI7Q$M)M@`YAoy*r=g50W$HQC*GASCU+mojdp<_ zx)#NuLw}WfX+}=kE9egF+KA<4Y}d3OIa@=Q4OoNjw$MO70jLk0GD7 z<9-d)Z(>=OH4Sg&>Gk?QQzxI~fO3rdNDUBy$>gJ<;i+se8_A07<_ZBN0059kenk^b z#ELAe8MDA>8ze@jqV6AO!kcylwNn;g>u1Bd;_W2YR4W2WCB5Hi0}&gPx%M5D&`DiuaGItH+P+} zutN~yBsIVrD$@ZJtx=Qh5h8TYCzsRe>-Gz@esq$=|1 zptSDP3JuDwr)RUEo?J=WE{SczROUO#5`1Sh0~!51i7}!iAE?B=9y?_KCO%7Il(cGB zjCf3hzpl#3w+(p{(O<(cJthN08VzR&#mi-WjM40C=f#o6Ab%@7 zcSZ@#+-ZC{{F1)y-+j-kcH#Ub1 zRG3-wl4mCkFkCLSnl3pOgIho=i&4)@Bl zzT_+>_9o^Hm@x1jIG(eUf#Qo#gu~j!$q0dXNd)GK7aLjcV~S($8ZkBc);Vi# z2hxD&ONr(g79kwEqk|;LE>(*E04chDw3s(nCBsf1qTWn)x@u_ct%(yxOk2ROrO9R+ z-O^yP=BR+kWMG*aWBx{6JAU9MmJSWtFkmeZRFu{Pfc0tOB3yA=035r6Qd_W_qBcq` zs-R8+9syJOPwlps`i!Agx){zYanl;sP*nXK&~QSpR`9T?(`Mp*Pd}qMU{pppm$Ok1 zd>IbgP*|goP6XY>Z3cjBHi{ahT5x!lZ3dC-OIS5z&9II4r}NJ2sbR6PLbAG`&kaZJ z&n)*8N;!mThCYfV^?0Wq-DAWet6{+6_5d-Bjg>=eDVgt-I}%S`04e z-l;@0AV6X_TYM+(?KrECm#N(FA(eOe_zOXVhOVg;%nOsH@CUc*xB;7*%eD$vpn!uY zYrq;s#L1_O60OFHf1(ta#!5J`1w|yMo)uXRe!8|cENuk>(7W4*L0vZCuA8nB*UarmiHLaJ-MvrS$*N85+bVtZWGZ-$95p?A;O6&mP)#B5{cBzVl zO|Q@6BliSvVT*xEbY0T8BfY+yk^v~L6+|ge3o$B^wY~Um%?2I=+_TH1A4j3OdvN8= zpj)QEYj4*A{%Tk{>r2UZPGp)fQRA*%mTsKGh(>q=8r(!nxs;fL)o! zg(y?wl5$o>Z_?&oM*U_s;6}=qwv3%c@HO5&7-aB#44Cx{3W|>`CrD5sLK1CjjKWKSg(GQDfbGUHEry7(2uVYZG0kB3oJ#!kwpf34HwYr$y^ zJ!SH&g(A}i!&USl_uQKJ@?niGV`v-oh6NdB_@U`Mpk2|^n~#s_8+usS9dNGT0pR*h z*LAD&K|uho44yqzQ9raNNpdexgIw|qeIfi0E+N``8MtpJ4|rZLJ)RR_qwHpwHS8{j zsGE;*1}g1f_us?{Vn|=SOd*frD8wwIg;;b>$(ee*zX8xIsJ-M=F)EH3{=ogn1i{RQ z6CG&nFhNeW9X5|^O(^nO!_TFD{_rTCFdCBdN#sGlMOM^!683P+3@uw=B&%oi*aEA5 zKmiEr(J5_4SDa+qw}v{a?Pd;5x?B0(iqaXX=+4ROZr_QIGS2%1u0-btsFZ_D40KlH zp&=60!R`3kO<1riW^OH~+xKA~dYqL!!7S*;Ta(RPacl=T`ytHEi{0xn1Q>tD{+N7| zR3zS~g$?-nE=uV}Ly7yakF39fOk55QJnykuG(EEd$)j8xPdesa2B3R0%^^36UP-!>m3_j{VS`Xl zHN_0Xqjp^@*umo3AOpY5r&?+vbjx&N(h#A(y*kFPeP?36HCaRh-N7640_a9SD*Yr~ zFx^5AwIvz>>{yA#bZlZhTbaK)YfM@c7wH9t6@%H^kn1xI=7mCsw3E)m!11Iw7dHnv zg#*VU0?h=2`+FyyFbW$^^bZF0`a@3L-|6pgR(d*T%|xL>j+DV%%QQboqc>(iYjWx6aj78jn70h0NU$(xz)BQk ztf#;E@`Ia*tbu4GPw|nst$F+$E>t4)=%ndLHamS@t*yI$rl8dBb>P!a6hgSluc{jU zxbcI0c(i%;l2l%+1?y9;&ci>!>DK#WV{Iv`nRgctLUIH%0#hTR-%p(bYeagRP(js~w%4=cGkqV$=H^p}C*7(bp~b3wz~p3>LE2+|Rc!-?cJJp?qf3X4 zX7HZ!#TL|idg=Cf9;Q5*YwaD5_ub8x<0BNx-ugCRJeIh&;mYn6a65HfZmVr0a1O+efcDyNyF-bzg46p3geFjxmUKJJ%&Kackpg{s@SwtRdPi_vRs zwdc*Gl#>QiQ0epW;5;4hmLKgxSbOh?Y1#_KE>T8W0U1Y-lq*81ki9tcj!N{hW!-_~ zuQw#HZN6^mg!NfE1FoKBHP%MhSSd&`(x@P!qS1ek!@@{*`Tu^t)GUuVB9uH>C0YAK zDI4H#*(*#}VB6`lTXjowWEElVRa3wMzjVB*pbMy9}L;eP-wcVMF*^^|)Mk&O8Ca>lo(1Dc{~ zH6gQ~QzITAOWu}&+ikhC`b<9~ozw73*dI7FgRm{#bfTC!FzI<6<5jpjS9BqWnIO&WTC>@Uc z^Q{r&5>Yb*$SKJ^sp?N5AufI%{do%SNv1BuEykPIpTjJJIVl(`dhTkSX)Vd5vZGtK za=o`N{+*^i6Lb#-VBiLXEJKk8>dB%t1=(~u(T~^F4K9>0tno#AWdSD0gi6EM%ozka z+1;Q-T6Xvbj}e2lf3eOooQXL>`2h!}kg5gQP><_r}91%|6o~O-PPboqNF7!RF0o=iUv+T04vbHQw<#o)jmVy70Q} zd|6PLD+>paZpIQDl<_R?hQP>qBaJa+7gLjv&}=XqG@!76N((|8fY%px-kXAK$ypUn z3P6_u{_E@|HjDBz3syiu@)HQgWOWEf3-(W=wBRTa*aAq&5vm+bIRrIhdFrVF28_X^ zI=C;+?*VeNfS)8m+OUynCV$Qifeg=;IWK^=L>VF>!mQ7{P)CAW8p&kVaMb$|xxj-h zPY*@1R@D*<5e0nzSC>e*qBCGpbRc`wzYmTblwAyrAkd9&M8kzVk!Lv%?@$uvL-4Y( zIOfobAog=40{<i(#l`!kwdrm=9?h`Lur>SrTNwm%cy95*T7{iSHNZU3`A$n>)V=o~~2AmZQ zEih#IkVZU4;WtLdh-3Br77l;!|1tmw0lx6EmuH$9^ysM)F***GV|7zi2X|LfKo8Kc z-qNxNJ}jZh&2sa*5K>gFY)&1xdEzZ4BjOuYA&7S-0fcq`_m5b<+ZdY!EPK;kp`Szn zW8N@10QU!)StRB_5A`qn4K@T_Y%?{-{U`PfU1dS^DZ-!~u! za%zEY*xl!c<+LHxoJ0^v*@whaF9SU3nu^;4!tLd@d%GDD zkj{*bi6jn%TrxCskajV)xS)=V7<;EKE&P&?nakgl;X~AOl=7flMzZFH+{~(i7y)fL z=A0XxOnWW1`=Y=t5{POqz5KMg|nS6>8%VDjGzK0*xg28IHnX(;Z@ezk{kkvZmI=*tUf53(~BJm zA@Hs}eS|9Kxu_hx1<@byyIJK=P7@jvrdHfI2Lrned-;jc=-t9R0Lo#%j-FT;2^@6$ zXP~#Z@dA#Zisg#|rWP7f4l0qS_i!&axvL|*mg3|?R}MV%BC3U#7l25Gwm?pAB`Y_! z1DN!T2->w?6Z_~@ko22b(v2_sD&;`1;|Doj1>c?7=N^B`lWryjm_yNXTVhZ-Zd~Pjhtsj)O+IK2Q1fwFgds|vWuC0WUjlOZyhYv|5sGP+MB7DycCD*tGFu4vX zADvUz!G$i7r}5Svx-}5bj;mLPA`AdPNIbg2oP=qgdlj)hQ#`axLy!|z9;b&BjO$~^ zsERc7(BN-97Y5W}*cM8*XV$PyP9mA?^Y?j0uukc!d)HS@rDa8ok?0kSEQ0~=acE3W z3s`#FP#N%-XBVR2a1jFvLX(>a;ea(gie_suyg^bUJ1FrDB_U%egFPMI>U??gqC^hO z^6NUM+nU+IO2)rGl)oSDg#~f(4XvqSDGsUtKwkk82*!qJsDQV#{C&sZqb7_Z>@7;Y z6)pmodYu8Bq>!SGtOLwML^wGrF^BLaRC4b$#r!U+sp=5aL^Wqsgg3SPaB1RKD!=B@ z)D2eNCQMdi*Lka`NtqFYe-52h{7ajCo)OIHx2G5H?YF1cb230H1SMK$Pk=&@ZHT~$ zb*0}lq_!$(F;W$xVtQbqE4pKWc5z(d|I%Y?z4^9G20X*QHvy)EJR0I8A%MtvJ*hB~ zd!Gmk{w%90c@4Jz${?(#0#N8Op_=1#dy_^gJkfj5v{2L098PhyHxsKzPn)u0(;YG0 zcOBWKc|MLP8V}!#N;bqLBcqc7LX7O+E&S<=Jy~;H{0|WGH-_WtyxN{0Xn3BWfvIm` z&3e?ukA${~lF=TV?$7K^Q-Q0p{oiN*QLL9{NnGhLC}bZ?ZH!38cuN1{-aJRXG)S~R z|IE*lY(dWH6)#qZMB)@ZTLZ7=f4*tgwGWY))bm10?rrz;oeESuQ`ZG7;E1TB;KO`5 zU5Y)EV>G*&llJYEVyK4>BOx+Q)h8uN6v9y$HIaa>lQ_tz{~|_BxELr^y->kTk7@92 zcvt#`SQ4B7q~vj$ku<`g&R-5nNVV$PD0l*o5&Up8h{AI-K|WI8vs0vdk1Cn%cUd130Vw+B2UU5)qiy$1gnN)3T(Q3^8s8Sz~g&_ z=|+Szz{4Ji$nhVTy`Ny~B7sG)61v=}dkG-0V7!x;I;Lg!qPrjDnFSg3_y2q|WqGg0 z@7N~})%5ZEdafYL`)tz8+mwGv#afwq+l;OABT*=WD8*>}F@~KFDUFfpHJ=lci8QRV z8k1Qj;_f^#825AyTH^jFU($8G>KeQ$2?Lvp7r!u^)Hq90OKSqZBt;nHsE0t5NU&tR zvdl@4VA%n(j?Ca7d?OY-OsCj5fG!|o{$1qXc9J&1#y_I~iY&?FvHc}ZIJU3=9H1r4 z!d%gQr<*#`+nO)BQaiJ}!vl~D=H&@GCd95M!+|co8|(Cl4Q{FQUi}T}geBNLjV-(k z0TiD!0*r;~qQ(zT5j+|(I&DrMp}@_6_vpnZrB5?Ra9iWm1$)b}mHff%*_H|-(JG6j z)D(woltF=hr0cByXP|{wP)q!e7ZAq-vtth#urC2!x{dZ7`SYvcR0TJ`txT(X*i#Ve z&^ew#FR6-|Ib;%UWDZ0WA;UEm(BKv75BEvPM-kpkuXbEW@V0MLm`B214s>`Fidj3Q z5()&VukgMkAhdXp3(@b^Oz&CCoA<*WJe#?v2REFW9*jW@z8@H!bg$s1*`q$B?>qP+ zUrk1*3JW{-Unr1JMqnl2CYtJw3eUEL19g`Co)R>Ox9>M7O-6DI)IkQtPb|H`>JPJS zodsEjnD`n>CACx8w2phC$4G{3{ur80;pk+Y5VY-IESg&f)6se3AcS8x5TN)8)X0{4 z-xa*iiMO)|O{6c2KAi!n>|YkNS;nW&84;Q9V~~IP+7O8ycX-gP8dU>S|NBsX9PC_x ztFVsGjUBzL3%DF}{mk|ACI{xQ4ra)HV8sj)FsUF3O$K>tambTsH%tbCksjw9($p}N zA#eRn{*VJ#uhhky(llvl8Yk{6*a5+&w4g2H#BEAld}`4M-#_EQ)T-(}pMsQWU^b(~ zRA{Nyl-T&nD`uFW{uP?L^WFebQ_Zpay&?`rSK#~dVh&URP^Tz-RCOjfe3i+S>Uo@i ziG}|*>VYnADXqSf8-D2mP+`6|SJl(ddUI8r;N)QF!taeK%Xbq{s>qh#cUBM}(5?4c zi|TffB&w7)+mE19GnGEaADy#WQU=>(&(QB=MVRHo-wUZudOHX<-1`D@g{v4#{?q@* zEZ2mLD!8a1b?>zU*B3{#>aS#Et)KJ#&byRCWzv=w&`ds`Y^?nr$eh8~0aCi8K1p5i z!sJi??-k#a6aD!AtM4!||E~-q%#8mpeWzASGk!LsGA~bQu!g;%nc;I z(b}38)@Jw?Tn)84ni@@W*cj_p|NeuB$dhGLi#=Yy-@1x24opGG9XD&bLt=DBItLrAaEm_U}vR+j0RXs3!x>_z1 zrvG-){_|z~$Jo<%4?hNRhKV*&>E793LyC0>z&&KN3w-1fAV5IYg znVi15kT#@6CCYhmIpXMM&Yu~Hjw9c+>6b8qhumd&yw}Nej>lB`!pxueK8VJJFP1v( z&HPCH5zpb|B*@Edx{rh9RWjVCG#ABtYiRng-B-Y5wSafLiW6$ZFe{uDJG>46e|Uhi z@4z~T|FqotGyXeb4B&8kStLKNjD=2?nwZgDlONJ;?D77vKoJga2j*r%r|?_Bb@m?u zX8aj-SJ7SL)C33#6-WKVhnyW?8VW%B54{G`gxtYg?|q3d8)O9AWIq{925pQxZ8R5a zySMVPz1#hIICdB1*l1n8-~6H2^DwMg__rQH?=9!KFP*}A`*5XsBT7_{dQuU0>nWmB zABGp!R&PFMIBUE17Rcg9Mhc3E00l-LQyxIKE6`_!ZXo}xD)~b!gAN(7CKoi!WJ7ap zh&5rw$Y!XGa<=I|f5m$P6ZT9T4WXv<2J+-}Q#A17;SOV+m`EsSWkb~TzXF?XQGRWj)@ zgzm+ng1T*d*upMtJ=HeBYDMTt>dT~Nf?frX$exp*RXof-QtyX58gEnSU#I+pM1KB8 zig;MN>_ud)1*A6Mf5r{9|N4#yTT#JhdX3(G`oVsJoR&i#CFTcI=h%oJMaYsf8CD8} zCdAl?2$$_Lu!*F-PjcynP7@uspeG9r!L=0Foe zbm@#UN$xPZs>JHTtAR2WVWS5?I@g#;7smVI0}C9qG!8Yyd1Qh>m$F}$Q{neZs-zhW zcHz~ZrrjobTgNExX0!G0c=B7U%jRo$+X^{K4{Q(170~Tes`w@38JFrObSogni=e;e zUG5=`t(EuYcR?WbK(q(ae^Q0z_M_h;<3U@;nY(OH3KK~@hWZKzgC3y35p3V9jynoH zl<+y%(Q8SMfhB@I(Z|juYkN^9JkMHlej;(&4aKw&k zq!f@DMkU}T`i%{jT?^jI8B%r+DfUIN5hXz6OWc)cED^W&! zr|KD`3MYJ?Zh2|pmjLgK7x3}3*p@zCg%CnvhLLzlah>Y&|F+HA);p}E(@G1wGWx#zUxfo3`ZI~Y({_aoI@r_?{2-ruJuz1 zlsA2zM@21If=^<^0#!Yl;=i+NtRy%KpYMkj5KjSt!`qHN=yrOm*Z3$_Oc8Qs{7G_& zH7LfH6oN3&4Z<3Ieodgyw$#6~X__^F8GC-AqS+1kcsd0xkpQyezUm^h0mswac+crAwPjKAeAeCIcMlGzs~ zHUkOF&V??(x9OVoTJoEuf~@u&s2~gRB@!njm&Eqn=1C<4zL&`Q*F;`w?TqAg9lrto zG7k#Qh9_=?bBn-OUx?iD!6w#r8VSpyCF?D=as#z^ZWP=Lt!$$Mj18^e$!Uk$;n&D= zg-jWd!U!N?=?RJ~tF2PQMBRFU&E0mn#{WR;E5Js27G%PgZm_@3)7N3$CF-2pCly{| zQz*MRWXb~FyJ(Rq5PP@BQA7FhtgJ0W=fHr7GysyKTETbs=${3N>UG|oNpi^D=V7p%evkUTmoEd#sGn#_15wV{JF5Ix4B^!=u`7Mh?e z(@5vNVGuG~0ZUn-ZxkyB9X6WxChH=BjoXl0ZJE)1w z-t|nl02H4Ss5KkeR?2qd4yJqbnpfZ*2dm|a z;z`21s}}X@Ef@u|On0w{0Ag^Gnm#cSf`9K&S)UlsMkS^0_>lLnioDq(C?`u1gK6?S|$e%2fHmWz((ws!Iyh|VV= zMaDdhF8N#@piZ`!kU2+YtWkv2l9;KH*}%SUj9Jn>XkC5Nk7ReV?v0s3Ff<@xld&u- zmlWdK)dBehe&m>gFp-D`5e%}li{~kZdf^^m)qnXy5Z?|6@+B4FSC;Gza@jP^EAh#G z4t|QJbo%m@3*M`PK3tj+Qw4#cF55q{p5O{oz9C&EC7Nl9bhe<>VbGi2Im$)eeg+ME z!L%;ThgxQa%^QTOUqUx1qmp=_%P~h2y9@boK<1{eN#Lg#k__k1+u?N^z|2-Y*MEGDAu zuPCSL1lLg)mnEVhzcdX~(77z%0a@x@sr0!^8g0B1NI>w)kkbL4VziD#t5;n`$Ch-u zBRHt)5|6P1g?Cd@kxqj#Y5Ie-#X}7{GN#fz^#8MCeA&c@bZY(@DCC70q}Q&k)VHxB_ZR|~rIx{Gz>FTJy} z!Z=T@=a3k;qFz;IvDt>%;oYadgKQX^?8u}JN*y8#QzOlK$RP|nrF(KGX@({VJA5&( zMMZ-Cfye$GX%tVrw~`b1Ap5JU+G$o-ut^_37EWV{4y9W^x$C?W{=6)g98M0a})EkAjJLl&ka1L7S3od-K8tg;ORZb;6l8P_ljIO` ziz1L5lo2|7&(1|=#A1o!6QF`7@?=rt$%F{!?QmA1;LAB=$eD!ArIatM_Vz%H?q+C( ze-&uQJ1e8@R``JGkz|{&K&-6+B8#YL^Con`lQ9n9mem3s(V5#gg{8C3j^5gH>VnIq zjfeGVUrnZ?y~sPMwd>pMvys>0`#qSC>*{i5et)QP*&~N3f8`35=n}rm5Dql?crRCy zH2xgW!;Z@>DgR!7&6QKh{h|P-5A60{KlUA0oA(qt7NR|yc$~7?wA-48zBCpJ5n8@+ z*O9Uo)V35(c~MVY%5n{yKWBIyEl&18^s8eF4y7oFMaw(qa^JO0Hy8t9nYCk7CF!=@ zZ&s9DuME3s#Y2*V=Zh>y zgA{qFXtUrIB?J&x_d25zMbx0>okD_)%e>6a$0V4fZdF9KI?1ghtxmV{Z_V!Zjb@lF zP*-n|(d@qA1`nRzPv^!!8&oGctyfnRVt>dEyX-qJ6DLT`8D*I6Y6oUDT*roE`Bx98 z#OO6M9EaAchIiFmc^KWqqRXvNedKZJh&6g44zxR9jKcLPbfp*Vdz?*>0uPURqWvhb z9`2yvG6p#v=t1<&k+_!ec8Fkc{YQs7&k<901Sw3XLe%xyO}`f?cgfJEuOvbCVgHA* zcZ?AwYP3Y#wr$(CZQHiHPusR_+qP}nr)}e%Z}Kv^lex*<`ms|xl}hbDduL_U!jI!G zL8(^C+AG->cV)7D-3!2p_@q=OX$`yA6)HSg3S(}cLytkB4w$F$OSQ&;ti4sw{>r$( zF0##vqMo>I)IPzTZHT|Fmcbr_!nv(G^xHEqYPqWOEhpyptr9EK>WY*>Hs~F&1UM+h3p*uX&ipR;+<7T%9qccv2V72hC zwee}cY|{HQ(^^qeF6BPO)T~Q*&-lCf72mE_LqT54u(!QlV|gE%)g^$ogspZQ)KVdG&kHkd-hWEk{HIJ z;<1J}`Yd`h7_zoL4}&Diu?y9lCsUw-_*lraI!@?*smN@Utr~Dx{xRK8@f^&o0v*Xo zc8(R`H%YH&5(>gleZNfF=ka70agVV$t5zp%<$}d2=_-o}?XAuvy0k^SVl1x~Z^s$4 zzH+ynJl{f9zqJ}InY5gnKF&xHJZn|B`<{)gpKEtF?pDmK179xr3Djb3+a;oXu5AN; zLoW+6vm)1*L9@T*i@;eq0{M5I=y8rd6Y9I?mw?&3y{+Vd=RIUP6O%T(CR|jC1*ZFSf8*Vxch!auF@_aeF&X)9>5wBJP6ltK zs~h~LqIVdz_C1zuO1lTQE9WIW*;P4b?0U|k{*$@U#8GcKt2NcpOm6E40f`z1EEh<; zAX9ytNw{+&!T##~)Xh6lp=9KU`H0RGV18O#OS*#=MS-v3QrQ#VP(3N*?Uo)WC@hbyjqpTEFO(T-edQEFue7?WT zs6InvGxr_ZzpkTC+5V!ZS)bY}Tj|uU~1Ar4=+p-+Aja=rSYHtuRHWP&PU`74ROM zUIc(6V=0gFG9_&2K$+Ov)xxJa_-b^SFy`ln88tpxZGZ`}X1!MNy!#Qc2W|JO=a0+4 z3lbZo3y{bO4DSMFE3q|4h@0Z&nNJ+y7etMm8#@&7_fVe__x6<^5=A z!&0445Dsi4^XG>L8W`K;Rd}ymjU;M9Q_= zDkWx1#GK2vI;BFXMBG^Yh+fq$iYciYBcp0H<3|YXh+|&@7vzdNp2UIpGmmNGk~;LV zN63J287j?#bx~U25~(Lj6Ary9 zjBzKEiVb6R{lfkxGaEplOx2*k*9M!V*X%@FgEges-VN*=$qH2+EDRu*m0vkdvd#Jd zZFo;x^8iqbNFHHr7a)WOu(4sU$lw_@8AXFNvi)7ByYRSjlEFH{9ii!@%n%yLi}9Lo5dNGEDS{H#Icd_X20jAy`%of*%B|_lpXLpvYf@0G!OJ2}uwJ|q z@sdgKYP`zoaT_^nwzc-|88woYS)-4l+7}o_yMqX#pi_~P=|_J(Y_FXr6CW`*0bK$R z@7HFMbOREpe#a~4r=t&lRLam&HXgvvA)fT*^rQ9(56j}Wt~Q1Jek$rNt2@?c2CVg% z2!6Vx!T`sHVs%cf4g;!Hs3v}M8OL}O$nYwxNxsG;fn@8w{D7#$zdBW>ZW1diBE^3w z@Po@W=sF$HH!fj}i4ayUZ8xX)!oQveowF^7;7`{J&v~$4z`t~<9E8T@QKGcuAZW`> zs;en>+10ysJad2RzG>>?x=XomI1;|8%^~vJO#m;3!rbkv`UWHX*X*5^uRID|02YzMn@89U-HjeVgw>+8arj(x8#FRsPP0x&8{klX z#qCUewn9T+dg-P`?p%~V<&3`G5O!Hm2;v1-h9<){&K(DVQl+D?H^<(Ay#!0k36!YX zC;b(=|4J8abm!*=^GHEu=hU7aT?G)3DhFd^!K5ccUkfYaMMk0_GK!Pznr|b*jiS;4 zCVVr?l!hMUy$LU>$b|6Z|D}HjInq)h=Dp8Hu2$2-!B7WKa6BzIBJ6M26PoZp=wkv4qh#LID9Y7@{|*d2623)-cJF33OMwP^!t_G!XF!XW((uR zkO;PKTqCfEw&BI=+t)ZvxBrorpekuj07bzP%o&X=i82@i7NCE$p@$hhPqB02T_hx4 znD!EEBTV9ViUq|@chX_w{>XL!ov+3d?+9jI2_TO-4tYtFvLfZqjCZ8@`v;#P>cIPYt2H|1Bb@| zcc4^I{$W4_Ww9<8<{(*Izz>)-9rLuz+xtC6f zkQppxtD{yk0}QktL6_;q@z+=IdyyjD5aX7+K&KR6oJ?raxwE^e<#F4qIBVKHKF10_|1uk};?Mc=4APB^daCHcP(_E700XZ~;e$jtaZ z!H+B~EdR$JwjS-jamQk)zq9%aNJ#A?&1-aNbp(X62E;%lTH}S@<091ZU0$Qraj|hZ zFyX&B?kRDtf7s$tGvRe>yYzaFUU#cDzUv>|@xI=>G_DURT$5bftKIVYc6vU9u9eth z0(Z8SZV)>4+PR1Aux@ux_c--^ezvph{2-FgP*{IHP2&mQ2# z&zf~`bmM>1bm1wS-t58WhVM^Sq1VZSU^Ymj|9N=TzZJfAY2@5oz5C2&JbISz}e&6i;fy9 zb35wdU2mwqfSbf>Xp@j?Hdo^{QBTaz0%4u2bnvDFYFGRj= z3S(RU6$Gr?Jpz62RmL`uFN9b5`C%0P=i&BaK5OJKb3!RUpSf?K^TklmHQMX6>@ZXz z(Ab(cVTQWbFLtG4KnxdQ3) zRL**aWo4z@mbmXA{!^i#aR==Ut=Ql}1Ko|DCcf+TJueM?Gti9dr?SCvW@vnk1~d7S zAB5tG-L4Y!V9TsBr4hI_2d&e4WWTt`ADDVd=z*`XlMv9al(MoBRAFG=mqKQ|ilSd{ zILuGp$)q6J0jNeXGR3;@)>mNiO0+v%L`Q%@*;4p#e0BLz&n2#4S^s|ldoX`PK1@LU^;KFhsru%i#l4+ z2~MyAoA!6r6=&z8>m%IfWnB6WqV)+bVy|dv_WB^ zZ@>MN)%=}BW1 zVhAV$GlKyNJu#n z#G77U1|7R{%GOA{s`%hHZFT~{a+Uj|*w>hmNfTTFK@??wYaVHCxSiFx8o}qyq{z^% zgpE75VU$YLAmg3uEc)9Qb{?=6d{rc7BJ_H;p#q|fDs8iG(BlS!W^=s zp@e0-Kic&0kHrM%v^+olJO z++xOK%2UH6|!UKympep8k zyG#IHLas`KxO$a`B)DlL=PY&?UcWco9JEYofLUPm5QLdb6#b2em zwX@OA7yTw`C)l&|30j4NkmcY45UhF@XcmPaK~dGFl;%yq3-UGyKC+<()3bNDe&4T1 zKdg9?Mh4AyCB#Q=|C=lsr@WLurBjY-o5dv0;WNSWA)BCnRdvpp+DHX$!cNsW3cb44CGN z>ATYHMT;PQXEZ1yAN6kFoqo;0DIQ{wQ7qR3!-NFH#f}v6!SkX^s@no%D4uD3I)4qJ zj`yyMr(r0e*)C5BV0EoT$dq62|0AQm3>P+~8RMMv>i|K?BNfkYO% z@ZPsH)Ca07fZt8(HX%Oh`yT+J!8J~ulYp!W85`2lYe>#F9>Fu%2=A$2aw-jbDr_`g z4lRm;LgSZZ42R2b4~X+IyA9`gz#$2p;@IT^5hr-eF=~)47foS=xq=#)7sSj90@9^s zBh%E$L#7~Mz#fSjpzvu9gy8@c=6Ylv;>m8z+k_%lCU7Z#1sDq#pbcy=Tyn zEz$k)=>orCt8VBlsAP)up@}w0`5vm|av&K98^xG1znm|R1X_D^Pgu1ZF1?@U&tseh zwJ7rnBP3nwrM^&hKFf~^S;_Mm?fd;OUV3&_%fSBeU26W0x$OMijXwr{%aP66qjkiZ z5y~Mx8NL77cj>zwt$JQ`(P{G{k7+N-de++Z0$EhhMrv0ix?i(|vdJCAZ(p+UVUcQ| za}lR?R;Hf3)L3 zSBLLr9s8JPQY15c`}Ha8_N+87%h=l$mrvT`@1C-9c(1vJGH(WW0qh3)SZjF4Jo#r+ ztNXUMt|RS-_jt4{a|!o`V?XQ=Y^-wQTBxlXjns_a%aazr`WR15hm;@>5CNc)^LJZQ zP^KO}mT9vepSqP)7j$0~a~AAziJ}F_G5;rL7C9!2<||)@J(5;8$}M*I&LAJ4dSrSo z@JDC}wX3)#GE_n8k~HdTOTt&i7iC^kXK+W5(WJW6q$7JEW^ci1?dc(aqC_U?f-W>^ z-MZe9_3>~oBGb5%1+}wSVmgE?PEx%$Qyam4^}Pd*Iauwm=b2Nninho&`lA!y3xNQJ zrGPiG*MY>?1YnX3XJZyrxbTn^GH-In=HF7?a;@{R)qA2ETF+%C>oVues-OVciyb22 zMlT@OgsOhdcbTlgpU;$4jISKHcQXM=3oqW^a;qr864QBbIG?d3z}Y)5w%YQ@0YG-$ z&Z8%rtdl+qt{mj2x@o{eoEHWzwuPb2RR9_)`6(|fyZVj&kWz?1@TY@tQK z6*9{;vg~QLG~SXWM`osuIdmeM38vXx(1~jfP7!=yA^g*PRtkY(+&FO%>nHn@%{_ zD5V~68OhV9i~}VgS7A~OON7&PG{nVP0@$GeHRjTnhK^a4%W-XSn zk%>Dbd%biGbT1;da~*F-%BGk$9bg;|6^8)N+UgGYPR~pxAyIN=z;4Llc(GirRDC|4KPc?%#>t z+v0#B=2er&z1#u2UgME6ctdm_of}7nUmLphtL;^^*Jzla6dVbT>t7`Zp-Bjlcxw%@ z@p7;GYwBG5%~``)O-@wa!p>)YC%&pr1LB^!`Kqjr5WY$&WYOh=OBFaqZ|iqq+~x=C z12DOYUi;ZY2_4ip+P3=7hZgK=<3x%NOn8jH5jC@!8=W=5zc91QuM>iWMdQLGjP>=+ zG(T@hWjAwNQ5z1tPm7rV2KKui*+but&+>3wN`3KaVcNd&(Z;?#9?#xS0LOa@NntOQ z2qI(2ix>fPT+Vp^zV^0`a05`Swda_S1A`invn|~PTUdF1IAO|Ba_ZQh`)J9LKk3wj z*gGzp!P{pgyC+41Yb<7)5`)PJ-?JZ6JU^fgvW=?6Ee9i-7Ev%~o9LK&=DFj9cWzvf zTZiR!f?UdG3e*-AcAaIc>=Y!s`3U|%iS}^?h+n9f5tGu2AORMWaF8O{$=*)Uvx59b zQ|f^(;)FF%R9S%$l&c_BM@3?{4urW@`as5q;6p)Y!dqH+zrkxq%g#uGXai<>_ zlifk-Ts6ft+vy$4g1@ z)e)nQW{69YE{QsDg;m}B%UBS!*~Fg&PTQc))=CQM62M(&&4Uh>Eus-0Bm0b8ZXg|f ztk9=o*qCBx0N#%!y^bFy8o7a08-KMI#i&9P^sl&Lr=jKrmeWGM*MU>M#r>4z6^qNF z_%O3erW3J#Jp^9G`Y6r0=*k5t9TnZBYhS{t3p!x@{5nnt29u-dAuKZ}+b|n3a3#n+ zTj`C5?Vuel`?Nq|B1Cr~`Ec{ZMi=NC&Jj)+$pCEaqX)J*Wb&oo69t!U_Q-1~o|jkR zXI9O<@nxK~&@cqp=tcw>_FQg}U1!*9emq|^vt$BD|lDaA{ zK6#V5=ngev2gLfOKA#KtnNpCBCf(JlI9S_6oIV#LCvzH3b8;vqvccp^ngU=g%^ccE z71lVA;9pH^Ozm6yeh76I*vdzxnXC`{d;(XXOO_(44D(*ZXjq+jW^)@kI zzesbR_PuWnDKoE6jA{-XM=Nlr%elwH5T0To`C5F1q1C9lA@yCS_3n3VIZOY>0*aoz zdlipxpg*|*PhBo}(MGpKGkT^Id6(!@&NW`?``zO=4$qMPHyq9XC^nCkiT(fJXiB)Z z61T+MZr`EQN~omOLTU560>F|dgH~g$QgrQ`6WD4a$KWxLaEOQiu>g2;zdy3IfnkA& zC*1A(p}&eR z+x*_S*SZ?NJ&}iyXxFM@!F9SSeY0gcnrD7fx_YgwHmES$B9)7(LvS6dUKjJ~lsMW) zepE7WHa#P$pt^QHB70%-9;E{grM7H z^=`0|W0)O2cU*M>5dfq1+{B5_l{jl2(Q?(IepN_spG5QnDD%5(;&Pkr)b-ln#0XbM z8f)(Br=f+nJpC|*EvsQy5p5GHrPj^3FL{d0l|6d+P;JUyt;Tua+k_#qzN+-VZ0K&G z5e$3}_*6Wa2a7p&rn>w3a!#Mo%g}Dc#-fMFJJhv!3IzuT?%2rY)^_!tOVce5IcYop zZHM@CyI%saXEFVE7}OKFw)sSUt?`n3M3uzf;5PULEuc^wd=T5{5Jk82xAdzuZqQ6aF`NC{_Lqb>!`R*B*N#KEsML$cN{8I z;=8J^Ye1OFg=^gVe5n1+y9e6*^a6+o=m<_>D2Le!uqb$w#tjjV&vU9vb~equLuUN~ z&yuK-VF)rfz@VSAPR4(Kl!v_G_H^TEG{WP(LRGzSj}Bek%?@3#)@l%_x&qVx;$&1|?-mZ!iJ-uU4Mg59HO_XMNl^0|Ww!@%G^W6o)neZSLM%e@gH1-Fq17B%#C5 z4J9(SUG6jxq{frCyAf6|A|Y3}k-l!%@kiS*0xma+GBied6J3q>C3FsYNL5LsC6}2Dzb0u4*mhLzv@xS0ip(0= z%8(DyBd4wOCs6OdZF^XO%(P$naxl5Aj&vl$5hSb2PR+p2^;9p5FiI9+Wv$$5Gg zH%C=mZC$m`rNhFiub|9?cR14=zK~gL$9ftM+w79Q~XM|t& zA7@Cg5Y0M<;7`Xzt`RKy+q?7H>l5@kFYpx&%FqlW%vwgVcz3Ya&ZPvT32fkJjTT+> z&%mElKyh@@j7nj+JWlm{%o#b9U_YUI%Ib`@n;id(J5(HHe+t$3<8eyAoeUkQR1uFl zkD#8BF^OT67lm?G0_UvgDJ!IxIFG83geAg|!+Oueq1-J|C<5g72I8Xh(nVa@)jCZe zkqVrfE3(=IfPt}CV}d_oS7_8%n`d|A_+ee01gDZ{(?%k8$9xrPs)BNvvwY5TCu-Nf zWnEL9D^QH65Kd#b+L-S?HGI&HpWt_N45H9k^YO6a&+WZrWGbr-XENwaBLceQ~)#1QQ!(6N}778_OAeTAydpUiY` z`#g?C$U^3O60F6B|yw)&}VU1g3v$SS_F-6%TR~3>12j^{8c}jjG`YePkt022~Jz4UhZ>o=e`PXo+dT-|7mn z@x%%=SoNBU3v=6v9tq4#A}*+v+yrZTI7(p40oy`^6@p|su%M0Jt2<{xqb#@qzH$CU zx22Qa zcSC^&sk7aZ`OM4+Qe#0Hgd>Z+DBOmZL?>ikLaOS#m&FOjCUF`!uO{pwl*&TKW*9KK zM#vO+cQ^F_86X(Z-?k7C5VdLl41%VMoVgu)u(p8+6q8gvrXmo(>)Pb~alom;*Jp47o_~7e6U3*n+bJQMk+qU3C&2Kj%rMEQnhn2SRG*VTQHBo!o;cE z&xcz^O4;U<%bn|LT^kkkV*LxsnN7#}j6dHr?xHx9oSftAbl@Qto;A~)0*SAUF z(z)r>D^-+uSkxaz28#2lgo3?y>Stn8YCBrH9$3}qZ5tfaTeDdy3V|52+7(mACfoeJ z2}=6Bqv-0`Z`mDj>TWrZ9`{u5BmVq;@n>OR1ULWMFMVMKsHCFbu4HwsM7Gv+^ZHfb zO_la0NjUAw zcXbPW{2G)G6WlU2!95*4za!u!nPq12tLw^b{^XdOuo*oC&MYfFFosPbw6e`&j;@bvA@mw~|Y>k`@Bb2cAs&dt&Nk5aLfZisFw? z-hH0DYXZt;Ms;yGrxq3qW>KYTA27PvV*9p_AhwzU=-?a%KHs1&m}AAOTLk?K$oWNW zD-|SJiDd~3`Ka(_1ZsxGS&S6|(%I!_F zuortu`ny~>%;OW7S434$?Q|3^lk0PJ*s%w$zZ@JLWb#qTr!`Syb5w=rq0=J>V?x?` za&tP>7dwAD^|f|Sq_XM;v)j}aRqgPVx9%_Altz7iE)FnyJI7M?1#Kn3eqA|SO!@h{ z30v=6>kCpG&3RoLnW7R=5?LIgih=8-lwQD!-~T)Sn!iP$i-i}}OCkslbdy(~hN3I5V3;I9dTyE82qYwp9Pkmy zJAIb;1PV6Jg(D10zkQboxg0Wr8lI7rJ3;zf64n-v@Af9f%3wtaLqkO)Y0vAquPHA$ zPm2y2r*z5hJWsPO>KQl;th8>k^4%Ft$g@Mbtk?(}2t>?&CNd^NdG&IKADqxK4AbB4 ziY>`l;C5Bo#S%J#nBTS8^U=N`QTmuU;E(IErWxEP@-M|qN#D^KDYp^hNK z?n#e$H=7LtV28I_ogG;)p#hVkSOrn{g5aP7j9Ec)8VRBe6xzIjS91m`i5@7<+^=)y zBNa7&dI=mpj@# zKS)|N20Ea5WKoaRBnpyC+F%T1l%rjFjnq?MV|t$Q3RQm>O@&6ljTsOKo^8bIfU*sx z)Ie()x?rAN0gc#C@hC)xqkETzqoNEBk?%z{btKXMl`=+&oOHTjC^O}`mM%|(to-^G zLNmh|{-Pa)#flJ@a~Qx(@q7->%k?^G+1LWR zkYjQz(NSE?uTWMq(hGVAN`;C)%D6@gWy2OsD@58VODH&r*yfr)6h7OYg%LFtn^p2i zPVzVeEGWp1Lqk%B{=u2&)XM=NN0BFNjCS@LvgtRs@viJ7c5=g8!tb_({F-c97F%YfIzK&u#m%wb@Q#6&C#M?kGa(ufxF%Kpq_J#WdK>oYZH zoYrm(s#%-mY{slvAP=*&43NKx{c}-T2RG;RFKNva4}`JjavH0tS5I$%YQqHe=6-Qz zePF;+b}-wM5S>%G zRC5IaC3E4pvJOk5C@!aq&RyFHpTNGcpd=^;CgIKcqf4G1wj&;hOB@fC6#pfH0RZt* z&j5ownh1QrF)){*AD1BD#&i5EigY}#Hu=OLJ$^LRJg`0GOgi0V|I4g-37J>38hTJWiCB-wj3pCo@ zzW%3KF3K$frSEj5pVkMoU@O#z=)o}Zlk8B)4U351ok3VG@d@P!>)ju!sM^;nHdRHo znV{#51z_vbcA|{@$f><q zp{DkV$R^_;xnp!RilK)5VxFnBjw84i<}$r=!;0Xv=JZZ7HAB+t!q^fDtu!ZBh0(ym zS_~ikz$Do|DA>fbdIIIU=kJ|{j@*}S!ZI*R+y+eQ>VRo*iujynVkrU8+JicVaKXS5 zhb)f^TGAtfXhX9?+4(60Ih3led0#s3Or7kBK8F1L9cm8&U*+0u>zv8?%{)KKY~g4vm}t=L=!KlOKmNIK69QA1@D-EWxBWjTfc#eiY8cxvEL1W{^m10S zjsBpeS0QgJI@0izUv>>6i1H25|58Lw*@tSJ8SK!Qq{%8gwRN|uvcV-y zQLHc2HxVI%z2hDyf6d>GJ5 zX)hpD=zD_N8N~ei>x)JHjJW+0~_Jx*N68r%mT*@ zw*rYLR-oQIEi?wCvXQ&Ei*gF0z=mC~xhVGhq@RUYgxIPM{{F&qKXDdhiyUU;nsvmF zE{dJJee%Sn3z>E9pT34O99l73E5+^sqDS99L~0Y1R&2J96FM%e?1AI{+d4uVxHwJy?MxFvY)@xP>W4S4Xmi$JgAv0>5n7UU z8{)T$)y_(j(A`+)ay;>6|NHd&_GfeZ|7JK@{tLtT|F>>3F)*_Jccq*vo(`t;8uCU~ zrp7Mx%C1H(|Cx)~JK6rP;Xh9a80m$C>^-y@{wdqon3h5dsio8Q|Ev6p9)U)#>%EH+VuYlLg`gZJzR9@MLk@^m0kYTffBI%=Ov|o zAFy*FU}5>sJ>S3Ue>Z|=*J#?>W3wau)anZ!^9@rL_ndHRssL^h(0Yxdx;Mam?}G){ z)FH2UITp)q6O8-aPRDPPs%KTDbb&;erw)ZN!-pNm`;J{iHevma=Ka%f?dfa1oH7j9 zk9wAv(qHY#?8-9O{pNz4{Y3TQiBIU9#0F&zt}JZiWu~&ab7Q1Si{v=kQqPzP zE#DbC@^XISWxvl))2A*?{avAM2E@U{{@t*l%lbZmu9-vh!1K2*6(xO`1LAj%iS6bP z^U{fq<{~JcL>}z(O7!LK(rcAAF#G+f^y#zr{j->GwBbW?ej<;2onCAB61^Z00l1hs z@g8KU%+M9f85aO(&Ft2v-vYa3TbD@`iP>e#H>bZ2q7>XIR;j=g0WJ&uiTC~M^ zru<^4r>d$VS~00tlX12W)p&Ol^?Ur+$-?~;?9E?Z9TQqg3i4op|BO*?Ekuu*Qgnt* zWYNp0cpAh?OMX|5mjOO;Qc##DujJ(@+wb$(|UdU%fMQm_WNfZPWcb z5NA|M%{a3;w%I@LfbgeoubwtN{Tv{afZKdFfgpYPlbBtM(6arJSW0ZD=m1}?slq4J zvi;H`mg&G%%@~V41D_Lb!6s(*kXh|{N!Tf$Ni)WBbSgRtrmhH6J)2FAtP-{jbVZn4 z43Kz2SV%00JHey|=6NZQOwJnn*fkLrF|m4-MV1?IP3qLPaN4w4wgcxARP%U)yK>|o zOy<>-WIAESGQ_BKmpBrVclym4$LC<@iVw%+Bd}tza3*?3lgq(d+Tc4U0542jNk83( z`TIS7_W%=juc$WWD~RSP7iYyti0m83o_LVpZEhm?&>x}LFY@3YzpyVy!dVmTkJ%c_ zpqDJ$>lkh2SOUuL)a_6I_||}ATl3} zkQc1V%B=b)b5P2oI={lhZnwuizj71($3Z;7-G%37W1Z&q6g`ijQ(U>#_Qyh>nBSQl zlsJU-D*AC!DS5fGtAmr}y88#_WVwn?tvW7vnWK!vDN1=V24{RWDLO4!Xteec50v@+ z0LLK&1p<$fgXI$*;n(Zc%oUZ&{D=zG>b|SX?9WW^XRI)jp07>T&{CPGT{Cy9i3C8G zev3zjg`L{JFmD9G0^h5PJ9Bzp-=C>Nk3;;jXPto%5wyka|NTZ`uXjT5C{; z7J}Z6#)26l!)Er#%(2!0wIey$RKj+gmLgflZ7qlE0RcTx%&>cli%FuN@rD6__dEDA zrBJUhG82JQN@>vT+cYS5s+zce%K!#V%Hm$4PpKuHqSXQ%A0ykmOx8YG=Tu{yjDRkw zz4R?u)+<(-5oN-pVgXlsf+lT&L4y)SKo{npJ6Q)+aMAsN^fUfut09)a?8J>&A|XK9+1;0j+OE{Xc=fRIQS8K;F>U5`?sf3 zYatvsy1HW+8v7|Qm^On6jKN`S&W9-oVBNe7v>xdgl9dfIFJXLCqI}QmQ%pR@0xUOc zOM+mQU4RnOJC)i6XNP2KW8j9K*GS6iE2R|@*CbmtrF;d8yphzpx@BWXEocH&Y&(_6 z#w(g`{Tqj6XJg<^e;} zI-{95S)lO1c;h{Yal8f2*ON*f$nokL7dq^nu}@O=b<8q*sYFGfQ?J3)@u(YnmUvt% zF(MtWCm<1xZ(wpe*t<~VIDG}6GVFL5ghjpP>i=~`jnwQ;|LQlB*W_?E)y1rIh7K}c z*uOU!T`*({{CFYVC55BCj6AWV2|v5-EN?BidVXC~!2B_1*$^s_HXQJB9h0>oZI6n1?PQTEiJY<(=qDr|IqBK$h>9 z9w(`rJx)@O@{`%O81TP9h>dMF#Gv|i8}j|efz(8QZSC8C{ha^_%OwA$jfH#_!FvO zfcS6F3j03@@c*n0XX4=C_%9$~vqW>p{%9Q5?^fS%n^y(wP;(}%w>6Bq^^fj0Dy>&r zRv{?cXTF57HoE46lvCr4zP{6VLh`sIUL*iw*w8zX-*NoW%36v}$Q`+QgchkI*`kdq zj@YSKX;n<5a01CVnkE|68ml6+B|@%AU#&=@Wvbh5ywF|@Q*N)6r@jhKm*t478$M1`1G&j$$f&k2@VT9u4RgPyLvx+RSfPO=CQbJgX8 z^IgGaPG^e@aMvzH5@>>kU&jx2w9*nqQSeq$Qr%WdWFUAVU6uFRNm$!_YVn*bM&%~bw*u^+!q}XVSo>~7W?K5`5+te)8hIquN zc}+Gf8a%N(*!v({cwN^0PAX_$epYvh5<$@|;NAKc+if!;hAb3J+ai9{kd_kKx>f89QNh^FBuyL^TDnhvSYd0zwz@N$ zMYol!@}_d?Etu*!i^VXR6bTsWW8PQuDh-qVke}fKF(|cZrGg3u>OEbx?@$-2CYADZ zIOMM0LS1R(?K)GGb-3lSBb3(MphFiVFjU_o7_{-g?blXXS2=Lu7oQYlZ1~=Oe-{6P z;5p;2+Ux9T{kj}u6H1OynUZ-6`;-U}9pOOh^yxP9T*f>%vZNRcOq(W^aWS%n@Wu{8 z?k)VS(zadCe$xher$eZ0*dJ@egsBmKG|W)LJ8BL*feUwS0-leyV`Wi2^s-ID(@2~` z{7t$&O}hX!tVEt?w~nrPQG%BnzvGc!Aq+8!ZRC*Ivcr8bN}sE*i+QGHSvdsI5?mYi zaxCe?KlVfjAl45$x38Wn@%`d|s;g#@jzI(Pft10#fHHut<7#~OaxW4s9*{oCZyHj= z<(s*!EC9Jk7CoVm@>b9Vm%9vm*$f-GS%qB%X8x}458Nv+m*K^V#Q@KX>}aKS)?^6D z?x$62H)=#VN{ub7=cMJAK!ewP0OH~go$z*o&0+=mym~eXYgy=W=F@h`tZXHnncs>fyRz7}{fs%*fzhP`OwL7r`6pt%$Gv*2b@7D=aEqx< z)-llXF@9T)<$y23SyDL5fgfJ(1F>V^7#>;@yzK@5}YRe;>Vl*Ibd**Vn}eZtM& zHJawJzd@w@hi#N19=-7HIDdpd2JpglFS{|zx4PSjCpn3{0c+WABY9g|aR~4&OPWRX zHgkz1+fd^BT6G<{5|@M@o09b@N*C5QEcpteG8I7cp3#XMNC{YQQ6M!6H(8f2TW``; z^|L>AH$b`P6UXNeJ!8+~_MgAkrC(>7`EU)PbemU%cLkzJ{uTKQnG@mg(ubVJ4>r0X zlkw4f{tL@`As(r+L=9hO+=TYma>c19WHRF;e1M!y?k`7JYKU1rDuM5xU?2*1kw2{W zz63%Fdu3J>O?&+$4|8M>#bq#!3N+JM!hZWpr0RGOy8cER^RKyFBRYyWIp9pjU`3j(j8|!HChe4PMT+Y+zie50-p|L`WXI$7R8TLqVZ!4F zZW7KgxI)@uPCY=#L1h1HOsX?N7^Vd*^q@_0#~TDSbY_TCWjrW+I#hZEZd7{P=91KA zdB-ZlHk#xFL<3H?PdZqQtx`aJ$jGj~$G>g?dy(cFT+SbzSnjmbeF0lW>%})+cP=HT zU`iRo!zYJ$SiWiou^Z4+O-L|vIiR2I(yAp*^hh>Jd6aTlfea5m0t;RqW)4&9x&7xw z1n!un3AJ7D%eH*-)@vH9T&MC%Ht_LhGqd~yhbrLTq+Y-oJ+^j9-NttiRNV?-NZYMd zHdUd>=p-eVi3fran1w{T74)deQFYi5*Yr8TNJ=FU;)@9m?IYRHCQV=E9z9dcgrR*J z^eVo!2U+>x+%i$+K+p_<1cV%WL@7$MD<5tNAa-qCqmFAVBSY&vCMr|iVh(!83`F{@ z_&uQYyUWjvY$Y-P-s}?1+9-RhVnXMm3_K=MK2+Ag5=_E|z-b}D3&Z9$)d>#JCcMSY zZcA7l$hiw`t$}z%&7oIVoAS1rQ%LqztHsD%oBIPwRjC-KV->d$)>B?j+yUhxD5l2Bg za`bt@0jUDkD#S*Z04>Rqd$8_hhz5@47oUy1Y!fl?4{{XHRYB*TyXOls9x#zA1RHBvBY)lq{LzcW>cBYHXSaHGX0 z6?c3^`S9C}vCrlU>H=OH^)b`Zh(r`X%6s;hy^{vEGpGI>0t9#z#%j3mUyZ|Y`c|3X zB2v{EY3h>OdgK~;LkOLN=l)9&srdQLUUGi>4c~u6GpKmYj)9YE{)XPJxs^>c{8O`A zn#0;`Z3Ty^c7Z>pf6nF^$+C+8$eWqpA(IwKb+_eUDR;W!#fXuTrivE3GZea`w5i^e zYHjkn7?7Alz1{?ALsO^^UW)JyNB^y)hiP5<9flY2-!=EPdH8X<9Jvz+XE8Zkv4Lky zpt{W4;Cye?roR?(bLbKqr3SXn6~Qszmp5_rQ#p$CBv?c8%o4-rB7H~O^Di$w zm)F#Pg=vOi3nUO%1(CfeifpXFtm327+(On;BMyWT^hE$ifnp!6B@`tBS5XV|Xko|B^`8Xe1=<)G zxtKT)9aK@ZwuV9ri6)PAK(f8PdT``1fi%7j%#^WYz6SJEu+bZBjP(BQ3;?nV8PkK; z?Ze&pf1nJLkK1Igg&6`sv(rlj{(kcq7f1}6!B;{CQTW`~jZlXzXDQ&7x${;g-5KS1 zDcXor%LNhw0ySZTL%AS)4^+EWQilNDT;EU_ofRO(2Bb=wIK*TA%Ja4gDYjdGve(fZqsVh`PWg3)u9B`7_E52bruGvEXHV+;90BTmA}H;6gVX zK()5u6%Pcgx|$Ud$o#8ZUi$^FHwx;vu)VW*gjK85-j1P1aOJT%QQnZ5IIwYyUG^uw zVWe6idY2Id+GXzn@4qBHCN0$3r8)e!$KyO;I8&U~*1>^Ne12?MI_$%#dT|$&r5hj2k zBEM2li{o6WD?i_nj^Lv-drU@t;F6Q;2Uq05G)>gPGTTG_(7-aEBm~&j>_NX?M>{54 zGE68~dKcl8+*>ENC!5?bGnVxc4;HpuVNFVZKCW$!tjJ}vEa6TZ94<)u!O1Sxm8>6=m{H#x`Hw3h$1RQ!FLfvQ9+z%+bQU&-oBGMKB zB*`GX48UZ;D~^yXbTAI(zzgW!6oxts70MH(%ce!NL4i#jLlW@9Og$DTiS$!^ry+@? zezT-;W07Z1d-(6t)yr0AQIk>5&zA74y4V;b|G81uzLg6fC`fTqSo8(@$dx1MuXNBq zNFcd5+>jr1ZV5;dd<0j2{9XK1JsLL$RdkKG@Fpyi7p3gIa7_Qjx1xK{Y$){=nR;n?zQmB6G(_bqTnFjC%6AJ@HMOd^$PTDJ>alHu#>GKXWH+=?O^Z5Q{F%Dn%}mu8 zGC<55mPv@It3utWq^Lgid^dmE@SY<$8?Qn5C>+Y#jk0^H`uI`pmEoK!x}{f3roNTm zbLM?_UAF>;c`o1-`gIEw;QaRXQC)+|pZrs>SdLFRHJHBOE%OTrZg!pwno4BB!-8J| zxAtwZI%Niozk0u%5P`d2rt7NN2kl1-@YI?TC7g3@QV#Lhr`kvj^ODr0HQx zYYvcgRdtfEOyKO3%_bdK)%KuD0tO?>iMx_Q;mByFC~WE8oW!yGXm)=3*D_PRPq_Kl z*Z@U=8UC+5-SZrR%5}FsDhQbXFN5u8ga(u~iI{Ex4es+Sb_bv%)1~wFmi4_Do3>h;Xid28e=WJo_YhI+F8`(|UV1 zznUPK#aD=v#}wv)|AL0pSpY=)Yaz8ecfb2Ss=@8NS9_W;Q^)mKVs)NuQs3z<{+2Sf zt_6iGKA+y&!d&&;?$Vn5Wx)!YW4eJlS@aFr6c~>$&=cmLK;uE@kWEE`Vm^HsyW02W zbm>M38-Ot)Tcr0tfm|$YIgz|K7i38!V@&PMT_< z%`|is<5daC2HIL2$M=y?6@Wk4yYSuIHfCmW1{3KHbOxfJgF|nS%nG(RAHSx~E6I*q zFzgr73-80;Ij13XhR44pKz1AU9nPilS=M;5-z{mxAxnZK-Vd4i0zmk?4!@#76$&-F z2p4BOOW%1Y?JR(wQ?gy+VUPIzq#P+*1j$15jmKx{Mb;SZt1|i}@ezQ=hDri26g^Wt zYPl%+o%ZO4It$`eo~dNBOZ!YhJ;dE3pB_(gok{N0gKm6xI1LGg>8p?pG<@xNUe9&D zy8CPLMFahy&H4M;@W?*4l;%P!3ff_Y67+pr9i?#$@zfO)Y$N!pX9ej*xR0Y>ETY-u zhz=wr9bKdI@&Tg@)76inILl-F3}p)X41zSX4%$#Z{g0Y=cWYuu2K7)Pp3fi*sTB~l zex>W%9STN%d$aa0f5moNuS&MqE?(wW8Y zKM>;Ece@>M5h34%3LvKr2U<~egIA$A(OmQ|79Zfg;|Yow0%I61MX;cYHHOjRGKN2G zXd1Nls{uC6XB@0$E=`;M1Cze$3?dl<2jcDVlCN{ZYs)bjQlf;ODtqz_{5RZDv_4jG zY@)J$4ict^SRNbpE%fR}2eN6AnX#a{F>u<%#;#Ve*^VqfB%u<2+?jyQtmm$-bp&a7L`5K?u z4G2>P^NEx!uPCR+zsYR^U67=Fv<%5A$dtV(X}Z}_QK@HBuT~+$aQiZPM;>g#oU=c4=!)LS>A7_nb77Vgvx9y6 zDZwRZy}wJphJ^$^clLj&i%1NMYxdj5zi(^}e75F?Oz8K-^pn84KvRi|#SPtol+>4k z-$O1fwfC?f@9eqw_vN4Z(?T`8 zruMolD-*x_UGdS7xZ224w5;B|Jp+i$BOQniKi<$yS#$)6AVdpzSG`slrtyFe<54JF zudU$Evl(?lR)5R!aW@~UJ>(mG$px;~TDq)guWiJwEinEd&%9nP^+PpH&QtFb)#zC^ z*B{m+lWIyXVUBfNg2C|!8qyK7uN7QXb}zOM_gOln9R+ms`8Yed+NKc?bA73aRtyq2 z7q}$?FEx{ZwP*a9uQHCATz15P#UD*x+10MpcGYVu49lViNZ0M5NpMj>Qcx)~%$bF~ zIeW(=je!XkO6L$)ar0$&gnTIpAjtFTU>R0aW1P^ApG#pQqiG_Qy^L7qlq6A_DbKUl z?Xv&^`eCtI?J(-=r_zUo{e%6`M{;AeNi&;>e)tT`wSHy6O=nY2O_T;?uYXTeTO!6F zzNkcjOuW+a{H;|BkfkoZjIuK*c!P`LDCWkJs|#q;a0f@TuXuzBuHzW7v7ge_m2r^# zxUglztc)_h#{ES_;evjY9UoQc z42NN_c!i~kJR`1Vuys>FWYE?6UBQ?d#z<{jv=ZQe{cL-!N+j2X-Q4^+G^O%;O_lG| zU)xH(f=@431%c)p@crw{*iJ3S&CfXf*eC&%j)~i{^ z&%h^I#zA8sIYR= z<3C77E5)pdcNRrXbQ4|4ow*!LB&8Nf9Yz@{Ar5$kYuE1m54e@up0ib7gyTLfTpU#2 zm*@9}jNH`LubAM|T1qwt3m`Q>8_JsBNkd#O#5UX%5o2pl4HYF%imP`_Y2<9}tT|@pg-HZd2?s}cnrqHhx~nNsN>l!_NDlRM zmUfVdm!UvtH_k(Y1~f0RuZeR#z>3g<$p)iIRsSPCoeGM-n>gIBnLFIS+D073+#wTk z_d-oqSi@W*}d_KbydQ{oA)!y;I@5r*l1Zu5@Iv32h z0;|S(SzD^|jp}hR!lP4g{BlioJ@;%gvE$LbIH)C#^7^HjL#qVgGIG-B9dZ7##<9ra z9>_ssd)tRy!vf6SB7rj;{TsIx=Qe0BDvWZOj%ZVLE(ppeGinC;=zbZ`WtR?*T; zvW((!LUdYRQF$M+iS`r{qxO%JCb_JFAe)O5=*ao(%Go}~u=p-(DECIFfOpP8miT$I zg%{aV2pi%|xVwI?k=Bzt=XGZC4bBe%E5bnR7JqEZ=-DAl8Pj^jTsTWtM#pY$J0E691-hoq8N)uM9AwugeQJwt^3?58w|5vV>xdc zs>IzJFaVV174Q!>1&cm9u;N}FuhwqYcwosa!ZpmgMM_IY+e^QV?)A#J@93cLJvU`o z>Sx#$KFHe0pFHb!eIpNxzs(d@QyMm1ys8=DEvKo=ocDDR5|~vy=PD^jWGoi(P{kdY zG|&iSd90!2RW@c<19!zZN?V-KmO)0C@c|-cPezBY$CMBcbX!1e$*nAYh;yW~#B850 zBA8^~fVHn8-KU|!y>K*5uH}UZk*@;AYF3&w>OtLX-cQh*;p++*VR@vj1*4b~O=<65 zt?C@td&k$?1Rhm-z>Nu4l*QZyBt1rX7euHG1MrXn&5#c~EvR)Z!e``wLv5;LF5;V* zv=FhU1-cgG`Z{%;KBkf=6P0Hp(X1d&yN^KVdylmRo&7YGQ`61dz;v^kN+A6Nvebsyag>!@{ zN)P{v0Fpw?kRKV!an`7e@d1JI*_zmzb>nQf#da7l2=>^h3kIT*NoHy$W@0QnLBUd7fb!kwHCNdhYr zs>ZEbh;N0kphpuu06Yf&!S^}guaT|@L2^{T@3>!37hp91|HFFw5AjUQ|6_c(8f!E0 zQ1m&c)*cmpETmecRkQV2buuvoz#sr-KVV5AHC{4{}eI0Mh_trJMd{E8uRn;L)ZtIP@ z+p7}uOSn5Z@{0fvY91ty*Wt3&a;;m9|>QJM=3c>*vb$@p%N^ zrNsluSL&N!JDZJ+rDvf<+pcHpD+W_p2Bu=q+Tiq)P@NHh~QLi_bBaWNwWHyp4*10_Lws2 zq7C!|f3n6JOcZvx{pwmWnOeWTzY&b=8JPEB3H+gw07BnF=eFil1z!8kc?KKr5oj~I z`(@$9V;Uo~<+^Q-^|SDTPuveh(2(I8`kO z2;;+n#JP(Lvj6k)&}h9V|CPPjeCiuq!b5kR@6{LWeqk;-T7wJ^t>&iTq?H{qx6#QTt)Xg5MLS;@3M* zWTo=UA3yMP_*Im)CZ0G;-X+USzNO3)v7@lRA}8LKZR|}8Xf^mUrT{W2ZH?zO-6^|*w(D_{>+k{U9gY@vW zDg;}Q#C>1fh29Za=*e~7g4%G!CS^4<%X@Tk-gIG*RC|7W4y@ewGTD^vSS?!7&GOolC2^gp*F6%%W0z*`09yeSO35ePs|PJ86FIMHrmN-w}kI zj0ZXkD|2{e>kJQbtYJ8R!}Ru$^vWUG)vr|1bIR-+9D=o8lK5Nq-@?-LP3 z3<(c}A&j%Zp4kl@v%%@AG@XpiH@?_eWEb@dErQZ{@lXaGRRcWb7ias}3kB5nEL zwNgA*`(0~`uj^^70530qG1xN*%(VT;>ILr>RKgMkP;3N`XQd_*w!p<9v+uL21$Cq_ z{(>Of*;B#ee_U!vUiaAE`$N%2nexKqUlrhoTi)k=^?j(_A3f<1BR{6bDosKwH;MatyY~TCQ3f$yxFNn2w@kQE>M(t%@ zvDU#~9f^E0tooH3CO4N$rNV=x)+NOIXnwzIv@!0|{UT2K*r!=>coPiZ=FyW3y%}bj zw^Y9MC(VOZu-*Vr-Z@J6EQgERaq0abM!h2we9s2Yt$7KvOaWZro z!6tGM<>0bs)&lxrl|YsY{KL_7Di!+@XX-!W@j3-+C~Mw-01$3PJTH${D0X43o!z6B z^cYrf7!QB+WWerbj1$2|MB@S5sAet$Mzm&p)#krpI0{?CGqDnAGo%v?rnAHS4@Dp} zSx7MsP_O#u?)$L0Tk>iHk(^J`-D;>a7;0-A*fqqSz+weQqjni%XOb zCGO|_E4}#{Ht8Q4^pW`rf^v#yl{R0HlVRdI$cptUC{al>N_fK@!yfVzF}<T;YZoVQ2-!@JLeT6*Vzqjm0Jau;e|3Z30sYL778SFC%goUScpBn%;j`SL!gW_;EdW`;B#mmG%G4YhV|27!y~Y&2j;%?>ae;a)aD&C z<^xM$FCTNdS%Fm`so4gu(S&xx@@t`7EioViFlfT8&L81>i@V+N+@YXD>nB|bfLt|! z_6ws}I4d+6HEewgCD{(xlF}OFhP{YyKqnO##z4I%KsboN!W9&-sedq2H{t!wlazLq zmxv8T8KH%Ydn8OFnkmC@`xlh7z5HECCvCHpJDKd=^sTUPoviQhf&`Wp5Z{-JH)4hw z!Qfxy^`ahFin3DcT~#_g60qw_>V8V^dP*DuR|Ng(@CD&Z*XH}Mi>h~C=O84C?F@(8R)^>pR(%s zdzhP4w7&v7Xq;&wjpg3874FHRvrPy$#&N}rw>Pb1G7jRMi4E5=$!D=4!@7r+FS^%bVEI0n) zM28M&^nQ}X6PDYd!B?oKQ-jbO%T%L7;GLwTn0R9pBbaFc@#X-5K(J|GQzi)9_2L<} z1+-!ulH;W#$n!5a|6q=*$zal`P}8X(KYt?l~h2*J+8YWrgVV&BWex%7AMCBW1M;|0%Q+g$d|m zJ&f7Wc3>@^jh;o`i=qUbeA`M(NU$}n|DKGF)~DmJLX?m!%=w&Tv&VvIclbQrSD2ud zRK=N(C?xH42p!fz6%VxsJ*uz5Ka9U%QfuGH;N!~LDlkp zu{N$Yy7QGN4TFdSK8IYMKvf`0h29o@052z?A~q79R67K(TZu`T0*1n&;(ZClBDe0w6K`F!$53&JVSYuuiN0qqgPgj7Hh#OF0(!&1gU=eE}ch#_|C0lS0{E5xWX%I zxM>W*!YJkQJF2qZSq*%m@vp5zCzk^Oyn|eDVw}JGp8y}!2*^@ zh_x*bs@ZA;_69P3AEA!)Yg%7zn?iQ$Ai9zL?B{`*Rne~w6Bt*XHFq6jIJ-<;h>D-N z!U}Jy``)*AEo1~a^5lb|xHbt0OGcJl*G08R( zm{V<_V}HIO(mNABaUpU4+9dxdyz~6QcVbe(xdkz7C?B)J(C`xI@nny#6P+EHC(CJ& z#S+ysZA%Ni^*+!n;_oskLUaSzSI`9ZWUW+c;m_oWJGUET6rmcFtKl79QzK(6EEb;r zeGc3Tj>|mqi5Prx*)nm?9bNz$$&S=&3NXDZ>ex~#*q79NYP-2;>?KEdVN-D6hvRJ= z3>U6wnPIVvTE?^~nws**3xM{`9uS#%sX_GAY@5{9=)2}P2*lWSLS2-A%f;QdA`)r3 zZb)O~1{|Md0%+5;x-)jWdT${E>BPsu8-3X)l zQ1btdRUAx50FTLzFlD(XG&iq@Wq42|ShFx$H^?;nY3=o$trmApQZzQ|=JyW)xm)VK z*qSZBsJyKms08>Ks1(*AbvP+qSLx~fc=)+^iRqa5qFGUU-K>;jc-B<8O(m;aP*~pj;VmHhYCyFAg**$~+RuvwW4H zB96JWSbAr=ADUrKWqnW;d)!*6MZ7$;T*m&FaYG}fs{m(m6FG9}c3n@~R?a0^f_pUi z?BZ1&{VGfzfC;a&?7D1!;POtm+A>jp_JV z@OT67(oUBpAo_6H|Dqym1>R*n6y6l1UsCfz@Af!W;sdE}>F*j{I_&CUVHD_K5p1*F z3K6m+G_Ln4pSHy+E(QOkfMaC=UPypjKRCvg262${;^F4)z6X0@9~+On0djh{1iJbc!XoZ%I>6lE#OCA;iBZFl9Q3sO#l$bjK3vMF4J1y~j`h~VyLV`G3p&?YPU9s+qlz^JxZl;WGe3bZ@dUnPWLLx^s2 z>IvwE`GHBO(tDZJXBmFT5sP)J-tN#s&^L ze3aBseUF%!2$v9hy~4C?TbDVa3E8{1($hATMMJs_I}#swv_~81=bRJR<+YrQ+pANE|~R3KW$eP~we))DL2lg#WnSMGOrlbBNHLlOvQ0#*FV6%|+u z(;C(%*vKbY5u7W>)?0Z87S|VEIK*myf3S~pwK_5j2KmR@QV4J#E3 zHwzGaDWFTBB(b9CTl3aK%(NTosy76X%D<<>Q|u#Hmy$C z;W_^r;6&X!L`F4l%w|E=4l_S(eeTyDf$S2V-zbN5m!+^Jo3FwU4dHyl&V^s z_)RXfxB6Ak#I!3c14XI?ixK+0Y8p8261N=4g6Vrws~B(DLYL1Qrw(YM=$Y!orNl#< zhaOyKOT2}A{+-k{^-PPqSw=cR#$Bx8^4FX}6A(D4@ZggR+(-gdp7_Y&3We@FlyP=Y+8*}IyATG%NXl%` zA1CGLumCH%X4Dl8i_Z?M9X$Nvyry;)+uigKbaD`37y4|tV;hB=O&J3@ss0H=K32oP zD5z%hO&wvZBm_sZ=-hr6#1zEHnYK&9#G<$NJWzO9XO&1*NKwM<5J8@l=>}+Ze|qm# z{m$GV;$g-JJ=G?FXjAOHk(==u zC9Z3gn1@(>ctTM~l6+2_>DhmM0bmT_pJe|VjUVf-}dRKVg8Gv7{|&0ShZ zxC_qc;>lYN-Y^P)fQ$SRPg)2n18gIROd^@}$AE}HXIYMMTWK!Xa9YFC*!LNYso6Aa zc+{Cd%o^jZuHIk6I4&mN-{BTs+ncYGC}B{6E=MW}y!L&)PJP@fxCOGYa4>_jnTkP+ zB8Wp#JB%Z13}+KZn{bk~R+_fKa9^r~)$UxfdKUABQmZp#Ii4__qYUu;%z#M_`uU-g zlDABZvgDs4zd@NRVsI2JS61{@8hw`-ACE?Pg|y9p0v9O}?J4V}1h&}F9$Z@vrNLtd z#3yVYygOA!Gi2UQtMGNZHkBnn48ez*eACU9L~AXUWuk6KaT1(OIGfR73kzEp4(8x~ z%yIAu;_{(GZ5gPSnsevlWKJeu&gNJ|*nrb47@c19n9e_9UNZJ&p9#_M9MKp*lu z8tFTBN<>Z)0>&1OyAD|A^*zyIu}@AAaky3rb56yX=ns?}S2E=-JsE=vKdYKV_jRZw zw;%=UD9`3C>@1&A5V-5T;N1X5CHEMG?i2){0enAzvX4j&yI_uSMY7sjzawAzX*&F9 z%~SjFb;f8IPwulJjgpEPq=^+#S2y0leqC7gN7zr?u68O<`l#Xg05-B=#M8E^lz2HNDr03pPS9oa8^7vW=a z_k9z-)RXjqYYp%O`xf!a9#2t$U0Tp5f9WKMNo0(8mSpbsqDMq^p&+c>qMB4=XcdZ6 zzq*>k$KfEh*q?STcETLX!1uYVTfGn^bsv zRgp@y?*MlAveGxKTkv1d>;>U)BQiL#ISqbdn4UxksHBL}U7C0FE7zmMIDB<&2Zm32 zp@}f7bpftwNkZAgn3@MN=7Kz7C+#MkXLS~Ld*Cvh)`}phU$d5Nl0lqf>P1!XWQaam zeU71{wK(b-#YmzvFzy);1RWog+tk+6_((NO;9Es+vZ{ek$lZi3dn}jYMXk+SbrNA; z0#$HYg!Kb@?%!Ztko549(X@|%$G4xmFuos$V;ED(V}})Z)x9c6oTDb;!q_kMUe*F> zaTq;O!-LXBXIcrkk-e6gxMo}&VB-lkDK$cw;H{1ehOUuFFXg^eg+LZZ4gs6PGTc*x z1!YFR@r__M?fa(6F)q+};s`UGvWPi~kr`f^_PMv2D!)b|#u4~zVA>1$BHWUA-wZ`I z-$WidzwxDuM`1VTO6X0atO}u1gj^YD*U@zJ7mR#*$49eO1z`>0^a=~;^&?Txz58s^ zy73``0nqvL;@>&|hFd=Ey5M7YUD9BK99Wf>8*7KedNd{g{2CuxL4C5rmUL|QP{Bx| z+8)DJ1iLlan~SQ(X`}c4O9kv0wPh)vQvQPB+M!km=Zj10*s#w_6zA6B&B%k71!4W~ zsZVFLhtR0(p%K_P^u9ifWjT3G=r57$%b``H8u{M617GO)&Z1v9n4?)I^TsoiFxKq0 z3mxE6BE%_@cevaQg@gAglcj=K!>+$qvX3>0Tp(8?3sp>tQKT|3mx&;c(ZxdBA-x$} z>^vbJ^54-3T8r*Z768C|Aem~+y1O^W9s_O39=gQ-9#UU-(%n47F0rp;_z{Fv*dsK& zD94ci(BtBg!MKzK@t(JXdquO)x5aesLM0`83gksA5@n@z4Anjk|wiRHw zw)H;1VFQqSjS%BI_@K#mm|tv>-{lop1v%LFEuM8c!*`!=#K6nGEbAP9AI9dfbMM`5 zay--aNMuU)gr5{A%m2(Fc&PD+S3lu`0`DPaF-yd5{X`F{ju*kxQv!Eu^Ost~8+^u$ z@ezYn3Xrtl(CJN7qx9VfL?06gH#0FiwLn~wV#a~4&GnyHQEDO-{^6t;yK9ezWmpNC zepY%>EXYW`&FkD);3&Yayp2cddRtFLv2{GWSwpE(zSl&&E%)-1sNixgGrZ-;SQR+3 ziHnhn6Zk$@0l#<~0V3?<-JJ!`z8bujZ3$%9Ze?dFO{Xt$1vob{PloX)ANL>F!6P9a zjg=oW4G@4c5|rqW--s2|0JU{iQ6i~lzmLS@ixa!^{p+~}>#sBOi2i$9^DEVPdE|Y- zT<~(@iaX`Keex7M+4nNSMG4IFmI4z?PzqDI1qYD5n|C0;VQfA^?wow4Zn{7r461Fs zK~079Wi!5mrpkc`!Ne5758;X`oZqYQZ^(if_;kbDs~a#W5Z%5gBQrt7Lbe4e^InnZ zArjSe(KkMNN!aUM2D55z`keCm zQ$^#iNMmi7Js%cc;Xa~ZxAdYTpfW}8R029N(?z$h^mv&o_JU|f5ABp0nCX(yXJAWC zEqn>J%jG>)VDysO{lYX2bI;$Vfa&qUHa|hP{BvnIz2eR%4Px#qzpgiz%p5%Afq+{} z#`+}Cc~evC{5@UBc~~9q)L%ZYWt3v09pd!fXN3T%6cG36g@YpKQ%*tH&-A?+m@OAS zXQjQIR1iLNH{Mj*k^AOYDQxU6r#*7=wK{i4!)RAIS??Q0o(ayhP%qf+ z31HF)P6f@smG2(0q<7P^OoGrK6->CMiI_g`BhzM(KCL+~4un0KL#^*HwOJ&TLu%wI ze9?@%^Q44NF7QUPgHy@|71yh)yH*ucIS=ygr+%w9jX9H({@pz7rJyCw zAgPYkseC?=_wt1B@S%W>g;VRCMfieKfhy5Nsh(VYyGUQRh^+l=Xp5aQqgDtl_p8uk zLMT&0LHG9wKyeOBsXS@tF^!V%wt&5F2f>RbY+FILtcq@w5^Azzuwf17C5GPY9_7_G z3Uj1@o`+alVp9#QkEItVUyoYg=Dsx8@~h8J#tM^60@ni`{{T`hX=$F%#6L}jr@mkh zb>coMj=%7{m>t;O2KXP|Tr3hCYlaWgspV4oBN?-ASTAUi-$pWv7wd;+cQNn}Fq1r! za3ZRRZ~4i#>{MWL_H2T|-1iG8`baWJA)5N3>6#1E`3F6GjxACg2NjfR=%%w~34$=Y zC$9g7hFq`Jo{7(wU&T{h41ht)RmU7(0JOWU35~o7@v=eD(o5UM3}6>Y1&(MJh?#f; z_cW52RTT6tz~{i6?clP!T93O~{`eJO zK8ekOZjR5iUe{Bydu|#d^1Wy3rH{>~-ZfJ|?7lf*@-845>8j~W;M^GLoJa>oeFd^t zYkGC&$+j3LvukHD$tO4@8%9`N2A05Tky#8Pl{nSmL!;v|#ooWdFp=3Wps~Y2{9WDc zhvwM4(j&0sN^3b{Sy25WSKx!&Lu1>=wEqQ#h{!g!Xw}}m*t0Ns$%V{EXQOG*a-x>f za?ff=Yg0#_$eL8wl;KWhBR{?X<}Z%jPoiD5et6$Q?g}Tzj?Vc?Z%hJ+aS7oE3Kk3V z0l|JEhYOXnE$REZE7<v^mxb;L8bm?~X zRqgjZ>lYB)8RztWJ7}!`S@n;Vf$_gPXepeVj)$xbx2M#~DHXUC6UFYmQyz}o8A%eR zWIOeZ94dz%{uJT|nMA08(TCpOFR)7?4>hgYCd^m>fKT1nncuw7ro3OEjWpRN7W>D; zRa?_Ny?K9j!Wsvt7tCFIbK30R=b{*xZVS+-^=a_M4p8q!rCdRAS^$ zxK+MW3uIF^ARH^PLtU@A!*lp5Ob4Oyz4@xHnU||D&PLyJcs1{@e@Zp>2K)Eq0a@K~ z(mR8I-|HK1Gy?!e`~FqCY`O!pDf^}1}1GcI282<- z|Mw3u;`@x`bEHCwf%M`_w0bLWBCNpjqk37W zC^G_$W6FvvP2#*!YH8v6D)9I{xYhRSjg%U|45*-R3zUIW9TPmv0bqm4hd}vetqM(+ z66xkMmF4q>+Prq~G{@4RwN_lb)GPjE#QO3Qo1!gSXo?LqdJ~=Vzih5}l5~PxBFv{V zJtDqHOBN&xc@3_cs`|QJl&EjH9srTxC*{N%6F#NNqIb@e)~j0;9j1Hscmo3ur5<}+ z7wdQAoZp{)4UmW3+aqN;{PqLT&ILlm@P%{lg)yo+?Z-QrwP+lkF%NZMNNSJti2ZE? z3~eDbR%sanqR1y&2K{k3Nu&+N8Hypauv8^O;^nf`p>HS7R)dp8Amg{1#5Swk}Rf*1k~_M3xsVN5aEFy|V2@+&5ob3!5$<`prF9#edVd zq%MO2KEX~j!45h@gRK1vX*3h&*Tq+r{VZ|?EXA25Gz_elQ-QNP&A5%Kg^~W|H8flD z_o^PXH-IeTW@TENkY$4Y!OD{vMZ@}33EBz82^dY++P;ZM19;t-A-?W^MqkQ?4f4n> zjR^Y7Be&)t%CXaM8?m8~+uuC)%ghbV7onb?DIMECRYEzqO_9NFR@+C1g;#&c`tko* zd&eMAnr2;ljcwbuZQE;X+qP}nwr#DkZF^>oJ#*&S=iL$S*=I+5U&KCtIx6}{M@Cmw zXJ+-Cc>$s<_pvuuo=fI0!8p`kFl5QwFSNc_7{MO_{}h|#9hh(9mwJ2+{JwfNL$M#m zGt+T?p6uQUt?&XSN*ZZHKlutF_ydCI{2Oua)cl&e6$mi6n=4OE`WK51N*p;*A4c~+ zfx|m`tDwvb^XTfgf`(WbLTv!bFIQXZc%4Jxx3H)^c&yU6!90$#~JY; zd9|oGcm@ZX8e0sYk=Ee2tB5&9Tux)C)Sc&$IbXvf1j{pA>xt1Fid`Xz9x~!g0GXZg zWYAX8l!Oq#TitxeOWAjSZuqEy5>Rf6S)Ah2^muH80ZhAkn)G777}tKgTpCbHg`iYY zgLeSGV4hGB9->p>S<3s(B+b=)9B4z?EY39^{tVMq1F-s2qkGZkW%3RNP(;YCN?;&+ z0?gQ7=6WiLMTaj_|b|hD0{-k$|99l`pp@RUeL3Rq1`s+QS<$@m#fq&J8Fh%t< zQG)g~P@Oa84hgzU$Na>VyecbW|`j?lOo33nuii@;7tz1@0n|uKpOF4C+4^0yK6`)XOaIr)!P5LvHzDYPIWNe5AwDq);C|-K z{OXVR&HPx3Um!YuS{63_xtbTaZ|TAgyvMRrqORPgwVMZi!vRA$$Zp|FIID5w zQ3sP^hv6iwerLKE#Jv^}N;7Oju`q=oB}528&1n*m4^PFv2O*_8M-tG(Ch37(G9E(H z?2PZy)F}G49D|tWVn&7?j-QS|CuNh4j)QjBf{Zf@_U4Qu6~Jx?u!-vD)TKIZmF0nV z5yC_mgli$NCn^dffYrJ!^|y;+14cFC`xALx>PFp4V$q~_05OE$rzegV+TFA&{_7Y2SkG|Jz_r%28{f)x?E_O_yA5^O zVTvG?5ExdR09$lIn6Yk!kD=lh!Ii3vZ=chm_-X!N?lSpIY!0_TbZwVwjqoc|z;ZX3BoPb`ff!(IPU330;$gT zFob3Xr>plyd9)L*p7$?CU+-u76aK*lT+m+99V1lEJKVB{1E1;~=g}N)X@9JcxwwO* zyz5l<$$gAieFo>U5Rj|x>i(;w?AB#SC21y_6@?P#uVSEe9n*2E3Qd49{5$j~wSZk% z5M1z8WsJT1j`0C2mC+HuWpQtiy)F4?lw;#WF08F^_2hEwaDYMQQs_{90u8l~zuyj~ zYl@1DlwOOm91Lvpm1#?&rD=4ptFw;=vTH|j&vGZvp=ujcVF z2EPFIRz9$RF5i$$HS7KMyOfOm`sXYRz*r6%_<5~Qkf}%GO#IZnm_|9Xh(pk`n0aF3 zyMVA-ET+|QmuBoTEZT5{NU}up5e?#BRLV*Et&H3z+^>^Voq#79ZZ~^g{3}vukL1C* z{fhEE4Cv4U333gu9m+VeAYm7gD*Y2TQV>j}8b;YU$KakDcT9DVQ5(tFm3SQ&*dO-I zS~5X=c1gy<%1m7wdp8<90qbq2^J!?tx~X|1ecIuBbB1-Gl?3Td{=SG=BlT!JlL;0K zE2wkYj}4Q?L_W{Gj8O~BzqZs&GcEl(JoEE5aR?$$9jh4Xns^%>J$55gv(_e=ZCOi; zbpBpZMrP0vCE|+73Hh?3u@chXkx)QX zZL+EO{Au1rBLY`-qGp;U84SEm^cta90U@F%R=TGc;k>*xTS=ngweTSS39|ZNO@@!& zXF|DDG;?5VSjExDfE7kUza4&1=tc%R7KSUQb5?xLBA4`w5>N$^k78B-EFDudT0h2kyp@An~nbjf>!eeU^Xkv?DS`TkT%ZdRE|+T!qZC5Ow%vP5Cr(Zk|j z_;)4jp*v;23_%KvT>wfr*Nzo&(8i1|UL{U9z4lTl;_O%>Y#n$GIP{=NTtQwW@(`#V$FJOp`eJU`OSLA?V2v_fv+C_Kg-w39?hxcBMPAfoiZ^fl~MJfP*Z7gCB_b*oFO9D@zwQ^h8Gf&B6L)~-S2z%q2H#5G8^JB-s3Ec(B`LJyo=I6@&zbp zcGqoDV@ijYz>R>#@%MXOZXkhh(tzBKQJ@t+}gj$x)gik^2+zrH@0y18&5 z?fnu3fda%L9#a^&miTqXY7N_9R}XcTm8+OUdc33V#&p#LI7pNdJS5zNNr9Z!l|Nn7 z5Uo?(?uKFE+WbD0q5R#J=3=;l1De&Ej{po=dy~pm- z9zg7i=)l#y1zs{vx*F^IIqYR|JRo(pD}En?RcU~08Ls=@hnsHB6t12r8zwb>rQamh z8A6SWWq`RmqkEYO(zI>g$ldJgpQUR0~swoyi4(1L3{lwBwmt0=#QcX+yTc6 z7AdMSzB*{IQ?OZ}2E2?n&T2Br!9btu8TQ!gxVFd{XFSHph z3%u+XbnroYnfjuiTEi=Gzy1J>k*r4&elWAHcid2s2{%SQD-7-G^hN@v!&uraP>|`W zVn-pUqq^F5vu;&STfRKgg{L0R2SS_E43xZ>X0@|TqEInL^<>8~)`Q|tAx5o5DNznO z^HzP-EBu>^X$~6$I-J%Yi_Qi}QUPP3daa2RX-HNNMR?iw+1p%YP7N~8d<+h`@$=6} zsGJ9v&)9WxkGW;iCQYCRg&}1FiFVWbrFSECC0GCV@%=mJi&EXF=g4e+W8Z6jCCV-K zF4$Q&gcqbYJXq7ycjm0A1^ldthG6>eLte(ykhhjnt?oppw40a&WfqHn*D0B?Zgzt5 zc}f^)t{ig5L3jEdu+j zxk{&Z@EMBBt*QrWu-x4wMux?wG-ha_1;kkaB^~+2i&C%>D)3qUA0q5Jv5;5~=Oyj} zXaq81G?7+gCJqg6Be%nJH-Oigci6oCD)1~4U5@mTT%Q3_YzaGu?zGH$X$E@0 zWxwK19`2&}n)&zdz=11ndU~xcY0pNX2#Rbr`>B0Cn^qU(!f4CF&ML%~9yUY@-WRsLG!Jx(j0JlYJY|@9(ThgWO+;^RV_0?)g`Rk} z$B*!$ds6iL>5$)*1=w2^?=o{QFbl{iVFf}P82A#bpy74(NPA<0rg3;pi$5?tc4lpX@)Isr!J-*@dus z#G>__%9aRN1@!`t7-1SJf9<0@#~3JQ3HO^C87c!43Lb`{qr?YKzM1Da&_M^vg>~1r z@O)(e&!^5tGMPcq&73*ia3U@?e-NK{@Tf=h&F_w<=f(cetq$M13rLW=KnV*N&mkp**F;fkB+X_V6Z2~M&w@@>fl3A_-- z$;TCC9bzIdz10eLX${E9xMk)%LO-&Ip<6(h^*C2nDnsS(4!TCE#Z{Hd>h&-hu$)FhMMx3o ziy|tnq2Yk$fb1(>d+s1`-~#S7#-nm)2RWeVfyz2EpJDkhH<=B`$UU4cgudbC1H)dy ztw%GiB3f0rY|dfaNK?IMS~^*G};L%*{;`pCa)$uY6$(C3agwZT(%po>t# zhifP3 z9SlqRC7Pb+I?P-94WygUX4e-h^D>O&0~VPh;@AI1h-DsSjRY$5R=|d}_9z>_z{%Q< zl2R-`bx5v>&cJWWW4(%s8@x38BhCds=^zMK`T%RHNL)59{oydeb}yB->%lk^Yi(`!OjrNmehPW-by7X!$gdutoU1%*NpAdJ*F;(zNkDO1Mb#m}y zO7H_==5}4Zr~3wEVS0D_zqdBqKeUH`^@sZxTY`a^p5cGamSARJrDcI+WT*ch(j{2g z{~=2JZ@L8Ae~>2r3tb{j#ui%yKIG<}N@EW&VM*H8RZ4<5zz+&+Kg}_T`d}B{^x#p{ z-&an>)%%hf6)+lQtfs43Q{~>Nykx@Fq-A>YCD!N0rqq@$(%qGhBk&vchIA#tdg(<= zz-(o7#7*QND%7$vBpWZkT$rW65}Xmzo$-`3HY%e(vCuF;Fz z$MMnUZO_7^+SM1w+>2qNx z9;b3=a(+5G%h68WnfJbbrr%p$s@?QmTV$fV&(0xhDE4i}a|bb0`@ z=PvB!5?4yGTBZ)$@Al;_j=!NT*y8RwABf2 zRF-f~TmsXaZ5e6XGip4M%ZgIr_K%sZ6|!Sgy8)9GCYyCnmY7Diq>SK(vi7o|+k~Vz zqL4a5Zih4+!KM=Y8VShd4&>5C$1ruOw=2lZQtc4BA-#}H!i(XNOgf7Y3uDxNB3CG( z!X+4YT0t=)gYhrX!}GA+9HN75uk@g2gg&`4Y;)T%%%==w&w|_6VN${HlCS&c8Ib=r z8xy#}V|v3x^cYpbgN%hM_d=RbdNWb|FIV=QVcJ2rAL4=N26Ic7d@%csJ|kp=m$bse zybm@FHDwrFp2#fs@UeL&mzN5QQQF)A8TGaIeZ~G#qDs?-r`@?Yy z$;xPxo84!^pnP&YdZm4Lj@nj6-)~;u%-Ctr95rJ|Ir6-R#7p6zlu}Ra%r6+OpeORG zf?^sV<7E1`0U{XXgt}QFg}%6}UFn}MtK5n5j7_>*%?(zv$b7R*&!U>)QLI&=IS|V- zvwF%Z2(F8PjJn9_|2<%j_!+#P-`CsO<_(uJC||ClPrLA_>Lm%IFU8I_lW0Ue%Fh_Y zH&}vq<j-#mlA|1T#4GAX0J9p9xlJYA@k*{_*p zZv&(EZFshj{1}nzS(NjS5zf6>m>OiATkxIk9kwd!7qTv#igVm+Cn4)oiP|yB%yNq( zQ?pK?mRp(H`M19R|MDbkY)t)0#$@FmTW~8E@8BT{CHEhC_3>(hu$$w;SN7$WhdGL#kP|Y3b4xHj6D_Icr+B zHXaYWzngqpJx+P6JMNc%pToCNdN;3ZqIc%eB^?OJVEJN&hWd2(;^pYDs$zl`ACJ5R zPvv8~(q-sght~JLy)<;Th~A+)x*q*%cY^8=;-FRHIRARRc$Uze`AVul)8(wB<|aqk zIF_fH!tI=p)zC%aD2Z~gmWZJ!)YeQi950#|9n*k5_-Qf(nEH~&Buu6TBxg1e6}Ji5 z=qFvry|mA&-+u9I_34RSdQvY)&RxWs7=0`pL$QDk4>!&XI%KFLa~6R!2w8E8fF^9u zPelH6>{b$ev`oW6=C)IZ>KZ=s_dcyVKEjbV)ZcTuGs=!){kk#7K(gr;3R1G^3<^TB z2zG#35WN*8sXve)D623I5U#z>&NL`Yi_RWOE!V<6S1s!Ek%ns>8ffQnMs)@I9Isx# zaqaP42`Kg9_@-wkdjN0vO?U7OOqj`pU0d|H+VW3Bmf$>9F<{gKWVv1XnH!OVdla~# z!rmfqIjlIyk-D9~IusJC{M}<-z`j~U=+N2m+Y==Mdx#^ikRR5xXVWwru(~Y4ne%mk z-vg(s06*ulQw`!^T?z$f)EG8~x@UjbAMCcsrHH}_YWH^VnC9I`FJcA1ccxN5&uQc) zxM;x?C2DH?@Gb~<;{sk6ylXn~;&3MC5k#X(V^Ik`hsx~Z96!nYBIMw04fC61wwyj_j7 zq%<{aI0Z72uZ#uc@DOEU#y~h`xGyw{<}zVU1rRIlCvw-I#MofMGqtiHSMpz<`~`al zz)+x((3{3a&NugGxW`NykN;*<{}X-vf3^>1VQ2Yw2|={7j4c)ca@VtJ-b29BB~%v( zs)uE!s70}8#J=Xn;K;~R%eFR{?cvi78uUBDS_&mzTJYY~FI(vx2(rmma0_(%8!r2@Hq?IjXJArL4`C!@UIL)6Jnfq7G_c4ortr zOYlD|<>9k{?663hcc(|Z)7`9Opi$#vUdb)w5bXr_N3I^K)w-KeXp7T~{lGMOPmW(p zH~sdp@c4m+ecw+n6!0Y4zmL|FAK(8Lo3(GOnP4!CatD<5W^akDvTA3h*}wFo+gJ~U zR%{42IkHlJNZ<~g0^9=%$4hHfuRonHl2Pt5D&V7j%n!|RuDlsyW>jZ2-vUQE;9W^* zJZ+0njhz%YPt$fhDA2OHxFb(Hl=xHm=59{lk+{};9+j(XXw5Jphqz$X1|`fL7yC5O z?zlsXpdH+hI^^ePU&f89U*R^lcTBVT0J=Om`?YT!+!2R1B$ zfEr@Z-($?CNkzc9N8uk=^`lx@Mo6|7e~pmjP|$-!xVJ?|4lxo8oEPiYypf%CSQ1%l z{oqV_sdHRn75tHw)r4x0L&kZNf(-~mdw=JhZzS8h6WPQ)e*XP?jJG->F8+V7M)rTJ z3H#4vg@c3b|ACMBk7R|3h2?+jW!V2YD*0Es!u}7p@4tGP^c`Db7R1q=liIq0L4RZ~ z>&`6~sY^+)fCUNeJ{z}y1q;SAE6{|qB2&G+-qP_Jye%RAQRgPAmSfF>hEIyvD@M7J z`f28u(622XtFKqm^ykrNrWaF>UVHB*Z`OR6v9_K@nNt}_6HM0D%z5PAhbH@_5?>d% z<9jObWSN%g_KvN^zwiyM^}>;}2TOy;2V0*$>{-0+Z}#^`OUIg5tSs9er=<^Aigngl zst=i%7nbbTFPY{a_N&=X2Ya#ETzE)PjV{}wwv98Eq*sM9B%59Lt)=#u+n*Qd+l{u? z(9O@QMjN#moI#D&5To6_I5|GJu$}rIHLiFYdgYEg99VBGN}*<~lkZ>W7dJTS4$q*R^nF^eVZKzeBXFZzNlUS8Nxtb2Bt)^}Z9?ai2j;MYs0lfd?`D`sqf z*NYXCAJn94ZIqDT8$`Dx_Dz>T6 z{p1Zvdktd#j9Rzg@*xwrAmet!MEn>v!h`yNI2gayIhmi>n_hqEci~X8;N|Cnznh6$ z);Qt-_o>xU+`DL<4s6?tqJhaed8a1ZEd68P(b$h&jM}&03MCV`B;!uQM8fIU4#4?| zAU~N*t%rJt{|tP4i-(#uz|+|ym&0AlP{!5ho);f2Q3(SfT))PnK|>8W29+l=i6KgC zo>A4MLQ<3pcR)hDBr|r2R0dH~GuG3t&*cle`sFCB%USArY(ApBuO}a`<|2E&_@ouFQ z7*8>Xdjt(`&hO|ArQ^ee!7A_clp#?UJHsT_w==lU2RrYeJErt)7S7e9R3fI9R0 znj=7|tS}}!WkE$GQe{uVOdTyJ=(p4HzYU6Y^Fg*UVA&owK{jg3*YJ~u)tt_CwqQ&= zdA=CaJ`w7yW6AE@c(%0DHtZR7y+5}~YxnqG_6A)#*6_A1?VourzmI$2??Lm37o3fc z_TZ4G7-_t2D?B#ZsNbF!qxu&z4dydCD~{G^r_Q%1S5!NhD_beV-5aGEEixbO3{p*+ zU6w99wKG)Ki&9rQoO~Yb|6ZZmJkq|KZKz-c53Yb$^+-W}IeMBufi(vAK9}lq_Oa|> zyAJAeU5!d9ZJ$hSJ%#C_mzZZAqdci2t}8E>1?m^5C(Z)TOChS|k^zuWi4+j$S11da zAy!P11d;j&$OGg{;AfA6$p|jByUh~E$rcZOfXL;T-dzTlm*w*&X5rP_?LgEW)cO`o zxNSkeBp=+Pb-3vS&7n6h)?jKhGImA`XC<#LVo$v$b4xk%P*qoIOscEPP-d8-#qcJP zTmWUsO=T=?fc3?lMg=yT$+`{#1n}5AC${6d91PkCPc44izQq!~#Ka!#we@cA+{pE& zKjmw5 zqEZ;y0fyB;{rW=&pnkehaOxwVz)u0rumsPh8{T#*+9Q@ zlCYj;&8772PU9UVoH!Tn%K~rDr;949J8_iso<8vbai)2OrQ$N!LE4c~)oRqa6}?hP zy6_G{&XCXx!`p3;sKT&pIw@N-{z}Xyk7~rRAkilRk8T)b%^0nN-Ry|Cj1eux0ho}; zb^c~2$D)a_5)YD5PXtAPlG{w`P9j$4_FIy1xWs5cz3{@r=m*f@U&9z=C)A*z7rAFu z^tPtqb+pF~yM2kdmJ)SB6D@qVmxWB&d`k~BqTA*)S$AXj8>?P{MGqwG-a=6K#Xfi3 z3_%1msuS92=Q|8z%8;lyyPqgKx^5@UQ3E1!+z|4qP8{_Ni6~C_rkt*KlkmV|BdOb2 z#>#5PSHbzZ$parNlDibKWX2Bx?E6-eIuU20F6)>(#ofJz(C-xH%rKw$vOlu{PK~6v z^1|A1g0KScnXrCw7U69W9JKP@#X7S<@P4VJj^U%yyfW>ZGuScoc-=|6enerQp5|@) z#Ux&lxGj2Z#1kx4g6trftzKE(iL_xi2%ndhJI#)>Zj*ZStSrBK-rXv)l0rTSt5JB< z<9PvE2`^v&l)}$MlMg+1wYp&Hp;K9?vv-%;`I6`qLT6(bp)!$+*pc!n^9%d)bu0uO zW$qnRja}>co~J%a5~I)s>tG7W+2n8bw`$0@(y<+#b?Cm7mXf?qy2+O9Q*H(R-VA=| zjAoE7oOD9vtf+WzNOWJGDo;a~fxTU7bt{S>ukl^pUf-~>s0ykGy|Ye~5R#WKfxnQ< zVUS#C+q0v?^FAG1jV%42wCH~z(F{WWT}=I->z9L`=f(~S_}n>9** zTjQKBzAkOItgoLfZ(j?s-So7=EqcC(JT7i4r67G){Yt zhR|4=dX^Ig7(8q=Bj~J?XzDyD2bBda6@M6o<1GGYDX+)Okfxsv(rCP9{(_bGDm2L~^TPVH7ZyHO0TR z->G=G=V{>;ga4k(M855hAlKKA;g-ygls|iXXw)iB4Lt`Wf{B1{7obuV0kt#`1a^D# zEoPm^kDWI^^8nV)FX(Fa2;@~OjbQvkQSiLZChE)_1cC{y)g3`Q(Ja})B#ebkq;?<6 zB9up|NXJIL4+AJ0Cx}}xFJ=yiL{5u>QmhnR24AQu67q%Q9ivqW{-fwOLw-YO4^aIw z&;5}54JImLjwpJ)6m$6o)-#@`YL?{qjS={*t1Nm(Sn354m83NOr}YoW^pNn!3`ht9 z&t4z*1ozCOUDiQEfcjBazkcL`%jlN0pJ#cPJ|Q~TNVq=6P?Y`%qZ~c$Wa1>S3E{!G zc$x(CDEBb;R2#$fB-5mk=rp(3k+fk66H{9HB=s?f3F|!}Q)>F?q^XVpDm_%XYMRJo zCWufaMMWFh?_N2=?=HfkB7qqGB`Om_W+;$ws3tUFm1H< z-<@CD4kq&h66uNbrn?!rckbtJ3vztgC#=gflUHER?bjsNe7uFZ-1-h%Nu?!+9e+*x zM&zzW-(%Z+{EXEijExv^>GtXqe1koSZK!1;Q{o$`fiq?D)x~2It*y*@%0WHg`GT`^ z&r@+gPt_SdG*%TBlvQ@*=Rn2pE)X;u6hPbC*V@<8kFlAE3;&L!)zpxBj=nzP3<-2# zx)75K1qMdHJL-sTe9k0#8i0#4+|1^W9+x$fuY2RSueD=$kgA_qY(=4F2N{diR1!ZLS%KRy+YC3wd*>Yefpi?uKO-{F_zWKT3xht1b zTSHMT{F++0IlEt*qidLwx0|J2*l*LA5O8^Qr)U76c-38Ir{(pT7Q)Lrn%0ZYRo+t( zoaK(?MDGn&aRTz`-9&QX0KO=+u)o)qF*}AigQw@^&wIEXxXf)!V%6K7H+3fb8o(m* zS*mfVaIb!pl1d52dEii5&6;hVX7MUc*j!xukwTg5a?Px)#Ab8cx{R>kVE0#axoY-a zGP-}^w#(ADe3Vzvx!~<7&Ukmr{fd|P*m{*#Y%RThPHVBdG9WDoYaLV9GcGzjZnpSh zMgekwpPhBDIxjq#@=TXSii~Hdt~xk-iE@@4{iOj2Wis-&Fi&bDw=tu%7^@)M{HbzW zWf1LhtX^)GU&KkAd&CCb%LIlO@rfZ`LS@QHEfOj1j>2TzKtv0CvW~($?cl^3zz5U~ z(2!K@_PheDIp zxw%{gBav0-7?mjJbw}G;eMAg+=);M51wxXE2p?ikrJy9CA|J9PZ%9@ATB9;}nh4=Q zO&KEbI3(oZn;w~-k*o5DL-0*T8PapzMv4@9fzH|gb74nbiY8VgF`Wh*Hi>#5;&wgi z%2_j@{k)U1LlU(J_B0$E91=8C-bAI)*mk67dcQFXXMOvapIl$8LGS1LG5xpOGbI=# z0mF!w@h=oeQjyX}^G{`j0=FRvs#H%~pJscF>jtJQP{$@tmk21<;J~qPp77KCpO&FQ zO5*lyME5^%Px&m6JJpIjL#kz(LLEfZ2!6;49!r@wA={(1`fSuAOh$H_t7vNomQq#F zyg|geT4yyp&}ef!wXG#mJbJ}5)8@X5VGyhFzGLSR)oz^bxO{Z4>AA5|96no?RnHEL z;ax5E+sho+Ll`IV=2eFPIWeW$BQXYJK!^{>3Ssqtxeg@(^V>%Y^}<83uYV9`_G2B} zL^eZ7rVnFYK<)-z&$=-_C9ZYNXRJb_%|d&>VaovuC<9}Gn?#jGfFPrngReT^pHOrH zMn8gbllu#&a^V#s3j~$gip1p?7K6Td{rV{_M()k$at{7f2Df$m$lT&uP?8TIc0R<*ABWCk!_+Wg8PpL}i47AW`Ejlv7V zPQnL9!iLA{<}8@%v=Zgy4?E-Qk@Jj%zVXa7UN-{nH&AQa%rkYCP@q^WY>wrlCTZ}e zIcPIq0(Y7Se5gpJ^1!Nk>9^{l{5c_!FC^77VaGNaXQ6ACN$JE*(H3T`Qfs#;NSc6OCbZw}9FCovBL&JPQg zH;K1H`fJO>U!jAYotM$tighNxNta1gjnRZS8rEpUW|}TF34^yN^eyDNZ8HFx^g<>- zz|ZL?B&XSeDi$R2xwyb$sXDxkX@w-~4wARjY=^n33i6#@Xr-_L!Op$@sui4IZ8jMc zVByM7%CKHbIj`vZa$RVfmHCO5Q8ur1YcE`d7!F-cPzn)%8pu%SbENBy%Zv+feh_eF zMNe(EvR{}ch-@5k*Ij75?!=y6*2Ml59N#Aqj!>1%r>Tn-@#f&bP-l*tO1o^MsI$89MFE4*t|bvDMxD`#X*o-?9K8> zWbcAMY+U5K)$5I#apcrcTB1odvuOhDA`d)S)VMQZEg&7uMaNtq;|vc@bR^#`_2PKZdM!BfBhLi=0lDqkrNyjs#$5S1$`!YZ}`to0M}G{x5&fo;RI zZ&=P|pgMr5^pEL|H|P1q8wQ*R zjm_QQAf)!HF{(4gZd61Skm4C#;#{sgzT)EIY;M4k9DroL-~jd>w@N!Ojf}v>ci2Zp zU~bI0^&FY4t$?fxQz|wqyP|G`CZFDWx|*5pEP8IAaoV^M@^M)J7*!`e`;7604aQb! zc@4hXd%h~ySS|6F6q9$*W8~ta0I|ZNSYAV3>$kePk~}0!nb;CSfpR6XldR>_1uHY@ zW6_g5-k9yq@C;wf8@!?GXYw1b#UGSW`yWwS1G|ixTbzT~fCImdps+g(lxsz04|P$V z?^69r1MFu%m;}NlL%M1yVR&my?_kGAkIUksZtU%pTh76(;+O5F9Rn0sLV_6d*T6I_ zm0YjqKofVU7&C9uJz)Z{eaKkAMdVRq5i z$k#%-$k%)eptU3~)^N1~XYAb^bUX#V^W`+(c8c$jMIEl69C4$qRxTpU}2O( z-81%|OpSyjW7Nyibl2BbCX^<2Q4=or60}TQ#YK^# zqn6Z7xdk5%68LhZ{wmo|qTG+{+@xJzugtg#&7|g`_oKhCPq*@k^5JS+2e%ZRRV^)| zM$&qn;tL3SX0B&X>$-^o8cyID%zVA!Ag8UOAg85mMP38gx`B3+1*BqI1WftB!0E)CFxQVxHwEwO+`}0 zpl0s|iQVvF*}S6jZ!sH>WqW#~NwopO>)ba|HdWNzhe4Lut!rD=(J=8>Ec;U<$CtdC zzAbikeY=2`CfRo2Y3fLo?ml5{E*FI(nCR<+OrHpcFxmKEro9oma#illooxF6`Yd3N9? zxF^2C&OzJ|>C*@GH``~FKxM9ga9Fot zR!K`&Nn729f;(QW)kg>;PRG1}0pH2`_tv)Sr(BS>L>ox!o~XC(3BT%rwws@p4;yCn zhbYyyG0JBR;=Ep>?8nbn0Me{4$#7DdQ1&n%OaHGbMYTGM{4s1xM#ERYM?4aSh`ZeA z)3KTo)dNoB)*rPyT8tT{^pq?`{gb`%h6TZ6xbsy*jH7a}=l!}pyseDk)yUWps*pl+ z+WpUuSQQ=^XKRq_f%ec+)|u7_8!g8n69U^=(A?<1kRFAu-(02Z>#{hpW-xIkf}lpT z_r9=)zZgrCRmSr=KUcFC1m~n0?k+U7uiQXg`V|>b}%;RK`*k9XNN83p-4tLY41!1r3u@tyrW+g0G{M z&H)JxRN(!(KtF55X9n#r)|5~@&ky7T@tV{+JmL>c8{WCmio>Orfg-}J&*S@p4M%y) zk;*a`$K`ynYFoo?<{|7Ea?ZkpKvZw$BV$WbiUsL0eU!(1ysq75=YUH4a60+igl7!r zkKn=!t?mfUBXmB~;y?QR8v=F2rY$2xT$Q zvyp+#_NM7>lXNxca#PgvNUK-1k|2MG(4KG@oB%kMP0qUOb_M|HjQ8q{(c-;NeMm-l z1MMkLtzwHjXbXL@JYpy%Jr8-16U3rkJ3~0N4QSew?aJrNeQ&5~6}#yBwh4ATo-B>j z*|IQ3Ojvf@1N%}+Gy36K(y0>bf08J!?|3|1E2cp{CT73n97SZ%0)1JOd zYU|2!OQN2X_9+jn8zXnBgw+d@OmoN6PaH<4A7%G8)I(N~X%Uxv33>R$^8Oj0NXne` zP0d0#@)U^L@x#wInCMvzoSy96gpkLkzW(Aak!p`8sMcq#7rwL3)bgAQR;7qcihzoX zhlpe;%{T&-}!o_#5NH9Qy0%Fv6Er{i3vukC3 z-*M(h&O|0A<4I{@@TU02a0%rYknAvgp<3J1vlwpfM>`tjRVcEVT|73I^tY`P2B6eo z1;CPMG2o9qAJddu-~i`ayDG1u(0xvv;qg6i^J;fUd}wG%+{Yik;jZyt9Zd9#2QZ zZS~Uaod()ka)B8a%H3$yLKb!NzCBDMgn}1sQN*N`~>{gcZGtiG#Nm#eZejFjg4zQT<9`p9h7dH03(k9B)@4dgGBF&CNT|$@B;DcsVxz^U zrk<^p+1lCJUfwA^2alW6L;6PjvLzy95HjXvBW#q6J1a%&O;$;lR?2vK*A@-r+*Vek z4K4^s*muwnraI*~5l^DgDqjfp&;@^!_Q08DX34FxrnxF8^&Fd_3SG7bpoCnlz&*V% zySacu`UTPoLy&n2`iQ1W+gAp&ACYyp8K9-)FpF+NFTPD$48XvLsE8M zPv8+}@d5*7EVkQ*^}Teia{;6_6QmX@pSET3cQVmfsdGvo2^fI$3VKV7FxMX|MUz+y z%-`bS3>}@~tJ=PHMwY74+nz?2c^v8lRx-`?1h2=Er$TLZN$48_`m5a>`wnT=P*wmt z+r2EO~J;sTN+{WE>b>QKCzC2!)7=LhJJ z&M&1ss7q10;$nBhh7LFwgSo%Am_L}0ad`_M$4vPMv#8|;wad07bVwX2>bnoN3vkS@ z@)uHped78$)G~3wGN$hgXOyU}XERat;l|$h)D>ve7_x-A(^5KOaWivP4n%(&lpA(K zB39d>?3u11y<^FIItkmq9U{=oH6-uM>FmMegHEv$r)S+v2uP$CH;ilS$g9JhHd#|y zV&T}EQ<1HM^9}rj(q^AZ1|MJkrzDK=DD6bO`G#kiq!S6AE z*)VqHe3`40DRhG8T1G)ve{v#Zu49hd5D&<+5vQG{ot2Tx#=d|UG@6)b6K&8Hm%8$l zn$j}6W-ZP^>1o!A@7bWUnXVEK3M$S@!BUGQG;ffoLeagJT;F<7Q~JLcJBJ`mqi##5 zS!vrwrETj=+qUhjv~AnAZQC{~ZD&_^L`UC0`VQ{+yn{14@7`;zXR!-osZT_9z?2y$ z=3LE8Ojd8J z9rq?o&cGS+#OB(b)6&_{IuiKAv;|>f+8x597GspCAtK=3JTtpi+tjHxp|)?1{q=02 zZDnZ%!giEq?t>ah60va?3t`;B%v4BF$vyGpi zbg^8BQE>>Lz@1nz8Gx-ULN4>q7%|creSnpclAf?6LAZ!e!dW`Li=$1limqc8b_W4~ z%s#wS5yRr2j^TW3K{+Zax%?fMZDn0!zNX_UOYRVMF-hm$O*nm~c=-qKs|xgA`l8rE zb|rfk3B~E-WvyI^lV0c9S*PW2RHJEmYw7&kILeuHk7!`&pS{j&)R7 z`lBLOcQV@_z{jTdJToep_9K{wtfXI*UkI;k&Jyeqqxh9Tc!@lS6e%TQ7tO2c?MHye z14eJ%rp-;H(`iqaE*2678Y*I$t7e_`I#a2zgY1v4!$zJwICu~@ubl4?rBE`GL4^Gi zPpJ+}a3Yud{t*Fcg}vra$z)G8H#>KCoyVzd>2!p>Xp&u7qV!|_93el&DIvdTo77!F zVlUGKmGk}ZxEz)*hXh{HFSThW%}U367e!IZ&ww$a;1M~b{@-1fsQFL zN&-oCZJG{ilGV;IFfXvtd{d1SKQC$_mD$7vNyfZURfvl8QpuXf|1O>HN{p++W}T6w zfOl4T@>UTr){wZ6d&yJv7Qy`yheUZ$|^(v?UoJZ!z_d8>`?oipFR?3eu#+#jR3M#WG~+n{o*u>p`LPIQW3s? zx|p!&Wl43t2eT!{I5@fQ7*Q^8o_1+a$UhUICW5T6)WlRc9fO7RQvOLOvYDH)$g$?|gwa!QJlNLZsYWT*3tlTe;; z^VuUa0K9V3`MPIP#^4JtJ%chsmEoqvgH)J`LxXZGaGLvABrP>ubDcvmDn?pCGJ>)+qjsE$I(W;Oz!TIIhMAqm(H*9mVCuXeOReuHzQ|dHZ(n$()PMWRc|}p{Ka;paeDhj>FK2G z-5#t(+a6&zU&kxiOOv=<00rAfBc}&prv5nZbF%pv#p|fu-*>WfglF-iJ0r(5xJlc< zAP>jNS*m_FdEQQZ=bO89`-@-yE&v!C!b$6%W&*-9BUXS!{i#`S1*$P0WQ|%{)a9pV zg#D3oxcw%rDJ!X}=uxDwkX$1*<(Gy2 z*z!r^!Pi8>EeP5rq+Rm8{hHw5{^RNz9LxJV)um&?cl*U_UKO^K5;cjsLz#+&Yz;8V z(ecq~fKF4DvCn)axR+x=1Opjs zbwW_NO1{DJ{B$+z^7M2?2SYn=DlprT0|bCOuw@-g8>%9CD&NLD<1Cf=_x$wPpRr{4 zK}eCcc-wS$=wm&6S8=5?Odddx#*5y$J}xzO{-%5x6oBOKr!?#jabEqeb}7R*pT#q3 zp+(=zlAV-|D5=d!Ed_lYo5uMQ`#bcAicW7>(Ur*V>u9DxT=OAWhPX zEs~SVld%8{P){X0*q;wKO^=v#IUYCSwla!{nU&ypk)o3fjXd%yt5ku$o68}GRqD>Q z=gw|kut&nOs)Cy7JuKB!!OaAX7joC#Gd^1vb3ez+o#y@JJgC%sj3>k0ME&`Uk9SQ> zph5lpVcVN}yDoG`L9<_#)h=k!bSkVT@a9IfH(WiAcPG=4CV7;*(^RK^Xs?>j0TK30~+;L`}*!5OZB!oW;o`CCMpIhAZaC!7rmmNENB{Vn9 zXn1w|V&rJzgSt)5LP##_Q=aqlpMm7CE@61urb5xQ*sBw@`ev*+6dXnsp80zjtuZ!DoO!5%nEbBXei2A1Y$`uoN}v zb?#CcNDh3Kxg}foAYX?+6$Xn(q^Gml#l8sxx~^e6@5IKZji#S;gje~`UxDX@|KvRW zTsK38#)zuk*Z^eWeD}5x*V;;4rN=GPK$)MUNnEM32SHXk8=kjBk*_{#q2y=Kh$aMi zk*l^Mk~j`vA_OLA)WeBe7M{hl24>;SePdtuMqk%|)Kt>jF8G^Fh9Q4`2MZ^CC&Co+ z+Ch8Dr_@2BNoOgnXhr!TBYlSxkQyk&=kL2H0O+1v0Ce$Bq@W6sGi>A}mEA)eTIRSf zALR2Y0sL4U?~I|If>*6$pVf%zV+Gu=BOj0**NXb41I__#pL6p%I(!t$B=#T>)Lh(R zGtMOFbA`0`*|9mpa+6H|a!cZMP&9P|qZW6Dxq9;ksP16UH*&^3!Q3jY+4k_nHCUv- zZd2Zf@su+35>iu@EGoum$Y0fq;NoL%j0ibIv*W4&D<-d|NcvU{3^a`>5&M`3);#4} z+S35w_k*4|r2;e3cW*y-pL<~YAu1Nla5nE}m07yB!`H$4O=gqpI`)m--=q>a$i4a0 zfQ^sXnH8(Pm>ef9;!u^HhZNeO%(_aJ_{@>TWfw<1ZF>RsYiirD+I+RS(S!&2M2$>u zs*X`OwtDHUmg;0tfvWLdwkoC_RD>$c%1CJ3Fkdxaoa+^WEA9<7evq5K@Swtdy87h| zMFJX{9=9uV9gHkf{w?2Um36+`a0Dpm$T=~wUJJ?8=We&6D?0t(1a}T8!hhHZsi|pY z%D1ZwDf}I0Hn*rVLAe}Hqc;y~+vJl7)2G|t=U%fN)aG56V$>K82EMD|x6w07IVeB%c%#~~;PNv%7hT|P z1>j`l`;7WPY+K$As&s92h|Sna%zdLxd}xogM_0$||D$?oocfouYcs&PW7Ar?+vw}H z&RQ&!>CYd&^a?UP9(5ovlDzn*DVvFcW<*{h2BR-Ez9L@7Z}Iow^ZS?L?zG=EZ|fDR z9orYw&6}1}+fiKrzq^x)^D{8kopWg^%qEbD0B6}8S z>r>i6UN=~5%3fD%yH&qbOWaZ1Qc$1@Z!)52BWkFjiGUSIp)2kpfplvZvn{QemRiBLMWT;cE)ywFQ2xL7& zIf2)%y|BEoZpQOBV!&L**+5-EROqPwENZX#H-)j-xGbm8pBcq(41*`X+xJz6kQR1pv)jr~gLH?PveNY) zxEg5rjkXJl-2Wns!Eby>y=uFL-GYjnZIF7U-}+&HcKRoXcHUfVZ1=r7c;VizE6R#3$}t{Uw<+W9lR6)^;!9^a-EErqBGuZB@vXZ8)DYSAO)Ev(V zN!{7B^g4Et%3^uwLbD{7PZyI3Kaw$iL;(&g47Ra37r~(&Wgg4C#JnpW-OMU2 zQ>+o5P4?BJP_{G1yj5_}h4E8BUlijffh0}|>67AD+7Oz<23pX+5+P?kInR;ZxSlVJ z^XOl+fCIQvK^xVRXdWG`Eqw)!jA?Huw2g&--awb`t=jGYRS_rIUFgy0d_A%@h8X-Y z%4teKJt;=nK4GMunUJek)u{9dA@)^L|7x_}FL(raJl{qR( z0aO4Et3wASgMg52*7={|&Rd{(5B{9+)4f1P_j-eAjZ`-_pGpCxJ-$2&E>CXtqrtv6 z#RrLRW3aq^!6g&4&v&3~A3_nBATo0VPv}U zFB4OR9!HN3a87nifGbXM9rl;sj?SQ*8v}T$eO_c_1n6;qZtN2h5`+2}ycZ*Q(vg4Z z5Pe#iYG!BBB=H3M2vzad=iq35<-rhs@9u=vw!w+22{~Farb71g_X>G$xI9O&YDe7({4?Mu*~U*yj!s-jF|o1kt>o1Xk5)~ ze7AL8TZ;OvbAKPf@f_)+{QsJmxIKik;Qe4N9enZ+Y={oJFpSP|#ednwert*L6RN!l za6EIp)CAqUrI5?Q$Obs;@Ify~s_C$2+TBZeGIBWbKvLLq5?_|*^Gh*P+_U+F`)hDb zE0S=gjzPM4!+nl9I$(P^5VA z;{v}Tjo%SyLj0O}Y45!K#RKa#xu3l9R;nYO#n1gob?*I~k2MAz|Km&GJNp&z&`g%? zG;VIe4kql7O;pOOr3?HC)po`{d<}?+h*hiv1w|OT0=DCj$69M}qkI()$i2-;XX zp6#=L!s@{|U@Yvwz&rJS2!*8d2Qs7N`U64sl#ZtHWDhBefv}`&e$@q~mZ!IhH-E(5 zR(6M$kc!sS7SKjlyml)tRE()wD*J9|Qj>z4>O%t@ZH zw?Yv5qG7DV@9UnT`l31vUPi}TG_1?WJGeP_aR}^F(|g%@7qLx_EevC)HeZ5sU)0$> z*Z8uq{KaKS$vZH@!-oS{4Tns>bUyDZZJq;WOmKxS(*d4=pFBHTtV1)VX zq8PaYik_*xW&(O)yRkT+otv|E%}qTPu-5f(Tmr8oXG?C0haex;$MmQPQh6aEOA$lN z9A^5YgZ8t?y%$Bxz5}S0hJ%$X9%Ma}^`I`5!Cz)6=Md-xRd+BAsVE>@Pa8gW(-c-Yik1nRZjxMS-MvlqzQn>P3O5PczR*MX$ z4H1c(YE5*BT>J1mk(n8q>AJvVG4eX5T=CK^#+GR)7Ab8tYIE0jt~pOUw&0p9JwnF$ zH~9?*Vx$7%)fCbZvS@OsBAaWd1qIB%r5Zx`vNdvGEa%9FgCXO>waA}%dmnNvfhzF& zO{uIWZKqN+@37#UI13d&5NO3gx+Een_O@Yj0n#rL6h4(R6K(&PJ|Oks#Whw?yaBp# zzJb{rJT!ph*+Y5Em9j*oltcAYDuE_{17dbN)*JzrX^dIZQ@4zSh?h?z*f!T-cNG%i0s}6_3a1E46iOuwf`*(g9%9P~rH~9R5kco*lTHVv~N=742EPcFJ zfcRzpA0NJn>AHkrx$W%GAg~J0Np3A{%WAmkQwmak=;N)y=zx$9N>}Bu9%hNTti%FG za(WsRCJA5w6;^_i9kT_e?wxh-gsPeGM<+XSM7)Y1|cHc&{RqQR(2KdAya3(%F87S<{(28|I`(u}T29xgvzB(*6g zzzp^ODgY4db-Xj6cb{yqP=;bBw*PQs#`Yv+*`lmwI+P$pG4s6$F71~CmG2pI{flfBVm-z3trc;rVX;9y*Ko| zVb^euk(`8+ijrDS-oxxK71DL3KLX{JPji&#!xt)HF>0`a$w#y_(1!h0 z92CH7N#spX@4Plwh=6;|v8j)c4S^c<5t= zTp`t|UH<9O0Riub>JQu`G&g-S6dyKP#z+t{F;D(X3)5S;t$PK!=Vj!cfsxUg{TViHD)iP7Oq(dfZi=DAz)l{e6zFdHN`mTkVr z3-U_&g=Dp~pzayc%TZd4WmB+4M;Kfmg*_isZcF@t8RW8GD#1#2E~Ydjy3- zigu*n&XTScY~vRzhTFbx?xdidE}mxJA@t(|9H(UHgIFma8CknHwG7q~6fP4|&YmYl zd*76fCqF@|*PP*;tqh$(dvXc^skj)oEKQxSH`5B}rgAYYV4mtr{Q|NE_2bHF%<}WV zm-;S*(4arxX;Rh0Wa-d6gv`B8Z?1O+m73x1r}sUuusFivfCF+18)kzkQ)Bl_PiBXi zc)lu6<3Sx?pK~Qba(tg~K8v(#lcI8RrE$r)>O7)98D3XM9;xNXyQD{2>|%5~X3R+7 zE5mA|gbj+;*7WS%{ZK`BF-p-KW+fsT07Zjln*fM|P61>ENlAJ!)lY98DWf4s>U7ri zcjO+pe@kPqRqS`kecxAU<51JWQFhhjiG_sxkQ&od^l;u$lzX4Q1z9}E%;ToZmCMv~$;i%>__J-u4g2N8f zwIr0LOxpfcA-O#46sPRbHAo`rnV@VDV+{cklaA3%QbDxXfq~k8T}lJCXUya)g4eTf z?EIZ*BTO|RW7x-F@+hl^d`g|oTHge8@x+6o{Cg$a{ z%@0r3AD7`N=<`i(BJ{-6=D8D`S%i~B%uLBkP2-=Z=b=Kfio8{|k#sP##pk?>Ge~X9QZJ-^ZDAj00?37z$asjFi0* z&=R=Fd!BiyjE6Pa#WXA%7&({o?D@}<<7Ogm$cnL*h~StSn4Oq4?wXZeS-Okf_B_@Z zoL5`2mr^B+3J&1r``)c^b`lEl#k_CCzh@>G4onYI)XpYwYhnukg~knvc&SD6JCa3^ z0=oynQEhZN?MVuje|%_VjRn$67Rf|KLdipby*P@x_2_E^-0mK@PSnNH(YrLVW~+Ek z?sBSNJtuImAs~vRvFe?4PaL$&#>1KTc%9j7ivfM2_o5m##ahg?5OaTs7UOK)fm5Z- z3W<$0#IR7~>05`E;wLrhLQ~FXSg=A&^UK@RGK{7G(*UY48VRaDAPnhrI%ks3N*m72 zeM?!~jGO?bHYrT=^yj4m|G=@|9ZV;DXO+3+k z^o}$ZrHxDuz+))}GUGOzw*)JYpJX1(>o=s?xXl}hFqHBb#UMo)<4Xb%PD(8ndZHXL z!o!og5K2%8k|-#~hN6AU5HFeewLQ4&LjS>lA<@fR4M&e0@;@4`kj7Pe5O z)en2_fqW?M4gwQ+p$f*#m8ZKY8}p+Jk~rLsG(B{g(vDJ%pgBH;rOljLP3azPTB^ zW`t~u?$#I2t4pKSmzm4C^p;)rpY0p(-M8m)U9Z~imG|mhyHL?+Y|(mk+Wm;6B-Xzv zvN~kRr5I(2_eYiYCspnwk_kF>GKu7o>i+_~MQN$(Q#T{>Q4A4CqRgKP1D%;I(cuF( z`y;uy`~b6`n1TXZJ`9gl9uyCp95TBLwEGAhMCtlMO7vksnK30} zDb@v^U@qNHf6Xj%RT?rZ#Rdt*EFuLl@`E9%a4C9~%5imCmB>aF3z%q9rK)ihX*8^c zfT=LAdoR@-ArjS!`?0R9skmz7hb3&65UGl#u<+5Lpke-8g~Y^?@K+V-@g|Pnk?N)= zCa59>Zzi&dqb(k+S=MI_>s^3rJQy`Hfsm89zy6*B1} zybX4wG`%Af-+(0xsbU%K1~$8)3ydGdXiehDh@WPNpWcekU$~5dvKc}$eI&X>F?r%C zCCK84q|>izpb>`A;RWRKwF(Z(pJf_V2xVZ@(J9f2q8LbPW915{fav0!{G8&PLPIM0 z;lh!B-Vl^>@FgL`_J6jJ|7!}7vP5gnkSbb{w<>p0W@^rn%GZtQql*|ewNR(4n$>tI zC@X6!TbxfUWG;jh2jDhm^5*jPAvr)8l?>q)Op43lZ+|Y8mDccQe>|Quz8TfTOyhdF zWPPSXA7WC>tVN^ai?uCc3SES#f8L+|`gr%B-1-Qc4fAmI=`WJ$(v8t$x21`2nrfmGJ#Pqqq>Z6V1taD| z>}u>|yz6jA&rB24AVkX&sWH*%b2NtdYoWU$<|-n9xEN6iq~tvzyBHWYjYzL>D^57} z!3h86d7dQj#}b3~Nh>`2-i%u*?CFO8_Nc-;y+Cq#W4)`ho1|BR9y^q)DSy=a;CA59 zeJ9c3hdb9Nc`(Y0hab74`pH3trt$%u%ZUd2>-f$ZKga*jo{`P)yFDF21Tz2k+Unh4 zvjGdnTqTPq^Ia#EkQTmOp3mP-;=Z30|5H*x60LNqFPO9AbvQTiYdwL@b?>f|t=}k7 z9!WbK-$}6gsDCI|AlFREzy$=7a`H`u^}VVmD1PqftT;;A#(iEZ%x8dO>MRwQz5gb! z+-z>Xi~q8TBlIJq@Q#?=9F!hr~n5*)B&TuJj&bo_$s>=Bzq?>=SulW|f)$;Z}$%w<{8>l%BqvgbCB(ZyQ> z!FtrkjdQ0I$uCb+Oa_zZ#9|{Z!ccd!-{06>8teMZB@Vwj#^aSWe0w8OST_3mmd^z| zOCAWlbY6F_q21axbaG_$H?Kqeq#?`Ty&wzPv?%O+)MTbCEgF3Q+&MHy1p%2#6RudV z+v^(+-THH}Dz?c1!;Svr^AoXp3bCwqI+-M$47g^=?mT4x^A*B-4IVaM*Nu8nFEW5P zigL=bH*!i(q%f5Keuc~w-5P{T((^U~>e|_l`PD+p>!#*QfE}V*42eTKhADD-ufac9 z6;j%*HAr|`obN(6x>TPrC#ZB3uIi)l7TbHl;Y5s-M_9ru%7YIT=M$#7tDnx4NuK4m zNAb8KNYlDzvoIJp_k0^Y>arvM=dol{a8TV&k_`m|b@ zreudBk1r0z0Tpz2`Kc(zzWnJhfeaNruL8Jjbc5_B>1E(I=$D$L#;vt;`sy_dx-eVx z;lE*IvFIY@OXzhGT12mHYon#LXPGs@e{kv{x+)#2o2mL3(~$1i$xq%A2WEpvy-G8!BogLQk1d zK$-DYPGw1=GdVP9x9|9e=Z|Vs8F3`anXysCVuAshISY(PKLp}U-q$ds1!_sZjAbQNR#$dJ>wFl4@BKHb*T-UQ z(Czw(umfPz^=MTn4xWu|6{>p=JhyY9c~J+P<)!!L>o}HORQ~U;b>T;7_!kPEsGB>U zJr}QSKDXw^mAk{|ZV%(hBUY@N-cFNUPNGZrrR9~?r6(7ageU%j@u|qJ?xzbn;xR^B znv?`-YkyR8Nx{GKAHsS$81Zei;`4=xw9b1eH>NB|hx1)WWMYZ%=8@GAM=c%CjUC@D zHy5@2wjFBg2~^tf(31yuA}k>qqi(OOm`pa6HXuGuQIKPY>sV_#AjR?*@*1eNt&K~% zfAqW+hkIwvMekU>Jb&i={vNE`<^{5ffwt#yz?pQoa0J?|87KP3&(tn9L&*ubO{Uip zFSFyBxF4kmBwc#4C=$WTWm`370LaMRksumWks&Ic!D%k~Pb+U#7j`gXU#y(B&$E@a zkEWbqoEb|cBJ!ZBWV5@?nRSO&tLrE8!e%PyEHUu?yRU>>&xKAfk>1D$cDQk)Bj%ZTBExw`{sq#zQ=0M$1Df=MZ;bpX3ySc6XH0AnhD)$WSckJ6 zJgASAMD9&oH|_26S{!f#=$SdrE;|Zj4fvf~vww2PA7vB1`M+o%R}CJYtItLwC03@z zD%UkqQgzgYz{W11lq37}baB>vkdhyGk9TyyHhC-1%j+;BtS!GvwjQS=9h4WN_YjvB zw=>Ag?a^p8+2G`OqDj3=-Ky*BtXp*RtRgIG>d!^>7i4Bfw5M@P*W-lrPH}~q1HE1z z5q3D4+d`X!f`f6_6Cy`z`XIOzE6?*$C~9$g-4En(=C~`SN}$5A@HSZ6g6zhrX~@}m z-eh$eZz*vvxLl_7`hO^*3>9u%UI^OV(v*=E40?#AD`~)isZ_|uvHE8EiIQecb?-wb zar7y+3#}(meW7Sndn3$zVo;CP#!ChoG`JFm6R)wk%d@s^+_*~)Lcs*Y=QCB{Lg{nb z{qBc!LGq{*Rl&qQTq~_{at2R#7>hWh$LmSCU33Np=Wu(R{Y0n-WW3Xs9dTH9W5n3g zlX(W(6Yp4f4y*f*yb56LYrY=mf*-ow_PD-ES%vLjql3Ewt2HEv)$6{RZ6+I6q?Ln~ zTE;y2koucjYGd`*`PWu`?T=ziD$GoSB82N76tM4oC-Lx&$se#e1v*v_%b{5?0`Y0r zUz2gC)Ot!{8mVNo;6{|$=P#P{r&1Cs81nBrGV`qfAvz|}0)>*4_|wA7cDCFZgqrY~ zPODQkAO1I$AogBcy8e@cCo|7&z(HFEsW_t;bqaKx<&}H{D}7AETIDp0$n2ABK-p@j zq;B23yl25JR&*G01pG)yGQdG;;YZY5={J}@EHz0w-&3%-V;ppmN~8bdwfjyLZF_} z@HZSa9hFys>73O#wg6m3yp`{~9Vvr8R6!z_iZRVgnyN5t3|^7rxk%|1iZ^6iP5Mq3 zibzKxN*@U?rAxyMe)Q>DfRR6lNF_!1>5aV%TP7CX#tzr@UeBBW0{^_t&gCoK&?22nr)Ul^_O z{6CYrNeQq`^NM(=f}HNWJzi&aMQ{nf)>}w3^H@v?4Bn~5`;J@V zP}QN!r5bN(ylnyiFq&LmY7Xr23re7r5M(q%9mDbxq}n~wo3hxAjLvt!L>U$vh!3t{ zrBi`MTiwlIMMN=>-h2PZ43*Pp39*TMzU%C@-sHQP=gq3|)XS_-+V?k5F0Hxm5P;Lx z{oh?uwT&QP$?`he3&(d!ACq5aWREj&sIaGJ&jAqans$_^0sVH>%c~clrVK(AELgpN zcE?3yU$=(v=yP?>fJ!F~hX39o?s}|0jyb2F$g*dd z2qOut*sU5V4Y$xzC6?IUM_0-cWmqB&4Uth;8J#9Qrrv&5X#+;5re@ETd@6P-U#{pHUwD~c1me%ic+yGl4RZ^lS#*l<5&C) zFw)_YNxh|QvUsBGg&?}jA)V{5+C%g-)~=F9qnC|Ki|PTDlcXNLALHy_vl`jhu%5)EL|X$mMf8vkd(SZ zgaDvbJ!Uf*)`=F>t`6%qXr`pm_M4{-EbQYp43f?AQiW6Jl{$LR!68F}D!Y~8Yx+04 zlm(Y&@@tt6O%mJLjb*2_E8kY6(~qMnx1u#ByMBMft&dq7@`y7=Wu1>cnn5tOOzE3K zF(+Y-P^RzynvF8POiNAgWdI>zs%5BUDokHZW1QkJ1z`?OpWt9x{0CaoOXvKhlli|X zYv@doX~R>Gv=#lA}^^A9ke?r6QH!YmTccs>0 znN?%0$cK5&`H6fp=Er6@5LSRBCuX{~zJa!x?-SrNEp23ThmN26^&7=6|E+YBf1FXG zW5|r)WjZ#L5=v@nrzz9AVR6_*95WxnHRbOFhjw)4&%>`x&X7_APb_+dNwQN`o}{7^ z5sY}EU27390W0@50#WbtkA;CQT+03X#ix8hW9E7iaF96Q(Gpfad((YDtL2`h+r`6h zPV;m6@hjNu!yiz#od?eL=l9PRhCOAAcq-x{w36$ozI72*Aya?Ys7+as`vxOzMJcW! z2degX)~7bCt2%)?vH~KI$uapshOIwFYX45xvZhDdZNmU-@kvP8@bTK(k}?{Lhq3b$ zP5;Q-TFd_5x=$6nLi9)H_YE9cRXx;JiMYoLxK$^S8Tsw&kjNsA(STcEf?QKdWEDLx^AK-3qOESp+5KKt#Wm0b|U#CS!`+GTbCl90G}GZ8ah>w zu%)&p&)d76^p&DfGZ2-wvt*QRSx{HP)^?Unf#d}HxNOE93HB->&*)(VBS(Td(qD%o zvI8MiUUHIom}-Ip=`&(Pf`r-Lo>w=piY#J&&hHCv5 zp2~fBKivAuZN}&X)NOd5^!ty^m{X_*C~89fHRNfrKNEnzr?0DKT}Vo=oSvRK z25RCI_;CgYqJWH9IWb8Q#CN+J81LQYYV_`>Jzb6j+Y||0GEBf@^b<1Rl;E041UD}J z4GnBKY|&TSUR716yB1FO%0L2a@$WNIX0zf)Z+X^B0;} zzYwwWHCcl37jfd4$6t)~Wz#se^G?k78q`UQldG9fm>TK|N>z!FU)AL4TjW|-35y3x zW$weL6xUM2H({wd4vsA;#Dz@-Wl}*f?>w=%0%qJ=+HOz(G6KHiI`r)a!ae2)WS_2c zVMl7JjoHY)2xL5bM2T+)=v)V8@d17f(7w(^;1R<;D^(+!J%p=5ARS6|_BFnT^u*4Bl^W4lg|j_N%d zMLB%g^WRdcWbYFTH9CeVZdT7g2%!v#)+2viFsxY!z*(e+TUbaU&VsK9rQfE})t21G zfZ5ewxZ!vQR%6>7;z^dv!yO+OwN#;Fi>zh$&={C$x^fIDIH61Kcg1weOtB5k6wqNB zG~9l9F^{D4H|hJ52V=pT^j14RZ64h4D0oK?v-W=$Pmq-E6TA5=+Yn<&fK@^V3q*p2 zyWnA~zL5q+tIX?IS(wUC7EA{}{}yHNb6+nHa3UAwQ7{C9xom}l0%fqOl!CrKIgTWK_df8dbzx!)6o}|03 zxtd9~j>r724#N*n0Nngy`V*}*?<^mTW)%i`Cm4Gwo*Qli2o+lvr-po-QJe>Ri-S_R zk(!Nm-C)>|!6NS52w}BxfwF|-VF&w( zytuY@u0#U)DZngK!LaKRUvu8QU2N-eD>*4eC`RhDqn@5(n4S7Ywu_o^=!jE(l>W9t z1wfcv%L`_miGlVAC*n@HoWAb<4le9;pTuf77R@{)}xa=;R3br2vkBK-HgyR#rRl3>}*h zkE?P=3T0v7r!OiURZlj}BA?80^=7-bcMo47Ok{dQL){!$MmUtb4#C^KC5Wk)F-eRv zE}3P;ua2KUx5M@7q&TsRLZ=0R+XqYfx0jCU&ZuhDVYG#w_<)S<_KDjmOn)7pnN_yg zc8u_bqfL6t76AuASQu(}HZi&~T$9BF9lbdhqIpC&7_MXPt_{Y1!b^V8wibNQQMbwr z6#-|m7rF=Qpu>k!HAM~d8iADVEZi9UZ<$3mCFCpA?5NUU4stSk>4ywrgbA0o&vv`8 zy3e$SRfDX1->xB6SRRXT%wB>&!q-nb@?%j<)13f_`+ zD;%at%hJkWS3XAO_NJOp^DDLFQkLVux+KKM#e^=3CyBP5&E&Q@Hlkx8h*@_+_Z*8Bz3?9yQw zCBe1C*apyg{(o`y4!oj<%eLk;*Id)KZQHhO+qP}nwr$(CZM)Y^ZaU{AyOYlT z2VcH7RjH~mGMn+y5?EZ{sj5Q8TcLaK z*O-0ruvN8`Gr63P=h2SPrZO4_4M)_uounWgs6P4$+N_3Zomv|)3_EBb<^ohi#w8*p z!=lz|09%-mH9o!rVNhP{cko1!9Jx?1+yly(6Bo6yV6Y6o*bEk6=x>K;SRn^Hx)UA1 z9>t44-rFy~k3_dH8NdV1Elnkf$Qbj26UMrCp;D&F)jrRf60C|m6C~0k5)4N>L%d~} z)^=qxYP0#h&j!JnqgU>cyX`AKkIR#)w9Osut{wCjKz~`sXEN$D0xW4d?n9u=_+5r+ zjX==p%(1AmT4F~CVqHdWm({Y6%sn&WF_3?LGLG~rd>>^;d`8b{?uJ)qbGxCM*5#19 zR>%WUrP-2zY>VIfvDr@~sA5~Z>&!i~z8@IP;9QWbELT*MF=)FW>uqN?*%Z|{W#-RQ zf~{i5hU>evhA(t(qK&P^pFp+l#@C_S?g#2V9OX@%Sbd#tx@VJr#e`sJPFCG;cpt~( z2}^l-n)I(Pq%&8-7roxDroJy!_I%oY*pqugjy})|^UF9)TWMH}D4bF4jFp9-mF1%? zhlJE>|CNE}nPNSB+bZg8=|xNXR4q}2A_56L3p*B2f2o>sTwNb<3O@nso;L@Tt3s46qGkCeVFc0y4OHoLrV#)v_SUkrDFg)qx$rdlD||?;Sk+1awBm z6LT@B9d?^blfQcAi7n3RdWDlfzk=A7xyjCm;aA}+>@0FhV5qGUQfTE5$a03XH%14# zRXwp51;&kZTadS#;cq$PaQ`g?k%Ojc*dUbWH^QDljfsS!||up_{RtotvATopIZ|n!0!_EzyL~C7||)5NAvi)sIkv zPs^LrKZVg=X!Me5$WOIQHcS=3*b|P(XxR9?!Tjoht+0bi02>GK>4nw~ZuDx(cMExd5!5cL!#t~!UtM%2QSh3PFz(^+vB;eS+KDumgD98 zZe`b?w}T&%y}8o)mZ``0_5sj)5;!M`$8%?O1}XY%4z6eR`9B{*8Fy9&iZWpzU?diFM`y?(9v;O`TBPVBm#aBSZja?6BF>k z-R{4ilc}}dKI6mo5tm<}1=RBCb``S)-QO!Sz6!Y4INw61;k=;3^ z1^2t|lP^{tg^xta!ql#*SZ_6kqEV@E3cbZH%*ytc^w|_YheAyVwSDy{g2m6@Wa?9j6afwh_T=h(;cO7j>Whwr)eae2^`kyU^- z3c4DlvTRJ6UbrekowY|vx>wf)0QDEb-_#EC<*?uppLSw1F2wq4rzBEzplM05cSy$> z9Hb1I&5UHG>hNAe%Ph=4L|!n%{_FW2X;o<>Hzo^Pi;_2UMkO{3Rsm!e&zTqOI|NeR z2#H>7x2=d0|LXDqyRxj2qlrQk{16k?a!>2UkdZgO-JwAsk z`!5c3Z*S&=1|Zd!7evfC`1@c0ikY?gVd>TLqRWQFsk3N$<{I?oIGX#tVgI!u z4YUsMa@|hA?s?Q5=m~opv6YRKfvL+frbvJ(M^>a;t~r@?oEtl+JTjL- zFbY_~VNiBFQ0BkcNw?RSyPgW&){=H=rywHXSaQ5wQQ61G67I|z$nk5+xW`+LVtz|v zJu_FQI(!KcOyav~^dTqrIS?6c+@-Xjm*l5OTnQ7i1{+%YR1pZqgE66WhFSgI1&HdBh<5v6&;^<1c^Zqy`?|Ci26|&!EEJ2s3#dPJryH1e ze#Tv?nYst033(uEG60uuiW9$nJcQUbwhaH(Z_e~R0bA>jFNWGsb4#R8&n5w1;!Ae} z_gP5YS~wmMr*n8kTIxvfF8-8|IjiXGvV~YA4%g=}NKRfKStmh}h=jVk7Dy2~C71f1 z29^Q_&jA06@S!OR4tqta{P&l)nrV3ZKb>O#4{`4QA{zgHpJM-^?pXe}Tg=@FakU*6 zpO9n(EkngvDtv?wC9wewK_EX!F(=T!hzJm)C1`jJjMgpDShK-c9fKO@95ea_hHCvG zz;)^-F65XP34$F&3MVNjBt<|i;iM~w=r)Go!u=LXFoT5lHhYoR$(d`V`}NbO_xAO( zCmv2*JV=N@T)W%I`(>lJcS@tjYrSe}C=qm+;8$8V3Z48b0FFc=NK^y@s+^8T-2VE# z=vM~)9)e%&uy1`Dq;LpFWM@$SSA1WL|9AYxy zpbtzFH$rF(D-A0RHw|ka;xgcgm!`s3fxirU9twgF3KAQ@j0oG6R3MrrY!K-py_Uv6 zODC30%tm{DD_cD32~`@8C4P;*5hIsqXDZ=fCV!r;*YKAX-o9%-+E`V1>O6d?nP^#9 zn3z~tXql*~X=$MW{i%JEfc}_XMQEa;5YA+KS3mCQD!?#ScXt%dB>>7VvF6s_-kSZx0a0v1`xXBmIt2t<}4) zbuM|6rx!Y>7J+8s3miRYg$M`dPsDb^`_`$=1?wGO--{517xlj7D@ysy#vuj4({sAW zzAq#bI^SXg!+ZWySn6+|AxMHx;h=;?`%Dsvp2pOjdQa(ACpafvGV;&6iRlqEJdMx< z5}j~LGhMkm#IlA4z-;cK!4;v+?Pa=qvy(JlgD3R$$FxBCMFO@-u=?kN<#KpUY+~b2j7C|YNa8wQ z#Kju}yGCfuON6Ru$%qwOQKNiQttf$)vt~rj51-9gBL*sfdM4XCv_8hLkr|7G$awDj z-A@EB*ZD8Q!?N&OP*>ZPHoY+ZPF#7YFM9qpoIi=>+C{VfUkq!iO+J=zmv_#v$D5YN zP&}A%<5f+aou{QNa{~(M8id-(OL|(9nZ9}~H#$fc(>Jb*;yI}G(sF+4s@xq>5^4i9WnZRm{cM1ID=v}VwErrdmITAVl$3!)wjc+ z!%$n_B_~9;1yg{^fj=mf%VS9uIg>i^-50IRmVQ~x=V%i}`$EtjC%p55rL zg_8Tgd)fP~a*)HH^q*YjtsXE?z-v^|&dakEOtS##dfASH)bHet0pn$7}wHpJ*Ni2wD&{_<}Zvn214k z`*FL04Rf?*oTsns9qzEFhj>yPYOV6}#YU8^$JUbRRu+C3QT~a}n*}Phpcnj#1q_-8 zh(m`<$WHT|davK$Ma`c1>y}NK=M7XXRZr~-U9zEvL$c?1LDqnqGL?~_-*)?9hzH-i zVx@>QciD45@|Zt-OB+B%oMDyOfX^!oTIG8#!tdDmhW&^AhGrB5&fPCh@tKYE0Rg=&wUrstq+X2w6~h4%4=$7{Z`>sam3hM9QV zzGPbC;4uSnBNYz>2h~fz18W?;PW=VJZ_?LY#T>R$7xwH_=vCT?UjN;pFjr|1Z07}KUe zzz4^R)1O(GZ{T+&XGZNjJc75;f=@ojnI{`Ci_bC$uLuyZ3rY4X`M5m*GE^ zJCDY!`|SlW6rm-L&1Q&%!W4(E4cgTRS0W$f7Y@(lmHa5=(HZ{5cOT4-Got)L$}{~5 znLBLb&;EUAG{gmZpC7FpDrYiEj!Z{CSX0%_dVSF9WBtPVoY?v61a=tjZ~bQNq!WWO zF=r&?8=gQz+esEvIz0JHTKJowr{zL=N-z(y@BCgToR~XNgF`!8f?sPQv(w^?smow9 zu)-mpf)Q9j7IW{0ggtn1*0JELNfCw86W_VvrW^F?(2XYX0F)pA1{3VO?(U9^Tp4M) z-!0CNRGa9FBU~7|=lHNuH=Sb0u5ol$xbLOk%Ti>q%wf{1(%$lp=Gop_2OX2>+wqVp zTa&XxMM9p}CJ$_~044>%iXG$1v~Qsi#)!5{$nL6+fte|{>Tmb@x}>bu_B^`8EEy#q zd;nsi$9eU%N#4qX82aXfwWPbA-xn z6tSxf$mgq~U9gz_WZaQonS18HED)Ui`%fVBHmA3ly&`Gkxsf`?+$!L9=4-8l;MhGv zrgFMvHlZ?W?S`g!Z=eKavAmyJOdL;HFjeMyN~(gpWL#65GeR_! zg$qGvwWgEUfUxO95wfYPuchyTz1jp2=|DtMfuMJe#p&=>hu+6B~_LGlbX)v2O1Vh#sEhm2iZ6 zVR^NplbOu*%q0LLf?2}!_8Y+Fd^w-mkq$g z$JCtd3}R}ul0ogaW+eP^WVS3r!X;_MRr3zwq5PE2TH#X9YzJGwxc@5si+j=yTo-7y z46=k=to)7%P63WzbsrZVt)#thmi zp|gT(1HoAMK7^1tFrtx;Zm1JT6Z}C2(ESx8q4g8(O23VgNkdxrT#_Qvf2#8Gx)lE+ z;se}%i(~4c{}Q}#!08Xsea&x@QW>%jfh*C)mUPHvw+7e3YC3xLvcc9)8NG7a#Pl)d zaIsRM%j1DDU(GV?P5!>!BIoz2D^8!IF5|n$eU9^DOGGR^EM6n0>rF>vFJ|e1d4K^% zQV3KchiiffahJ#pkD`M@J?uv&2dA?o#>#0;4cQd!w%_PwwAXb{wwoO+sFx0|k0O!h z1U8tm{Pu@)pysapR&XaP=e!t0GX;gw`!~N5LUK6!k9@>YQTsGGY`bE@KRyKa_-(9> zYjIJYCGRKhDuh$}r|6YALqkEaFUE5eA=OF}i! z(>FRItG-~Oqg!F5qjQ*{Eu)%{nkAE;6I)TDYieRzm8WB%t5dR@lBS-Jl9^JxlcpLM zo0g^mbZ|flcuZtel%zmrvSN^sfU0_Wo}8R~MsSRLeu!MSW^zWFV3cG)3Q&ktjGRzV zMA~|IltB7Ia8yL@vsndBoOy2tEha7(hU$4IFBf*_6QzFu&Q=}h8yXuc_$~~*9bNK< z?)lA_bvp`Wc#KWE^)A>|f5N5wGGU%q{F;8)dfRke5vA>!xJRb_G4GALZk=LsVW&ZP zeCNtabZhf)YiqT)KHOmm4FwN#@8v}T1V}S80sBu+@;?ao|K9{z21e%p`3Bl@{T73C zf4V;@ovzRZWq^%C;mH2ckc==L1=r{6lc(y|M(qE-e-7K6>4QT)J>?XUYA~mcoB=#7 zYL3(Q@;6Jg~mQ*L#p!I{0fRBmoMEhJcJZ7t1pl}VP@&D<>~#*H;43iJ3gvbd_$vQPrt z%kqI%94L=YS51{w8t$xHf5)EF*LSR_BZnY*m0WhOmcq-co0l5J(~Raz((0Do(nwzD zS5}B-u{k||DnzpMI_Vm9Zi?SRib`RKRV$s@iHheWYXrn=1i=FN<-FjbIfI7 zN0AWU3#3X``mD$bTV{u?Ie%`?EE!WoYiFi2_T z`z8C03Au&RjH2r7=D*teXq?N#f9YH@%Vz`?<*9j6se0Rk3@)^~2!m=gZ-v!GQ`}MSZf> zH7oC<`_Mojbf*}L#RZJGTw>**{S}BP|I9#<*!vj^kIX__F30)a<&4&M=9?jnga^TP zWpK>7fNzE^EIXh9H2x&ak3?xh?m&Kg(?PM0>NC9c{|>~Rg}Vud_>(;+0mGc2P8B={ zy26JEVIdGnMJ&siHj%m>D(eX-Ws$!1JE%PrPqpt?StmsM=A`Qeu>Mv@srM%(S5Xy5 zFX#1oRR!5fN)2TCK~Dek6=8h{-4!eo96iY9%uolF3MRA)aUxu2zta5M@}!$Qz}j~} zzYk-?5#d_9_K1l;U4M)pm0XWfjA%8Uhqu`sUEFR?2y%~{D%^nMN_1a z+v`(gD2Qj7tr6?rH3p?20t{7NaJ20bDR3JZ_0ndg2)RB5s$)a1q}TySR6tPqm4TtO zpjN%fM!beNYLvc$5PoAW~$Li373-6_H2@?kK49+C{;S2C@cE z`lP6m^n?bIY#Mg?Qq4FOwWFrv6Sb=iw?!-K1f5x%)P z&KK&!ppkC@?3oi5&fqh8Z*EgbK9fKb__9%M*p&1!UPI};NiNiL?JKfBSNB! zlgv*)bIapR=3jTib~B$p?kF|NhC?DrwWDRisFr@^wpLZEewZ0tGoJUm2}XD(c$BVj z%0?XF?{!caebY;=EG_n*d#|^%g&&kjPWebSbn*SMI@iOrKySsVC`W+hbg+|ZSaMFD zeV5e15-{j|y#<(Df5DSf9EN1SCMFc@*tq7Z`Rf&w1j<*Tb8Ki7>{6dV4L{FjYYj)8 z%j;lh8|hZ)Oxg|5R&kjIJjAw}x#?}42OZR)YMV)5s=0cO8V-^E49a$s-u#KWX$RwF zBns6sY19|&b?ov0`SE+a!`8Y$d(nG~_5AFIP0h-A^G>frSn0Q{EAg-v76^F;)*EW= zC;jsw6UIr=K`@-ykA8)3R_}OjnTQXY=2fiP#Nn8h&XG4|uS#t1{vJ{IMGXOB(=Wd) z*~4X=(CH|L*w7w&AEXhi{nI^-MkX^sLe9)98FAE_t3@??ui3GFK zQO%6(T*I1oM9$4~ix@@lGhRu+bFm&gkyUbmWaHXn5XK8U8j~H25AW2Kr^1KJU6{xHWMlXrwub2*s4YACZi`=Ebrdm%c& zX{JZ7_3J06gA_6K zA5dpdd6)mx5E=f9hRDFi!1O;GVzn})U7|6Dj2O9Zy^?(63i2;+O_(q`*+oP_m_oG# zHMOesU0zjU*t@oC~g+~F#7zb!QEt~p-))uJzlg3>4`$}%XYTn^JHhqMWIr9VOu7O zV5uQNIuHB%`0)6+Y|-7D0>An>3qEDJ8NTIt<1CHlkOu+#yZVYGR!d%0nct)YEu4*^ zBUI2*DAbCFjy$?1nK3bUuQb^at!{%56$&lQo)Yy5q=c?oTNp77qK#p-p=v$o@|A%K zMJ%OJHThnh`r;a@>?#sYC@n$*XqA%=y=Gr=-3ZE%@?0Au2;-0wI;ayeb?^(rHL?c9 zwep5#)2ex;xp|Y51{if>OG-;li+@Xeiu7c@bpgHU+*uQv1}t~*4tJs3Na!E_Ka3|| zQZrL#RHiG5jR2CvJmHAL-AKA z!@EITG^GfSTVeVq4xb6uTod<=z zotesinQK71b*gphgQ9@R^~jG8MjN=8 z6vy2j&WVpEKu2>0?0Z{sq6)N8AuWh?PWh{d2e1TU@j?o zb)r*V3PWo+2%i>L=aMa>&JN`R-leNnL48`WiiQr>06I#In~En7UOzeVAqV#4ktQ%D z`6I(OEP?m0{r%^+^jLDh+kt#y+Jw5BGluF;4Tl}zmfQ0J*$^$~diZzB44A`=C|Ugh zEBFqu9MV7X(H|!@g1utmx}9y0L41Hjb{s!GWx?jMMMzx`bstSr3FJ+Je6E2AH;}W( znOXya4IDmrg7D(ch#KtHUs%a|P7+?@eTi^$d@ZJl4Uj~DI6H(T5ig;cyj;|E#Cag} zpvR;Y9OulhL0DJF?Ze3(^-Z-{um=-?Ed@ePVSZTY@r}ak zdKRU&aDE;*eSnGttXygL1_7a}iQ%5P$+z8yYwEvAph=Is-1$?9S6N1stKqmXjh?2v zApg&5P>O!G^_)KSFS>5@RK>x#=E6>u4Pj*b3lpV8W+yLC!)9lg=Z%tj7>0&-dEuh| z{aRmi4ZgwoMVFpa+9eV3YHg>$A4Xhr%$=bfQ>qO!gNj7->_O~Trffr<@%KFA!NxFD9{<|4*Ws1Df^HEjTNzNn8|DLRn_WzfCx?nc2wM z)Vjd|=;GNq1dIyMyEaNI7PAnybZq-R^2ov_{<1B5tJ9_o0W&$oL(afF5!uj<)XEm} z`u^If_mOvZ&eB=IJ%%Xy0j!5)P26wrh@PEH+j#p=##zJS2;-k5yR3~a6@cipGQsr_ zSH8?igT^%t4=SF>QTh!ms~6T!YQWLB`eIA62Fu%@Mn7zuiIgvr;v-a)?*N55OWG#W z{n@?`E1CF{Kq+7Rqrf@=8l>cg%i1az^^>I=J>PdsVZ)dtXLQYR~8L;`NAD(f z4sUDwb2xo}qf)wr_>u|8Z;@S0MBHbTV!R|wo(qdIF)O;p*d$5hcq&yD9mg*7I(AAL z;qrAH+ide#e`HxKif=gN!ru0X`V3&wHM4CyVcA?bhqs4C0R$n>056*DIM&0Q#whk4 zjS?@lvB*LyCNO&1sz?GGIyS+1uF~6>?Yebr<_t`arv^_+G(3F}JOj42%GW&AJ!WM5 z*3GOL6z2vda4~u>O8j`AnSsgYR7?~d_n-@d4QvWZ7v~XP{sv+PfYo+s74P{wZ5RS~ z=WKtnzBQEZZEIww5^{Hp=ztRA!L8f{4Le)7#kf?c)SVt2F0M@ic*#z{j**$QZ_R$Ucfca?Y(s(``Bmj;!9VR#h6ZgYnX zE`wqdt0c?otzbx2Dd4{0&%wWR4qkIcUGD{cNB=NDDMJI9 zJrO!5GQbk0T75z%>&sJ}bh9<8-Q~;X&#ue`=v+5#QWSN2DDj&AVw|u3H4y?nV_L*0 z`Lh+^ZkffP1*8zHvQnmkVg|A#e`dKl$tou^%N)6!KoQd=LPZH&#kT34Zu}i>nQ~?j^-zVmp{ZKF|xh9bib#KfbZIlH1Qy59=lCj}p=;quuxYDM*zl@Tj5kiSZuwvk3StrbM4adud^!TB|NH#1d%3Q#y1 z&PdbIQr6LVQ-%DcWir!lj@){}%lm*p0@bYYSAXn|tvi3?ZB9!|e zh2vLd+xBB?V1N+j+tJ|2Di8w1x)j62l5n1dDohoF?h4JTn-)o#a0;IqRsmXY`|^g_Ey8sAF&?RsU(aX8JFtYZmtZb<9Ja zu@ytK#~QugwRY^Zx?m9HcXt)YFF1cM4N(!O^2(E$ieotIXRozxaldrofFUV2pUM5` zb21}eoX#j`^G?7Q(4&)#qpDMYBFYtm3@GniyyU+bhu)Fx$)yEDg7VW(`w%(bez*iB z_Yf?n*+RWmSl$2OR}69k3p1ss*C9c0xKAR$0jP5(xDt_C%gv}ejg50R@CO?F2h|~$ zHZ7iaD#>YFeMCuA&r66JR}ztlp-G-7I>)-mh(4wkDal}<3rKIKACOlo*UXw9fO=M1 zKyJiab6i}^mNCUJbS$#(BBy0fcF+J}RG#2a>q)6DdT6)sr!cXiLiS;&j>>6Vw**Nk z{i8)ZsS4~juD}=a2Y@*Yl$0MUEDrDw86V`?sM4C>8j5$Ac)Z}*Cf3Kz@X!={yO~h4 z29m(l8-zK(h8#0nUf^rL4+3^_10cs2;oPdMP^2UsSqzA2 z7uv-Sz9=R*#xnTAWxN$*~@I#CZmWV7>L=g(JU8 zGQ=mz*j>JJQ*$f&5X!6EV~W<55U56ELv=}gQGR_pF>D7O4KleaD=F|ls8;@Q>>X=N z0fI^qn|R@*Oz&e!^C~X~9F`Y@dnVfFgSi{Z!RBqvNdnzw+;1OIvzT-#W@a9yLz@_7 z>`K;QZafw-NSuYK50d6C|5guHw@`0WlT;tR*VcZc!&YGg zv~Q$562Y&P4WeDMopQpvOKEBE0PGtEcEVA@m%_U7q}}!0jP}k*=XPB)wkvBVZl}i{0Q&V)#K&n_N9$;15?U+;TDuF^om91s zXMPL;0WwGJM;aRHE3S|@1@C{VhX0`j{Qo!h(f_Y%_uWm!kUNRGDPtjrTdK0|H7HZd!7wH60_=!2=^^=hf z^$m=T$!jc|>*-aQ>ggS2>B^}mrRT^OzSFG*A(g*>FJg4rDbX)rDvs;?q#Ya z#Ajrx0UaHYft-?>mZT_An5`KlB_XMuU#6y}UJ{?6TpptqZ<=3_C7Grek%1JWl%gaT zmyop`pC*&N5ucP$_-)sKlVm?SLQ6<0L?U@TDk?-i_{A82bj0YR10xa=My|yE?a>9= zoqVD5F?}=d|7tw(Oh3CC-i}O7a#9`tb?;Zv-qO};D)Wrt(!D>(eUV)AnUpn6Jv7KE z`N$1*M1WO=g?+`T;okB*dAtLlzkLlQ0O*M}p88KO^52LJ^z3Z^>l~V_pL>0PRn;r5(pgQ->$% z4?`AE5UnC-Y1S-r$aqP}T)t97o|94X;$_(Lk~KphB>gEno9<}ZlWu$S{e5%$s-1B6 zN-CC+)!@8b($)skXn4SCqjlw3UxuVP)E{~YG?a^Q09}N3K-$1?&;AP+fG1cY8Sh57 zjM|%Z5=kq8PxpX6kV|xK<)|(+G?IY9+I*GTG=SzN@tUa_6ObYd@>c zCR;L%sLg+fNPu68upm5Hh^juqEmBRgiVz{JHT-!9%z*IzZ+-Zhloc@=ZvfF(7;CtN zK^z(}<2ZcpyN|Jt?gUkX=a1vJCNL220&q_$iz4Zf<;3uX97L1%QD@rrk(m!V}j_M%>(YEJyGj;TpPzJCL z`Uj5pKpsXHFn2AB#xy7ul8U;ei*WM*AG25;mGaV;(=-wX0;^%O2oGj{6`le@m z6RvV2)RIRu7a>hjxTHL#A}_! zQl=#I{poMa-yghQth_XqCJl|Xzh&k-in5e^e2u=p&3rje1z6!$e>nQKJT58zuD>t{ znY1@>S3z@NVtLCMwz6_uTu;vdJmJO`*&HRkX>;dbIOE@F-jHP7F|D**V>MxecRZ%f zZB_;X#e^0>SrT`D5dgF6NsUXg7AN(>Ite`Z2V!Lj`QQj`nhTfv%|IU+(<S!F?&Zy=pQ%4J=FFSSq<&QSiTQ8c3jYn?w7BPL+|{TwDJ=V;EF0=6 z;#TJ~T(4?(15@!rj`(Uzq&vI0DR7Q+4eFL)c9zI$z{`~^Nhv%{dk&ve>8e+0do<;5 zV_;}9B(-e4%1sj~xxC;H%Pm;=3MI;3x7#t7su zw8!MaY26$=aN`QBHzVf3S)0U1fIZ~9xc@+FUM2Kj6$_XW%9S_#!Yw8aogtWT9T&pM zHrN1spEX^MvszD%UoEv08uUB3WL{&;Fw@^^LNa zmcyyIVD)ikC4_I`gd}ZhPx05(;DFZK`E#pXGIgd}9Sy#hAw*HH+bM{IPm)LQTy2)i z-;nY--lWe9$&9G;7be5Mgz63R<+O#rfn8b@cOoRK6ym$`SX{urApG6vXtlSKt-(Wn z=%Q_{Y)O-JEj+>$^dS$mS)#5r96t(riwkz0LuEC6t4F-Sac+u^3x-le5PsPE?`S9m zrFmAD_Rktdt^CZ5y`@#E*Ee>fjpEZoS)1&y)9Zu#t*NN$kSrQ#u)F-!DnnA>0h*9b zuTQLVkrOg2;P(8=!nQug7YFL!TdRdj^H{Gb-LxVorbFc}H@Ejg*MWYS`a7JcXM4UV%y&F${j z^-j5Xb5?c+&ODqmAG^*GlM(cNPIIH#dx)bh2hwa zbH`;=w%CCn*24qs{jK9~4Kf2){`_Vw-IT19VPI^~(#_Gv-#u&kH{Ym2B@6YY;MkxW z8~WFMJNP)~%9Z0s6t4OTY)o{y&d&ADCB263w7a0*Co&(*dOH-zUwbT$#< zgC9SKi4}eHmKQadWR_weSlkTi%;(67sur(~EDu;15G>DtHzJEHOHS4RTjj?Z2G*}5 z2qhLaS(mLARyM7s?1tMKl~&=~Z3T1mAQZFE3qTEjN6>n0yjZa)RdKtFe8dW@=Va&R zIe}M;{wV`=;}ix*y~a~!RMkz6Jam`JXv#-K>hN{jlrv(gmTqF|@h~^cZiK6+;_9;M zvdX{{+;82BG3R8ZWM}bTLSi2{EqSX}e)KgLYHEgbMi+EN>*kuhha9U?z_Ti?M3Lo2 z!?aGo!PSjlKFTU*11T5{dC~nYigvnjPIU;V2QX=U()Ua!et(12#$UjEOOxUMFcL8Q zHwq~eE93tX1hW5cRxdq546Rx8IaUp$c!?sVH8!qy6m_+*C31a%k}!M`8Xn(NA{U-N zQBY7^-$6165D0$P0AUd8n~zHMY)ectY-RT|{mhcfhRo@V>#P0W*UV1AQ<+ci$KCH! zukrQItL7Ek?-kqXX#$g35;ZCmU(*Rp$V?XQR%L?O1@jilW6k7Rt9R=JHqnGK=>VDkt=XC>R zH~n-uVwFi^445D4B(orQi+BOb8F-6ifp7xNl1a$S(Obu4sA{=ll>=vXoqkbbN@fk| z7L9@hsQTd}R@R@hLKfX3n9XIey9T-RNY@jxM(2IApXL%|aU_gKVjxMO_B?H)ak)aTE*2uRk=G5e|3Ipjnn#LQ$MCB{@D`_k6 z&&u_cbyb1|LZ0KQ{jFH32rsdn48z>V!L6(I(G%twUvr=7fKn7G5nZG$ ztNZrYRnhrU^m(|C>|7KX{A++n@g8YMd8SW*Z>(FynTT^_hLhEi0I9J0HA`)vvdZ#m zS7xS_jy}nka}J~kg6ya4@o&@j(S0{rGYU4Gjuo#9{OQ8E#;jCr)L{kLd%jQ5yp=d^XPY!LmdjwUQ38U_-L62?rL*#s&YDwy)V0csJud$$wF09* zsXGDHscKMGPZgX?Rk)RE_2P`(@uSTQfKxdke}iZ0M7!KAVc8d~G=Z63qW_f;uI1@+ zK7f8BI@$k-O?`Cj-zk`~66OyVNncQxknUn#_kj7 zkAHJ|1o<9*4mSHE+#BY$#c2lrk;AIIesjixyZlr%Nb>OB#M<&8nK$&$^Rlvr8R6M9 zNW24t_sGN{z9GJjdPW=+FDeVY_gzqb4W6lm^t$o9AAe})-}S!ozI%VDXX&!=@RyO< zy{7h>EUq;4!4+!n+1mV%EHh#bF84k4cR;T2E&p;?G!|2=OxW5$dQ7fPr56JbJG=T+ zBfJ^BDHJ#6J-(^OOe+^tlt3+gK13E0V}!76!CJ63nG0#e@yICewH@a(um1~EH_PW|K(uo0-yJr-)$TOz6GTj|F?{vlh zwA)LRN~gzHOZ#oc&g5m!16u$`L4{$Eg$d8xnKoxVhJ2eMSiPeNO%6OKWW9aeA3`(G zNL)O|k~HMfpy)}MKN>x{V-`V35;n6q^U#iQg9{H&|B-E#POP2cSJlrM{h$(vrGnV_lWA1T{ix<# zK|$>(p1zMpflCGZcP8j^r|>;)La7j6NKmmHF zV8sI4$4UJ{T4=cfckXK@d-nU<`~BtGloF>^&&uE`fl%i!ckAb!pDcAgNYYV!8)bPM zL;zBOU`1-cV*Wb80<15o>7byfbZ%Qv_YY`DOn5rw$MgLjokz_JovcnNPEAfl&g#l| z^znj#YleORXIEbjl9~{FnFZ$htn1$BhE2dnVoHt{>fR~cmMN?CQC`t((P41h!VuD| zM4Ogbmoz;TvnVXXqu(W?gXlF^Hdn$xBDJb7R@=(y{Rr)I7133$;O;of+o8uqWqod> zTrvg3AK6TtFU9PWIJ#YJe-f>ZTm3(sokMgc;Irmqt7CR--kN7Bce?w=$mk|YY#bBfs~=dB{Mn(XaXk5a|TVx+?Dd8W!x?u z-~`z{dt5y>?;G(|V`D!|j_c6nfOhs31>tyE=!0e7oqk$l=?XqtykQnC-ob{F*2c!k zy-KUzm}afHN>$7G$a{ws$WQd>Mu;~`QBS$B&()8Xr)!*j5<+|XslX z`NcH{0CH_zh3x4HQ~M#JU7$^q1VrK_XO~AORFe`~(c&B%t-C)Uioob1KXy@|8A?{` z!}(qeUI8zIQb=L_S3R7WuU=y3bo4@AMNf=i@abykojK_x+67jJ(zNVPmHUf>qk!Hl z%2W*V{a?BtcCR9*E_*+zu?}2Qbkf=Ms`9&ffy!l0YD1+j~9f(io>9)q#zjEI1R(x=*W?is9lx6TEbu8r7U?8S_^PJAMH`q$_KHQEi8Pt5tUp? z!jRuK z!$*IKJe&)ZG#|}x-h+B!GtrQ@s%+MPdDZ7d=d|yRKt@d47#Ua@i(hSB=8sB+a>N2} zq`=k@=ebe8_Ly;zJ3J4%vSc_Dn!s#etQwjSu{$n2RMx|5Mp1LV-`i`*^F9&0u#02K z6_PUTds@8f{I<%}ziFxVoVSS+!jzd9$zExMB;Wcx+r+$c>j^-6mtw+ff8<~}Rrn=J ze>(VWZqGx~qH$tI6zt*qbcfu9Z7`MUuCTJ$-C&n5dgJ@z3 ziY+DbmWqDG9(9`zgF1249_HG*D#{?|!^4b$DRLzZ=de0{#njaeF=N9_9*{VGx=7H^ zARRpE7bUg=rzPT1tdyx%+Lu+E(#HI!9fW`oUz8dBiFQkH3L9cL;-g2}7tGkW+O+f$ z?GofZLFv>#yc(iKZ=$r?r{D4|6`#Ds*JD6$w)%Y_7I8 z1(2A_szz4Q@DKtk-dT6-))6Df4zHJ%cr#_p{ev?=reSp++bR8Ac?v-?_TpcfEN)(P z$tTIQm$IYM*=R=>4Tgw(uE@oh^)~0VExdE9M#QWKc=ATaB ztJI%fOaSN4Dc064xvu#s#{v&3_z9yz1@5RJI+NgW_;IVQH~q0(rWW4_)jYrx8e=uO zI!5~^yG{-Ws_{75EMg>C3kjikOD7)I`Gms#{dIA7!e>=aIwa0a1&5EYwB-m`#Z$L*@{V6K4vk;JS{?Wk3sIqBLXH6OrF9V;zJyIh& zM>xWKW2m4oBJd)fTE^D{K}1nX((%0Qg3NMt2wwTMK3Q?d^klYuqBiMc^h4k@QtP7` z#~sS^YxzUONYBb`%Dt3Q{-sdMxBihdx&mbs;1>ZHNbXjBu9S2H{e3pFuSyb(9?yDmQM&!KJRqkNU~`4^HPKhQ?sNJjbh z_vdi4Slx|&K!Y4?n`8R%JK4d{5YbR*c&A^3C)Qs<5phscNYHP+pZ)&1v3GTzeu3ij z2AJebq4s<}Vwy;cFHm2*cxU{Woa-xd(PMu%cN$+KGi@pbkRBlZP}GA)5Mk6R>KWzu z5k~yA5$ynK$Yb)Y@Ca1jV})S;{P4WRv!!X3igJgG6}jr?V4+pFLaEvl#5)MI^=&4idQ(ymOnZIC~vzrlJ~ZnZX3Zb*uah&n!)2LL=WdJuKZoJ5)` zH+GqWcgFAPZQ0UqM`#&zM_&xlip&!+iGpq z#vE91itO$HBx+gu1fnt2#T}dTkA+!C8{qEIMC}}0(PYd}wn589#eneYZ3z>gcSc|8 zE22lNEsgcJ8sfMAU3U@y>BN=0>j^t1F{!RqOXgfk2ilzMSpM3zuJ9+J`YW%Au_BFN zBFQ+_cbkkUPk7`90v^y+@CV`71q~hL3W3N@=FLrnl3!Z7^1CzG;^7*{awTVrpt*hP zICRk_mH-SA)n6KCrmi9|d>%P^PS4j3EwrOwlilcCbc9LN%931!ksBr8BAvUhEL1_> zDynp>YXOQENAPyV$(96+sfH8UH-#bBC%n`02h>k^nkh>x?->B}MtPNFy$7U)F)p{I z1Frj({3#<`O{-I$P6ktiE@26(ho6xFj3e;{dNhpkk}vv>>E0pqo~bx^Nn@-zeu|No z@-&l{rVKVEaNgYoW`BlO0lh>_Og8C1OmXrwEq=2 zedu64s3})+OTM_}H+?6XfW~rEn;fy&rC4B;%~HcxfR%z1ROj<<%Q;S|vmqx?)3z-L=Wm*#e<&v)$Lam78 z*)*ZEUS!L(O|yVb_4rg43VGGL5aG5p1U!UKQ#QHAaujE-A>2YS_MUE*YD2W4NJZ3} z+lQJFr6{>alMFk=%v%X9JSp~goN#3+8KN@iU)-X2H|3CG#{(nratZeN6(bFncgOyiSTlAf}^!-p5p63yA_Ku4%Z-E9d zl9NzLK?`-s-iR}vQ;LwY*URc$Hs~-im~f*QYvbhx{Q8n zw&AEdM}@4#=ntgwkcOEvVp?Ah~=nQ87@tnoSQ?jj2dB zoUL}35-s_T&%G^?mzA2wa8oTVXgxT{S0w7Y;LFJXJLIN{d5+2FSw=H&3&B=kG%B(R zuty_sMY)vv{{EO3k1P~bV3;?zL^u$Y$tzKEqGGg!ZAi+}kgd#9RV*)En8$VWr9TjQ zQ9LW%QXu@Hbw=%H3Eq&fDlCKlduAS`EQ?#pMUAxqk+$Y-k#D_oz@SAa72FZQsh!h#ZF)|&N>-xpJPapmBl}QC zq_DbxS8XP(+}TUcZE9;ojfCNCCSr_nasxi3g1M6?AP%&Hofd=R(B?}>;V-A&^5YW+42|kugC>$!dX$4Es1^89-P>& z@XxqFT!kyCgCPm_Qv{K**i-=<)vw?{=cXW zYjLsK3Xs^kVQA2;&*N;~Pr14buVljk&Z;`u+5<~M4j8aQ)C**R5qvhHHl`8SF%0(R z7w-4$DTXT$Jh;_^R(iWl3Fsrbc{ApKxIPEICpDc7Ytu^sWsf1wv%b|pO2 zP`#wM`Pu@I(jKJsjROLSOu&EG(tw$xE+MapM5|OzA5TwivbvoZ`JO_%w8q^6q2G1w zxY0d5HLMv!xQ9R`@z-Pje|3dVxIAVFef@3AIBWFkOky(`s=9OX9`RwCJ_e?O2N7t9 zNB$o`k27n~l$RH(UX9)JWC>%~*Ug_J1`t^ckhKPs<>Ol}X<|?ugr=!*B)xA6RmP(k zFr((;nBKbSHpn^0chhlD7h0O+qC`6jTFF7z@39HWT%i!8y`>n5&_+fL;`JJ+4M1WjEp0=nvXRXK_beh|C zr#|+3bGo0n_^+joKDMF6eeok;*en6wjyIUzit-fm@fuYxe^N-nj%oYjUHI>GeMft2x3wtC(Ae*a|~; zV=(!2*+tAVOJ4FeaS+*!&6u)2Jq+3G%I-rn;6GM3jYX7TM_aN{)PwZ-=IwZmmE(m>}9UGp4L>Sx*^jw_A z>V4leh0tQ^(+PwB1OzpKf=2IkGM6~H<*XY*5r|6Rn zXEf%uv(F37hO-ghWgiAxbQ4=s0G({-4E`o64BTO$;m+#W1{E*ws#J9acQuTRMDXd? zFTyQd(<_Lf%7R3ZP308)7&dr>r^9Fz&^Ihi93U;mV$69t9~{8An$qr=wK^Hn%eg|n z$cXstjV9>=hWCUYS^}d0-f-t&*%Z87%=S1nPAOZ@ui9hC$MpHAqmH-v6l?KD(+44d z5*3g&b)`9XD<12f9IKl&jpdVSGzp_vD-T_U1%AqLM6R9Bwl8D-k&Gl?Ma66v14 zo0=iu5Av+ws?k^^c2v~3yCxg3mh2nhLe4ImJ1oYhj?V|ux$o+Zhme+E-Xf0a*mFlEFxDE7?@W-Jrm4IN;@&aKUBHmDB|%|e5`b)9V#>1!4c zO)67)h&*QA{hxwkl=eToYlUAB+~o9KsU^lSUGq&#D$>(xnB2kkg7>QPUUIYB0vGl2 zWlZ<2Y&*yViBq|Fd;UyGkz|R}FJN4)Aioh4O2S9t0kxdzyZM;6-bMXYb(9Q+bSyk< z(+%vTD?1U?ja*cG3QJ5n@Irm1_4ubc#{EFNLmva9 zkiO+!Bo{C{`{(rXVbS`WQQVm#SqNOelcD5&rmq1oQ)P(|4sVfFn!e@XIZWo^lFA)c zH?2)&QAgb!IA{m2r!zFP{Kj=>#9@Tr1kPT`MeCh!?u08O0SpVMbOUrx&=^c#&Um2K9=0WMU8KSt>a32WZc0E@&s<+%;h0i7_wd@_; zPJeXO_S4wsCawfN!wV(gJLWv%kJVVO3bU+n@n=hiDxJ)k4W|#molknatb4y7j~3(Q z>OJOg1xX(VqQwI8|5giXv2mggIR@EV8+-;Gx)z~R`N_Uz^nk8J!LDa!Skn_`NPGXC z!Mz-42n*l63(TB-0zOt%T9hVsgcY;Pe!{vj*6`-j;N8{iyYP9+ znCD86ct&!sjf6O#1e&lD%q*o{6)?OBOi?S=?aEIu@ zM`~US+-Fpvs(i8i!B{^SLD?8h71A(Jm|K~KmlsN8?)sb(8|KAaK~<~bMA2fZui4TbuceAdHh~`u6Z-HuH6l zj5lviGydE>W^Ni1pvi?;JW&MM#>ALh% zrlt2~yxxl0?b(%6bJ~T8zYMC(zvIKHWT8NfqE$CKB|HlDxd$!0Lk1QXo5R{sh(e<; zuXPzYV+fmt&{qf7I#|>F4=$2`1@tfd*5A^};acG*0XTv9_?E2RDJzk2V~%MRgGKzu z$P|8wDnAg2|leilx zp_u$=0Fy_V2_Yi=N|Wy+9r1I;B3A1~ROZiPbyi+&AK$;w&-` zyUWkOUew#h-+2`3NR-s{VSDr(c7m>~5%@6!TW#32cWywnZTyE`<6_Z#>6i(I3fB?a zxD7FMkTLOa zPNUd?PU^EJe>}YrJao|ilJVbQsh>}ry!a(HtKbGd$%&Q{(oJTj=uQDpLJcumF>)5-^AGNBmG*wRbZ z|1wG!X$u2^=e9!(!r$Qy)20AJcibi<>NoHf_`^!oVLJKSsq|zR>>@3diBv#npeUFR zo1_?Z94`rb-yjo#DN&ldjY+o;ov}1n`+6>vRRq#FZw==7Q2sZ8eF~KjA{hn8`{C*82@o*<4!hF=fZL+eRCq?%@jvu)#sdgJ(KARg z<*Q?yVAr8cQD0o;ZVkD{mHAb(;`R9)%rTtPF8%``Ef37Y!Q)b3Ga3$ zAD&s(hNB+Zm$|;xr-o24x$!CsTr46^^3P;z{?{VvC+Z@X0DN1>Tko;a zZ}?=maj|&i8o0^*jUIgkJE|2uSQ|L!Va0A-!D7JxY#xw%{#Y$^{uN$x`e)I0k6))* z9|!-T-P6&Y2m5<-)0prkt7AR6NXkQZX5Gol{-I+tS4;D-ZCz%awQIq|wWp&K2Q6n+ zoEc6!FHf4bSzW`!$%2P0KAsd(eNV#Xo15OzyacImVvo5eZ>NR5xO_C2O;W$@zD0FK z(&zilSyx@%ae$R|Nr`}Yi`wf1cz5;1UZ_=&2@Q*+NG>PtB|r(ylW$eCq51bh1)8Ik z4;IhewTh=gY9s(V^E%HZ-?*l^jrUZ#u__0B;eBT~1}Sa&05n)=84UmXfv!rpw) zJ|{4z4eUe?X%di#zvPEYS#U_b{ctO2>Xprow~}#TXVt;Zqj*Ww!ZekoG$jmknLe#8 z>{GXLe~Agt@7Or}OS1`@M7xH3oI2*uBVuXv+KL214%i@6!-W{VbfWenIzz^6`3x&U zbD#OQ<=q>zGc#*S)(~NChYXW}UYGF)HQ6PaAl_HTa2BjEn#6Y10CCUyqTtu<24MBmJvig)mc#2rRa8bSy^e6nm< z+scb^JKu-H?;vR=aBfX2=Iea@HiyiZqq()CbE3EL61Tf6Hz!f}6aTY-{jR;^D&1*Z zZ(p~R4L<%av)hmN-B;+hGcG8Ls^GV?fvBtzW6!#X%EOn#E8!_X6_M&LIdYf7=USAu zZhamb?`l1BPF!_K>M&-~Xsg$!S$BrM0=GT#?DZ}sypGAB$A zxMY(JI7cvE*!=IwHK!%A=F#4?Qgp&S(}XHMCvT)U9=ZzCJ(G-@NnQ4M_>;&6)|POlK)r^-k+nbso<|_9N2}U_*4Zt+!y+0SfI)O-IbgBx{C(fiPI)qH`8>AG5Z= z*(?dMf0R#eUTP@$ekS6W@r_0e%em##Xr!F5gDO4OU8OBZG>qBVEhAi*RP& zxj!H$v)GCQE2GX{n0#MJ|vfm%lv?&{d_h1S`XPYfp_o4Z^-eBU=RZ?huQn z$sAQ`yj^8YeOCU7Mg5%kIiNb3QHwQnP44}Yc=q*7xGZr5h5qd9ax+?v3K078)wi#j z1y3CcPz6_5Id{R{9^ZcHuHAn2|CvIIYe(+B{MeqA^b{? z)}H~|F#y zYB3RY*ewN%Q7eFfWxVj$H9X+Pwkbf`Y@twPM=|j!{vkLx%@j#?xMb+8VAsg|ms0CC z<*W~A^s4=fMV2t<-i0;7wITw`Sj$#ld0yH?t24pON&@!uAHB{ z-Vz*@JXh6yRQ%%w8Opp`TiL%|%hE6lt2Q%pzAvPS$$!n*H*F4Lo!&oxm^e@we75(? z(INaT#JApxtGBz1mtHl;WhwbDYack zyUl6{={;0e<{zo$ zQTD>!Z~A?{R|j1LYqL205M09(ES!aL1+zyS@>9w?_6{i6qsImxJ}w7gz%)Rx!B)$d zgO}{_fR-x8)0dnJZ~y4ycw1xSa79Lf(CZ(47?cc$qgA%5D3TV!n{Y>-ghN~qy6CcdczAowIT`bS%n z&dqmHvXtS`mx&ItE$(m>?lETip&r9(&nrPjNkPK9_JW~$s;`6yyUh-3Ugd5xcn1` z1#{8TGGqa#mL|@Er9*_+OeCf7#7HTtNE4?vK6}BvnD?VQY2!enh5V+W?&Wb~bM3fv zAX^WZr#7EwJACoMw5UF31Ee+tWp{ZOCbaKyLq^NaM6c0=ya zh*Y-^jIO^NJLzMSuj`>{!wzFdhpsPEJ<6>{!`1oxNUNww_Dc1y4yyE2s1>&Vg zla$ieq~7FxD z_P`KlFezs!+B16Y9$9BUP8N{wXGS(?dKDkl|Cu;W&IXRo?v5s=aLnJu5f=cCoLods H6z+ckKQf12 literal 0 HcmV?d00001 diff --git a/doc/src.high-level/high-level-machi.tex b/doc/src.high-level/high-level-machi.tex index 6f2a37f..c59c9ff 100644 --- a/doc/src.high-level/high-level-machi.tex +++ b/doc/src.high-level/high-level-machi.tex @@ -48,8 +48,7 @@ consistency have been moved to a separate document, please see \section{Abstract} \label{sec:abstract} -Our goal -is creation of a robust \& reliable, distributed, highly +Our goal is a robust \& reliable, distributed, highly available\footnote{Capable of operating in ``AP mode'' or ``CP mode'' relative to the CAP Theorem, see Section~\ref{sub:wedge}.} From ed6c54c0d525fea455186bb86f26272a6682cba1 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 15:56:34 +0900 Subject: [PATCH 05/14] WIP: integration of chain-self-management-sketch.org into high-level-chain-mgr.tex --- doc/chain-self-management-sketch.org | 327 +---------- doc/src.high-level/high-level-chain-mgr.tex | 606 ++++++++++++++++---- 2 files changed, 515 insertions(+), 418 deletions(-) diff --git a/doc/chain-self-management-sketch.org b/doc/chain-self-management-sketch.org index fcd8f8b..1be3268 100644 --- a/doc/chain-self-management-sketch.org +++ b/doc/chain-self-management-sketch.org @@ -5,20 +5,14 @@ #+SEQ_TODO: TODO WORKING WAITING DONE * 1. Abstract -Yo, this is the first draft of a document that attempts to describe a +Yo, this is the second draft of a document that attempts to describe a proposed self-management algorithm for Machi's chain replication. Welcome! Sit back and enjoy the disjointed prose. -We attempt to describe first the self-management and self-reliance -goals of the algorithm. Then we make a side trip to talk about -write-once registers and how they're used by Machi, but we don't -really fully explain exactly why write-once is so critical (why not -general purpose registers?) ... but they are indeed critical. Then we -sketch the algorithm by providing detailed annotation of a flowchart, -then let the flowchart speak for itself, because writing good prose is -prose is damn hard, but flowcharts are very specific and concise. +The high level design of the Machi "chain manager" has moved to the +[[high-level-chain-manager.pdf][Machi chain manager high level design]] document. -Finally, we try to discuss the network partition simulator that the +We try to discuss the network partition simulator that the algorithm runs in and how the algorithm behaves in both symmetric and asymmetric network partition scenarios. The symmetric partition cases are all working well (surprising in a good way), and the asymmetric @@ -46,319 +40,10 @@ the simulator. %% under the License. #+END_SRC -* 3. Naming: possible ideas (TODO) -** Humming consensus? - -See [[https://tools.ietf.org/html/rfc7282][On Consensus and Humming in the IETF]], RFC 7282. - -See also: [[http://www.snookles.com/slf-blog/2015/03/01/on-humming-consensus-an-allegory/][On “Humming Consensus”, an allegory]]. - -** Foggy consensus? - -CORFU-like consensus between mist-shrouded islands of network -partitions - -** Rough consensus - -This is my favorite, but it might be too close to handwavy/vagueness -of English language, even with a precise definition and proof -sketching? - -** Let the bikeshed continue! - -I agree with Chris: there may already be a definition that's close -enough to "rough consensus" to continue using that existing tag than -to invent a new one. TODO: more research required - -* 4. What does "self-management" mean in this context? - -For the purposes of this document, chain replication self-management -is the ability for the N nodes in an N-length chain replication chain -to manage the state of the chain without requiring an external party -to participate. Chain state includes: - -1. Preserve data integrity of all data stored within the chain. Data - loss is not an option. -2. Stably preserve knowledge of chain membership (i.e. all nodes in - the chain, regardless of operational status). A systems - administrators is expected to make "permanent" decisions about - chain membership. -3. Use passive and/or active techniques to track operational - state/status, e.g., up, down, restarting, full data sync, partial - data sync, etc. -4. Choose the run-time replica ordering/state of the chain, based on - current member status and past operational history. All chain - state transitions must be done safely and without data loss or - corruption. -5. As a new node is added to the chain administratively or old node is - restarted, add the node to the chain safely and perform any data - synchronization/"repair" required to bring the node's data into - full synchronization with the other nodes. - -* 5. Goals -** Better than state-of-the-art: Chain Replication self-management - -We hope/believe that this new self-management algorithem can improve -the current state-of-the-art by eliminating all external management -entities. Current state-of-the-art for management of chain -replication chains is discussed below, to provide historical context. - -*** "Leveraging Sharding in the Design of Scalable Replication Protocols" by Abu-Libdeh, van Renesse, and Vigfusson. - -Multiple chains are arranged in a ring (called a "band" in the paper). -The responsibility for managing the chain at position N is delegated -to chain N-1. As long as at least one chain is running, that is -sufficient to start/bootstrap the next chain, and so on until all -chains are running. (The paper then estimates mean-time-to-failure -(MTTF) and suggests a "band of bands" topology to handle very large -clusters while maintaining an MTTF that is as good or better than -other management techniques.) - -If the chain self-management method proposed for Machi does not -succeed, this paper's technique is our best fallback recommendation. - -*** An external management oracle, implemented by ZooKeeper - -This is not a recommendation for Machi: we wish to avoid using ZooKeeper. -However, many other open and closed source software products use -ZooKeeper for exactly this kind of data replica management problem. - -*** An external management oracle, implemented by Riak Ensemble - -This is a much more palatable choice than option #2 above. We also -wish to avoid an external dependency on something as big as Riak -Ensemble. However, if it comes between choosing Riak Ensemble or -choosing ZooKeeper, the choice feels quite clear: Riak Ensemble will -win, unless there is some critical feature missing from Riak -Ensemble. If such an unforseen missing feature is discovered, it -would probably be preferable to add the feature to Riak Ensemble -rather than to use ZooKeeper (and document it and provide product -support for it and so on...). - -** Support both eventually consistent & strongly consistent modes of operation - -Machi's first use case is for Riak CS, as an eventually consistent -store for CS's "block" storage. Today, Riak KV is used for "block" -storage. Riak KV is an AP-style key-value store; using Machi in an -AP-style mode would match CS's current behavior from points of view of -both code/execution and human administrator exectations. - -Later, we wish the option of using CP support to replace other data -store services that Riak KV provides today. (Scope and timing of such -replacement TBD.) - -We believe this algorithm allows a Machi cluster to fragment into -arbitrary islands of network partition, all the way down to 100% of -members running in complete network isolation from each other. -Furthermore, it provides enough agreement to allow -formerly-partitioned members to coordinate the reintegration & -reconciliation of their data when partitions are healed. - -** Preserve data integrity of Chain Replicated data - -While listed last in this section, preservation of data integrity is -paramount to any chain state management technique for Machi. - -** Anti-goal: minimize churn - -This algorithm's focus is data safety and not availability. If -participants have differing notions of time, e.g., running on -extremely fast or extremely slow hardware, then this algorithm will -"churn" in different states where the chain's data would be -effectively unavailable. - -In practice, however, any series of network partition changes that -case this algorithm to churn will cause other management techniques -(such as an external "oracle") similar problems. [Proof by handwaving -assertion.] See also: "time model" assumptions (below). - -* 6. Assumptions -** Introduction to assumptions, why they differ from other consensus algorithms - -Given a long history of consensus algorithms (viewstamped replication, -Paxos, Raft, et al.), why bother with a slightly different set of -assumptions and a slightly different protocol? - -The answer lies in one of our explicit goals: to have an option of -running in an "eventually consistent" manner. We wish to be able to -make progress, i.e., remain available in the CAP sense, even if we are -partitioned down to a single isolated node. VR, Paxos, and Raft -alone are not sufficient to coordinate service availability at such -small scale. - -** The CORFU protocol is correct - -This work relies tremendously on the correctness of the CORFU -protocol, a cousin of the Paxos protocol. If the implementation of -this self-management protocol breaks an assumption or prerequisite of -CORFU, then we expect that the implementation will be flawed. - -** Communication model: Asyncronous message passing -*** Unreliable network: messages may be arbitrarily dropped and/or reordered -**** Network partitions may occur at any time -**** Network partitions may be asymmetric: msg A->B is ok but B->A fails -*** Messages may be corrupted in-transit -**** Assume that message MAC/checksums are sufficient to detect corruption -**** Receiver informs sender of message corruption -**** Sender may resend, if/when desired -*** System particpants may be buggy but not actively malicious/Byzantine -** Time model: per-node clocks, loosely synchronized (e.g. NTP) - -The protocol & algorithm presented here do not specify or require any -timestamps, physical or logical. Any mention of time inside of data -structures are for human/historic/diagnostic purposes only. - -Having said that, some notion of physical time is suggested for -purposes of efficiency. It's recommended that there be some "sleep -time" between iterations of the algorithm: there is no need to "busy -wait" by executing the algorithm as quickly as possible. See below, -"sleep intervals between executions". - -** Failure detector model: weak, fallible, boolean - -We assume that the failure detector that the algorithm uses is weak, -it's fallible, and it informs the algorithm in boolean status -updates/toggles as a node becomes available or not. - -If the failure detector is fallible and tells us a mistaken status -change, then the algorithm will "churn" the operational state of the -chain, e.g. by removing the failed node from the chain or adding a -(re)started node (that may not be alive) to the end of the chain. -Such extra churn is regrettable and will cause periods of delay as the -"rough consensus" (decribed below) decision is made. However, the -churn cannot (we assert/believe) cause data loss. - -** The "wedge state", as described by the Machi RFC & CORFU - -A chain member enters "wedge state" when it receives information that -a newer projection (i.e., run-time chain state reconfiguration) is -available. The new projection may be created by a system -administrator or calculated by the self-management algorithm. -Notification may arrive via the projection store API or via the file -I/O API. - -When in wedge state, the server/FLU will refuse all file write I/O API -requests until the self-management algorithm has determined that -"rough consensus" has been decided (see next bullet item). The server -may also refuse file read I/O API requests, depending on its CP/AP -operation mode. - -See the Machi RFC for more detail of the wedge state and also the -CORFU papers. - -** "Rough consensus": consensus built upon data that is *visible now* - -CS literature uses the word "consensus" in the context of the problem -description at -[[http://en.wikipedia.org/wiki/Consensus_(computer_science)#Problem_description]]. -This traditional definition differs from what is described in this -document. - -The phrase "rough consensus" will be used to describe -consensus derived only from data that is visible/known at the current -time. This implies that a network partition may be in effect and that -not all chain members are reachable. The algorithm will calculate -"rough consensus" despite not having input from all/majority/minority -of chain members. "Rough consensus" may proceed to make a -decision based on data from only a single participant, i.e., the local -node alone. - -When operating in AP mode, i.e., in eventual consistency mode, "rough -consensus" could mean that an chain of length N could split into N -independent chains of length 1. When a network partition heals, the -rough consensus is sufficient to manage the chain so that each -replica's data can be repaired/merged/reconciled safely. -(Other features of the Machi system are designed to assist such -repair safely.) - -When operating in CP mode, i.e., in strong consistency mode, "rough -consensus" would require additional supplements. For example, any -chain that didn't have a minimum length of the quorum majority size of -all members would be invalid and therefore would not move itself out -of wedged state. In very general terms, this requirement for a quorum -majority of surviving participants is also a requirement for Paxos, -Raft, and ZAB. - -(Aside: The Machi RFC also proposes using "witness" chain members to -make service more available, e.g. quorum majority of "real" plus -"witness" nodes *and* at least one member must be a "real" node. See -the Machi RFC for more details.) - -** Heavy reliance on a key-value store that maps write-once registers - -The projection store is implemented using "write-once registers" -inside a key-value store: for every key in the store, the value must -be either of: - -- The special 'unwritten' value -- An application-specific binary blob that is immutable thereafter -* 7. The projection store, built with write-once registers +* -- NOTE to the reader: The notion of "public" vs. "private" projection - stores does not appear in the Machi RFC. - -Each participating chain node has its own "projection store", which is -a specialized key-value store. As a whole, a node's projection store -is implemented using two different key-value stores: - -- A publicly-writable KV store of write-once registers -- A privately-writable KV store of write-once registers - -Both stores may be read by any cluster member. - -The store's key is a positive integer; the integer represents the -epoch number of the projection. The store's value is an opaque -binary blob whose meaning is meaningful only to the store's clients. - -See the Machi RFC for more detail on projections and epoch numbers. - -** The publicly-writable half of the projection store - -The publicly-writable projection store is used to share information -during the first half of the self-management algorithm. Any chain -member may write a projection to this store. - -** The privately-writable half of the projection store - -The privately-writable projection store is used to store the "rough -consensus" result that has been calculated by the local node. Only -the local server/FLU may write values into this store. - -The private projection store serves multiple purposes, including: - -- remove/clear the local server from "wedge state" -- act as the store of record for chain state transitions -- communicate to remote nodes the past states and current operational - state of the local node - -* 8. Modification of CORFU-style epoch numbering and "wedge state" triggers - -According to the CORFU research papers, if a server node N or client -node C believes that epoch E is the latest epoch, then any information -that N or C receives from any source that an epoch E+delta (where -delta > 0) exists will push N into the "wedge" state and C into a mode -of searching for the projection definition for the newest epoch. - -In the algorithm sketch below, it should become clear that it's -possible to have a race where two nodes may attempt to make proposals -for a single epoch number. In the simplest case, assume a chain of -nodes A & B. Assume that a symmetric network partition between A & B -happens, and assume we're operating in AP/eventually consistent mode. - -On A's network partitioned island, A can choose a UPI list of `[A]'. -Similarly B can choose a UPI list of `[B]'. Both might choose the -epoch for their proposal to be #42. Because each are separated by -network partition, neither can realize the conflict. However, when -the network partition heals, it can become obvious that there are -conflicting values for epoch #42 ... but if we use CORFU's protocol -design, which identifies the epoch identifier as an integer only, then -the integer 42 alone is not sufficient to discern the differences -between the two projections. - -The proposal modifies all use of CORFU's projection identifier -to use the identifier below instead. (A later section of this -document presents a detailed example.) +* 8. #+BEGIN_SRC {epoch #, hash of the entire projection (minus hash field itself)} diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 0683374..9fd634a 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -27,7 +27,7 @@ \preprintfooter{Draft \#0, April 2014} \title{Machi Chain Replication: management theory and design} -\subtitle{} +\subtitle{Includes ``humming consensus'' overview} \authorinfo{Basho Japan KK}{} @@ -46,14 +46,272 @@ For an overview of the design of the larger Machi system, please see \section{Abstract} \label{sec:abstract} -TODO +We attempt to describe first the self-management and self-reliance +goals of the algorithm. Then we make a side trip to talk about +write-once registers and how they're used by Machi, but we don't +really fully explain exactly why write-once is so critical (why not +general purpose registers?) ... but they are indeed critical. Then we +sketch the algorithm, supplemented by a detailed annotation of a flowchart. + +A discussion of ``humming consensus'' follows next. This type of +consensus does not require active participation by all or even a +majority of participants to make decisions. Machi's chain manager +bases its logic on humming consensus to make decisions about how to +react to changes in its environment, e.g. server crashes, network +partitions, and changes by Machi cluster admnistrators. Once a +decision is made during a virtual time epoch, humming consensus will +eventually discover if other participants have made a different +decision during that epoch. When a differing decision is discovered, +new time epochs are proposed in which a new consensus is reached and +disseminated to all available participants. \section{Introduction} \label{sec:introduction} -TODO +\subsection{What does "self-management" mean?} +\label{sub:self-management} -\section{Projections: calculation, then storage, then (perhaps) use} +For the purposes of this document, chain replication self-management +is the ability for the $N$ nodes in an $N$-length chain replication chain +to manage the state of the chain without requiring an external party +to participate. Chain state includes: + +\begin{itemize} +\item Preserve data integrity of all data stored within the chain. Data + loss is not an option. +\item Stably preserve knowledge of chain membership (i.e. all nodes in + the chain, regardless of operational status). A systems + administrators is expected to make "permanent" decisions about + chain membership. +\item Use passive and/or active techniques to track operational + state/status, e.g., up, down, restarting, full data sync, partial + data sync, etc. +\item Choose the run-time replica ordering/state of the chain, based on + current member status and past operational history. All chain + state transitions must be done safely and without data loss or + corruption. +\item As a new node is added to the chain administratively or old node is + restarted, add the node to the chain safely and perform any data + synchronization/"repair" required to bring the node's data into + full synchronization with the other nodes. +\end{itemize} + +\subsection{Ultimate goal: Preserve data integrity of Chain Replicated data} + +Preservation of data integrity is paramount to any chain state +management technique for Machi. Even when operating in an eventually +consistent mode, Machi must not lose data without cause outside of all +design, e.g., all particpants crash permanently. + +\subsection{Goal: better than state-of-the-art Chain Replication management} + +We hope/believe that this new self-management algorithem can improve +the current state-of-the-art by eliminating all external management +entities. Current state-of-the-art for management of chain +replication chains is discussed below, to provide historical context. + +\subsubsection{``Leveraging Sharding in the Design of Scalable Replication Protocols'' by Abu-Libdeh, van Renesse, and Vigfusson} +\label{ssec:elastic-replication} +Multiple chains are arranged in a ring (called a "band" in the paper). +The responsibility for managing the chain at position N is delegated +to chain N-1. As long as at least one chain is running, that is +sufficient to start/bootstrap the next chain, and so on until all +chains are running. The paper then estimates mean-time-to-failure +(MTTF) and suggests a "band of bands" topology to handle very large +clusters while maintaining an MTTF that is as good or better than +other management techniques. + +{\bf NOTE:} If the chain self-management method proposed for Machi does not +succeed, this paper's technique is our best fallback recommendation. + +\subsubsection{An external management oracle, implemented by ZooKeeper} +\label{ssec:an-oracle} +This is not a recommendation for Machi: we wish to avoid using ZooKeeper. +However, many other open source software products use +ZooKeeper for exactly this kind of data replica management problem. + +\subsubsection{An external management oracle, implemented by Riak Ensemble} + +This is a much more palatable choice than option~\ref{ssec:an-oracle} +above. We also +wish to avoid an external dependency on something as big as Riak +Ensemble. However, if it comes between choosing Riak Ensemble or +choosing ZooKeeper, the choice feels quite clear: Riak Ensemble will +win, unless there is some critical feature missing from Riak +Ensemble. If such an unforseen missing feature is discovered, it +would probably be preferable to add the feature to Riak Ensemble +rather than to use ZooKeeper (and document it and provide product +support for it and so on...). + +\subsection{Goal: Support both eventually consistent \& strongly consistent modes of operation} + +Machi's first use case is for Riak CS, as an eventually consistent +store for CS's "block" storage. Today, Riak KV is used for "block" +storage. Riak KV is an AP-style key-value store; using Machi in an +AP-style mode would match CS's current behavior from points of view of +both code/execution and human administrator exectations. + +Later, we wish the option of using CP support to replace other data +store services that Riak KV provides today. (Scope and timing of such +replacement TBD.) + +We believe this algorithm allows a Machi cluster to fragment into +arbitrary islands of network partition, all the way down to 100% of +members running in complete network isolation from each other. +Furthermore, it provides enough agreement to allow +formerly-partitioned members to coordinate the reintegration \& +reconciliation of their data when partitions are healed. + +\subsection{Anti-goal: minimize churn} + +This algorithm's focus is data safety and not availability. If +participants have differing notions of time, e.g., running on +extremely fast or extremely slow hardware, then this algorithm will +"churn" in different states where the chain's data would be +effectively unavailable. + +In practice, however, any series of network partition changes that +case this algorithm to churn will cause other management techniques +(such as an external "oracle") similar problems. +{\bf [Proof by handwaving assertion.]} +See also: Section~\ref{sub:time-model} + +\section{Assumptions} +\label{sec:assumptions} + +Given a long history of consensus algorithms (viewstamped replication, +Paxos, Raft, et al.), why bother with a slightly different set of +assumptions and a slightly different protocol? + +The answer lies in one of our explicit goals: to have an option of +running in an "eventually consistent" manner. We wish to be able to +make progress, i.e., remain available in the CAP sense, even if we are +partitioned down to a single isolated node. VR, Paxos, and Raft +alone are not sufficient to coordinate service availability at such +small scale. + +\subsection{The CORFU protocol is correct} + +This work relies tremendously on the correctness of the CORFU +protocol \cite{corfu1}, a cousin of the Paxos protocol. +If the implementation of +this self-management protocol breaks an assumption or prerequisite of +CORFU, then we expect that Machi's implementation will be flawed. + +\subsection{Communication model: asyncronous message passing} + +The network is unreliable: messages may be arbitrarily dropped and/or +reordered. Network partitions may occur at any time. +Network partitions may be asymmetric, e.g., a message can be sent +from $A \rightarrow B$, but messages from $B \rightarrow A$ can be +lost, dropped, and/or arbitrarily delayed. + +System particpants may be buggy but not actively malicious/Byzantine. + +\subsection{Time model} +\label{sub:time-model} + +Our time model is per-node wall-clock time clocks, loosely +synchronized by NTP. + +The protocol and algorithm presented here do not specify or require any +timestamps, physical or logical. Any mention of time inside of data +structures are for human/historic/diagnostic purposes only. + +Having said that, some notion of physical time is suggested for +purposes of efficiency. It's recommended that there be some "sleep +time" between iterations of the algorithm: there is no need to "busy +wait" by executing the algorithm as quickly as possible. See below, +"sleep intervals between executions". + +\subsection{Failure detector model: weak, fallible, boolean} + +We assume that the failure detector that the algorithm uses is weak, +it's fallible, and it informs the algorithm in boolean status +updates/toggles as a node becomes available or not. + +If the failure detector is fallible and tells us a mistaken status +change, then the algorithm will "churn" the operational state of the +chain, e.g. by removing the failed node from the chain or adding a +(re)started node (that may not be alive) to the end of the chain. +Such extra churn is regrettable and will cause periods of delay as the +"rough consensus" (decribed below) decision is made. However, the +churn cannot (we assert/believe) cause data loss. + +\subsection{Use of the ``wedge state''} + +A participant in Chain Replication will enter "wedge state", as +described by the Machi high level design \cite{machi-design} and by CORFU, +when it receives information that +a newer projection (i.e., run-time chain state reconfiguration) is +available. The new projection may be created by a system +administrator or calculated by the self-management algorithm. +Notification may arrive via the projection store API or via the file +I/O API. + +When in wedge state, the server will refuse all file write I/O API +requests until the self-management algorithm has determined that +"rough consensus" has been decided (see next bullet item). The server +may also refuse file read I/O API requests, depending on its CP/AP +operation mode. + +\subsection{Use of ``humming consensus''} + +CS literature uses the word "consensus" in the context of the problem +description at +{\tt http://en.wikipedia.org/wiki/ Consensus\_(computer\_science)\#Problem\_description}. +This traditional definition differs from what is described here as +``humming consensus''. + +"Humming consensus" describes +consensus that is derived only from data that is visible/known at the current +time. This implies that a network partition may be in effect and that +not all chain members are reachable. The algorithm will calculate +an approximate consensus despite not having input from all/majority +of chain members. Humming consensus may proceed to make a +decision based on data from only a single participant, i.e., only the local +node. + +See Section~\ref{sec:humming-consensus} for detailed discussion. + +\section{The projection store} + +The Machi chain manager relies heavily on a key-value store of +write-once registers called the ``projection store''. +Each participating chain node has its own projection store. +The store's key is a positive integer; +the integer represents the epoch number of the projection. The +store's value is either the special `unwritten' value\footnote{We use + $\emptyset$ to denote the unwritten value.} or else an +application-specific binary blob that is immutable thereafter. + +The projection store is vital for the correct implementation of humming +consensus (Section~\ref{sec:humming-consensus}). + +All parts store described below may be read by any cluster member. + +\subsection{The publicly-writable half of the projection store} + +The publicly-writable projection store is used to share information +during the first half of the self-management algorithm. Any chain +member may write a projection to this store. + +\subsection{The privately-writable half of the projection store} + +The privately-writable projection store is used to store the humming consensus +result that has been chosen by the local node. Only +the local server may write values into this store. + +The private projection store serves multiple purposes, including: + +\begin{itemize} +\item remove/clear the local server from ``wedge state'' +\item act as the store of record for chain state transitions +\item communicate to remote nodes the past states and current operational + state of the local node +\end{itemize} + +\section{Projections: calculation, storage, and use} \label{sec:projections} Machi uses a ``projection'' to determine how its Chain Replication replicas @@ -61,19 +319,122 @@ should operate; see \cite{machi-design} and \cite{corfu1}. At runtime, a cluster must be able to respond both to administrative changes (e.g., substituting a failed server box with replacement hardware) as well as local network conditions (e.g., is -there a network partition?). The concept of a projection is borrowed +there a network partition?). + +The concept of a projection is borrowed from CORFU but has a longer history, e.g., the Hibari key-value store \cite{cr-theory-and-practice} and goes back in research for decades, e.g., Porcupine \cite{porcupine}. -\subsection{Phases of projection change} +\subsection{The projection data structure} +\label{sub:the-projection} + +{\bf NOTE:} This section is a duplicate of the ``The Projection and +the Projection Epoch Number'' section of \cite{machi-design}. + +The projection data +structure defines the current administration \& operational/runtime +configuration of a Machi cluster's single Chain Replication chain. +Each projection is identified by a strictly increasing counter called +the Epoch Projection Number (or more simply ``the epoch''). + +Projections are calculated by each server using input from local +measurement data, calculations by the server's chain manager +(see below), and input from the administration API. +Each time that the configuration changes (automatically or by +administrator's request), a new epoch number is assigned +to the entire configuration data structure and is distributed to +all servers via the server's administration API. Each server maintains the +current projection epoch number as part of its soft state. + +Pseudo-code for the projection's definition is shown in +Figure~\ref{fig:projection}. To summarize the major components: + +\begin{figure} +\begin{verbatim} +-type m_server_info() :: {Hostname, Port,...}. +-record(projection, { + epoch_number :: m_epoch_n(), + epoch_csum :: m_csum(), + creation_time :: now(), + author_server :: m_server(), + all_members :: [m_server()], + active_upi :: [m_server()], + active_all :: [m_server()], + down_members :: [m_server()], + dbg_annotations :: proplist() + }). +\end{verbatim} +\caption{Sketch of the projection data structure} +\label{fig:projection} +\end{figure} + +\begin{itemize} +\item {\tt epoch\_number} and {\tt epoch\_csum} The epoch number and + projection checksum are unique identifiers for this projection. +\item {\tt creation\_time} Wall-clock time, useful for humans and + general debugging effort. +\item {\tt author\_server} Name of the server that calculated the projection. +\item {\tt all\_members} All servers in the chain, regardless of current + operation status. If all operating conditions are perfect, the + chain should operate in the order specified here. +\item {\tt active\_upi} All active chain members that we know are + fully repaired/in-sync with each other and therefore the Update + Propagation Invariant (Section~\ref{sub:upi} is always true. +\item {\tt active\_all} All active chain members, including those that + are under active repair procedures. +\item {\tt down\_members} All members that the {\tt author\_server} + believes are currently down or partitioned. +\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to + add any hints for why the projection change was made, delay/retry + information, etc. +\end{itemize} + +\subsection{Why the checksum field?} + +According to the CORFU research papers, if a server node $S$ or client +node $C$ believes that epoch $E$ is the latest epoch, then any information +that $S$ or $C$ receives from any source that an epoch $E+\delta$ (where +$\delta > 0$) exists will push $S$ into the "wedge" state and $C$ into a mode +of searching for the projection definition for the newest epoch. + +In the humming consensus description in +Section~\ref{sec:humming-consensus}, it should become clear that it's +possible to have a situation where two nodes make proposals +for a single epoch number. In the simplest case, assume a chain of +nodes $A$ and $B$. Assume that a symmetric network partition between +$A$ and $B$ happens. Also, let's assume that operating in +AP/eventually consistent mode. + +On $A$'s network-partitioned island, $A$ can choose +an active chain definition of {\tt [A]}. +Similarly $B$ can choose a definition of {\tt [B]}. Both $A$ and $B$ +might choose the +epoch for their proposal to be \#42. Because each are separated by +network partition, neither can realize the conflict. + +When the network partition heals, it can become obvious to both +servers that there are conflicting values for epoch \#42. If we +use CORFU's protocol design, which identifies the epoch identifier as +an integer only, then the integer 42 alone is not sufficient to +discern the differences between the two projections. + +Humming consensus requires that any projection be identified by both +the epoch number and the projection checksum, as described in +Section~\ref{sub:the-projection}. + +\section{Phases of projection change} +\label{sec:phases-of-projection-change} Machi's use of projections is in four discrete phases and are discussed below: network monitoring, projection calculation, projection storage, and -adoption of new projections. +adoption of new projections. The phases are described in the +subsections below. The reader should then be able to recognize each +of these phases when reading the humming consensus algorithm +description in Section~\ref{sec:humming-consensus}. -\subsubsection{Network monitoring} +\subsection{Network monitoring} \label{sub:network-monitoring} Monitoring of local network conditions can be implemented in many @@ -84,7 +445,6 @@ following techniques: \begin{itemize} \item Internal ``no op'' FLU-level protocol request \& response. -\item Use of distributed Erlang {\tt net\_ticktime} node monitoring \item Explicit connections of remote {\tt epmd} services, e.g., to tell the difference between a dead Erlang VM and a dead machine/hardware node. @@ -98,7 +458,7 @@ methods for determining status. Instead, hard Boolean up/down status decisions are required by the projection calculation phase (Section~\ref{subsub:projection-calculation}). -\subsubsection{Projection data structure calculation} +\subsection{Projection data structure calculation} \label{subsub:projection-calculation} Each Machi server will have an independent agent/process that is @@ -124,7 +484,7 @@ changes may require retry logic and delay/sleep time intervals. \label{sub:proj-storage-writing} All projection data structures are stored in the write-once Projection -Store that is run by each FLU. (See also \cite{machi-design}.) +Store that is run by each server. (See also \cite{machi-design}.) Writing the projection follows the two-step sequence below. In cases of writing @@ -138,22 +498,29 @@ projection value that the local actor is trying to write! \begin{enumerate} \item Write $P_{new}$ to the local projection store. This will trigger - ``wedge'' status in the local FLU, which will then cascade to other - projection-related behavior within the FLU. + ``wedge'' status in the local server, which will then cascade to other + projection-related behavior within the server. \item Write $P_{new}$ to the remote projection store of {\tt all\_members}. Some members may be unavailable, but that is OK. \end{enumerate} -(Recall: Other parts of the system are responsible for reading new -projections from other actors in the system and for deciding to try to -create a new projection locally.) +\subsection{Adoption of new projections} -\subsection{Projection storage: reading} +The projection store's ``best value'' for the largest written epoch +number at the time of the read is projection used by the server. +If the read attempt for projection $P_p$ +also yields other non-best values, then the +projection calculation subsystem is notified. This notification +may/may not trigger a calculation of a new projection $P_{p+1}$ which +may eventually be stored and so +resolve $P_p$'s replicas' ambiguity. + +\section{Reading from the projection store} \label{sub:proj-storage-reading} Reading data from the projection store is similar in principle to -reading from a Chain Replication-managed FLU system. However, the -projection store does not require the strict replica ordering that +reading from a Chain Replication-managed server system. However, the +projection store does not use the strict replica ordering that Chain Replication does. For any projection store key $K_n$, the participating servers may have different values for $K_n$. As a write-once store, it is impossible to mutate a replica of $K_n$. If @@ -196,48 +563,7 @@ unwritten replicas. If the value of $K$ is not unanimous, then the ``best value'' $V_{best}$ is used for the repair. If all respond with {\tt error\_unwritten}, repair is not required. -\subsection{Adoption of new projections} - -The projection store's ``best value'' for the largest written epoch -number at the time of the read is projection used by the FLU. -If the read attempt for projection $P_p$ -also yields other non-best values, then the -projection calculation subsystem is notified. This notification -may/may not trigger a calculation of a new projection $P_{p+1}$ which -may eventually be stored and so -resolve $P_p$'s replicas' ambiguity. - -\subsubsection{Alternative implementations: Hibari's ``Admin Server'' - and Elastic Chain Replication} - -See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's -chain management agent, the ``Admin Server''. In brief: - -\begin{itemize} -\item The Admin Server is intentionally a single point of failure in - the same way that the instance of Stanchion in a Riak CS cluster - is an intentional single - point of failure. In both cases, strict - serialization of state changes is more important than 100\% - availability. - -\item For higher availability, the Hibari Admin Server is usually - configured in an active/standby manner. Status monitoring and - application failover logic is provided by the built-in capabilities - of the Erlang/OTP application controller. - -\end{itemize} - -Elastic chain replication is a technique described in -\cite{elastic-chain-replication}. It describes using multiple chains -to monitor each other, as arranged in a ring where a chain at position -$x$ is responsible for chain configuration and management of the chain -at position $x+1$. This technique is likely the fall-back to be used -in case the chain management method described in this RFC proves -infeasible. - -\subsection{Likely problems and possible solutions} -\label{sub:likely-problems} +\section{Just in case Humming Consensus doesn't work for us} There are some unanswered questions about Machi's proposed chain management technique. The problems that we guess are likely/possible @@ -266,13 +592,102 @@ include: \end{itemize} +\subsection{Alternative: Elastic Replication} + +Using Elastic Replication (Section~\ref{ssec:elastic-replication}) is +our preferred alternative, if Humming Consensus is not usable. + +\subsection{Alternative: Hibari's ``Admin Server'' + and Elastic Chain Replication} + +See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's +chain management agent, the ``Admin Server''. In brief: + +\begin{itemize} +\item The Admin Server is intentionally a single point of failure in + the same way that the instance of Stanchion in a Riak CS cluster + is an intentional single + point of failure. In both cases, strict + serialization of state changes is more important than 100\% + availability. + +\item For higher availability, the Hibari Admin Server is usually + configured in an active/standby manner. Status monitoring and + application failover logic is provided by the built-in capabilities + of the Erlang/OTP application controller. + +\end{itemize} + +Elastic chain replication is a technique described in +\cite{elastic-chain-replication}. It describes using multiple chains +to monitor each other, as arranged in a ring where a chain at position +$x$ is responsible for chain configuration and management of the chain +at position $x+1$. This technique is likely the fall-back to be used +in case the chain management method described in this RFC proves +infeasible. + +\section{Humming Consensus} +\label{sec:humming-consensus} + +Sources for background information include: + +\begin{itemize} +\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for +background on the use of humming during meetings of the IETF. + +\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory}, +for an allegory in the style (?) of Leslie Lamport's original Paxos +paper +\end{itemize} + + +"Humming consensus" describes +consensus that is derived only from data that is visible/known at the current +time. This implies that a network partition may be in effect and that +not all chain members are reachable. The algorithm will calculate +an approximate consensus despite not having input from all/majority +of chain members. Humming consensus may proceed to make a +decision based on data from only a single participant, i.e., only the local +node. + +When operating in AP mode, i.e., in eventual consistency mode, humming +consensus may reconfigure chain of length $N$ into $N$ +independent chains of length 1. When a network partition heals, the +humming consensus is sufficient to manage the chain so that each +replica's data can be repaired/merged/reconciled safely. +Other features of the Machi system are designed to assist such +repair safely. + +When operating in CP mode, i.e., in strong consistency mode, humming +consensus would require additional restrictions. For example, any +chain that didn't have a minimum length of the quorum majority size of +all members would be invalid and therefore would not move itself out +of wedged state. In very general terms, this requirement for a quorum +majority of surviving participants is also a requirement for Paxos, +Raft, and ZAB. \footnote{The Machi RFC also proposes using +``witness'' chain members to +make service more available, e.g. quorum majority of ``real'' plus +``witness'' nodes {\bf and} at least one member must be a ``real'' node.} + \section{Chain Replication: proof of correctness} -\label{sub:cr-proof} +\label{sec:cr-proof} See Section~3 of \cite{chain-replication} for a proof of the correctness of Chain Replication. A short summary is provide here. Readers interested in good karma should read the entire paper. +\subsection{The Update Propagation Invariant} +\label{sub:upi} + +``Update Propagation Invariant'' is the original chain replication +paper's name for the +$H_i \succeq H_j$ +property mentioned in Figure~\ref{tab:chain-order}. +This paper will use the same name. +This property may also be referred to by its acronym, ``UPI''. + +\subsection{Chain Replication and strong consistency} + The three basic rules of Chain Replication and its strong consistency guarantee: @@ -337,9 +752,9 @@ $i$ & $<$ & $j$ \\ \multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\ length($H_i$) & $\geq$ & length($H_j$) \\ \multicolumn{3}{l}{For example, a quiescent chain:} \\ -48 & $\geq$ & 48 \\ +length($H_i$) = 48 & $\geq$ & length($H_j$) = 48 \\ \multicolumn{3}{l}{For example, a chain being mutated:} \\ -55 & $\geq$ & 48 \\ +length($H_i$) = 55 & $\geq$ & length($H_j$) = 48 \\ \multicolumn{3}{l}{Example ordered mutation sets:} \\ $[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ \multicolumn{3}{c}{\bf Therefore the right side is always an ordered @@ -374,27 +789,22 @@ then no other chain member can have a prior/older value because their respective mutations histories cannot be shorter than the tail member's history. -\paragraph{``Update Propagation Invariant''} -is the original chain replication paper's name for the -$H_i \succeq H_j$ -property. This paper will use the same name. - \section{Repair of entire files} \label{sec:repair-entire-files} There are some situations where repair of entire files is necessary. \begin{itemize} -\item To repair FLUs added to a chain in a projection change, - specifically adding a new FLU to the chain. This case covers both - adding a new, data-less FLU and re-adding a previous, data-full FLU +\item To repair servers added to a chain in a projection change, + specifically adding a new server to the chain. This case covers both + adding a new, data-less server and re-adding a previous, data-full server back to the chain. \item To avoid data loss when changing the order of the chain's servers. \end{itemize} Both situations can set the stage for data loss in the future. If a violation of the Update Propagation Invariant (see end of -Section~\ref{sub:cr-proof}) is permitted, then the strong consistency +Section~\ref{sec:cr-proof}) is permitted, then the strong consistency guarantee of Chain Replication is violated. Because Machi uses write-once registers, the number of possible strong consistency violations is small: any client that witnesses a written $\rightarrow$ @@ -407,7 +817,7 @@ wish to avoid data loss whenever a chain has at least one surviving server. Another method to avoid data loss is to preserve the Update Propagation Invariant at all times. -\subsubsection{Just ``rsync'' it!} +\subsection{Just ``rsync'' it!} \label{ssec:just-rsync-it} A simple repair method might be perhaps 90\% sufficient. @@ -432,7 +842,7 @@ For uses such as CORFU, strong consistency is a non-negotiable requirement. Therefore, we will use the Update Propagation Invariant as the foundation for Machi's data loss prevention techniques. -\subsubsection{Divergence from CORFU: repair} +\subsection{Divergence from CORFU: repair} \label{sub:repair-divergence} The original repair design for CORFU is simple and effective, @@ -498,7 +908,7 @@ vulnerability is eliminated.\footnote{SLF's note: Probably? This is my not safe} in Machi, I'm not 100\% certain anymore than this ``easy'' fix for CORFU is correct.}. -\subsubsection{Whole-file repair as FLUs are (re-)added to a chain} +\subsection{Whole-file repair as FLUs are (re-)added to a chain} \label{sub:repair-add-to-chain} Machi's repair process must preserve the Update Propagation @@ -560,8 +970,9 @@ While the normal single-write and single-read operations are performed by the cluster, a file synchronization process is initiated. The sequence of steps differs depending on the AP or CP mode of the system. -\paragraph{In cases where the cluster is operating in CP Mode:} +\subsubsection{Cluster in CP mode} +In cases where the cluster is operating in CP Mode, CORFU's repair method of ``just copy it all'' (from source FLU to repairing FLU) is correct, {\em except} for the small problem pointed out in Section~\ref{sub:repair-divergence}. The problem for Machi is one of @@ -661,23 +1072,9 @@ change: \end{itemize} -%% Then the only remaining safety problem (as far as I can see) is -%% avoiding this race: +\subsubsection{Cluster in AP Mode} -%% \begin{enumerate} -%% \item Enumerate byte ranges $[B_0,B_1,\ldots]$ in file $F$ that must -%% be copied to the repair target, based on checksum differences for -%% those byte ranges. -%% \item A real-time concurrent write for byte range $B_x$ arrives at the -%% U.P.~Invariant preserving chain for file $F$ but was not a member of -%% step \#1's list of byte ranges. -%% \item Step \#2's update is propagated down the chain of chains. -%% \item Step \#1's clobber updates are propagated down the chain of -%% chains. -%% \item The value for $B_x$ is lost on the repair targets. -%% \end{enumerate} - -\paragraph{In cases the cluster is operating in AP Mode:} +In cases the cluster is operating in AP Mode: \begin{enumerate} \item Follow the first two steps of the ``CP Mode'' @@ -696,7 +1093,7 @@ of chains, skipping any FLUs where the data is known to be written. Such writes will also preserve Update Propagation Invariant when repair is finished. -\subsubsection{Whole-file repair when changing FLU ordering within a chain} +\subsection{Whole-file repair when changing FLU ordering within a chain} \label{sub:repair-chain-re-ordering} Changing FLU order within a chain is an operations optimization only. @@ -725,7 +1122,7 @@ that are made during the chain reordering process. This method will not be described here. However, {\em if reviewers believe that it should be included}, please let the authors know. -\paragraph{In both Machi operating modes:} +\subsubsection{In both Machi operating modes:} After initial implementation, it may be that the repair procedure is a bit too slow. In order to accelerate repair decisions, it would be helpful have a quicker method to calculate which files have exactly @@ -1080,8 +1477,7 @@ Replication chain configuration changes. For example, in the split brain scenario of Section~\ref{sub:split-brain-scenario}, we have two pieces of data written to different ``sides'' of the split brain, $D_0$ and $D_1$. If the chain is naively reconfigured after the network -partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$\footnote{Where $\emptyset$ - denotes the unwritten value.} then $D_1$ +partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$ is in danger of being lost. Why? The Update Propagation Invariant is violated. Any Chain Replication read will be @@ -1105,6 +1501,11 @@ contains at least one FLU. \begin{thebibliography}{} \softraggedright +\bibitem{rfc-7282} +RFC 7282: On Consensus and Humming in the IETF. +Internet Engineering Task Force. +{\tt https://tools.ietf.org/html/rfc7282} + \bibitem{elastic-chain-replication} Abu-Libdeh, Hussam et al. Leveraging Sharding in the Design of Scalable Replication Protocols. @@ -1141,6 +1542,11 @@ Chain Replication in Theory and in Practice. Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang'10), 2010. {\tt http://www.snookles.com/scott/publications/ erlang2010-slf.pdf} +\bibitem{humming-consensus-allegory} +Fritchie, Scott Lystig. +On “Humming Consensus”, an allegory. +{\tt http://www.snookles.com/slf-blog/2015/03/ 01/on-humming-consensus-an-allegory/} + \bibitem{the-log-what} Kreps, Jay. The Log: What every software engineer should know about real-time data's unifying abstraction @@ -1155,6 +1561,12 @@ NetDB’11. {\tt http://research.microsoft.com/en-us/UM/people/ srikanth/netdb11/netdb11papers/netdb11-final12.pdf} +\bibitem{part-time-parliament} +Lamport, Leslie. +The Part-Time Parliament. +DEC technical report SRC-049, 1989. +{\tt ftp://apotheca.hpl.hp.com/gatekeeper/pub/ DEC/SRC/research-reports/SRC-049.pdf} + \bibitem{paxos-made-simple} Lamport, Leslie. Paxos Made Simple. From 451d7d458c4c6fcf1d1dcbe93827c794eaaaba05 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 16:54:00 +0900 Subject: [PATCH 06/14] WIP: more restructuring --- doc/chain-self-management-sketch.org | 16 +- doc/src.high-level/high-level-chain-mgr.tex | 905 +++++++++++--------- 2 files changed, 483 insertions(+), 438 deletions(-) diff --git a/doc/chain-self-management-sketch.org b/doc/chain-self-management-sketch.org index 1be3268..ae950c7 100644 --- a/doc/chain-self-management-sketch.org +++ b/doc/chain-self-management-sketch.org @@ -5,10 +5,6 @@ #+SEQ_TODO: TODO WORKING WAITING DONE * 1. Abstract -Yo, this is the second draft of a document that attempts to describe a -proposed self-management algorithm for Machi's chain replication. -Welcome! Sit back and enjoy the disjointed prose. - The high level design of the Machi "chain manager" has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document. @@ -41,15 +37,7 @@ the simulator. #+END_SRC -* - -* 8. - -#+BEGIN_SRC -{epoch #, hash of the entire projection (minus hash field itself)} -#+END_SRC - -* 9. Diagram of the self-management algorithm +* 3. Diagram of the self-management algorithm ** Introduction Refer to the diagram [[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]], @@ -264,7 +252,7 @@ use of quorum majority for UPI members is out of scope of this document. Also out of scope is the use of "witness servers" to augment the quorum majority UPI scheme.) -* 10. The Network Partition Simulator +* 4. The Network Partition Simulator ** Overview The function machi_chain_manager1_test:convergence_demo_test() executes the following in a simulated network environment within a diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 9fd634a..37771ac 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -46,12 +46,11 @@ For an overview of the design of the larger Machi system, please see \section{Abstract} \label{sec:abstract} -We attempt to describe first the self-management and self-reliance -goals of the algorithm. Then we make a side trip to talk about -write-once registers and how they're used by Machi, but we don't -really fully explain exactly why write-once is so critical (why not -general purpose registers?) ... but they are indeed critical. Then we -sketch the algorithm, supplemented by a detailed annotation of a flowchart. +We describe the self-management and self-reliance +goals of the algorithm: preserve data integrity, advance the current +state of the art, and supporting multiple consisistency levels. + +TODO Fix, after all of the recent changes to this document. A discussion of ``humming consensus'' follows next. This type of consensus does not require active participation by all or even a @@ -156,7 +155,7 @@ store services that Riak KV provides today. (Scope and timing of such replacement TBD.) We believe this algorithm allows a Machi cluster to fragment into -arbitrary islands of network partition, all the way down to 100% of +arbitrary islands of network partition, all the way down to 100\% of members running in complete network isolation from each other. Furthermore, it provides enough agreement to allow formerly-partitioned members to coordinate the reintegration \& @@ -277,7 +276,7 @@ See Section~\ref{sec:humming-consensus} for detailed discussion. \section{The projection store} The Machi chain manager relies heavily on a key-value store of -write-once registers called the ``projection store''. +write-once registers called the ``projection store''. Each participating chain node has its own projection store. The store's key is a positive integer; the integer represents the epoch number of the projection. The @@ -483,26 +482,11 @@ changes may require retry logic and delay/sleep time intervals. \subsection{Projection storage: writing} \label{sub:proj-storage-writing} -All projection data structures are stored in the write-once Projection -Store that is run by each server. (See also \cite{machi-design}.) - -Writing the projection follows the two-step sequence below. -In cases of writing -failure at any stage, the process is aborted. The most common case is -{\tt error\_written}, which signifies that another actor in the system has -already calculated another (perhaps different) projection using the -same projection epoch number and that -read repair is necessary. Note that {\tt error\_written} may also -indicate that another actor has performed read repair on the exact -projection value that the local actor is trying to write! - -\begin{enumerate} -\item Write $P_{new}$ to the local projection store. This will trigger - ``wedge'' status in the local server, which will then cascade to other - projection-related behavior within the server. -\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. - Some members may be unavailable, but that is OK. -\end{enumerate} +Individual replicas of the projections written to participating +projection stores are not managed by Chain Replication --- if they +were, we would have a circular dependency! See +Section~\ref{sub:proj-store-writing} for the technique for writing +projections to all participating servers' projection stores. \subsection{Adoption of new projections} @@ -515,8 +499,64 @@ may/may not trigger a calculation of a new projection $P_{p+1}$ which may eventually be stored and so resolve $P_p$'s replicas' ambiguity. +\section{Humming consensus's management of multiple projection store} + +Individual replicas of the projections written to participating +projection stores are not managed by Chain Replication. + +An independent replica management technique very similar to the style +used by both Riak Core \cite{riak-core} and Dynamo is used. +The major difference is +that successful return status from (minimum) a quorum of participants +{\em is not required}. + +\subsection{Read repair: repair only when unwritten} + +The idea of ``read repair'' is also shared with Riak Core and Dynamo +systems. However, there is a case read repair cannot truly ``fix'' a +key because two different values have been written by two different +replicas. + +Machi's projection store is write-once, and there is no ``undo'' or +``delete'' or ``overwrite'' in the projection store API. It doesn't +matter what caused the two different values. In case of multiple +values, all participants in humming consensus merely agree that there +were multiple opinions at that epoch which must be resolved by the +creation and writing of newer projections with later epoch numbers. + +Machi's projection store read repair can only repair values that are +unwritten, i.e., storing $\emptyset$. + +\subsection{Projection storage: writing} +\label{sub:proj-store-writing} + +All projection data structures are stored in the write-once Projection +Store that is run by each server. (See also \cite{machi-design}.) + +Writing the projection follows the two-step sequence below. + +\begin{enumerate} +\item Write $P_{new}$ to the local projection store. (As a side + effect, + this will trigger + ``wedge'' status in the local server, which will then cascade to other + projection-related behavior within that server.) +\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. + Some members may be unavailable, but that is OK. +\end{enumerate} + +In cases of {\tt error\_written} status, +the process may be aborted and read repair +triggered. The most common reason for {\tt error\_written} status +is that another actor in the system has +already calculated another (perhaps different) projection using the +same projection epoch number and that +read repair is necessary. Note that {\tt error\_written} may also +indicate that another actor has performed read repair on the exact +projection value that the local actor is trying to write! + \section{Reading from the projection store} -\label{sub:proj-storage-reading} +\label{sec:proj-store-reading} Reading data from the projection store is similar in principle to reading from a Chain Replication-managed server system. However, the @@ -563,6 +603,61 @@ unwritten replicas. If the value of $K$ is not unanimous, then the ``best value'' $V_{best}$ is used for the repair. If all respond with {\tt error\_unwritten}, repair is not required. +\section{Humming Consensus} +\label{sec:humming-consensus} + +Sources for background information include: + +\begin{itemize} +\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for +background on the use of humming during meetings of the IETF. + +\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory}, +for an allegory in the style (?) of Leslie Lamport's original Paxos +paper. +\end{itemize} + + +\subsection{Summary of humming consensus} + +"Humming consensus" describes +consensus that is derived only from data that is visible/known at the current +time. This implies that a network partition may be in effect and that +not all chain members are reachable. The algorithm will calculate +an approximate consensus despite not having input from all/majority +of chain members. Humming consensus may proceed to make a +decision based on data from only a single participant, i.e., only the local +node. + +\begin{itemize} + +\item When operating in AP mode, i.e., in eventual consistency mode, humming +consensus may reconfigure chain of length $N$ into $N$ +independent chains of length 1. When a network partition heals, the +humming consensus is sufficient to manage the chain so that each +replica's data can be repaired/merged/reconciled safely. +Other features of the Machi system are designed to assist such +repair safely. + +\item When operating in CP mode, i.e., in strong consistency mode, humming +consensus would require additional restrictions. For example, any +chain that didn't have a minimum length of the quorum majority size of +all members would be invalid and therefore would not move itself out +of wedged state. In very general terms, this requirement for a quorum +majority of surviving participants is also a requirement for Paxos, +Raft, and ZAB. See Section~\ref{sec:split-brain-management} for a +proposal to handle ``split brain'' scenarios while in CPU mode. + +\end{itemize} + +If a decision is made during epoch $E$, humming consensus will +eventually discover if other participants have made a different +decision during epoch $E$. When a differing decision is discovered, +newer \& later time epochs are defined by creating new projections +with epochs numbered by $E+\delta$ (where $\delta > 0$). +The distribution of the $E+\delta$ projections will bring all visible +participants into the new epoch $E+delta$ and then into consensus. + \section{Just in case Humming Consensus doesn't work for us} There are some unanswered questions about Machi's proposed chain @@ -597,6 +692,12 @@ include: Using Elastic Replication (Section~\ref{ssec:elastic-replication}) is our preferred alternative, if Humming Consensus is not usable. +Elastic chain replication is a technique described in +\cite{elastic-chain-replication}. It describes using multiple chains +to monitor each other, as arranged in a ring where a chain at position +$x$ is responsible for chain configuration and management of the chain +at position $x+1$. + \subsection{Alternative: Hibari's ``Admin Server'' and Elastic Chain Replication} @@ -618,56 +719,358 @@ chain management agent, the ``Admin Server''. In brief: \end{itemize} -Elastic chain replication is a technique described in -\cite{elastic-chain-replication}. It describes using multiple chains -to monitor each other, as arranged in a ring where a chain at position -$x$ is responsible for chain configuration and management of the chain -at position $x+1$. This technique is likely the fall-back to be used -in case the chain management method described in this RFC proves -infeasible. +\section{``Split brain'' management in CP Mode} +\label{sec:split-brain-management} -\section{Humming Consensus} -\label{sec:humming-consensus} +Split brain management is a thorny problem. The method presented here +is one based on pragmatics. If it doesn't work, there isn't a serious +worry, because Machi's first serious use case all require only AP Mode. +If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'', +then perhaps that's +fine enough. Meanwhile, let's explore how a +completely self-contained, no-external-dependencies +CP Mode Machi might work. -Sources for background information include: +Wikipedia's description of the quorum consensus solution\footnote{See + {\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice +and short: + +\begin{quotation} +A typical approach, as described by Coulouris et al.,[4] is to use a +quorum-consensus approach. This allows the sub-partition with a +majority of the votes to remain available, while the remaining +sub-partitions should fall down to an auto-fencing mode. +\end{quotation} + +This is the same basic technique that +both Riak Ensemble and ZooKeeper use. Machi's +extensive use of write-registers are a big advantage when implementing +this technique. Also very useful is the Machi ``wedge'' mechanism, +which can automatically implement the ``auto-fencing'' that the +technique requires. All Machi servers that can communicate with only +a minority of other servers will automatically ``wedge'' themselves +and refuse all requests for service until communication with the +majority can be re-established. + +\subsection{The quorum: witness servers vs. full servers} + +In any quorum-consensus system, at least $2f+1$ participants are +required to survive $f$ participant failures. Machi can implement a +technique of ``witness servers'' servers to bring the total cost +somewhere in the middle, between $2f+1$ and $f+1$, depending on your +point of view. + +A ``witness server'' is one that participates in the network protocol +but does not store or manage all of the state that a ``full server'' +does. A ``full server'' is a Machi server as +described by this RFC document. A ``witness server'' is a server that +only participates in the projection store and projection epoch +transition protocol and a small subset of the file access API. +A witness server doesn't actually store any +Machi files. A witness server is almost stateless, when compared to a +full Machi server. + +A mixed cluster of witness and full servers must still contain at +least $2f+1$ participants. However, only $f+1$ of them are full +participants, and the remaining $f$ participants are witnesses. In +such a cluster, any majority quorum must have at least one full server +participant. + +Witness FLUs are always placed at the front of the chain. As stated +above, there may be at most $f$ witness FLUs. A functioning quorum +majority +must have at least $f+1$ FLUs that can communicate and therefore +calculate and store a new unanimous projection. Therefore, any FLU at +the tail of a functioning quorum majority chain must be full FLU. Full FLUs +actually store Machi files, so they have no problem answering {\tt + read\_req} API requests.\footnote{We hope that it is now clear that + a witness FLU cannot answer any Machi file read API request.} + +Any FLU that can only communicate with a minority of other FLUs will +find that none can calculate a new projection that includes a +majority of FLUs. Any such FLU, when in CP mode, would then move to +wedge state and remain wedged until the network partition heals enough +to communicate with the majority side. This is a nice property: we +automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side + is wedged and therefore refuses to serve because it is, so to speak, + ``on the wrong side of the fence.''} + +There is one case where ``fencing'' may not happen: if both the client +and the tail FLU are on the same minority side of a network partition. +Assume the client and FLU $F_z$ are on the "wrong side" of a network +split; both are using projection epoch $P_1$. The tail of the +chain is $F_z$. + +Also assume that the "right side" has reconfigured and is using +projection epoch $P_2$. The right side has mutated key $K$. Meanwhile, +nobody on the "right side" has noticed anything wrong and is happy to +continue using projection $P_1$. \begin{itemize} -\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for -background on the use of humming during meetings of the IETF. - -\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory}, -for an allegory in the style (?) of Leslie Lamport's original Paxos -paper +\item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via + $F_z$. $F_z$ does not detect an epoch problem and thus returns an + answer. Given our assumptions, this value is stale. For some + client use cases, this kind of staleness may be OK in trade for + fewer network messages per read \ldots so Machi may + have a configurable option to permit it. +\item {\bf Option b}: The wrong side client must confirm that $P_1$ is + in use by a full majority of chain members, including $F_z$. \end{itemize} +Attempts using Option b will fail for one of two reasons. First, if +the client can talk to a FLU that is using $P_2$, the client's +operation must be retried using $P_2$. Second, the client will time +out talking to enough FLUs so that it fails to get a quorum's worth of +$P_1$ answers. In either case, Option B will always fail a client +read and thus cannot return a stale value of $K$. -"Humming consensus" describes -consensus that is derived only from data that is visible/known at the current -time. This implies that a network partition may be in effect and that -not all chain members are reachable. The algorithm will calculate -an approximate consensus despite not having input from all/majority -of chain members. Humming consensus may proceed to make a -decision based on data from only a single participant, i.e., only the local -node. +\subsection{Witness FLU data and protocol changes} -When operating in AP mode, i.e., in eventual consistency mode, humming -consensus may reconfigure chain of length $N$ into $N$ -independent chains of length 1. When a network partition heals, the -humming consensus is sufficient to manage the chain so that each -replica's data can be repaired/merged/reconciled safely. -Other features of the Machi system are designed to assist such -repair safely. +Some small changes to the projection's data structure +are required (relative to the initial spec described in +\cite{machi-design}). The projection itself +needs new annotation to indicate the operating mode, AP mode or CP +mode. The state type notifies the chain manager how to +react in network partitions and how to calculate new, safe projection +transitions and which file repair mode to use +(Section~\ref{sec:repair-entire-files}). +Also, we need to label member FLU servers as full- or +witness-type servers. -When operating in CP mode, i.e., in strong consistency mode, humming -consensus would require additional restrictions. For example, any -chain that didn't have a minimum length of the quorum majority size of -all members would be invalid and therefore would not move itself out -of wedged state. In very general terms, this requirement for a quorum -majority of surviving participants is also a requirement for Paxos, -Raft, and ZAB. \footnote{The Machi RFC also proposes using -``witness'' chain members to -make service more available, e.g. quorum majority of ``real'' plus -``witness'' nodes {\bf and} at least one member must be a ``real'' node.} +Write API requests are processed by witness servers in {\em almost but + not quite} no-op fashion. The only requirement of a witness server +is to return correct interpretations of local projection epoch +numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error +codes. In fact, a new API call is sufficient for querying witness +servers: {\tt \{check\_epoch, m\_epoch()\}}. +Any client write operation sends the {\tt + check\_\-epoch} API command to witness FLUs and sends the usual {\tt + write\_\-req} command to full FLUs. + +\section{The safety of projection epoch transitions} +\label{sec:safety-of-transitions} + +Machi uses the projection epoch transition algorithm and +implementation from CORFU, which is believed to be safe. However, +CORFU assumes a single, external, strongly consistent projection +store. Further, CORFU assumes that new projections are calculated by +an oracle that the rest of the CORFU system agrees is the sole agent +for creating new projections. Such an assumption is impractical for +Machi's intended purpose. + +Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle +coordinator), but we wish to keep Machi free of big external +dependencies. We would also like to see Machi be able to +operate in an ``AP mode'', which means providing service even +if all network communication to an oracle is broken. + +The model of projection calculation and storage described in +Section~\ref{sec:projections} allows for each server to operate +independently, if necessary. This autonomy allows the server in AP +mode to +always accept new writes: new writes are written to unique file names +and unique file offsets using a chain consisting of only a single FLU, +if necessary. How is this possible? Let's look at a scenario in +Section~\ref{sub:split-brain-scenario}. + +\subsection{A split brain scenario} +\label{sub:split-brain-scenario} + +\begin{enumerate} + +\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a, + F_b, F_c]$ using projection epoch $P_p$. + +\item Assume data $D_0$ is written at offset $O_0$ in Machi file + $F_0$. + +\item Then a network partition happens. Servers $F_a$ and $F_b$ are + on one side of the split, and server $F_c$ is on the other side of + the split. We'll call them the ``left side'' and ``right side'', + respectively. + +\item On the left side, $F_b$ calculates a new projection and writes + it unanimously (to two projection stores) as epoch $P_B+1$. The + subscript $_B$ denotes a + version of projection epoch $P_{p+1}$ that was created by server $F_B$ + and has a unique checksum (used to detect differences after the + network partition heals). + +\item In parallel, on the right side, $F_c$ calculates a new + projection and writes it unanimously (to a single projection store) + as epoch $P_c+1$. + +\item In parallel, a client on the left side writes data $D_1$ + at offset $O_1$ in Machi file $F_1$, and also + a client on the right side writes data $D_2$ + at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$ + because each sequencer is forced to choose disjoint filenames from + any prior epoch whenever a new projection is available. + +\end{enumerate} + +Now, what happens when various clients attempt to read data values +$D_0$, $D_1$, and $D_2$? + +\begin{itemize} +\item All clients can read $D_0$. +\item Clients on the left side can read $D_1$. +\item Attempts by clients on the right side to read $D_1$ will get + {\tt error\_unavailable}. +\item Clients on the right side can read $D_2$. +\item Attempts by clients on the left side to read $D_2$ will get + {\tt error\_unavailable}. +\end{itemize} + +The {\tt error\_unavailable} result is not an error in the CAP Theorem +sense: it is a valid and affirmative response. In both cases, the +system on the client's side definitely knows that the cluster is +partitioned. If Machi were not a write-once store, perhaps there +might be an old/stale value to read on the local side of the network +partition \ldots but the system also knows definitely that no +old/stale value exists. Therefore Machi remains available in the +CAP Theorem sense both for writes and reads. + +We know that all files $F_0$, +$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous +to set union) onto each server in $[F_a, F_b, F_c]$ safely +when the network partition is healed. However, +unlike pure theoretical set union, Machi's data merge \& repair +operations must operate within some constraints that are designed to +prevent data loss. + +\subsection{Aside: defining data availability and data loss} +\label{sub:define-availability} + +Let's take a moment to be clear about definitions: + +\begin{itemize} +\item ``data is available at time $T$'' means that data is available + for reading at $T$: the Machi cluster knows for certain that the + requested data is not been written or it is written and has a single + value. +\item ``data is unavailable at time $T$'' means that data is + unavailable for reading at $T$ due to temporary circumstances, + e.g. network partition. If a read request is issued at some time + after $T$, the data will be available. +\item ``data is lost at time $T$'' means that data is permanently + unavailable at $T$ and also all times after $T$. +\end{itemize} + +Chain Replication is a fantastic technique for managing the +consistency of data across a number of whole replicas. There are, +however, cases where CR can indeed lose data. + +\subsection{Data loss scenario \#1: too few servers} +\label{sub:data-loss1} + +If the chain is $N$ servers long, and if all $N$ servers fail, then +of course data is unavailable. However, if all $N$ fail +permanently, then data is lost. + +If the administrator had intended to avoid data loss after $N$ +failures, then the administrator would have provisioned a Machi +cluster with at least $N+1$ servers. + +\subsection{Data Loss scenario \#2: bogus configuration change sequence} +\label{sub:data-loss2} + +Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place. + +\begin{figure} +\begin{enumerate} +%% NOTE: the following list 9 items long. We use that fact later, see +%% string YYY9 in a comment further below. If the length of this list +%% changes, then the counter reset below needs adjustment. +\item Projection $P_p$ says that chain membership is $[F_a]$. +\item A write of data $D$ to file $F$ at offset $O$ is successful. +\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via + an administration API request. +\item Machi will trigger repair operations, copying any missing data + files from FLU $F_a$ to FLU $F_b$. For the purpose of this + example, the sync operation for file $F$'s data and metadata has + not yet started. +\item FLU $F_a$ crashes. +\item The chain manager on $F_b$ notices $F_a$'s crash, + decides to create a new projection $P_{p+2}$ where chain membership is + $[F_b]$ + successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged. +\item FLU $F_a$ is down, therefore the + value of $P_{p+2}$ is unanimous for all currently available FLUs + (namely $[F_b]$). +\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous + projection. It unwedges itself and continues operation using $P_{p+2}$. +\item Data $D$ is definitely unavailable for now, perhaps lost forever? +\end{enumerate} +\caption{Data unavailability scenario with danger of permanent data loss} +\label{fig:data-loss2} +\end{figure} + +At this point, the data $D$ is not available on $F_b$. However, if +we assume that $F_a$ eventually returns to service, and Machi +correctly acts to repair all data within its chain, then $D$ +all of its contents will be available eventually. + +However, if server $F_a$ never returns to service, then $D$ is lost. The +Machi administration API must always warn the user that data loss is +possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must +warn the administrator in multiple ways that fewer than the full {\tt + length(all\_members)} number of replicas are in full sync. + +A careful reader should note that $D$ is also lost if step \#5 were +instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.'' +For any possible step following \#5, $D$ is lost. This is data loss +for the same reason that the scenario of Section~\ref{sub:data-loss1} +happens: the administrator has not provisioned a sufficient number of +replicas. + +Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This +time, we add a final step at the end of the sequence: + +\begin{enumerate} +\setcounter{enumi}{9} % YYY9 +\item The administration API is used to change the chain +configuration to {\tt all\_members=$[F_b]$}. +\end{enumerate} + +Step \#10 causes data loss. Specifically, the only copy of file +$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now +permanently inaccessible. + +The chain manager {\em must} keep track of all +repair operations and their status. If such information is tracked by +all FLUs, then the data loss by bogus administrator action can be +prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow +F_b$` repair has not yet finished and therefore it is unsafe to remove +$F_a$ from the cluster. + +\subsection{Data Loss scenario \#3: chain replication repair done badly} +\label{sub:data-loss3} + +It's quite possible to lose data through careless/buggy Chain +Replication chain configuration changes. For example, in the split +brain scenario of Section~\ref{sub:split-brain-scenario}, we have two +pieces of data written to different ``sides'' of the split brain, +$D_0$ and $D_1$. If the chain is naively reconfigured after the network +partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$ +is in danger of being lost. Why? +The Update Propagation Invariant is violated. +Any Chain Replication read will be +directed to the tail, $F_c$. The value exists there, so there is no +need to do any further work; the unwritten values at $F_a$ and $F_b$ +will not be repaired. If the $F_c$ server fails sometime +later, then $D_1$ will be lost. The ``Chain Replication Repair'' +section of \cite{machi-design} discusses +how data loss can be avoided after servers are added (or re-added) to +an active chain configuration. + +\subsection{Summary} + +We believe that maintaining the Update Propagation Invariant is a +hassle anda pain, but that hassle and pain are well worth the +sacrifices required to maintain the invariant at all times. It avoids +data loss in all cases where the U.P.~Invariant preserving chain +contains at least one FLU. \section{Chain Replication: proof of correctness} \label{sec:cr-proof} @@ -681,7 +1084,7 @@ Readers interested in good karma should read the entire paper. ``Update Propagation Invariant'' is the original chain replication paper's name for the -$H_i \succeq H_j$ +$H_i \succeq H_j$ property mentioned in Figure~\ref{tab:chain-order}. This paper will use the same name. This property may also be referred to by its acronym, ``UPI''. @@ -1144,363 +1547,17 @@ maintain. Then for any two FLUs that claim to store a file $F$, if both FLUs have the same hash of $F$'s written map + checksums, then the copies of $F$ on both FLUs are the same. -\section{``Split brain'' management in CP Mode} -\label{sec:split-brain-management} - -Split brain management is a thorny problem. The method presented here -is one based on pragmatics. If it doesn't work, there isn't a serious -worry, because Machi's first serious use case all require only AP Mode. -If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'', -then perhaps that's -fine enough. Meanwhile, let's explore how a -completely self-contained, no-external-dependencies -CP Mode Machi might work. - -Wikipedia's description of the quorum consensus solution\footnote{See - {\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice -and short: - -\begin{quotation} -A typical approach, as described by Coulouris et al.,[4] is to use a -quorum-consensus approach. This allows the sub-partition with a -majority of the votes to remain available, while the remaining -sub-partitions should fall down to an auto-fencing mode. -\end{quotation} - -This is the same basic technique that -both Riak Ensemble and ZooKeeper use. Machi's -extensive use of write-registers are a big advantage when implementing -this technique. Also very useful is the Machi ``wedge'' mechanism, -which can automatically implement the ``auto-fencing'' that the -technique requires. All Machi servers that can communicate with only -a minority of other servers will automatically ``wedge'' themselves -and refuse all requests for service until communication with the -majority can be re-established. - -\subsection{The quorum: witness servers vs. full servers} - -In any quorum-consensus system, at least $2f+1$ participants are -required to survive $f$ participant failures. Machi can implement a -technique of ``witness servers'' servers to bring the total cost -somewhere in the middle, between $2f+1$ and $f+1$, depending on your -point of view. - -A ``witness server'' is one that participates in the network protocol -but does not store or manage all of the state that a ``full server'' -does. A ``full server'' is a Machi server as -described by this RFC document. A ``witness server'' is a server that -only participates in the projection store and projection epoch -transition protocol and a small subset of the file access API. -A witness server doesn't actually store any -Machi files. A witness server is almost stateless, when compared to a -full Machi server. - -A mixed cluster of witness and full servers must still contain at -least $2f+1$ participants. However, only $f+1$ of them are full -participants, and the remaining $f$ participants are witnesses. In -such a cluster, any majority quorum must have at least one full server -participant. - -Witness FLUs are always placed at the front of the chain. As stated -above, there may be at most $f$ witness FLUs. A functioning quorum -majority -must have at least $f+1$ FLUs that can communicate and therefore -calculate and store a new unanimous projection. Therefore, any FLU at -the tail of a functioning quorum majority chain must be full FLU. Full FLUs -actually store Machi files, so they have no problem answering {\tt - read\_req} API requests.\footnote{We hope that it is now clear that - a witness FLU cannot answer any Machi file read API request.} - -Any FLU that can only communicate with a minority of other FLUs will -find that none can calculate a new projection that includes a -majority of FLUs. Any such FLU, when in CP mode, would then move to -wedge state and remain wedged until the network partition heals enough -to communicate with the majority side. This is a nice property: we -automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side - is wedged and therefore refuses to serve because it is, so to speak, - ``on the wrong side of the fence.''} - -There is one case where ``fencing'' may not happen: if both the client -and the tail FLU are on the same minority side of a network partition. -Assume the client and FLU $F_z$ are on the "wrong side" of a network -split; both are using projection epoch $P_1$. The tail of the -chain is $F_z$. - -Also assume that the "right side" has reconfigured and is using -projection epoch $P_2$. The right side has mutated key $K$. Meanwhile, -nobody on the "right side" has noticed anything wrong and is happy to -continue using projection $P_1$. - -\begin{itemize} -\item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via - $F_z$. $F_z$ does not detect an epoch problem and thus returns an - answer. Given our assumptions, this value is stale. For some - client use cases, this kind of staleness may be OK in trade for - fewer network messages per read \ldots so Machi may - have a configurable option to permit it. -\item {\bf Option b}: The wrong side client must confirm that $P_1$ is - in use by a full majority of chain members, including $F_z$. -\end{itemize} - -Attempts using Option b will fail for one of two reasons. First, if -the client can talk to a FLU that is using $P_2$, the client's -operation must be retried using $P_2$. Second, the client will time -out talking to enough FLUs so that it fails to get a quorum's worth of -$P_1$ answers. In either case, Option B will always fail a client -read and thus cannot return a stale value of $K$. - -\subsection{Witness FLU data and protocol changes} - -Some small changes to the projection's data structure -are required (relative to the initial spec described in -\cite{machi-design}). The projection itself -needs new annotation to indicate the operating mode, AP mode or CP -mode. The state type notifies the chain manager how to -react in network partitions and how to calculate new, safe projection -transitions and which file repair mode to use -(Section~\ref{sec:repair-entire-files}). -Also, we need to label member FLU servers as full- or -witness-type servers. - -Write API requests are processed by witness servers in {\em almost but - not quite} no-op fashion. The only requirement of a witness server -is to return correct interpretations of local projection epoch -numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error -codes. In fact, a new API call is sufficient for querying witness -servers: {\tt \{check\_epoch, m\_epoch()\}}. -Any client write operation sends the {\tt - check\_\-epoch} API command to witness FLUs and sends the usual {\tt - write\_\-req} command to full FLUs. - -\section{The safety of projection epoch transitions} -\label{sec:safety-of-transitions} - -Machi uses the projection epoch transition algorithm and -implementation from CORFU, which is believed to be safe. However, -CORFU assumes a single, external, strongly consistent projection -store. Further, CORFU assumes that new projections are calculated by -an oracle that the rest of the CORFU system agrees is the sole agent -for creating new projections. Such an assumption is impractical for -Machi's intended purpose. - -Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle -coordinator), but we wish to keep Machi free of big external -dependencies. We would also like to see Machi be able to -operate in an ``AP mode'', which means providing service even -if all network communication to an oracle is broken. - -The model of projection calculation and storage described in -Section~\ref{sec:projections} allows for each server to operate -independently, if necessary. This autonomy allows the server in AP -mode to -always accept new writes: new writes are written to unique file names -and unique file offsets using a chain consisting of only a single FLU, -if necessary. How is this possible? Let's look at a scenario in -Section~\ref{sub:split-brain-scenario}. - -\subsection{A split brain scenario} -\label{sub:split-brain-scenario} - -\begin{enumerate} - -\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a, - F_b, F_c]$ using projection epoch $P_p$. - -\item Assume data $D_0$ is written at offset $O_0$ in Machi file - $F_0$. - -\item Then a network partition happens. Servers $F_a$ and $F_b$ are - on one side of the split, and server $F_c$ is on the other side of - the split. We'll call them the ``left side'' and ``right side'', - respectively. - -\item On the left side, $F_b$ calculates a new projection and writes - it unanimously (to two projection stores) as epoch $P_B+1$. The - subscript $_B$ denotes a - version of projection epoch $P_{p+1}$ that was created by server $F_B$ - and has a unique checksum (used to detect differences after the - network partition heals). - -\item In parallel, on the right side, $F_c$ calculates a new - projection and writes it unanimously (to a single projection store) - as epoch $P_c+1$. - -\item In parallel, a client on the left side writes data $D_1$ - at offset $O_1$ in Machi file $F_1$, and also - a client on the right side writes data $D_2$ - at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$ - because each sequencer is forced to choose disjoint filenames from - any prior epoch whenever a new projection is available. - -\end{enumerate} - -Now, what happens when various clients attempt to read data values -$D_0$, $D_1$, and $D_2$? - -\begin{itemize} -\item All clients can read $D_0$. -\item Clients on the left side can read $D_1$. -\item Attempts by clients on the right side to read $D_1$ will get - {\tt error\_unavailable}. -\item Clients on the right side can read $D_2$. -\item Attempts by clients on the left side to read $D_2$ will get - {\tt error\_unavailable}. -\end{itemize} - -The {\tt error\_unavailable} result is not an error in the CAP Theorem -sense: it is a valid and affirmative response. In both cases, the -system on the client's side definitely knows that the cluster is -partitioned. If Machi were not a write-once store, perhaps there -might be an old/stale value to read on the local side of the network -partition \ldots but the system also knows definitely that no -old/stale value exists. Therefore Machi remains available in the -CAP Theorem sense both for writes and reads. - -We know that all files $F_0$, -$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous -to set union) onto each server in $[F_a, F_b, F_c]$ safely -when the network partition is healed. However, -unlike pure theoretical set union, Machi's data merge \& repair -operations must operate within some constraints that are designed to -prevent data loss. - -\subsection{Aside: defining data availability and data loss} -\label{sub:define-availability} - -Let's take a moment to be clear about definitions: - -\begin{itemize} -\item ``data is available at time $T$'' means that data is available - for reading at $T$: the Machi cluster knows for certain that the - requested data is not been written or it is written and has a single - value. -\item ``data is unavailable at time $T$'' means that data is - unavailable for reading at $T$ due to temporary circumstances, - e.g. network partition. If a read request is issued at some time - after $T$, the data will be available. -\item ``data is lost at time $T$'' means that data is permanently - unavailable at $T$ and also all times after $T$. -\end{itemize} - -Chain Replication is a fantastic technique for managing the -consistency of data across a number of whole replicas. There are, -however, cases where CR can indeed lose data. - -\subsection{Data loss scenario \#1: too few servers} -\label{sub:data-loss1} - -If the chain is $N$ servers long, and if all $N$ servers fail, then -of course data is unavailable. However, if all $N$ fail -permanently, then data is lost. - -If the administrator had intended to avoid data loss after $N$ -failures, then the administrator would have provisioned a Machi -cluster with at least $N+1$ servers. - -\subsection{Data Loss scenario \#2: bogus configuration change sequence} -\label{sub:data-loss2} - -Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place. - -\begin{figure} -\begin{enumerate} -%% NOTE: the following list 9 items long. We use that fact later, see -%% string YYY9 in a comment further below. If the length of this list -%% changes, then the counter reset below needs adjustment. -\item Projection $P_p$ says that chain membership is $[F_a]$. -\item A write of data $D$ to file $F$ at offset $O$ is successful. -\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via - an administration API request. -\item Machi will trigger repair operations, copying any missing data - files from FLU $F_a$ to FLU $F_b$. For the purpose of this - example, the sync operation for file $F$'s data and metadata has - not yet started. -\item FLU $F_a$ crashes. -\item The chain manager on $F_b$ notices $F_a$'s crash, - decides to create a new projection $P_{p+2}$ where chain membership is - $[F_b]$ - successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged. -\item FLU $F_a$ is down, therefore the - value of $P_{p+2}$ is unanimous for all currently available FLUs - (namely $[F_b]$). -\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous - projection. It unwedges itself and continues operation using $P_{p+2}$. -\item Data $D$ is definitely unavailable for now, perhaps lost forever? -\end{enumerate} -\caption{Data unavailability scenario with danger of permanent data loss} -\label{fig:data-loss2} -\end{figure} - -At this point, the data $D$ is not available on $F_b$. However, if -we assume that $F_a$ eventually returns to service, and Machi -correctly acts to repair all data within its chain, then $D$ -all of its contents will be available eventually. - -However, if server $F_a$ never returns to service, then $D$ is lost. The -Machi administration API must always warn the user that data loss is -possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must -warn the administrator in multiple ways that fewer than the full {\tt - length(all\_members)} number of replicas are in full sync. - -A careful reader should note that $D$ is also lost if step \#5 were -instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.'' -For any possible step following \#5, $D$ is lost. This is data loss -for the same reason that the scenario of Section~\ref{sub:data-loss1} -happens: the administrator has not provisioned a sufficient number of -replicas. - -Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This -time, we add a final step at the end of the sequence: - -\begin{enumerate} -\setcounter{enumi}{9} % YYY9 -\item The administration API is used to change the chain -configuration to {\tt all\_members=$[F_b]$}. -\end{enumerate} - -Step \#10 causes data loss. Specifically, the only copy of file -$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now -permanently inaccessible. - -The chain manager {\em must} keep track of all -repair operations and their status. If such information is tracked by -all FLUs, then the data loss by bogus administrator action can be -prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow -F_b$` repair has not yet finished and therefore it is unsafe to remove -$F_a$ from the cluster. - -\subsection{Data Loss scenario \#3: chain replication repair done badly} -\label{sub:data-loss3} - -It's quite possible to lose data through careless/buggy Chain -Replication chain configuration changes. For example, in the split -brain scenario of Section~\ref{sub:split-brain-scenario}, we have two -pieces of data written to different ``sides'' of the split brain, -$D_0$ and $D_1$. If the chain is naively reconfigured after the network -partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$ -is in danger of being lost. Why? -The Update Propagation Invariant is violated. -Any Chain Replication read will be -directed to the tail, $F_c$. The value exists there, so there is no -need to do any further work; the unwritten values at $F_a$ and $F_b$ -will not be repaired. If the $F_c$ server fails sometime -later, then $D_1$ will be lost. The ``Chain Replication Repair'' -section of \cite{machi-design} discusses -how data loss can be avoided after servers are added (or re-added) to -an active chain configuration. - -\subsection{Summary} - -We believe that maintaining the Update Propagation Invariant is a -hassle anda pain, but that hassle and pain are well worth the -sacrifices required to maintain the invariant at all times. It avoids -data loss in all cases where the U.P.~Invariant preserving chain -contains at least one FLU. - \bibliographystyle{abbrvnat} \begin{thebibliography}{} \softraggedright +\bibitem{riak-core} +Klophaus, Rusty. +"Riak Core." +ACM SIGPLAN Commercial Users of Functional Programming (CUFP'10), 2010. +{\tt http://dl.acm.org/citation.cfm?id=1900176} and +{\tt https://github.com/basho/riak\_core} + \bibitem{rfc-7282} RFC 7282: On Consensus and Humming in the IETF. Internet Engineering Task Force. From d90d11ae7d58534e9775aa4143f49ae2c32ad982 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 16:54:55 +0900 Subject: [PATCH 07/14] Cut out "The safety of epoch transitions" section (commentary follows) I don't want to cut this section, because the points that it makes are important ... but those points aren't a good fit for the purposes of this document. If someone needs some examples of why badly managed chain replication can lose data, this is the section to look in. ^_^ --- doc/src.high-level/high-level-chain-mgr.tex | 225 -------------------- 1 file changed, 225 deletions(-) diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 37771ac..f858d23 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -847,231 +847,6 @@ Any client write operation sends the {\tt check\_\-epoch} API command to witness FLUs and sends the usual {\tt write\_\-req} command to full FLUs. -\section{The safety of projection epoch transitions} -\label{sec:safety-of-transitions} - -Machi uses the projection epoch transition algorithm and -implementation from CORFU, which is believed to be safe. However, -CORFU assumes a single, external, strongly consistent projection -store. Further, CORFU assumes that new projections are calculated by -an oracle that the rest of the CORFU system agrees is the sole agent -for creating new projections. Such an assumption is impractical for -Machi's intended purpose. - -Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle -coordinator), but we wish to keep Machi free of big external -dependencies. We would also like to see Machi be able to -operate in an ``AP mode'', which means providing service even -if all network communication to an oracle is broken. - -The model of projection calculation and storage described in -Section~\ref{sec:projections} allows for each server to operate -independently, if necessary. This autonomy allows the server in AP -mode to -always accept new writes: new writes are written to unique file names -and unique file offsets using a chain consisting of only a single FLU, -if necessary. How is this possible? Let's look at a scenario in -Section~\ref{sub:split-brain-scenario}. - -\subsection{A split brain scenario} -\label{sub:split-brain-scenario} - -\begin{enumerate} - -\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a, - F_b, F_c]$ using projection epoch $P_p$. - -\item Assume data $D_0$ is written at offset $O_0$ in Machi file - $F_0$. - -\item Then a network partition happens. Servers $F_a$ and $F_b$ are - on one side of the split, and server $F_c$ is on the other side of - the split. We'll call them the ``left side'' and ``right side'', - respectively. - -\item On the left side, $F_b$ calculates a new projection and writes - it unanimously (to two projection stores) as epoch $P_B+1$. The - subscript $_B$ denotes a - version of projection epoch $P_{p+1}$ that was created by server $F_B$ - and has a unique checksum (used to detect differences after the - network partition heals). - -\item In parallel, on the right side, $F_c$ calculates a new - projection and writes it unanimously (to a single projection store) - as epoch $P_c+1$. - -\item In parallel, a client on the left side writes data $D_1$ - at offset $O_1$ in Machi file $F_1$, and also - a client on the right side writes data $D_2$ - at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$ - because each sequencer is forced to choose disjoint filenames from - any prior epoch whenever a new projection is available. - -\end{enumerate} - -Now, what happens when various clients attempt to read data values -$D_0$, $D_1$, and $D_2$? - -\begin{itemize} -\item All clients can read $D_0$. -\item Clients on the left side can read $D_1$. -\item Attempts by clients on the right side to read $D_1$ will get - {\tt error\_unavailable}. -\item Clients on the right side can read $D_2$. -\item Attempts by clients on the left side to read $D_2$ will get - {\tt error\_unavailable}. -\end{itemize} - -The {\tt error\_unavailable} result is not an error in the CAP Theorem -sense: it is a valid and affirmative response. In both cases, the -system on the client's side definitely knows that the cluster is -partitioned. If Machi were not a write-once store, perhaps there -might be an old/stale value to read on the local side of the network -partition \ldots but the system also knows definitely that no -old/stale value exists. Therefore Machi remains available in the -CAP Theorem sense both for writes and reads. - -We know that all files $F_0$, -$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous -to set union) onto each server in $[F_a, F_b, F_c]$ safely -when the network partition is healed. However, -unlike pure theoretical set union, Machi's data merge \& repair -operations must operate within some constraints that are designed to -prevent data loss. - -\subsection{Aside: defining data availability and data loss} -\label{sub:define-availability} - -Let's take a moment to be clear about definitions: - -\begin{itemize} -\item ``data is available at time $T$'' means that data is available - for reading at $T$: the Machi cluster knows for certain that the - requested data is not been written or it is written and has a single - value. -\item ``data is unavailable at time $T$'' means that data is - unavailable for reading at $T$ due to temporary circumstances, - e.g. network partition. If a read request is issued at some time - after $T$, the data will be available. -\item ``data is lost at time $T$'' means that data is permanently - unavailable at $T$ and also all times after $T$. -\end{itemize} - -Chain Replication is a fantastic technique for managing the -consistency of data across a number of whole replicas. There are, -however, cases where CR can indeed lose data. - -\subsection{Data loss scenario \#1: too few servers} -\label{sub:data-loss1} - -If the chain is $N$ servers long, and if all $N$ servers fail, then -of course data is unavailable. However, if all $N$ fail -permanently, then data is lost. - -If the administrator had intended to avoid data loss after $N$ -failures, then the administrator would have provisioned a Machi -cluster with at least $N+1$ servers. - -\subsection{Data Loss scenario \#2: bogus configuration change sequence} -\label{sub:data-loss2} - -Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place. - -\begin{figure} -\begin{enumerate} -%% NOTE: the following list 9 items long. We use that fact later, see -%% string YYY9 in a comment further below. If the length of this list -%% changes, then the counter reset below needs adjustment. -\item Projection $P_p$ says that chain membership is $[F_a]$. -\item A write of data $D$ to file $F$ at offset $O$ is successful. -\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via - an administration API request. -\item Machi will trigger repair operations, copying any missing data - files from FLU $F_a$ to FLU $F_b$. For the purpose of this - example, the sync operation for file $F$'s data and metadata has - not yet started. -\item FLU $F_a$ crashes. -\item The chain manager on $F_b$ notices $F_a$'s crash, - decides to create a new projection $P_{p+2}$ where chain membership is - $[F_b]$ - successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged. -\item FLU $F_a$ is down, therefore the - value of $P_{p+2}$ is unanimous for all currently available FLUs - (namely $[F_b]$). -\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous - projection. It unwedges itself and continues operation using $P_{p+2}$. -\item Data $D$ is definitely unavailable for now, perhaps lost forever? -\end{enumerate} -\caption{Data unavailability scenario with danger of permanent data loss} -\label{fig:data-loss2} -\end{figure} - -At this point, the data $D$ is not available on $F_b$. However, if -we assume that $F_a$ eventually returns to service, and Machi -correctly acts to repair all data within its chain, then $D$ -all of its contents will be available eventually. - -However, if server $F_a$ never returns to service, then $D$ is lost. The -Machi administration API must always warn the user that data loss is -possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must -warn the administrator in multiple ways that fewer than the full {\tt - length(all\_members)} number of replicas are in full sync. - -A careful reader should note that $D$ is also lost if step \#5 were -instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.'' -For any possible step following \#5, $D$ is lost. This is data loss -for the same reason that the scenario of Section~\ref{sub:data-loss1} -happens: the administrator has not provisioned a sufficient number of -replicas. - -Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This -time, we add a final step at the end of the sequence: - -\begin{enumerate} -\setcounter{enumi}{9} % YYY9 -\item The administration API is used to change the chain -configuration to {\tt all\_members=$[F_b]$}. -\end{enumerate} - -Step \#10 causes data loss. Specifically, the only copy of file -$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now -permanently inaccessible. - -The chain manager {\em must} keep track of all -repair operations and their status. If such information is tracked by -all FLUs, then the data loss by bogus administrator action can be -prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow -F_b$` repair has not yet finished and therefore it is unsafe to remove -$F_a$ from the cluster. - -\subsection{Data Loss scenario \#3: chain replication repair done badly} -\label{sub:data-loss3} - -It's quite possible to lose data through careless/buggy Chain -Replication chain configuration changes. For example, in the split -brain scenario of Section~\ref{sub:split-brain-scenario}, we have two -pieces of data written to different ``sides'' of the split brain, -$D_0$ and $D_1$. If the chain is naively reconfigured after the network -partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$ -is in danger of being lost. Why? -The Update Propagation Invariant is violated. -Any Chain Replication read will be -directed to the tail, $F_c$. The value exists there, so there is no -need to do any further work; the unwritten values at $F_a$ and $F_b$ -will not be repaired. If the $F_c$ server fails sometime -later, then $D_1$ will be lost. The ``Chain Replication Repair'' -section of \cite{machi-design} discusses -how data loss can be avoided after servers are added (or re-added) to -an active chain configuration. - -\subsection{Summary} - -We believe that maintaining the Update Propagation Invariant is a -hassle anda pain, but that hassle and pain are well worth the -sacrifices required to maintain the invariant at all times. It avoids -data loss in all cases where the U.P.~Invariant preserving chain -contains at least one FLU. - \section{Chain Replication: proof of correctness} \label{sec:cr-proof} From 7badb93f9afabcf47cc1a3432216f6dc651e8766 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 17:16:04 +0900 Subject: [PATCH 08/14] WIP: more restructuring --- doc/src.high-level/high-level-chain-mgr.tex | 156 ++++++++++++-------- 1 file changed, 91 insertions(+), 65 deletions(-) diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index f858d23..6460c61 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -776,30 +776,30 @@ participants, and the remaining $f$ participants are witnesses. In such a cluster, any majority quorum must have at least one full server participant. -Witness FLUs are always placed at the front of the chain. As stated -above, there may be at most $f$ witness FLUs. A functioning quorum +Witness servers are always placed at the front of the chain. As stated +above, there may be at most $f$ witness servers. A functioning quorum majority -must have at least $f+1$ FLUs that can communicate and therefore -calculate and store a new unanimous projection. Therefore, any FLU at -the tail of a functioning quorum majority chain must be full FLU. Full FLUs +must have at least $f+1$ servers that can communicate and therefore +calculate and store a new unanimous projection. Therefore, any server at +the tail of a functioning quorum majority chain must be full server. Full servers actually store Machi files, so they have no problem answering {\tt read\_req} API requests.\footnote{We hope that it is now clear that - a witness FLU cannot answer any Machi file read API request.} + a witness server cannot answer any Machi file read API request.} -Any FLU that can only communicate with a minority of other FLUs will +Any server that can only communicate with a minority of other servers will find that none can calculate a new projection that includes a -majority of FLUs. Any such FLU, when in CP mode, would then move to +majority of servers. Any such server, when in CP mode, would then move to wedge state and remain wedged until the network partition heals enough to communicate with the majority side. This is a nice property: we -automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side +automatically get ``fencing'' behavior.\footnote{Any server on the minority side is wedged and therefore refuses to serve because it is, so to speak, ``on the wrong side of the fence.''} There is one case where ``fencing'' may not happen: if both the client -and the tail FLU are on the same minority side of a network partition. -Assume the client and FLU $F_z$ are on the "wrong side" of a network +and the tail server are on the same minority side of a network partition. +Assume the client and server $S_z$ are on the "wrong side" of a network split; both are using projection epoch $P_1$. The tail of the -chain is $F_z$. +chain is $S_z$. Also assume that the "right side" has reconfigured and is using projection epoch $P_2$. The right side has mutated key $K$. Meanwhile, @@ -808,23 +808,23 @@ continue using projection $P_1$. \begin{itemize} \item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via - $F_z$. $F_z$ does not detect an epoch problem and thus returns an + $S_z$. $S_z$ does not detect an epoch problem and thus returns an answer. Given our assumptions, this value is stale. For some client use cases, this kind of staleness may be OK in trade for fewer network messages per read \ldots so Machi may have a configurable option to permit it. \item {\bf Option b}: The wrong side client must confirm that $P_1$ is - in use by a full majority of chain members, including $F_z$. + in use by a full majority of chain members, including $S_z$. \end{itemize} Attempts using Option b will fail for one of two reasons. First, if -the client can talk to a FLU that is using $P_2$, the client's +the client can talk to a server that is using $P_2$, the client's operation must be retried using $P_2$. Second, the client will time -out talking to enough FLUs so that it fails to get a quorum's worth of +out talking to enough servers so that it fails to get a quorum's worth of $P_1$ answers. In either case, Option B will always fail a client read and thus cannot return a stale value of $K$. -\subsection{Witness FLU data and protocol changes} +\subsection{Witness server data and protocol changes} Some small changes to the projection's data structure are required (relative to the initial spec described in @@ -834,7 +834,7 @@ mode. The state type notifies the chain manager how to react in network partitions and how to calculate new, safe projection transitions and which file repair mode to use (Section~\ref{sec:repair-entire-files}). -Also, we need to label member FLU servers as full- or +Also, we need to label member server servers as full- or witness-type servers. Write API requests are processed by witness servers in {\em almost but @@ -844,8 +844,8 @@ numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error codes. In fact, a new API call is sufficient for querying witness servers: {\tt \{check\_epoch, m\_epoch()\}}. Any client write operation sends the {\tt - check\_\-epoch} API command to witness FLUs and sends the usual {\tt - write\_\-req} command to full FLUs. + check\_\-epoch} API command to witness servers and sends the usual {\tt + write\_\-req} command to full servers. \section{Chain Replication: proof of correctness} \label{sec:cr-proof} @@ -989,7 +989,7 @@ violations is small: any client that witnesses a written $\rightarrow$ unwritten transition is a violation of strong consistency. But avoiding even this one bad scenario is a bit tricky. -As explained in Section~\ref{sub:data-loss1}, data +Data unavailability/loss when all chain servers fail is unavoidable. We wish to avoid data loss whenever a chain has at least one surviving server. Another method to avoid data loss is to preserve the Update @@ -1045,12 +1045,12 @@ Figure~\ref{fig:data-loss2}.) \begin{figure} \begin{enumerate} -\item Write value $V$ to offset $O$ in the log with chain $[F_a]$. +\item Write value $V$ to offset $O$ in the log with chain $[S_a]$. This write is considered successful. -\item Change projection to configure chain as $[F_a,F_b]$. Prior to - the change, all values on FLU $F_b$ are unwritten. -\item FLU server $F_a$ crashes. The new projection defines the chain - as $[F_b]$. +\item Change projection to configure chain as $[S_a,S_b]$. Prior to + the change, all values on FLU $S_b$ are unwritten. +\item FLU server $S_a$ crashes. The new projection defines the chain + as $[S_b]$. \item A client attempts to read offset $O$ and finds an unwritten value. This is a strong consistency violation. %% \item The same client decides to fill $O$ with the junk value @@ -1061,16 +1061,43 @@ Figure~\ref{fig:data-loss2}.) \label{fig:corfu-repair-sc-violation} \end{figure} +\begin{figure} +\begin{enumerate} +\item Projection $P_p$ says that chain membership is $[S_a]$. +\item A write of data $D$ to file $F$ at offset $O$ is successful. +\item Projection $P_{p+1}$ says that chain membership is $[S_a,S_b]$, via + an administration API request. +\item Machi will trigger repair operations, copying any missing data + files from FLU $S_a$ to FLU $S_b$. For the purpose of this + example, the sync operation for file $F$'s data and metadata has + not yet started. +\item FLU $S_a$ crashes. +\item The chain manager on $S_b$ notices $S_a$'s crash, + decides to create a new projection $P_{p+2}$ where chain membership is + $[S_b]$ + successfully stores $P_{p+2}$ in its local store. FLU $S_b$ is now wedged. +\item FLU $S_a$ is down, therefore the + value of $P_{p+2}$ is unanimous for all currently available FLUs + (namely $[S_b]$). +\item FLU $S_b$ sees that projection $P_{p+2}$ is the newest unanimous + projection. It unwedges itself and continues operation using $P_{p+2}$. +\item Data $D$ is definitely unavailable for now. If server $S_a$ is + never re-added to the chain, then data $D$ is lost forever. +\end{enumerate} +\caption{Data unavailability scenario with danger of permanent data loss} +\label{fig:data-loss2} +\end{figure} + A variation of the repair algorithm is presented in section~2.5 of a later CORFU paper \cite{corfu2}. However, the re-use a failed server is not discussed there, either: the example of a failed server -$F_6$ uses a new server, $F_8$ to replace $F_6$. Furthermore, the +$S_6$ uses a new server, $S_8$ to replace $S_6$. Furthermore, the repair process is described as: \begin{quote} -``Once $F_6$ is completely rebuilt on $F_8$ (by copying entries from - $F_7$), the system moves to projection (C), where $F_8$ is now used +``Once $S_6$ is completely rebuilt on $S_8$ (by copying entries from + $S_7$), the system moves to projection (C), where $S_8$ is now used to service all reads in the range $[40K,80K)$.'' \end{quote} @@ -1089,16 +1116,6 @@ vulnerability is eliminated.\footnote{SLF's note: Probably? This is my \subsection{Whole-file repair as FLUs are (re-)added to a chain} \label{sub:repair-add-to-chain} -Machi's repair process must preserve the Update Propagation -Invariant. To avoid data races with data copying from -``U.P.~Invariant preserving'' servers (i.e. fully repaired with -respect to the Update Propagation Invariant) -to servers of unreliable/unknown state, a -projection like the one shown in -Figure~\ref{fig:repair-chain-of-chains} is used. In addition, the -operations rules for data writes and reads must be observed in a -projection of this type. - \begin{figure*} \centering $ @@ -1118,6 +1135,16 @@ $ \label{fig:repair-chain-of-chains} \end{figure*} +Machi's repair process must preserve the Update Propagation +Invariant. To avoid data races with data copying from +``U.P.~Invariant preserving'' servers (i.e. fully repaired with +respect to the Update Propagation Invariant) +to servers of unreliable/unknown state, a +projection like the one shown in +Figure~\ref{fig:repair-chain-of-chains} is used. In addition, the +operations rules for data writes and reads must be observed in a +projection of this type. + \begin{itemize} \item The system maintains the distinction between ``U.P.~preserving'' @@ -1200,22 +1227,6 @@ algorithm proposed is: \end{enumerate} -\begin{figure} -\centering -$ -[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, - H_2, M_{21}, T_2, - \ldots - H_n, M_{n1}, - \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} -] -$ -\caption{Representation of Figure~\ref{fig:repair-chain-of-chains} - after all repairs have finished successfully and a new projection has - been calculated.} -\label{fig:repair-chain-of-chains-finished} -\end{figure} - When the repair is known to have copied all missing data successfully, then the chain can change state via a new projection that includes the repaired FLU(s) at the end of the U.P.~Invariant preserving chain \#1 @@ -1231,7 +1242,6 @@ step \#1 will force any new data writes to adapt to a new projection. Consider the mutations that either happen before or after a projection change: - \begin{itemize} \item For all mutations $M_1$ prior to the projection change, the @@ -1250,6 +1260,22 @@ change: \end{itemize} +\begin{figure} +\centering +$ +[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, + H_2, M_{21}, T_2, + \ldots + H_n, M_{n1}, + \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} +] +$ +\caption{Representation of Figure~\ref{fig:repair-chain-of-chains} + after all repairs have finished successfully and a new projection has + been calculated.} +\label{fig:repair-chain-of-chains-finished} +\end{figure} + \subsubsection{Cluster in AP Mode} In cases the cluster is operating in AP Mode: @@ -1266,7 +1292,7 @@ In cases the cluster is operating in AP Mode: The end result is a huge ``merge'' where any {\tt \{FName, $O_{start}, O_{end}$\}} range of bytes that is written -on FLU $F_w$ but missing/unwritten from FLU $F_m$ is written down the full chain +on FLU $S_w$ but missing/unwritten from FLU $S_m$ is written down the full chain of chains, skipping any FLUs where the data is known to be written. Such writes will also preserve Update Propagation Invariant when repair is finished. @@ -1277,14 +1303,14 @@ repair is finished. Changing FLU order within a chain is an operations optimization only. It may be that the administrator wishes the order of a chain to remain as originally configured during steady-state operation, e.g., -$[F_a,F_b,F_c]$. As FLUs are stopped \& restarted, the chain may +$[S_a,S_b,S_c]$. As FLUs are stopped \& restarted, the chain may become re-ordered in a seemingly-arbitrary manner. It is certainly possible to re-order the chain, in a kludgy manner. -For example, if the desired order is $[F_a,F_b,F_c]$ but the current -operating order is $[F_c,F_b,F_a]$, then remove $F_b$ from the chain, -then add $F_b$ to the end of the chain. Then repeat the same -procedure for $F_c$. The end result will be the desired order. +For example, if the desired order is $[S_a,S_b,S_c]$ but the current +operating order is $[S_c,S_b,S_a]$, then remove $S_b$ from the chain, +then add $S_b$ to the end of the chain. Then repeat the same +procedure for $S_c$. The end result will be the desired order. From an operations perspective, re-ordering of the chain using this kludgy manner has a @@ -1318,9 +1344,9 @@ file offset 1 is written. It may be advantageous for each FLU to maintain for each file a checksum of a canonical representation of the {\tt \{$O_{start},O_{end},$ CSum\}} tuples that the FLU must already -maintain. Then for any two FLUs that claim to store a file $F$, if -both FLUs have the same hash of $F$'s written map + checksums, then -the copies of $F$ on both FLUs are the same. +maintain. Then for any two FLUs that claim to store a file $S$, if +both FLUs have the same hash of $S$'s written map + checksums, then +the copies of $S$ on both FLUs are the same. \bibliographystyle{abbrvnat} \begin{thebibliography}{} From 36ce2c75bdd0caab0d455ff046199c85a189ebcc Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 17:27:16 +0900 Subject: [PATCH 09/14] WIP: more restructuring --- doc/src.high-level/high-level-chain-mgr.tex | 319 +++++++++----------- 1 file changed, 137 insertions(+), 182 deletions(-) diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 6460c61..32b8982 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -847,126 +847,6 @@ Any client write operation sends the {\tt check\_\-epoch} API command to witness servers and sends the usual {\tt write\_\-req} command to full servers. -\section{Chain Replication: proof of correctness} -\label{sec:cr-proof} - -See Section~3 of \cite{chain-replication} for a proof of the -correctness of Chain Replication. A short summary is provide here. -Readers interested in good karma should read the entire paper. - -\subsection{The Update Propagation Invariant} -\label{sub:upi} - -``Update Propagation Invariant'' is the original chain replication -paper's name for the -$H_i \succeq H_j$ -property mentioned in Figure~\ref{tab:chain-order}. -This paper will use the same name. -This property may also be referred to by its acronym, ``UPI''. - -\subsection{Chain Replication and strong consistency} - -The three basic rules of Chain Replication and its strong -consistency guarantee: - -\begin{enumerate} - -\item All replica servers are arranged in an ordered list $C$. - -\item All mutations of a datum are performed upon each replica of $C$ - strictly in the order which they appear in $C$. A mutation is considered - completely successful if the writes by all replicas are successful. - -\item The head of the chain makes the determination of the order of - all mutations to all members of the chain. If the head determines - that some mutation $M_i$ happened before another mutation $M_j$, - then mutation $M_i$ happens before $M_j$ on all other members of - the chain.\footnote{While necesary for general Chain Replication, - Machi does not need this property. Instead, the property is - provided by Machi's sequencer and the write-once register of each - byte in each file.} - -\item All read-only operations are performed by the ``tail'' replica, - i.e., the last replica in $C$. - -\end{enumerate} - -The basis of the proof lies in a simple logical trick, which is to -consider the history of all operations made to any server in the chain -as a literal list of unique symbols, one for each mutation. - -Each replica of a datum will have a mutation history list. We will -call this history list $H$. For the $i^{th}$ replica in the chain list -$C$, we call $H_i$ the mutation history list for the $i^{th}$ replica. - -Before the $i^{th}$ replica in the chain list begins service, its mutation -history $H_i$ is empty, $[]$. After this replica runs in a Chain -Replication system for a while, its mutation history list grows to -look something like -$[M_0, M_1, M_2, ..., M_{m-1}]$ where $m$ is the total number of -mutations of the datum that this server has processed successfully. - -Let's assume for a moment that all mutation operations have stopped. -If the order of the chain was constant, and if all mutations are -applied to each replica in the chain's order, then all replicas of a -datum will have the exact same mutation history: $H_i = H_J$ for any -two replicas $i$ and $j$ in the chain -(i.e., $\forall i,j \in C, H_i = H_J$). That's a lovely property, -but it is much more interesting to assume that the service is -not stopped. Let's look next at a running system. - -\begin{figure*} -\centering -\begin{tabular}{ccc} -{\bf {{On left side of $C$}}} & & {\bf On right side of $C$} \\ -\hline -\multicolumn{3}{l}{Looking at replica order in chain $C$:} \\ -$i$ & $<$ & $j$ \\ - -\multicolumn{3}{l}{For example:} \\ - -0 & $<$ & 2 \\ -\hline -\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\ -length($H_i$) & $\geq$ & length($H_j$) \\ -\multicolumn{3}{l}{For example, a quiescent chain:} \\ -length($H_i$) = 48 & $\geq$ & length($H_j$) = 48 \\ -\multicolumn{3}{l}{For example, a chain being mutated:} \\ -length($H_i$) = 55 & $\geq$ & length($H_j$) = 48 \\ -\multicolumn{3}{l}{Example ordered mutation sets:} \\ -$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ -\multicolumn{3}{c}{\bf Therefore the right side is always an ordered - subset} \\ -\multicolumn{3}{c}{\bf of the left side. Furthermore, the ordered - sets on both} \\ -\multicolumn{3}{c}{\bf sides have the exact same order of those elements they have in common.} \\ -\multicolumn{3}{c}{The notation used by the Chain Replication paper is -shown below:} \\ -$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\succeq$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ - -\end{tabular} -\caption{A demonstration of Chain Replication protocol history ``Update Propagation Invariant''.} -\label{tab:chain-order} -\end{figure*} - -If the entire chain $C$ is processing any number of concurrent -mutations, then we can still understand $C$'s behavior. -Figure~\ref{tab:chain-order} shows us two replicas in chain $C$: -replica $R_i$ that's on the left/earlier side of the replica chain $C$ -than some other replica $R_j$. We know that $i$'s position index in -the chain is smaller than $j$'s position index, so therefore $i < j$. -The restrictions of Chain Replication make it true that length($H_i$) -$\ge$ length($H_j$) because it's also that $H_i \supset H_j$, i.e, -$H_i$ on the left is always is a superset of $H_j$ on the right. - -When considering $H_i$ and $H_j$ as strictly ordered lists, we have -$H_i \succeq H_j$, where the right side is always an exact prefix of the left -side's list. This prefixing propery is exactly what strong -consistency requires. If a value is read from the tail of the chain, -then no other chain member can have a prior/older value because their -respective mutations histories cannot be shorter than the tail -member's history. - \section{Repair of entire files} \label{sec:repair-entire-files} @@ -1113,7 +993,7 @@ vulnerability is eliminated.\footnote{SLF's note: Probably? This is my not safe} in Machi, I'm not 100\% certain anymore than this ``easy'' fix for CORFU is correct.}. -\subsection{Whole-file repair as FLUs are (re-)added to a chain} +\subsection{Whole file repair as servers are (re-)added to a chain} \label{sub:repair-add-to-chain} \begin{figure*} @@ -1135,6 +1015,22 @@ $ \label{fig:repair-chain-of-chains} \end{figure*} +\begin{figure} +\centering +$ +[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, + H_2, M_{21}, T_2, + \ldots + H_n, M_{n1}, + \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} +] +$ +\caption{Representation of Figure~\ref{fig:repair-chain-of-chains} + after all repairs have finished successfully and a new projection has + been calculated.} +\label{fig:repair-chain-of-chains-finished} +\end{figure} + Machi's repair process must preserve the Update Propagation Invariant. To avoid data races with data copying from ``U.P.~Invariant preserving'' servers (i.e. fully repaired with @@ -1175,7 +1071,7 @@ While the normal single-write and single-read operations are performed by the cluster, a file synchronization process is initiated. The sequence of steps differs depending on the AP or CP mode of the system. -\subsubsection{Cluster in CP mode} +\subsubsection{Repair in CP mode} In cases where the cluster is operating in CP Mode, CORFU's repair method of ``just copy it all'' (from source FLU to repairing @@ -1260,23 +1156,7 @@ change: \end{itemize} -\begin{figure} -\centering -$ -[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, - H_2, M_{21}, T_2, - \ldots - H_n, M_{n1}, - \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} -] -$ -\caption{Representation of Figure~\ref{fig:repair-chain-of-chains} - after all repairs have finished successfully and a new projection has - been calculated.} -\label{fig:repair-chain-of-chains-finished} -\end{figure} - -\subsubsection{Cluster in AP Mode} +\subsubsection{Repair in AP Mode} In cases the cluster is operating in AP Mode: @@ -1297,56 +1177,131 @@ of chains, skipping any FLUs where the data is known to be written. Such writes will also preserve Update Propagation Invariant when repair is finished. -\subsection{Whole-file repair when changing FLU ordering within a chain} +\subsection{Whole-file repair when changing server ordering within a chain} \label{sub:repair-chain-re-ordering} -Changing FLU order within a chain is an operations optimization only. -It may be that the administrator wishes the order of a chain to remain -as originally configured during steady-state operation, e.g., -$[S_a,S_b,S_c]$. As FLUs are stopped \& restarted, the chain may -become re-ordered in a seemingly-arbitrary manner. +This section has been cut --- please see Git commit history for discussion. -It is certainly possible to re-order the chain, in a kludgy manner. -For example, if the desired order is $[S_a,S_b,S_c]$ but the current -operating order is $[S_c,S_b,S_a]$, then remove $S_b$ from the chain, -then add $S_b$ to the end of the chain. Then repeat the same -procedure for $S_c$. The end result will be the desired order. +\section{Chain Replication: why is it correct?} +\label{sec:cr-proof} -From an operations perspective, re-ordering of the chain -using this kludgy manner has a -negative effect on availability: the chain is temporarily reduced from -operating with $N$ replicas down to $N-1$. This reduced replication -factor will not remain for long, at most a few minutes at a time, but -even a small amount of time may be unacceptable in some environments. +See Section~3 of \cite{chain-replication} for a proof of the +correctness of Chain Replication. A short summary is provide here. +Readers interested in good karma should read the entire paper. -Reordering is possible with the introduction of a ``temporary head'' -of the chain. This temporary FLU does not need to be a full replica -of the entire chain --- it merely needs to store replicas of mutations -that are made during the chain reordering process. This method will -not be described here. However, {\em if reviewers believe that it should -be included}, please let the authors know. +\subsection{The Update Propagation Invariant} +\label{sub:upi} -\subsubsection{In both Machi operating modes:} -After initial implementation, it may be that the repair procedure is a -bit too slow. In order to accelerate repair decisions, it would be -helpful have a quicker method to calculate which files have exactly -the same contents. In traditional systems, this is done with a single -file checksum; see also the ``checksum scrub'' subsection in -\cite{machi-design}. -Machi's files can be written out-of-order from a file offset point of -view, which violates the order which the traditional method for -calculating a full-file hash. If we recall out-of-temporal-order -example in the ``Append-only files'' section of \cite{machi-design}, -the traditional method cannot -continue calculating the file checksum at offset 2 until the byte at -file offset 1 is written. +``Update Propagation Invariant'' is the original chain replication +paper's name for the +$H_i \succeq H_j$ +property mentioned in Figure~\ref{tab:chain-order}. +This paper will use the same name. +This property may also be referred to by its acronym, ``UPI''. -It may be advantageous for each FLU to maintain for each file a -checksum of a canonical representation of the -{\tt \{$O_{start},O_{end},$ CSum\}} tuples that the FLU must already -maintain. Then for any two FLUs that claim to store a file $S$, if -both FLUs have the same hash of $S$'s written map + checksums, then -the copies of $S$ on both FLUs are the same. +\subsection{Chain Replication and strong consistency} + +The three basic rules of Chain Replication and its strong +consistency guarantee: + +\begin{enumerate} + +\item All replica servers are arranged in an ordered list $C$. + +\item All mutations of a datum are performed upon each replica of $C$ + strictly in the order which they appear in $C$. A mutation is considered + completely successful if the writes by all replicas are successful. + +\item The head of the chain makes the determination of the order of + all mutations to all members of the chain. If the head determines + that some mutation $M_i$ happened before another mutation $M_j$, + then mutation $M_i$ happens before $M_j$ on all other members of + the chain.\footnote{While necesary for general Chain Replication, + Machi does not need this property. Instead, the property is + provided by Machi's sequencer and the write-once register of each + byte in each file.} + +\item All read-only operations are performed by the ``tail'' replica, + i.e., the last replica in $C$. + +\end{enumerate} + +The basis of the proof lies in a simple logical trick, which is to +consider the history of all operations made to any server in the chain +as a literal list of unique symbols, one for each mutation. + +Each replica of a datum will have a mutation history list. We will +call this history list $H$. For the $i^{th}$ replica in the chain list +$C$, we call $H_i$ the mutation history list for the $i^{th}$ replica. + +Before the $i^{th}$ replica in the chain list begins service, its mutation +history $H_i$ is empty, $[]$. After this replica runs in a Chain +Replication system for a while, its mutation history list grows to +look something like +$[M_0, M_1, M_2, ..., M_{m-1}]$ where $m$ is the total number of +mutations of the datum that this server has processed successfully. + +Let's assume for a moment that all mutation operations have stopped. +If the order of the chain was constant, and if all mutations are +applied to each replica in the chain's order, then all replicas of a +datum will have the exact same mutation history: $H_i = H_J$ for any +two replicas $i$ and $j$ in the chain +(i.e., $\forall i,j \in C, H_i = H_J$). That's a lovely property, +but it is much more interesting to assume that the service is +not stopped. Let's look next at a running system. + +\begin{figure*} +\centering +\begin{tabular}{ccc} +{\bf {{On left side of $C$}}} & & {\bf On right side of $C$} \\ +\hline +\multicolumn{3}{l}{Looking at replica order in chain $C$:} \\ +$i$ & $<$ & $j$ \\ + +\multicolumn{3}{l}{For example:} \\ + +0 & $<$ & 2 \\ +\hline +\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\ +length($H_i$) & $\geq$ & length($H_j$) \\ +\multicolumn{3}{l}{For example, a quiescent chain:} \\ +length($H_i$) = 48 & $\geq$ & length($H_j$) = 48 \\ +\multicolumn{3}{l}{For example, a chain being mutated:} \\ +length($H_i$) = 55 & $\geq$ & length($H_j$) = 48 \\ +\multicolumn{3}{l}{Example ordered mutation sets:} \\ +$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ +\multicolumn{3}{c}{\bf Therefore the right side is always an ordered + subset} \\ +\multicolumn{3}{c}{\bf of the left side. Furthermore, the ordered + sets on both} \\ +\multicolumn{3}{c}{\bf sides have the exact same order of those elements they have in common.} \\ +\multicolumn{3}{c}{The notation used by the Chain Replication paper is +shown below:} \\ +$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\succeq$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ + +\end{tabular} +\caption{The ``Update Propagation Invariant'' as + illustrated by Chain Replication protocol history.} +\label{tab:chain-order} +\end{figure*} + +If the entire chain $C$ is processing any number of concurrent +mutations, then we can still understand $C$'s behavior. +Figure~\ref{tab:chain-order} shows us two replicas in chain $C$: +replica $R_i$ that's on the left/earlier side of the replica chain $C$ +than some other replica $R_j$. We know that $i$'s position index in +the chain is smaller than $j$'s position index, so therefore $i < j$. +The restrictions of Chain Replication make it true that length($H_i$) +$\ge$ length($H_j$) because it's also that $H_i \supset H_j$, i.e, +$H_i$ on the left is always is a superset of $H_j$ on the right. + +When considering $H_i$ and $H_j$ as strictly ordered lists, we have +$H_i \succeq H_j$, where the right side is always an exact prefix of the left +side's list. This prefixing propery is exactly what strong +consistency requires. If a value is read from the tail of the chain, +then no other chain member can have a prior/older value because their +respective mutations histories cannot be shorter than the tail +member's history. \bibliographystyle{abbrvnat} \begin{thebibliography}{} From cc6988ead6a48f6cf7493f42208dfe1d42ccc70c Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 18:38:32 +0900 Subject: [PATCH 10/14] WIP: more restructuring --- doc/chain-self-management-sketch.org | 11 +- doc/src.high-level/high-level-chain-mgr.tex | 308 ++++++++++++-------- 2 files changed, 193 insertions(+), 126 deletions(-) diff --git a/doc/chain-self-management-sketch.org b/doc/chain-self-management-sketch.org index ae950c7..dd11a35 100644 --- a/doc/chain-self-management-sketch.org +++ b/doc/chain-self-management-sketch.org @@ -36,8 +36,13 @@ the simulator. %% under the License. #+END_SRC - -* 3. Diagram of the self-management algorithm + +* 3. Document restructuring + +Much of the text previously appearing in this document has moved to the +[[high-level-chain-manager.pdf][Machi chain manager high level design]] document. + +* 4. Diagram of the self-management algorithm ** Introduction Refer to the diagram [[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]], @@ -252,7 +257,7 @@ use of quorum majority for UPI members is out of scope of this document. Also out of scope is the use of "witness servers" to augment the quorum majority UPI scheme.) -* 4. The Network Partition Simulator +* 5. The Network Partition Simulator ** Overview The function machi_chain_manager1_test:convergence_demo_test() executes the following in a simulated network environment within a diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 32b8982..fc39855 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -264,8 +264,8 @@ This traditional definition differs from what is described here as "Humming consensus" describes consensus that is derived only from data that is visible/known at the current -time. This implies that a network partition may be in effect and that -not all chain members are reachable. The algorithm will calculate +time. +The algorithm will calculate an approximate consensus despite not having input from all/majority of chain members. Humming consensus may proceed to make a decision based on data from only a single participant, i.e., only the local @@ -281,7 +281,7 @@ Each participating chain node has its own projection store. The store's key is a positive integer; the integer represents the epoch number of the projection. The store's value is either the special `unwritten' value\footnote{We use - $\emptyset$ to denote the unwritten value.} or else an + $\bot$ to denote the unwritten value.} or else an application-specific binary blob that is immutable thereafter. The projection store is vital for the correct implementation of humming @@ -422,6 +422,98 @@ Humming consensus requires that any projection be identified by both the epoch number and the projection checksum, as described in Section~\ref{sub:the-projection}. +\section{Managing multiple projection stores} + +An independent replica management technique very similar to the style +used by both Riak Core \cite{riak-core} and Dynamo is used to manage +replicas of Machi's projection data structures. +The major difference is that humming consensus +{\em does not necessarily require} +successful return status from a minimum number of participants (e.g., +a quorum). + +\subsection{Read repair: repair only unwritten values} + +The idea of ``read repair'' is also shared with Riak Core and Dynamo +systems. However, Machi has situations where read repair cannot truly +``fix'' a key because two different values have been written by two +different replicas. +Machi's projection store is write-once, and there is no ``undo'' or +``delete'' or ``overwrite'' in the projection store API.\footnote{It doesn't +matter what caused the two different values. In case of multiple +values, all participants in humming consensus merely agree that there +were multiple opinions at that epoch which must be resolved by the +creation and writing of newer projections with later epoch numbers.} +Machi's projection store read repair can only repair values that are +unwritten, i.e., storing $\bot$. + +The value used to repair $\bot$ values is the ``best'' projection that +is currently available for the current epoch $E$. If there is a single, +unanimous value $V_{u}$ for the projection at epoch $E$, then $V_{u}$ +is use to repair all projections stores at $E$ that contain $\bot$ +values. If the value of $K$ is not unanimous, then the ``highest +ranked value'' $V_{best}$ is used for the repair; see +Section~\ref{sub:projection-ranking} for a summary of projection +ranking. + +Read repair may complete successfully regardless of availability of any of the +participants. This applies to both phases, reading and writing. + +\subsection{Projection storage: writing} +\label{sub:proj-store-writing} + +All projection data structures are stored in the write-once Projection +Store that is run by each server. (See also \cite{machi-design}.) + +Writing the projection follows the two-step sequence below. + +\begin{enumerate} +\item Write $P_{new}$ to the local projection store. (As a side + effect, + this will trigger + ``wedge'' status in the local server, which will then cascade to other + projection-related behavior within that server.) +\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. + Some members may be unavailable, but that is OK. +\end{enumerate} + +In cases of {\tt error\_written} status, +the process may be aborted and read repair +triggered. The most common reason for {\tt error\_written} status +is that another actor in the system has +already calculated another (perhaps different) projection using the +same projection epoch number and that +read repair is necessary. Note that {\tt error\_written} may also +indicate that another server has performed read repair on the exact +projection $P_{new}$ that the local server is trying to write! + +The writing phase may complete successfully regardless of availability +of any of the participants. + +\subsection{Reading from the projection store} +\label{sub:proj-store-reading} + +Reading data from the projection store is similar in principle to +reading from a Chain Replication-managed server system. However, the +projection store does not use the strict replica ordering that +Chain Replication does. For any projection store key $K_n$, the +participating servers may have different values for $K_n$. As a +write-once store, it is impossible to mutate a replica of $K_n$. If +replicas of $K_n$ differ, then other parts of the system (projection +calculation and storage) are responsible for reconciling the +differences by writing a later key, +$K_{n+x}$ when $x>0$, with a new projection. + +The reading phase may complete successfully regardless of availability +of any of the participants. +The minimum number of replicas is only one: the local projection store +should always be available, even if no other remote replica projection +stores are available. +If all available servers return a single, unanimous value $V_u, V_u +\ne \bot$, then $V_u$ is the final result. Any non-unanimous value is +considered disagreement and is resolved by writes to the projection +store by the humming consensus algorithm. + \section{Phases of projection change} \label{sec:phases-of-projection-change} @@ -457,7 +549,7 @@ methods for determining status. Instead, hard Boolean up/down status decisions are required by the projection calculation phase (Section~\ref{subsub:projection-calculation}). -\subsection{Projection data structure calculation} +\subsection{Calculating new projection data structures} \label{subsub:projection-calculation} Each Machi server will have an independent agent/process that is @@ -474,134 +566,38 @@ Projection calculation will be a pure computation, based on input of: (Section~\ref{sub:network-monitoring}). \end{enumerate} -All decisions about {\em when} to calculate a projection must be made +Decisions about {\em when} to calculate a projection are made using additional runtime information. Administrative change requests probably should happen immediately. Change based on network status changes may require retry logic and delay/sleep time intervals. -\subsection{Projection storage: writing} +\subsection{Writing a new projection} \label{sub:proj-storage-writing} -Individual replicas of the projections written to participating -projection stores are not managed by Chain Replication --- if they +The replicas of Machi projection data that are used by humming consensus +are not managed by Chain Replication --- if they were, we would have a circular dependency! See Section~\ref{sub:proj-store-writing} for the technique for writing projections to all participating servers' projection stores. -\subsection{Adoption of new projections} +\subsection{Adoption a new projection} -The projection store's ``best value'' for the largest written epoch -number at the time of the read is projection used by the server. -If the read attempt for projection $P_p$ -also yields other non-best values, then the -projection calculation subsystem is notified. This notification -may/may not trigger a calculation of a new projection $P_{p+1}$ which -may eventually be stored and so -resolve $P_p$'s replicas' ambiguity. +A projection $P_{new}$ is used by a server only if: -\section{Humming consensus's management of multiple projection store} +\begin{itemize} +\item The server can determine that the projection has been replicated + unanimously across all currently available servers. +\item The change in state from the local server's current projection to new + projection, $P_{current} \rightarrow P_{new}$ will not cause data loss, + e.g., the Update Propagation Invariant and all other safety checks + required by chain repair in Section~\ref{sec:repair-entire-files} + are correct. +\end{itemize} -Individual replicas of the projections written to participating -projection stores are not managed by Chain Replication. - -An independent replica management technique very similar to the style -used by both Riak Core \cite{riak-core} and Dynamo is used. -The major difference is -that successful return status from (minimum) a quorum of participants -{\em is not required}. - -\subsection{Read repair: repair only when unwritten} - -The idea of ``read repair'' is also shared with Riak Core and Dynamo -systems. However, there is a case read repair cannot truly ``fix'' a -key because two different values have been written by two different -replicas. - -Machi's projection store is write-once, and there is no ``undo'' or -``delete'' or ``overwrite'' in the projection store API. It doesn't -matter what caused the two different values. In case of multiple -values, all participants in humming consensus merely agree that there -were multiple opinions at that epoch which must be resolved by the -creation and writing of newer projections with later epoch numbers. - -Machi's projection store read repair can only repair values that are -unwritten, i.e., storing $\emptyset$. - -\subsection{Projection storage: writing} -\label{sub:proj-store-writing} - -All projection data structures are stored in the write-once Projection -Store that is run by each server. (See also \cite{machi-design}.) - -Writing the projection follows the two-step sequence below. - -\begin{enumerate} -\item Write $P_{new}$ to the local projection store. (As a side - effect, - this will trigger - ``wedge'' status in the local server, which will then cascade to other - projection-related behavior within that server.) -\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. - Some members may be unavailable, but that is OK. -\end{enumerate} - -In cases of {\tt error\_written} status, -the process may be aborted and read repair -triggered. The most common reason for {\tt error\_written} status -is that another actor in the system has -already calculated another (perhaps different) projection using the -same projection epoch number and that -read repair is necessary. Note that {\tt error\_written} may also -indicate that another actor has performed read repair on the exact -projection value that the local actor is trying to write! - -\section{Reading from the projection store} -\label{sec:proj-store-reading} - -Reading data from the projection store is similar in principle to -reading from a Chain Replication-managed server system. However, the -projection store does not use the strict replica ordering that -Chain Replication does. For any projection store key $K_n$, the -participating servers may have different values for $K_n$. As a -write-once store, it is impossible to mutate a replica of $K_n$. If -replicas of $K_n$ differ, then other parts of the system (projection -calculation and storage) are responsible for reconciling the -differences by writing a later key, -$K_{n+x}$ when $x>0$, with a new projection. - -Projection store reads are ``best effort''. The projection used is chosen from -all replica servers that are available at the time of the read. The -minimum number of replicas is only one: the local projection store -should always be available, even if no other remote replica projection -stores are available. - -For any key $K$, different projection stores $S_a$ and $S_b$ may store -nothing (i.e., {\tt error\_unwritten} when queried) or store different -values, $P_a \ne P_b$, despite having the same projection epoch -number. The following ranking rules are used to -determine the ``best value'' of a projection, where highest rank of -{\em any single projection} is considered the ``best value'': - -\begin{enumerate} -\item An unwritten value is ranked at a value of $-1$. -\item A value whose {\tt author\_server} is at the $I^{th}$ position - in the {\tt all\_members} list has a rank of $I$. -\item A value whose {\tt dbg\_annotations} and/or other fields have - additional information may increase/decrease its rank, e.g., - increase the rank by $10.25$. -\end{enumerate} - -Rank rules \#2 and \#3 are intended to avoid worst-case ``thrashing'' -of different projection proposals. - -The concept of ``read repair'' of an unwritten key is the same as -Chain Replication's. If a read attempt for a key $K$ at some server -$S$ results in {\tt error\_unwritten}, then all of the other stores in -the {\tt \#projection.all\_members} list are consulted. If there is a -unanimous value $V_{u}$ elsewhere, then $V_{u}$ is use to repair all -unwritten replicas. If the value of $K$ is not unanimous, then the -``best value'' $V_{best}$ is used for the repair. If all respond with -{\tt error\_unwritten}, repair is not required. +Both of these steps are performed as part of humming consensus's +normal operation. It may be non-intuitive that the minimum number of +available servers is only one, but ``one'' is the correct minimum +number for humming consensus. \section{Humming Consensus} \label{sec:humming-consensus} @@ -613,14 +609,12 @@ Sources for background information include: background on the use of humming during meetings of the IETF. \item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory}, -for an allegory in the style (?) of Leslie Lamport's original Paxos +for an allegory in homage to the style of Leslie Lamport's original Paxos paper. \end{itemize} -\subsection{Summary of humming consensus} - -"Humming consensus" describes +Humming consensus describes consensus that is derived only from data that is visible/known at the current time. This implies that a network partition may be in effect and that not all chain members are reachable. The algorithm will calculate @@ -658,6 +652,47 @@ with epochs numbered by $E+\delta$ (where $\delta > 0$). The distribution of the $E+\delta$ projections will bring all visible participants into the new epoch $E+delta$ and then into consensus. +The remainder of this section follows the same patter as +Section~\ref{sec:phases-of-projection-change}: network monitoring, +calculating new projections, writing projections, then perhaps +adopting the newest projection (which may or may not be the projection +that we just wrote). + +\subsection{Network monitoring} + +\subsection{Calculating new projection data structures} + +\subsection{Projection storage: writing} + +\subsection{Adopting a of new projection, perhaps} + +TODO finish + +A new projection is adopted by a Machi server if two requirements are +met: + +\subsubsection{All available copies of the projection are unanimous/identical} + +If we query all available servers for their latest projection, assume +that $E$ is the largest epoch number found. If we read public +projections from all available servers, and if all are equal to some +projection $P_E$, then projection $P_E$ is the best candidate for +adoption by the local server. + +If we see a projection $P^2_E$ that has the same epoch $E$ but a +different checksum value, then we must consider $P^2_E \ne P_E$. + +Any TODO FINISH + +The projection store's ``best value'' for the largest written epoch +number at the time of the read is projection used by the server. +If the read attempt for projection $P_p$ +also yields other non-best values, then the +projection calculation subsystem is notified. This notification +may/may not trigger a calculation of a new projection $P_{p+1}$ which +may eventually be stored and so +resolve $P_p$'s replicas' ambiguity. + \section{Just in case Humming Consensus doesn't work for us} There are some unanswered questions about Machi's proposed chain @@ -1303,6 +1338,33 @@ then no other chain member can have a prior/older value because their respective mutations histories cannot be shorter than the tail member's history. +\section{TODO: orphaned text} + +\subsection{1} + +For any key $K$, different projection stores $S_a$ and $S_b$ may store +nothing (i.e., {\tt error\_unwritten} when queried) or store different +values, $P_a \ne P_b$, despite having the same projection epoch +number. The following ranking rules are used to +determine the ``best value'' of a projection, where highest rank of +{\em any single projection} is considered the ``best value'': + +\begin{enumerate} +\item An unwritten value is ranked at a value of $-1$. +\item A value whose {\tt author\_server} is at the $I^{th}$ position + in the {\tt all\_members} list has a rank of $I$. +\item A value whose {\tt dbg\_annotations} and/or other fields have + additional information may increase/decrease its rank, e.g., + increase the rank by $10.25$. +\end{enumerate} + +Rank rules \#2 and \#3 are intended to avoid worst-case ``thrashing'' +of different projection proposals. + +\subsection{ranking} +\label{sub:projection-ranking} + + \bibliographystyle{abbrvnat} \begin{thebibliography}{} \softraggedright From 8481e2321407633a1a273b847a25b570963b5891 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 20:30:26 +0900 Subject: [PATCH 11/14] WIP: more restructuring --- doc/src.high-level/high-level-chain-mgr.tex | 81 ++++++++++++++++----- 1 file changed, 63 insertions(+), 18 deletions(-) diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index fc39855..11dd5d5 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -197,8 +197,9 @@ If the implementation of this self-management protocol breaks an assumption or prerequisite of CORFU, then we expect that Machi's implementation will be flawed. -\subsection{Communication model: asyncronous message passing} +\subsection{Communication model} +The communication model is asynchronous point-to-point messaging. The network is unreliable: messages may be arbitrarily dropped and/or reordered. Network partitions may occur at any time. Network partitions may be asymmetric, e.g., a message can be sent @@ -223,7 +224,7 @@ time" between iterations of the algorithm: there is no need to "busy wait" by executing the algorithm as quickly as possible. See below, "sleep intervals between executions". -\subsection{Failure detector model: weak, fallible, boolean} +\subsection{Failure detector model} We assume that the failure detector that the algorithm uses is weak, it's fallible, and it informs the algorithm in boolean status @@ -234,8 +235,8 @@ change, then the algorithm will "churn" the operational state of the chain, e.g. by removing the failed node from the chain or adding a (re)started node (that may not be alive) to the end of the chain. Such extra churn is regrettable and will cause periods of delay as the -"rough consensus" (decribed below) decision is made. However, the -churn cannot (we assert/believe) cause data loss. +humming consensus algorithm (decribed below) makes decisions. However, the +churn cannot {\bf (we assert/believe)} cause data loss. \subsection{Use of the ``wedge state''} @@ -250,7 +251,7 @@ I/O API. When in wedge state, the server will refuse all file write I/O API requests until the self-management algorithm has determined that -"rough consensus" has been decided (see next bullet item). The server +humming consensus has been decided (see next bullet item). The server may also refuse file read I/O API requests, depending on its CP/AP operation mode. @@ -310,6 +311,16 @@ The private projection store serves multiple purposes, including: state of the local node \end{itemize} +The private half of the projection store is not replicated. +Projections that are stored in the private projection store are +meaningful only to the local projection store and are, furthermore, +merely ``soft state''. Data loss in the private projection store +cannot result in loss of ``hard state'' information. Therefore, +replication of the private projection store is not required. The +replication techniques described by +Section~\ref{sec:managing-multiple-projection-stores} applies only to +the public half of the projection store. + \section{Projections: calculation, storage, and use} \label{sec:projections} @@ -320,6 +331,13 @@ administrative changes (e.g., substituting a failed server box with replacement hardware) as well as local network conditions (e.g., is there a network partition?). +The projection defines the operational state of Chain Replication's +chain order as well the (re-)synchronization of data managed by by +newly-added/failed-and-now-recovering members of the chain. This +chain metadata, together with computational processes that manage the +chain, must be managed in a safe manner in order to avoid unintended +data loss of data managed by the chain. + The concept of a projection is borrowed from CORFU but has a longer history, e.g., the Hibari key-value store \cite{cr-theory-and-practice} and goes back in research for decades, @@ -423,6 +441,7 @@ the epoch number and the projection checksum, as described in Section~\ref{sub:the-projection}. \section{Managing multiple projection stores} +\label{sec:managing-multiple-projection-stores} An independent replica management technique very similar to the style used by both Riak Core \cite{riak-core} and Dynamo is used to manage @@ -597,31 +616,30 @@ A projection $P_{new}$ is used by a server only if: Both of these steps are performed as part of humming consensus's normal operation. It may be non-intuitive that the minimum number of available servers is only one, but ``one'' is the correct minimum -number for humming consensus. + number for humming consensus. \section{Humming Consensus} \label{sec:humming-consensus} -Sources for background information include: +Additional sources for information humming consensus include: \begin{itemize} \item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for -background on the use of humming during meetings of the IETF. +background on the use of humming by IETF meeting participants during +IETF meetings. \item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory}, for an allegory in homage to the style of Leslie Lamport's original Paxos paper. \end{itemize} - -Humming consensus describes -consensus that is derived only from data that is visible/known at the current -time. This implies that a network partition may be in effect and that -not all chain members are reachable. The algorithm will calculate -an approximate consensus despite not having input from all/majority -of chain members. Humming consensus may proceed to make a -decision based on data from only a single participant, i.e., only the local -node. +Humming consensus describes consensus that is derived only from data +that is visible/known at the current time. It's OK if a network +partition is in effect and that not all chain members are available; +the algorithm will calculate an approximate consensus despite not +having input from all/majority of chain members. Humming consensus +may proceed to make a decision based on data from only one +participant, i.e., only the local node. \begin{itemize} @@ -652,12 +670,39 @@ with epochs numbered by $E+\delta$ (where $\delta > 0$). The distribution of the $E+\delta$ projections will bring all visible participants into the new epoch $E+delta$ and then into consensus. -The remainder of this section follows the same patter as +The remainder of this section follows the same pattern as Section~\ref{sec:phases-of-projection-change}: network monitoring, calculating new projections, writing projections, then perhaps adopting the newest projection (which may or may not be the projection that we just wrote). +\subsubsection{Aside: origin of the analogy to humming a song} + +The ``humming'' part of humming consensus comes from the action taken +when the environment changes. If we imagine an egalitarian group of +people, all in the same room humming some pitch together, then we take +action to change our humming pitch if: + +\begin{itemize} +\item Some member departs the room (because they witness the person +walking out the door) or if someone else in the room starts humming a +new pitch with a new epoch number.\footnote{It's very difficult for + the human ear to hear the epoch number part of a hummed pitch, but + for the sake of the analogy, assume that it can.} +\item If a member enters the room and starts humming with the same + epoch number but a different note. +\end{itemize} + +If someone were to transcribe onto a musical score the pitches that +are hummed in the room over a period of time, we might have something +that approximates music. If this musical core uses chord progressions +and rhythms that obey the rules of a musical genre, e.g., Gregorian +chant, then the final musical score is a valid Gregorian chant. + +By analogy, if the rules of the musical score are obeyed, then the +Chain Replication invariants that are managed by humming consensus are +obeyed. Such safe management of Chain Replication is our end goal. + \subsection{Network monitoring} \subsection{Calculating new projection data structures} From 9ab104933eb2d6bf56c3c29ad43608eda9d4c716 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 20:32:20 +0900 Subject: [PATCH 12/14] WIP: more restructuring --- doc/src.high-level/high-level-chain-mgr.tex | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 11dd5d5..3426caf 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -658,7 +658,7 @@ all members would be invalid and therefore would not move itself out of wedged state. In very general terms, this requirement for a quorum majority of surviving participants is also a requirement for Paxos, Raft, and ZAB. See Section~\ref{sec:split-brain-management} for a -proposal to handle ``split brain'' scenarios while in CPU mode. +proposal to handle ``split brain'' scenarios while in CP mode. \end{itemize} @@ -670,11 +670,13 @@ with epochs numbered by $E+\delta$ (where $\delta > 0$). The distribution of the $E+\delta$ projections will bring all visible participants into the new epoch $E+delta$ and then into consensus. -The remainder of this section follows the same pattern as +The next portion of this section follows the same pattern as Section~\ref{sec:phases-of-projection-change}: network monitoring, calculating new projections, writing projections, then perhaps adopting the newest projection (which may or may not be the projection that we just wrote). +Beginning with Section~9.5\footnote{TODO correction needed?}, we will +explore TODO topics. \subsubsection{Aside: origin of the analogy to humming a song} From cd6282b76d8b7806ae4c8b8be12d0f50028c3342 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 21:09:25 +0900 Subject: [PATCH 13/14] WIP: more restructuring --- doc/src.high-level/high-level-chain-mgr.tex | 111 ++++++++++++-------- 1 file changed, 66 insertions(+), 45 deletions(-) diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 3426caf..56d044e 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -267,7 +267,7 @@ This traditional definition differs from what is described here as consensus that is derived only from data that is visible/known at the current time. The algorithm will calculate -an approximate consensus despite not having input from all/majority +a rough consensus despite not having input from all/majority of chain members. Humming consensus may proceed to make a decision based on data from only a single participant, i.e., only the local node. @@ -563,13 +563,14 @@ machine/hardware node. Output of the monitor should declare the up/down (or available/unavailable) status of each server in the projection. Such -Boolean status does not eliminate ``fuzzy logic'' or probabilistic -methods for determining status. Instead, hard Boolean up/down status -decisions are required by the projection calculation phase -(Section~\ref{subsub:projection-calculation}). +Boolean status does not eliminate fuzzy logic, probabilistic +methods, or other techniques for determining availability status. +Instead, hard Boolean up/down status +decisions are required only by the projection calculation phase +(Section~\ref{sub:projection-calculation}). -\subsection{Calculating new projection data structures} -\label{subsub:projection-calculation} +\subsection{Calculating a new projection data structure} +\label{sub:projection-calculation} Each Machi server will have an independent agent/process that is responsible for calculating new projections. A new projection may be @@ -600,6 +601,7 @@ Section~\ref{sub:proj-store-writing} for the technique for writing projections to all participating servers' projection stores. \subsection{Adoption a new projection} +\label{sub:proj-adoption} A projection $P_{new}$ is used by a server only if: @@ -621,22 +623,10 @@ available servers is only one, but ``one'' is the correct minimum \section{Humming Consensus} \label{sec:humming-consensus} -Additional sources for information humming consensus include: - -\begin{itemize} -\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for -background on the use of humming by IETF meeting participants during -IETF meetings. - -\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory}, -for an allegory in homage to the style of Leslie Lamport's original Paxos -paper. -\end{itemize} - Humming consensus describes consensus that is derived only from data that is visible/known at the current time. It's OK if a network partition is in effect and that not all chain members are available; -the algorithm will calculate an approximate consensus despite not +the algorithm will calculate a rough consensus despite not having input from all/majority of chain members. Humming consensus may proceed to make a decision based on data from only one participant, i.e., only the local node. @@ -668,7 +658,7 @@ decision during epoch $E$. When a differing decision is discovered, newer \& later time epochs are defined by creating new projections with epochs numbered by $E+\delta$ (where $\delta > 0$). The distribution of the $E+\delta$ projections will bring all visible -participants into the new epoch $E+delta$ and then into consensus. +participants into the new epoch $E+delta$ and then eventually into consensus. The next portion of this section follows the same pattern as Section~\ref{sec:phases-of-projection-change}: network monitoring, @@ -676,9 +666,21 @@ calculating new projections, writing projections, then perhaps adopting the newest projection (which may or may not be the projection that we just wrote). Beginning with Section~9.5\footnote{TODO correction needed?}, we will -explore TODO topics. +explore TODO TOPICS. -\subsubsection{Aside: origin of the analogy to humming a song} +Additional sources for information humming consensus include: + +\begin{itemize} +\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for +background on the use of humming by IETF meeting participants during +IETF meetings. + +\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory}, +for an allegory in homage to the style of Leslie Lamport's original Paxos +paper. +\end{itemize} + +\subsubsection{Aside: origin of the analogy to composing music} The ``humming'' part of humming consensus comes from the action taken when the environment changes. If we imagine an egalitarian group of @@ -697,48 +699,64 @@ new pitch with a new epoch number.\footnote{It's very difficult for If someone were to transcribe onto a musical score the pitches that are hummed in the room over a period of time, we might have something -that approximates music. If this musical core uses chord progressions +that is roughly like music. If this musical score uses chord progressions and rhythms that obey the rules of a musical genre, e.g., Gregorian chant, then the final musical score is a valid Gregorian chant. By analogy, if the rules of the musical score are obeyed, then the Chain Replication invariants that are managed by humming consensus are -obeyed. Such safe management of Chain Replication is our end goal. +obeyed. Such safe management of Chain Replication metadata is our end goal. \subsection{Network monitoring} -\subsection{Calculating new projection data structures} +See also: Section~\ref{sub:network-monitoring}. -\subsection{Projection storage: writing} +\subsection{Calculating a new projection data structure} -\subsection{Adopting a of new projection, perhaps} +See also: Section~\ref{sub:projection-calculation}. + +\subsection{Writing a new projection} + +See also: Section~\ref{sub:proj-storage-writing}. + +\subsection{Adopting a new projection} + +See also: Section~\ref{sub:proj-adoption}. TODO finish -A new projection is adopted by a Machi server if two requirements are -met: +A new projection $P_E$ is adopted by a Machi server at epoch $E$ if +two requirements are met: -\subsubsection{All available copies of the projection are unanimous/identical} +\paragraph{\#1: All available copies of $P_E$ are unanimous/identical} -If we query all available servers for their latest projection, assume +If we query all available servers for their latest projection, then assume that $E$ is the largest epoch number found. If we read public -projections from all available servers, and if all are equal to some -projection $P_E$, then projection $P_E$ is the best candidate for -adoption by the local server. +projections for epoch $E$ from all available servers, and if all are +equal to some projection $P_E$, then projection $P_E$ is +(by definition) the best candidate for adoption by the local server. If we see a projection $P^2_E$ that has the same epoch $E$ but a different checksum value, then we must consider $P^2_E \ne P_E$. - -Any TODO FINISH - -The projection store's ``best value'' for the largest written epoch -number at the time of the read is projection used by the server. -If the read attempt for projection $P_p$ -also yields other non-best values, then the +If we see multiple different values $P^*_E$ for epoch $E$, then the projection calculation subsystem is notified. This notification -may/may not trigger a calculation of a new projection $P_{p+1}$ which -may eventually be stored and so -resolve $P_p$'s replicas' ambiguity. +will trigger a calculation of a new projection $P_{E+1}$ which +may eventually be stored and therefore help +resolve epoch $E$'s ambiguous and unusable condition. + +\paragraph{\#2: The transition from current $\rightarrow$ new projection is +safe} + +The projection $P_E = P_{latest}$ is evaluated by numerous rules and +invariants, relative to the projection that the server is currently +using, $P_{current}$. If such rule or invariant is violated/false, +then the local server will discard $P_{latest}$. Instead, it will +trigger the projection calculation subsystem to create an alternative, +safe projection $P_{latest+1}$ that will hopefully create a unanimous +epoch $E+1$. + +See Section~\ref{sub:humming-rules-and-invariants} for detail about +these rules and invariants. \section{Just in case Humming Consensus doesn't work for us} @@ -1411,6 +1429,9 @@ of different projection proposals. \subsection{ranking} \label{sub:projection-ranking} +\subsection{rules \& invariants} +\label{sub:humming-rules-and-invariants} + \bibliographystyle{abbrvnat} \begin{thebibliography}{} From 3c70fff0033274e994c21c8f1b35e5282678cdcf Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 21:21:11 +0900 Subject: [PATCH 14/14] WIP: more restructuring --- doc/src.high-level/high-level-chain-mgr.tex | 35 ++++++++++++++++++--- 1 file changed, 31 insertions(+), 4 deletions(-) diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 56d044e..6a06c22 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -680,19 +680,18 @@ for an allegory in homage to the style of Leslie Lamport's original Paxos paper. \end{itemize} -\subsubsection{Aside: origin of the analogy to composing music} - +\paragraph{Aside: origin of the analogy to composing music} The ``humming'' part of humming consensus comes from the action taken when the environment changes. If we imagine an egalitarian group of people, all in the same room humming some pitch together, then we take action to change our humming pitch if: \begin{itemize} -\item Some member departs the room (because they witness the person +\item Some member departs the room (because we can witness the person walking out the door) or if someone else in the room starts humming a new pitch with a new epoch number.\footnote{It's very difficult for the human ear to hear the epoch number part of a hummed pitch, but - for the sake of the analogy, assume that it can.} + for the sake of the analogy, let's assume that it can.} \item If a member enters the room and starts humming with the same epoch number but a different note. \end{itemize} @@ -711,14 +710,38 @@ obeyed. Such safe management of Chain Replication metadata is our end goal. See also: Section~\ref{sub:network-monitoring}. +In today's implementation, there is only a single criterion for +determining the available/not-available status of a remote server $S$: +is $S$'s projection store available. If yes, then we assume that all +$S$ is available. If $S$'s projection store is not available for any +reason, we assume $S$ is entirely unavailable. This simple single +criterion appears to be sufficient for humming consensus, according to +simulations of arbitrary network partitions. + \subsection{Calculating a new projection data structure} See also: Section~\ref{sub:projection-calculation}. +TODO: +0. incorporate text from ORG file at all relevant places!!!!!!!!!! +1. calculating a new projection is straightforward +2. define flapping? +3. a simple criterion for flapping/not-flapping is pretty easy +4. if flapping, then calculate an ``inner'' projection +5. if flapping $\rightarrow$ not-flapping, then copy inner +$\rightarrow$ outer projection and reset flapping counter. + \subsection{Writing a new projection} See also: Section~\ref{sub:proj-storage-writing}. +TODO: +1. We write a new projection based on flowchart A* and B* states and +state transtions. + +TOOD: include the flowchart into the doc. We'll probably need to +insert it landscape? + \subsection{Adopting a new projection} See also: Section~\ref{sub:proj-adoption}. @@ -758,6 +781,10 @@ epoch $E+1$. See Section~\ref{sub:humming-rules-and-invariants} for detail about these rules and invariants. +TODO: +1. We write a new projection based on flowchart A* and B* and C1* states and +state transtions. + \section{Just in case Humming Consensus doesn't work for us} There are some unanswered questions about Machi's proposed chain