Commit graph

587 commits

Author SHA1 Message Date
Scott Lystig Fritchie
d2ac5b0583 Bugfix: arg type to machi_util:parse_filename() 2015-10-21 18:37:30 +09:00
Scott Lystig Fritchie
028ddc79ff Data type cleanups, other 2015-10-21 18:37:30 +09:00
Scott Lystig Fritchie
595f9a463e Unexported funcs 2015-10-21 18:37:30 +09:00
Scott Lystig Fritchie
177aca0a68 Merge pull request #22 from basho/ss-flu1-init-sync
Make flu1 initialization synchronous
2015-10-21 18:36:12 +09:00
Shunichi Shinohara
478107915b Make flu1 initialization synchronous 2015-10-21 16:16:03 +09:00
Scott Lystig Fritchie
30d7e592a3 Merge pull request #20 from basho/ku/read-all-chunks
Allow reading multiple chunks at once
2015-10-21 15:28:10 +09:00
Scott Lystig Fritchie
1c8e436a64 Fix race #3 2015-10-21 15:01:11 +09:00
UENISHI Kota
a43397a7b8 Update to review comments 2015-10-21 10:58:00 +09:00
UENISHI Kota
ebb9bc3f5a Allow reading multiple chunks at once
* When repairing multiple chunks at once and any of its repair
  failed, the whole read request and repair work will fail
* Rename read_repair3 and read_repair4 to do_repair_chunks and
  do_repair chunk in machi_file_proxy
* This pull request changes return semantics of read_chunk(), that
  returns any chunk included in requested range
* First and last chunk may be cut to fit the requested range
* In machi_file_proxy, unwritten_bytes are removed and replaced by
  machi_csum_table
2015-10-20 17:59:09 +09:00
Scott Lystig Fritchie
6f9814ffb4 Merge ss/deps-for-debugging (with rebar.config conflict fix) 2015-10-19 16:41:03 +09:00
UENISHI Kota
3e975f53b8 Allow read_chunk() to return partial chunks
This is simply a change of read_chunk() protocol, where a response of
read_chunk() becomes list of written bytes along with checksum. All
related code including repair is changed as such. This is to pass all
tests and not actually supporting partial chunks.
2015-10-19 15:37:17 +09:00
Shunichi Shinohara
208c02853f Add cluster_info to deps and small callback module
For debuging from shell, some functions in machi_cinfo are exported:

- public_projection/1
- private_projection/1
- fitness/1
- chain_manager/1
- flu1/1
2015-10-19 15:36:05 +09:00
UENISHI Kota
cb67764273 Merge pull request #12 from basho/slf/packaging1
Slf/packaging1
2015-10-16 17:56:27 +09:00
Scott Lystig Fritchie
00ac0f4cd3 Reduce compiler warnings and verbose output that clutters eunit test output 2015-10-16 17:41:01 +09:00
Scott Lystig Fritchie
299016cafb FLU startup via app.config 2015-10-16 16:28:46 +09:00
UENISHI Kota
6f790527f5 Follow with missing tests and related fix 2015-10-16 10:10:05 +09:00
UENISHI Kota
e45469b5ce Move checksum file related code to machi_csum_table 2015-10-15 11:28:40 +09:00
Mark Allen
baeffbab0b Merge pull request #6 from basho/mra/write-once-clean
Integrate write once invariant into current FLU implementation
2015-10-14 10:15:57 -05:00
Scott Lystig Fritchie
e344ee42ff Remove stale TODO comment about write-once enforcement 2015-10-14 16:56:51 +09:00
Scott Lystig Fritchie
d6a3180ecd Use pattern matching instead of length() BIF 2015-10-14 16:52:03 +09:00
UENISHI Kota
07ceff095a Fix gen_server style return value 2015-10-14 16:22:11 +09:00
Scott Lystig Fritchie
8eb9cc9700 Fix "HEY, machi_pb_translate:852 got {error,bad_csum}" errors
s/bad_csum/bad_checksum/ as needed in in machi_file_proxy.erl
2015-10-14 14:26:46 +09:00
Scott Lystig Fritchie
ed112bfb52 Argument fix for read_chunk() when write_chunk() says 'written' 2015-10-14 14:16:51 +09:00
Scott Lystig Fritchie
6dbf52db6f Remove some debugging verbosity 2015-10-14 12:50:10 +09:00
UENISHI Kota
1b612bd969 Fix typo in comment 2015-10-14 12:40:56 +09:00
Mark Allen
fe71b72494 Add filename parse and validation functions 2015-10-13 21:12:14 -05:00
Mark Allen
f8707c61c0 Choose new filename when epoch changes
The filename manager needs to choose a new file name
for a prefix when the epoch number changes. This helps
ensure safety of file merges across the cluster.
(Prevents conflicts across divergent cluster members.)
2015-10-13 21:09:31 -05:00
Mark Allen
161e6cd9f9 Pass epoch id to append operations
Needed to handle a filename change when epoch changes.
2015-10-13 21:08:48 -05:00
Mark Allen
85e1e5a26d Handle {error, bad_arg} on read 2015-10-13 21:08:24 -05:00
UENISHI Kota
e113f6ffdd Reach the trim stub to CR client 2015-10-13 17:25:59 +09:00
UENISHI Kota
dfe953b7d8 Add surface of trim to scrub 2015-10-13 17:14:44 +09:00
Scott Lystig Fritchie
777909b0f5 TODO MARK todo comment and bugfix for machi_cr_client_test 2015-10-12 15:30:37 +09:00
Mark Allen
289b2bcc7c Debug WIP 2015-10-11 23:04:29 -05:00
Mark Allen
c1b9038447 The return value of ets is generally 'true' 2015-10-08 15:47:11 -05:00
Mark Allen
aca3759e45 Bug fixes found during testing runs 2015-10-08 15:46:40 -05:00
Mark Allen
1ecbb5cffe Fixed order of start_link parameters 2015-10-08 15:45:04 -05:00
Mark Allen
303aad97e9 Use {error, bad_checksum} directly
We previously copied {error, bad_csum} as it was used in the main
FLU code.  The protobufs stuff expects the full atom bad_checksum
though.
2015-10-08 15:43:54 -05:00
Scott Lystig Fritchie
952d2fa508 Change flag_checksum -> flag_no_checksum for consistency 2015-10-08 20:41:59 +09:00
Mark Allen
679046600f Merge remote-tracking branch 'origin/bug/from-bp-request-error' into mra/write-once-clean 2015-10-07 23:02:03 -05:00
Scott Lystig Fritchie
796937fe75 Add LL generic error PB response decoding 2015-10-08 12:33:55 +09:00
Scott Lystig Fritchie
0054445f13 Delete spammy message from fitness servers every 5 seconds 2015-10-07 18:52:24 +09:00
Mark Allen
d627f238bf Cache generated names until disk files are written 2015-10-06 22:44:31 -05:00
Mark Allen
f83b0973f2 Have to call filename mgr with FluName 2015-10-06 22:43:19 -05:00
Mark Allen
7a6999465a Make sure we use '^' as filename separators 2015-10-06 22:02:31 -05:00
Mark Allen
2d0c03ef35 Integration with current FLU implementation 2015-10-05 22:18:29 -05:00
Mark Allen
36c11e7d08 Add a metadata manager supervisor 2015-10-05 16:37:53 -05:00
Mark Allen
d3fe7ee181 Pull write-once files over to clean branch
I am treating the original write-once branch as a prototype
which I am now throwing away. I had too much work interleved
in there, so I felt like the best thing to do would be to cut
a new clean branch and pull the files over and start over
against a recent-ish master.

We will have to refactor the other things in FLU in a more
piecemeal fashion.
2015-10-02 16:29:09 -05:00
Scott Lystig Fritchie
6d5b61f747 Tweaks to sleep_ranked_order() call in C200 2015-09-21 21:47:25 +09:00
Scott Lystig Fritchie
5eecb2b935 Change to P_current_calc epoch @ C100 2015-09-21 21:44:03 +09:00
Scott Lystig Fritchie
340af05f0f WIP: server-side of CP mode repairing-as-witness 2015-09-21 21:44:03 +09:00
Scott Lystig Fritchie
d9b9397e75 Avoid some projection churn in C100's sanity check 2015-09-21 21:44:03 +09:00
Scott Lystig Fritchie
5010d03677 Call manage_last_down_list() at C220 and C310 2015-09-21 15:36:54 +09:00
Scott Lystig Fritchie
69a304102e Write public proj in all_members order only 2015-09-21 15:09:16 +09:00
Scott Lystig Fritchie
6b4ed1c061 Verbose debugging cruft 2015-09-19 14:25:07 +09:00
Scott Lystig Fritchie
72bfa163ba Small test bugfixes & verbose/debugging cruft 2015-09-19 14:16:54 +09:00
Scott Lystig Fritchie
d695f30e4f Avoid using host/port combo for machi_fitness (ab)use of machi_projection 2015-09-17 16:43:08 +09:00
Scott Lystig Fritchie
09ae2db0ba Bugfix: double-check local private projection write with a read 2015-09-16 16:31:10 +09:00
Scott Lystig Fritchie
79b1d156c4 Add backlog option to gen_tcp:listen 2015-09-16 13:52:36 +09:00
Scott Lystig Fritchie
778bd015ee Bugfix: pattern matching error in C110 2015-09-16 12:41:53 +09:00
Scott Lystig Fritchie
d3b116bd9e Bugfix: CP mode: ignore P_latest if it has UPI or down server in my down list 2015-09-15 17:55:18 +09:00
Scott Lystig Fritchie
75c94420e0 Add test_ets_table to give programmatic slowdown 2015-09-14 22:52:41 +09:00
Scott Lystig Fritchie
7bf1132142 Bugfix: IsRelevantToMe_p adjustment for P_latest.upi == [] 2015-09-14 17:28:50 +09:00
Scott Lystig Fritchie
b4f8bc8058 Add pretty_time(). Add CONFIRM verbose logging for none proj 2015-09-14 17:00:09 +09:00
Scott Lystig Fritchie
4e11cdd50f Bugfix: derp, pattern match for UniqueHistoryTrigger_p 2015-09-14 16:59:58 +09:00
Scott Lystig Fritchie
a036f119a6 Add send_spam_to_everyone(), add 1% chance of using it 2015-09-14 16:01:26 +09:00
Scott Lystig Fritchie
6c543dfc18 Re-use the flapping criteria for a different use (more)
Hooray, very early I ended up with a simulator example which kicked
in and tested this change.  (A deterministice fault injection method
for testing would also be valuable, probably.)

    machi_chain_manager1_converge_demo:t(7, [{private_write_verbose,true}]).

We switched partitions in the simulator like this:

    SET partitions = [{b,f},{c,f},{d,e},{f,e}] (2 of 90252) at {14,37,5}
    ...
    Stable projection at epoch 1429 upi=[b,c,g,a,d],repairing=[]
    ...
    SET partitions = [{b,d},{c,b},{d,c},{f,a}] (3 of 90252) at {14,37,44}

Part of the chain reassembled quickly from the following UPIs: [g], then
[g,e], then [g,e,f] via a series of successful simulated repairs.  For
the first two repairs, all parties (e & f & g) are unanimous about the
projections.  For the final repair, very strange, not all three adopt
[g,e,f] chain: e says nothing, f & g use it.

Also weird, then g immediately moves f!  upi=[g,e],repairing=[f].
Then e also adopts this chain of 2.  From that point forward, f keeps
trying to use upi=[g,e,f],[] and the others try using only upi=[g,e],[f].
There are lots of messages from g saying that it's insane (correctly!)
to try calc=1487:[g,e],[f] -> 1494:[g,e,f],[] without a valid repair
author.

It's worth checking why g dropped from [g,e,f] -> [g,e].  But even
still, this new use for the flapping counter & reset via C103 is
working.  ... Ah, now I understand.  The very occasional undefined
socket bug in machi_flu1_client appears to be the cause: g had a
one-time problem talking with f and so decided f was down long enough to
make the shorter UPI.  The other participants didn't have any such
problem with f and so kept f in the UPI.  This would have been a
deadlock/infinite loop case without someone deciding to reset state.
2015-09-14 15:41:48 +09:00
Scott Lystig Fritchie
23554ffccc Handle timeout/paritition failures in C110 2015-09-14 13:54:47 +09:00
Scott Lystig Fritchie
fdf78bdbbc Tweak IsRelevantToMe_p in B10 (more)
Last night we hit a rare case of failed convergence.

f was out of sync with the rest of the world.
f: upi=[b,g,f] repairing=[a,c]
The "rest of the world" used a larger chain at:
*: upi=[c,b,g,a], repairing=[f]

And f refused to join the larger chain because of the way that
IsRelevantToMe_p was being calculated before this commit.

Hrrrm, though, I'm not convinced that this particular problem
is fixed 100% by this patch.  What if the chain lengths were
the same but also UPI incompatible?  e.g. if I remove 'a' from
the "real world (in the partition simulator)" example above:

f: upi=[b,g,f] repairing=[c]
*: upi=[c,b,g], repairing=[f]

Hrmmmmm, I may need to reintroduce the my-recent-adopted-projection-
flapping-like-counter thingie to try to break this kind of
incompatible deadlock.
2015-09-14 13:40:34 +09:00
Scott Lystig Fritchie
62186395ed Hooray! The weekend's CP work hasn't broken AP, I believe. 2015-09-14 00:04:53 +09:00
Scott Lystig Fritchie
f5901c6cd3 Hey, appears to work for CP mode chain len=3, hooray! 2015-09-13 21:51:20 +09:00
Scott Lystig Fritchie
89f57616a8 Avoid some churn when both latest & newprop are none proj 2015-09-13 17:44:23 +09:00
Scott Lystig Fritchie
f3a0ee91cf WIP: thread P_calc_current all the way to C100 for CP mode assist 2015-09-13 15:58:45 +09:00
Scott Lystig Fritchie
0a20417682 Adjustments for CP mode (still slightly experimental) 2015-09-13 14:56:28 +09:00
Scott Lystig Fritchie
32c4d39156 Bugfix: set consistency_mode at set_chain_members 2015-09-13 14:16:02 +09:00
Scott Lystig Fritchie
b3ce9f9ab8 A bit less verbose output 2015-09-11 23:08:47 +09:00
Scott Lystig Fritchie
5efec1b6cd Add upi_unanimous annotation to AP mode 2015-09-11 21:47:05 +09:00
Scott Lystig Fritchie
fe8ff6033d Make better state transition choices in AP mode 2015-09-11 19:14:41 +09:00
Scott Lystig Fritchie
a0c129c16d Bugfix: wow, a chain state transition sanity check bug 2015-09-11 17:32:52 +09:00
Scott Lystig Fritchie
8df7d58365 Add partition simulator support to fitness service 2015-09-11 16:45:29 +09:00
Scott Lystig Fritchie
efe6ce7894 WIP: small refactoring to prepare for fitness server 'use' of partition simulator 2015-09-11 16:03:49 +09:00
Scott Lystig Fritchie
35e8efeb96 Add timer:sleep() to accomodate machi_chain_manager1_converge_demo 2015-09-11 15:56:02 +09:00
Scott Lystig Fritchie
bbf925d132 Add fault injection method via C100 to test C103 admin down cycle 2015-09-10 18:05:55 +09:00
Scott Lystig Fritchie
41737ae62a Add delete_admin_down API implementation, oops! 2015-09-10 18:05:18 +09:00
Scott Lystig Fritchie
d45c249e89 Add admin down status API to fitness server 2015-09-10 17:30:11 +09:00
Scott Lystig Fritchie
c14b9ce50f Minor cleanup, add more partitions to converge demo 2015-09-10 16:39:15 +09:00
Scott Lystig Fritchie
af94d1c1c3 Bugfix: ExpectedUPI error in A40 2015-09-10 02:15:49 +09:00
Scott Lystig Fritchie
daf3a3d65a Remove some verbose debugging cruft 2015-09-10 01:47:46 +09:00
Scott Lystig Fritchie
329a5e0682 Bugfix: damn, no idea how many problems this 5 month old bug caused 2015-09-10 01:33:55 +09:00
Scott Lystig Fritchie
5943494d54 Add ExpectedUPI to A40's AmHosedP clause 2015-09-10 00:43:37 +09:00
Scott Lystig Fritchie
10c655ebfe WIP: fix one source of problems, now shift back to 'TODO this clause needs more review' 2015-09-09 23:59:40 +09:00
Scott Lystig Fritchie
b7aa33c617 Yeah, nearly there. AP fails occasionally in multiple-asymmetric-partition sequence 2015-09-09 23:10:39 +09:00
Scott Lystig Fritchie
72141c8ecb WIP: split A30 into A30/A31 based on AllHosed 2015-09-09 21:06:40 +09:00
Scott Lystig Fritchie
5029911b52 WIP: remove verbose goop 2015-09-09 20:46:52 +09:00
Scott Lystig Fritchie
38ea36fc1c WIP: Stand back, I'm going to try math! ... It works, {redacted}! 2015-09-09 20:45:57 +09:00
Scott Lystig Fritchie
27891bc5e9 WIP: 'broadcast'/spam works! async reminder ticks remain! 2015-09-09 19:14:52 +09:00
Scott Lystig Fritchie
dd095f117f Derp, fix smoke_test() for machi_fitness:map_set() 2015-09-09 16:49:27 +09:00
Scott Lystig Fritchie
21015efcbb WIP: Stand back, I'm going to try CRDTs! 2015-09-08 19:13:03 +09:00
Scott Lystig Fritchie
7af863d840 Add stubs of machi_fitness server 2015-09-08 16:13:07 +09:00
Scott Lystig Fritchie
185c9eb313 WIP: add failing eunit placeholder for spam 2015-09-07 15:38:23 +09:00
Scott Lystig Fritchie
c7684f660c WIP: Friday evening/Monday morning, laying groundwork for spam "broadcast" 2015-09-07 15:20:10 +09:00
Scott Lystig Fritchie
4376ce9ec1 Remove all flap counting and inner projection stuff 2015-09-04 17:17:49 +09:00
Scott Lystig Fritchie
42aeecd9db Fix machi_projection_store_test error 2015-09-04 15:24:16 +09:00
Scott Lystig Fritchie
3c1026da28 WIP: too tired to continue tonight 2015-09-01 22:10:45 +09:00
Scott Lystig Fritchie
4378ef7b54 Bugfix: inner->outer proj @ A30 2015-09-01 00:51:46 +09:00
Scott Lystig Fritchie
e79265228e Bugfix: more correct for inner->outer sanity transition 2015-08-31 22:14:28 +09:00
Scott Lystig Fritchie
1e5d58b22d Bugfix: more to ignore in make_basic_comparison_stable() 2015-08-31 17:57:37 +09:00
Scott Lystig Fritchie
bce225a200 Bugfix: a30_make_inner_projection() ignore newprop down list if none proj 2015-08-31 17:03:12 +09:00
Scott Lystig Fritchie
a095e0cfc3 Bugfix: ignore creation_time in make_comparison_stable() 2015-08-31 15:40:19 +09:00
Scott Lystig Fritchie
c637939cc2 Bugfix: A29 should trigger if EpochID (not Epoch# alone) differs 2015-08-31 15:21:17 +09:00
Scott Lystig Fritchie
5422dc45c2 Bugfix: derp in A29 revival 2015-08-31 14:44:05 +09:00
Scott Lystig Fritchie
004c686c8c WIP: remove make_zerf() from calc_projection(); add make_zerf() to resurrected A29. Status: broken, needs work 2015-08-30 20:39:58 +09:00
Scott Lystig Fritchie
a449025e8b Bugfix: epoch handling around none proj: epoch 0 only at first bootstrap! 2015-08-30 19:53:47 +09:00
Scott Lystig Fritchie
ec2e7b5669 Sunday experiment: all-but-remove A29, feels right but definitely not sure yet 2015-08-30 16:08:14 +09:00
Scott Lystig Fritchie
0dc53274d1 Get more aggressive about AllHosed+down nodes for inner proj 2015-08-30 02:22:59 +09:00
Scott Lystig Fritchie
771164b82f Bugfix: Flapping manifesto, leaving #2: only if not me 2015-08-30 00:50:23 +09:00
Scott Lystig Fritchie
4b83893047 Bugfix: minor flap count bookeeping error 2015-08-30 00:50:03 +09:00
Scott Lystig Fritchie
a7db3a26c6 Bugfix: a30_make_inner_projection() compatible inner if not none proj 2015-08-30 00:04:13 +09:00
Scott Lystig Fritchie
53d865b247 Bugfix: serious derp fix for A30's inner->outer 2015-08-29 23:42:47 +09:00
Scott Lystig Fritchie
5c8b255da9 Bugfix: first new CP experiments with chain len=5 2015-08-29 22:40:18 +09:00
Scott Lystig Fritchie
94394d3429 Bugfix: allow none proj to re-emerge from flapping (more)
See comments added in this commit at A40.

So far, I've been doing CP mode testing with a handful of (very useful)
network partition combinations using:

    machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]).

Next steps:

* Expand number & types of partitions
* Expand to chain lengths of 5 and beyond
2015-08-29 21:36:53 +09:00
Scott Lystig Fritchie
ee19a0856b WIP: justincase 2015-08-29 19:59:46 +09:00
Scott Lystig Fritchie
6b84cd6e6a Reduce poll sleep time when running with partition simulator 2015-08-29 18:30:53 +09:00
Scott Lystig Fritchie
dc5ae4047a Bugfix: react_to_env_A30 inner->norm fix, make_zerf() none proj derp fix 2015-08-29 18:01:13 +09:00
Scott Lystig Fritchie
c9340a662d Bugfix: force stable creation_time on inner none proj 2015-08-29 15:06:57 +09:00
Scott Lystig Fritchie
6d9526b379 Add more ?REACT() 2015-08-29 13:13:31 +09:00
Scott Lystig Fritchie
f21fcdd7be Bugfix: none proj must flap, undo previous commits, which may cause mess later 2015-08-29 13:13:23 +09:00
Scott Lystig Fritchie
af0ade9840 Bugfix: projection checksum fix in A30 2015-08-29 12:33:41 +09:00
Scott Lystig Fritchie
582f9e5eab Bugfix: fix effectively-none-projection transition to C100. Still buggy 2015-08-28 23:08:38 +09:00
Scott Lystig Fritchie
403cb5b7a6 WIP: improvements, but now flapping inner epoch keeps increasing {sigh} 2015-08-28 21:13:54 +09:00
Scott Lystig Fritchie
9edd91f48e Bugfixes for a->b column transition & flap dampening 2015-08-28 20:06:09 +09:00
Scott Lystig Fritchie
18aac6e489 WIP: undo AmFlappingNow_p condition added at commit 3dfe5c2 2015-08-28 18:39:18 +09:00
Scott Lystig Fritchie
3dfe5c2677 WIP: fix annotation history on disk 2015-08-28 18:37:11 +09:00
Scott Lystig Fritchie
8ca1ffdb13 WIP: bugfixes and lots of verbose goop added 2015-08-28 01:55:31 +09:00
Scott Lystig Fritchie
deb2cdee2c Bugfix: correct epoch number checking when inner proj 2015-08-27 22:22:15 +09:00
Scott Lystig Fritchie
93b9b948fc WIP: debugging, uff da 2015-08-27 22:02:23 +09:00
Scott Lystig Fritchie
efb89efb0d Reduce verbosity 2015-08-27 20:27:33 +09:00
Scott Lystig Fritchie
0eaa008810 Change checksum algorithm to exclude 'flap' also 2015-08-27 20:27:24 +09:00
Scott Lystig Fritchie
12b74a52fd WIP: pre-dinner paranoid checkin 2015-08-27 18:45:27 +09:00
Scott Lystig Fritchie
65cd18939c WIP: changes to annotation management 2015-08-27 17:58:43 +09:00
Scott Lystig Fritchie
8a61a85ae0 WIP: rewrite make_zerf() to use new annotation scheme 2015-08-27 16:19:22 +09:00
Scott Lystig Fritchie
28335a1310 Add CP mode unwedge. All eunit tests are passing again. 2015-08-26 18:47:39 +09:00
Scott Lystig Fritchie
9222881689 Oops, bugfixes 2015-08-26 17:51:43 +09:00
Scott Lystig Fritchie
568e165f4f Allow pstore -> FLU unwedge only in ap_mode, machi_cr_client_test broken (uses cp_mode) 2015-08-26 15:51:14 +09:00
Scott Lystig Fritchie
e8f3ab381d Add set_consistency_mode() to projection store API, use it 2015-08-26 14:57:51 +09:00
Scott Lystig Fritchie
c0ee323637 Our new unit test works, yay 2015-08-25 19:42:33 +09:00
Scott Lystig Fritchie
83f49472db WIP: intermediate refactoring 2015-08-25 19:31:05 +09:00
Scott Lystig Fritchie
6dbe887298 Remove old cruft, including hugly HTTP server hack 2015-08-25 18:49:48 +09:00
Scott Lystig Fritchie
1c5a17b708 WIP: adjust throttle of flapping 'shut up' 2015-08-25 17:01:14 +09:00
Scott Lystig Fritchie
9a86453753 WIP: half-baked idea, stopping for the night (more)
So, I'm 50% sure this is a good idea for CP mode: if there's
a later public projection than P_current, then who knows what
we might have missed.  So, call make_zerf() to find out the
absolute latest.  Problem: flapping state appears to be lost,
booo.
2015-08-24 21:54:30 +09:00
Scott Lystig Fritchie
ea61fe78bf Add flap disabler for 3 seconds after up/down change 2015-08-24 20:38:54 +09:00
Scott Lystig Fritchie
2f82fe0487 WIP: cp_mode improvements 2015-08-24 19:04:26 +09:00
Scott Lystig Fritchie
66cafe066e Remove proj_i_history, tweak AllAreFlapping_and_IamBad_and_NotRelevant_p in B10 2015-08-23 20:47:43 +09:00
Scott Lystig Fritchie
70022d11ce Add damper check for flapping of *inner* projections, whee! 2015-08-23 20:00:19 +09:00
Scott Lystig Fritchie
561e60a7ac WIP: start adding support to detect flapping of inner projections (ha!) 2015-08-23 17:50:25 +09:00
Scott Lystig Fritchie
0136fccff7 CP mode fix a30_make_inner_projection 2015-08-23 16:43:15 +09:00
Scott Lystig Fritchie
2d050ff7a6 Fix ?REACT() FSM names: a30->a40 2015-08-23 15:46:57 +09:00
Scott Lystig Fritchie
34d35fab63 Shorten the verbose output of private_write_verbose 2015-08-22 23:30:30 +09:00
Scott Lystig Fritchie
51a06844d5 Fix epoch number reuse bug when transiting C103 2015-08-22 21:40:21 +09:00
Scott Lystig Fritchie
0414da783a Fix repairs when everyone is in stable flapping state 2015-08-22 21:27:01 +09:00
Scott Lystig Fritchie
a0477d62c0 WIP: bugfix for checking latest proj's flap count 2015-08-22 14:50:10 +09:00
Scott Lystig Fritchie
0278d7254b Add A29 state for shouting circuit breaker for long long loops 2015-08-20 23:04:27 +09:00
Scott Lystig Fritchie
b46730eb2c WIP: adjust the flapping manifest: delete clause 3 2015-08-20 21:28:56 +09:00
Scott Lystig Fritchie
71decc5dc0 WIP: AP mode less bad again 2015-08-20 18:47:50 +09:00
Scott Lystig Fritchie
4e7d1f2310 WIP: egadz, a refactoring mess, but finally AP mode not sucky 2015-08-20 17:32:46 +09:00
Scott Lystig Fritchie
a71e9543fe WIP: refactoring inner handling, but ... (more)
There are a couple of weird things in the snippet below (AP mode):

    22:32:58.209 b uses inner: [{epoch,136},{author,c},{mode,ap_mode},{witnesses,[]},{upi,[b,c]},{repair,[]},{down,[a]},{flap,undefined},{d,[d_foo1,{ps,[{a,b}]},{nodes_up,[b,c]}]},{d2,[]}] (outer flap epoch 136: {flap_i,{{{epk,115},{1439,904777,11627}},28},[a,{a,problem_with,b},{b,problem_with,a}],[{a,{{{epk,126},{1439,904777,149865}},16}},{b,{{{epk,115},{1439,904777,11627}},28}},{c,{{{epk,121},{1439,904777,134392}},15}}]}) (my flap {{epk,115},{1439,904777,11627}} 29 [{a,{{{epk,126},{1439,904777,149865}},28}},{b,{{{epk,115},{1439,904777,11627}},29}},{c,{{{epk,121},{1439,904777,134392}},26}}])

    22:32:58.224 c uses inner: [{epoch,136},{author,c},{mode,ap_mode},{witnesses,[]},{upi,[b,c]},{repair,[]},{down,[a]},{flap,undefined},{d,[d_foo1,{ps,[{a,b}]},{nodes_up,[b,c]}]},{d2,[]}] (outer flap epoch 136: {flap_i,{{{epk,115},{1439,904777,11627}},28},[a,{a,problem_with,b},{b,problem_with,a}],[{a,{{{epk,126},{1439,904777,149865}},16}},{b,{{{epk,115},{1439,904777,11627}},28}},{c,{{{epk,121},{1439,904777,134392}},15}}]}) (my flap {{epk,121},{1439,904777,134392}} 28 [{a,{{{epk,126},{1439,904777,149865}},28}},{b,{{{epk,115},{1439,904777,11627}},28}},{c,{{{epk,121},{1439,904777,134392}},28}}])

    CONFIRM by epoch inner 136 <<103,64,252,...>> at [b,c] []

    Priv1 [{a,{{132,<<"Cï|ÿzKX:Á"...>>},[a],[c],[b],[],false}},
           {b,{{127,<<185,139,3,2,96,189,...>>},[b,c],[],[a],[],false}},
           {c,{{133,<<145,71,223,6,177,...>>},[b,c],[a],[],[],false}}] agree false
    Pubs: [{a,136},{b,136},{c,136}]
    DoIt,

1. Both the "uses inner" messages and also the "CONFIRM by epoch inner 136"
   show that B & C are using the same inner projection.

   However, the 'Priv1' output shows b & c on different epochs, 127 & 133.
   Weird.

2. I've added an infinite loop, probably in this commit.  :-(
2015-08-18 22:35:57 +09:00
Scott Lystig Fritchie
9bf0eedb64 WIP: add the flapping manifesto, much is muchmuch better now 2015-08-18 20:49:36 +09:00
Scott Lystig Fritchie
e9268080af Finish/catchup commit from end of last week, silly me 2015-08-17 20:14:29 +09:00
Scott Lystig Fritchie
48e82ac1a4 WIP: use digraph to calculate better AllHosed 2015-08-14 22:29:20 +09:00
Scott Lystig Fritchie
20f2bf4b92 WIP: more ?REACT() tracing 2015-08-14 22:28:50 +09:00
Scott Lystig Fritchie
d2ce8f8447 Fix repair bug that has survived witness additions, oops 2015-08-14 19:30:36 +09:00
Scott Lystig Fritchie
9e02a1ea73 Add more ?REACT() tracing 2015-08-14 19:30:05 +09:00
Scott Lystig Fritchie
5aff775383 WIP: it's ugly, but CP+witnesses is mostly working? 2015-08-14 17:05:16 +09:00
Scott Lystig Fritchie
4e66d7bd91 WIP: keep CMode propagation consistent, but still violating CP transition safety 2015-08-14 00:12:13 +09:00
Scott Lystig Fritchie
14fad2d704 End-to-end chain state checking is still broken (more)
If we use verbose output from:

    machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]).

And use:

    tail -f typescript_file | egrep --line-buffered 'SET|attempted|CONFIRM'

... then we can clearly see a chain safety violation when moving from
epoch 81 -> 83.  I need to add more smarts to the safety checking,
both at the individual transition sanity check and at the converge_demo
overall rolling sanity check.

Key to output: CONFIRM by epoch {num} {csum} at {UPI} {Repairing}

    SET # of FLUs = 3 members [a,b,c]).
    CONFIRM by epoch 1 <<96,161,96,...>> at [a,b] [c]
    CONFIRM by epoch 5 <<134,243,175,...>> at [b,c] []
    CONFIRM by epoch 7 <<207,93,225,...>> at [b,c] []
    CONFIRM by epoch 47 <<60,142,248,...>> at [b,c] []
    SET partitions = [{c,b},{c,a}] (1 of 2) at {22,3,34}
    CONFIRM by epoch 81 <<223,58,184,...>> at [a,b] []
    SET partitions = [{b,c},{b,a}] (2 of 2) at {22,3,38}
    CONFIRM by epoch 83 <<33,208,224,...>> at [a,c] []
    SET partitions = []
    CONFIRM by epoch 85 <<173,179,149,...>> at [a,c] [b]
2015-08-13 22:16:28 +09:00
Scott Lystig Fritchie
f7121f8845 Witness + flapping seems to mostly work, yay! 2015-08-13 21:24:56 +09:00
Scott Lystig Fritchie
425b9c8f60 Merge slf/projection-conditional-write branch 2015-08-13 19:10:48 +09:00
Scott Lystig Fritchie
dcbc3b45ff C110: handle proj store private write failure when conditional fails 2015-08-13 18:45:15 +09:00
Scott Lystig Fritchie
9768f3c035 Projection store private write returns bad_arg if max_public_epochid is greater 2015-08-13 18:44:25 +09:00
Scott Lystig Fritchie
58d840ef7e Minor react changes, minor fix for return val of A50 2015-08-13 18:43:41 +09:00
Scott Lystig Fritchie
d4275e5460 WIP: zerf_find_last_common() fix, eunit passes & very basic len=3 converge demo works 2015-08-13 15:41:18 +09:00
Scott Lystig Fritchie
0b8de235a9 WIP: zerf_find_last_common(), but is confused/broken by partial write @ private 2015-08-13 14:21:31 +09:00
Scott Lystig Fritchie
054397d187 WIP: find last common majority epoch 2015-08-12 17:53:39 +09:00
Scott Lystig Fritchie
d340b6a706 WIP: Duh, fix think-o in a40_latest_author_down() 2015-08-12 17:37:45 +09:00
Scott Lystig Fritchie
8e2a688526 WIP: cp_mode code from last Friday 2015-08-11 15:24:26 +09:00
Scott Lystig Fritchie
512251ac55 Adjust flap_limit constant 2015-08-07 12:29:10 +09:00
Scott Lystig Fritchie
3ca0f4491d WIP: always start chain manager with none projection 2015-08-06 19:24:14 +09:00
Scott Lystig Fritchie
0d7f6c8d7e WIP: chain transitions are now fully (?) aware of witness servers 2015-08-06 17:48:31 +09:00
Scott Lystig Fritchie
e9c4e2f98d WIP: rearrange CP mode projection calc 2015-08-06 15:22:04 +09:00
Scott Lystig Fritchie
82b6726261 Revert UPI [] -> [FirstRepairing] to commit 91496c6 2015-08-06 15:21:44 +09:00
Scott Lystig Fritchie
01da7a7046 TODO WTF was I thinking here??.... 2015-08-06 14:13:19 +09:00
Scott Lystig Fritchie
dcf532bafd WIP: Witness test expansion 2015-08-05 18:23:44 +09:00
Scott Lystig Fritchie
0f18ab8d20 Add better (?) timeout handling to machi_cr_client.erl gen_server calls 2015-08-05 17:48:06 +09:00
Scott Lystig Fritchie
e3d9ba2b83 WIP: Witness test expansion 2015-08-05 17:17:25 +09:00
Scott Lystig Fritchie
b21803a6c6 Fix witness calculation projections, part II 2015-08-05 16:05:03 +09:00
Scott Lystig Fritchie
f43a5ca96d Fix witness calculation projections, part I 2015-08-05 15:50:32 +09:00
Scott Lystig Fritchie
91496c656b Oops, fix PB stuff to add witnesses 2015-08-05 12:53:20 +09:00
Scott Lystig Fritchie
3f51357577 WIP: pre-travel code, not sure if good, check in for history 2015-07-30 13:12:08 -07:00
Scott Lystig Fritchie
aa1a31982a Add 'witnesses' to machi_projection:make_summary() 2015-07-30 13:11:43 -07:00
Scott Lystig Fritchie
6e521700bd WIP: Adding witness_smoke_test_ but it's broken (more)
So, the problem is that the chain manager isn't finishing repair
because UPI=[a], and a is a witness, and a can't do the list files etc etc
repair stuff that repairer FLUs need to do.

The best (?) way forward is to add some advance smarts to the
chain manager so that it doesn't propose a UPI of 100% witnesses?
2015-07-21 19:05:04 +09:00
Scott Lystig Fritchie
432190435e Add witness_mode to FLU 2015-07-21 17:29:33 +09:00