Commit graph

481 commits

Author SHA1 Message Date
Scott Lystig Fritchie
1c8e436a64 Fix race #3 2015-10-21 15:01:11 +09:00
UENISHI Kota
a43397a7b8 Update to review comments 2015-10-21 10:58:00 +09:00
UENISHI Kota
ebb9bc3f5a Allow reading multiple chunks at once
* When repairing multiple chunks at once and any of its repair
  failed, the whole read request and repair work will fail
* Rename read_repair3 and read_repair4 to do_repair_chunks and
  do_repair chunk in machi_file_proxy
* This pull request changes return semantics of read_chunk(), that
  returns any chunk included in requested range
* First and last chunk may be cut to fit the requested range
* In machi_file_proxy, unwritten_bytes are removed and replaced by
  machi_csum_table
2015-10-20 17:59:09 +09:00
Scott Lystig Fritchie
6f9814ffb4 Merge ss/deps-for-debugging (with rebar.config conflict fix) 2015-10-19 16:41:03 +09:00
UENISHI Kota
3e975f53b8 Allow read_chunk() to return partial chunks
This is simply a change of read_chunk() protocol, where a response of
read_chunk() becomes list of written bytes along with checksum. All
related code including repair is changed as such. This is to pass all
tests and not actually supporting partial chunks.
2015-10-19 15:37:17 +09:00
Shunichi Shinohara
208c02853f Add cluster_info to deps and small callback module
For debuging from shell, some functions in machi_cinfo are exported:

- public_projection/1
- private_projection/1
- fitness/1
- chain_manager/1
- flu1/1
2015-10-19 15:36:05 +09:00
UENISHI Kota
cb67764273 Merge pull request #12 from basho/slf/packaging1
Slf/packaging1
2015-10-16 17:56:27 +09:00
Scott Lystig Fritchie
00ac0f4cd3 Reduce compiler warnings and verbose output that clutters eunit test output 2015-10-16 17:41:01 +09:00
Scott Lystig Fritchie
299016cafb FLU startup via app.config 2015-10-16 16:28:46 +09:00
UENISHI Kota
6f790527f5 Follow with missing tests and related fix 2015-10-16 10:10:05 +09:00
UENISHI Kota
e45469b5ce Move checksum file related code to machi_csum_table 2015-10-15 11:28:40 +09:00
Mark Allen
baeffbab0b Merge pull request #6 from basho/mra/write-once-clean
Integrate write once invariant into current FLU implementation
2015-10-14 10:15:57 -05:00
Scott Lystig Fritchie
e344ee42ff Remove stale TODO comment about write-once enforcement 2015-10-14 16:56:51 +09:00
Scott Lystig Fritchie
d6a3180ecd Use pattern matching instead of length() BIF 2015-10-14 16:52:03 +09:00
UENISHI Kota
07ceff095a Fix gen_server style return value 2015-10-14 16:22:11 +09:00
Scott Lystig Fritchie
8eb9cc9700 Fix "HEY, machi_pb_translate:852 got {error,bad_csum}" errors
s/bad_csum/bad_checksum/ as needed in in machi_file_proxy.erl
2015-10-14 14:26:46 +09:00
Scott Lystig Fritchie
ed112bfb52 Argument fix for read_chunk() when write_chunk() says 'written' 2015-10-14 14:16:51 +09:00
Scott Lystig Fritchie
6dbf52db6f Remove some debugging verbosity 2015-10-14 12:50:10 +09:00
UENISHI Kota
1b612bd969 Fix typo in comment 2015-10-14 12:40:56 +09:00
Mark Allen
fe71b72494 Add filename parse and validation functions 2015-10-13 21:12:14 -05:00
Mark Allen
f8707c61c0 Choose new filename when epoch changes
The filename manager needs to choose a new file name
for a prefix when the epoch number changes. This helps
ensure safety of file merges across the cluster.
(Prevents conflicts across divergent cluster members.)
2015-10-13 21:09:31 -05:00
Mark Allen
161e6cd9f9 Pass epoch id to append operations
Needed to handle a filename change when epoch changes.
2015-10-13 21:08:48 -05:00
Mark Allen
85e1e5a26d Handle {error, bad_arg} on read 2015-10-13 21:08:24 -05:00
UENISHI Kota
e113f6ffdd Reach the trim stub to CR client 2015-10-13 17:25:59 +09:00
UENISHI Kota
dfe953b7d8 Add surface of trim to scrub 2015-10-13 17:14:44 +09:00
Scott Lystig Fritchie
777909b0f5 TODO MARK todo comment and bugfix for machi_cr_client_test 2015-10-12 15:30:37 +09:00
Mark Allen
289b2bcc7c Debug WIP 2015-10-11 23:04:29 -05:00
Mark Allen
c1b9038447 The return value of ets is generally 'true' 2015-10-08 15:47:11 -05:00
Mark Allen
aca3759e45 Bug fixes found during testing runs 2015-10-08 15:46:40 -05:00
Mark Allen
1ecbb5cffe Fixed order of start_link parameters 2015-10-08 15:45:04 -05:00
Mark Allen
303aad97e9 Use {error, bad_checksum} directly
We previously copied {error, bad_csum} as it was used in the main
FLU code.  The protobufs stuff expects the full atom bad_checksum
though.
2015-10-08 15:43:54 -05:00
Scott Lystig Fritchie
952d2fa508 Change flag_checksum -> flag_no_checksum for consistency 2015-10-08 20:41:59 +09:00
Mark Allen
679046600f Merge remote-tracking branch 'origin/bug/from-bp-request-error' into mra/write-once-clean 2015-10-07 23:02:03 -05:00
Scott Lystig Fritchie
796937fe75 Add LL generic error PB response decoding 2015-10-08 12:33:55 +09:00
Scott Lystig Fritchie
0054445f13 Delete spammy message from fitness servers every 5 seconds 2015-10-07 18:52:24 +09:00
Mark Allen
d627f238bf Cache generated names until disk files are written 2015-10-06 22:44:31 -05:00
Mark Allen
f83b0973f2 Have to call filename mgr with FluName 2015-10-06 22:43:19 -05:00
Mark Allen
7a6999465a Make sure we use '^' as filename separators 2015-10-06 22:02:31 -05:00
Mark Allen
2d0c03ef35 Integration with current FLU implementation 2015-10-05 22:18:29 -05:00
Mark Allen
36c11e7d08 Add a metadata manager supervisor 2015-10-05 16:37:53 -05:00
Mark Allen
d3fe7ee181 Pull write-once files over to clean branch
I am treating the original write-once branch as a prototype
which I am now throwing away. I had too much work interleved
in there, so I felt like the best thing to do would be to cut
a new clean branch and pull the files over and start over
against a recent-ish master.

We will have to refactor the other things in FLU in a more
piecemeal fashion.
2015-10-02 16:29:09 -05:00
Scott Lystig Fritchie
6d5b61f747 Tweaks to sleep_ranked_order() call in C200 2015-09-21 21:47:25 +09:00
Scott Lystig Fritchie
5eecb2b935 Change to P_current_calc epoch @ C100 2015-09-21 21:44:03 +09:00
Scott Lystig Fritchie
340af05f0f WIP: server-side of CP mode repairing-as-witness 2015-09-21 21:44:03 +09:00
Scott Lystig Fritchie
d9b9397e75 Avoid some projection churn in C100's sanity check 2015-09-21 21:44:03 +09:00
Scott Lystig Fritchie
5010d03677 Call manage_last_down_list() at C220 and C310 2015-09-21 15:36:54 +09:00
Scott Lystig Fritchie
69a304102e Write public proj in all_members order only 2015-09-21 15:09:16 +09:00
Scott Lystig Fritchie
6b4ed1c061 Verbose debugging cruft 2015-09-19 14:25:07 +09:00
Scott Lystig Fritchie
72bfa163ba Small test bugfixes & verbose/debugging cruft 2015-09-19 14:16:54 +09:00
Scott Lystig Fritchie
d695f30e4f Avoid using host/port combo for machi_fitness (ab)use of machi_projection 2015-09-17 16:43:08 +09:00
Scott Lystig Fritchie
09ae2db0ba Bugfix: double-check local private projection write with a read 2015-09-16 16:31:10 +09:00
Scott Lystig Fritchie
79b1d156c4 Add backlog option to gen_tcp:listen 2015-09-16 13:52:36 +09:00
Scott Lystig Fritchie
778bd015ee Bugfix: pattern matching error in C110 2015-09-16 12:41:53 +09:00
Scott Lystig Fritchie
d3b116bd9e Bugfix: CP mode: ignore P_latest if it has UPI or down server in my down list 2015-09-15 17:55:18 +09:00
Scott Lystig Fritchie
75c94420e0 Add test_ets_table to give programmatic slowdown 2015-09-14 22:52:41 +09:00
Scott Lystig Fritchie
7bf1132142 Bugfix: IsRelevantToMe_p adjustment for P_latest.upi == [] 2015-09-14 17:28:50 +09:00
Scott Lystig Fritchie
b4f8bc8058 Add pretty_time(). Add CONFIRM verbose logging for none proj 2015-09-14 17:00:09 +09:00
Scott Lystig Fritchie
4e11cdd50f Bugfix: derp, pattern match for UniqueHistoryTrigger_p 2015-09-14 16:59:58 +09:00
Scott Lystig Fritchie
a036f119a6 Add send_spam_to_everyone(), add 1% chance of using it 2015-09-14 16:01:26 +09:00
Scott Lystig Fritchie
6c543dfc18 Re-use the flapping criteria for a different use (more)
Hooray, very early I ended up with a simulator example which kicked
in and tested this change.  (A deterministice fault injection method
for testing would also be valuable, probably.)

    machi_chain_manager1_converge_demo:t(7, [{private_write_verbose,true}]).

We switched partitions in the simulator like this:

    SET partitions = [{b,f},{c,f},{d,e},{f,e}] (2 of 90252) at {14,37,5}
    ...
    Stable projection at epoch 1429 upi=[b,c,g,a,d],repairing=[]
    ...
    SET partitions = [{b,d},{c,b},{d,c},{f,a}] (3 of 90252) at {14,37,44}

Part of the chain reassembled quickly from the following UPIs: [g], then
[g,e], then [g,e,f] via a series of successful simulated repairs.  For
the first two repairs, all parties (e & f & g) are unanimous about the
projections.  For the final repair, very strange, not all three adopt
[g,e,f] chain: e says nothing, f & g use it.

Also weird, then g immediately moves f!  upi=[g,e],repairing=[f].
Then e also adopts this chain of 2.  From that point forward, f keeps
trying to use upi=[g,e,f],[] and the others try using only upi=[g,e],[f].
There are lots of messages from g saying that it's insane (correctly!)
to try calc=1487:[g,e],[f] -> 1494:[g,e,f],[] without a valid repair
author.

It's worth checking why g dropped from [g,e,f] -> [g,e].  But even
still, this new use for the flapping counter & reset via C103 is
working.  ... Ah, now I understand.  The very occasional undefined
socket bug in machi_flu1_client appears to be the cause: g had a
one-time problem talking with f and so decided f was down long enough to
make the shorter UPI.  The other participants didn't have any such
problem with f and so kept f in the UPI.  This would have been a
deadlock/infinite loop case without someone deciding to reset state.
2015-09-14 15:41:48 +09:00
Scott Lystig Fritchie
23554ffccc Handle timeout/paritition failures in C110 2015-09-14 13:54:47 +09:00
Scott Lystig Fritchie
fdf78bdbbc Tweak IsRelevantToMe_p in B10 (more)
Last night we hit a rare case of failed convergence.

f was out of sync with the rest of the world.
f: upi=[b,g,f] repairing=[a,c]
The "rest of the world" used a larger chain at:
*: upi=[c,b,g,a], repairing=[f]

And f refused to join the larger chain because of the way that
IsRelevantToMe_p was being calculated before this commit.

Hrrrm, though, I'm not convinced that this particular problem
is fixed 100% by this patch.  What if the chain lengths were
the same but also UPI incompatible?  e.g. if I remove 'a' from
the "real world (in the partition simulator)" example above:

f: upi=[b,g,f] repairing=[c]
*: upi=[c,b,g], repairing=[f]

Hrmmmmm, I may need to reintroduce the my-recent-adopted-projection-
flapping-like-counter thingie to try to break this kind of
incompatible deadlock.
2015-09-14 13:40:34 +09:00
Scott Lystig Fritchie
62186395ed Hooray! The weekend's CP work hasn't broken AP, I believe. 2015-09-14 00:04:53 +09:00
Scott Lystig Fritchie
f5901c6cd3 Hey, appears to work for CP mode chain len=3, hooray! 2015-09-13 21:51:20 +09:00
Scott Lystig Fritchie
89f57616a8 Avoid some churn when both latest & newprop are none proj 2015-09-13 17:44:23 +09:00
Scott Lystig Fritchie
f3a0ee91cf WIP: thread P_calc_current all the way to C100 for CP mode assist 2015-09-13 15:58:45 +09:00
Scott Lystig Fritchie
0a20417682 Adjustments for CP mode (still slightly experimental) 2015-09-13 14:56:28 +09:00
Scott Lystig Fritchie
32c4d39156 Bugfix: set consistency_mode at set_chain_members 2015-09-13 14:16:02 +09:00
Scott Lystig Fritchie
b3ce9f9ab8 A bit less verbose output 2015-09-11 23:08:47 +09:00
Scott Lystig Fritchie
5efec1b6cd Add upi_unanimous annotation to AP mode 2015-09-11 21:47:05 +09:00
Scott Lystig Fritchie
fe8ff6033d Make better state transition choices in AP mode 2015-09-11 19:14:41 +09:00
Scott Lystig Fritchie
a0c129c16d Bugfix: wow, a chain state transition sanity check bug 2015-09-11 17:32:52 +09:00
Scott Lystig Fritchie
8df7d58365 Add partition simulator support to fitness service 2015-09-11 16:45:29 +09:00
Scott Lystig Fritchie
efe6ce7894 WIP: small refactoring to prepare for fitness server 'use' of partition simulator 2015-09-11 16:03:49 +09:00
Scott Lystig Fritchie
35e8efeb96 Add timer:sleep() to accomodate machi_chain_manager1_converge_demo 2015-09-11 15:56:02 +09:00
Scott Lystig Fritchie
bbf925d132 Add fault injection method via C100 to test C103 admin down cycle 2015-09-10 18:05:55 +09:00
Scott Lystig Fritchie
41737ae62a Add delete_admin_down API implementation, oops! 2015-09-10 18:05:18 +09:00
Scott Lystig Fritchie
d45c249e89 Add admin down status API to fitness server 2015-09-10 17:30:11 +09:00
Scott Lystig Fritchie
c14b9ce50f Minor cleanup, add more partitions to converge demo 2015-09-10 16:39:15 +09:00
Scott Lystig Fritchie
af94d1c1c3 Bugfix: ExpectedUPI error in A40 2015-09-10 02:15:49 +09:00
Scott Lystig Fritchie
daf3a3d65a Remove some verbose debugging cruft 2015-09-10 01:47:46 +09:00
Scott Lystig Fritchie
329a5e0682 Bugfix: damn, no idea how many problems this 5 month old bug caused 2015-09-10 01:33:55 +09:00
Scott Lystig Fritchie
5943494d54 Add ExpectedUPI to A40's AmHosedP clause 2015-09-10 00:43:37 +09:00
Scott Lystig Fritchie
10c655ebfe WIP: fix one source of problems, now shift back to 'TODO this clause needs more review' 2015-09-09 23:59:40 +09:00
Scott Lystig Fritchie
b7aa33c617 Yeah, nearly there. AP fails occasionally in multiple-asymmetric-partition sequence 2015-09-09 23:10:39 +09:00
Scott Lystig Fritchie
72141c8ecb WIP: split A30 into A30/A31 based on AllHosed 2015-09-09 21:06:40 +09:00
Scott Lystig Fritchie
5029911b52 WIP: remove verbose goop 2015-09-09 20:46:52 +09:00
Scott Lystig Fritchie
38ea36fc1c WIP: Stand back, I'm going to try math! ... It works, {redacted}! 2015-09-09 20:45:57 +09:00
Scott Lystig Fritchie
27891bc5e9 WIP: 'broadcast'/spam works! async reminder ticks remain! 2015-09-09 19:14:52 +09:00
Scott Lystig Fritchie
dd095f117f Derp, fix smoke_test() for machi_fitness:map_set() 2015-09-09 16:49:27 +09:00
Scott Lystig Fritchie
21015efcbb WIP: Stand back, I'm going to try CRDTs! 2015-09-08 19:13:03 +09:00
Scott Lystig Fritchie
7af863d840 Add stubs of machi_fitness server 2015-09-08 16:13:07 +09:00
Scott Lystig Fritchie
185c9eb313 WIP: add failing eunit placeholder for spam 2015-09-07 15:38:23 +09:00
Scott Lystig Fritchie
c7684f660c WIP: Friday evening/Monday morning, laying groundwork for spam "broadcast" 2015-09-07 15:20:10 +09:00
Scott Lystig Fritchie
4376ce9ec1 Remove all flap counting and inner projection stuff 2015-09-04 17:17:49 +09:00
Scott Lystig Fritchie
42aeecd9db Fix machi_projection_store_test error 2015-09-04 15:24:16 +09:00
Scott Lystig Fritchie
3c1026da28 WIP: too tired to continue tonight 2015-09-01 22:10:45 +09:00
Scott Lystig Fritchie
4378ef7b54 Bugfix: inner->outer proj @ A30 2015-09-01 00:51:46 +09:00
Scott Lystig Fritchie
e79265228e Bugfix: more correct for inner->outer sanity transition 2015-08-31 22:14:28 +09:00
Scott Lystig Fritchie
1e5d58b22d Bugfix: more to ignore in make_basic_comparison_stable() 2015-08-31 17:57:37 +09:00