4.6 KiB
- To Do list
- remove the escript* stuff from machi_util.erl
- Add functions to manipulate 1-chain projections
- Change all protocol ops to add epoch ID
- Add projection store to each FLU.
- Finish OTP'izing the Chain Manager with FLU & proj store processes
- Eliminate the timeout exception for the client: just {error,timeout} ret
- Move prototype/chain-manager code to "top" of source tree
- Add projection wedging logic to each FLU.
- Implement real data repair, orchestrated by the chain manager
- Change all protocol ops to enforce the epoch ID
- Adapt the projection-aware, CR-implementing client from demo-day
- Add major comment sections to the CR-impl client
- Simple basho_bench driver, put some unscientific chalk on the benchtop
- Create parallel PULSE test for basic API plus chain manager repair
- Add client-side vs. server-side checksum type, expand client API?
- Add gproc and get rid of registered name rendezvous
- Fix all known bugs/cruft with Chain Manager (list below)
- Move the FLU server to gen_server behavior?
- Chain manager CP mode, Plan B
- SKIP Maybe? Change ch_mgr to use middleworker
- Add new proc to psup group
- ch_mgr keeps its current proc struct: i.e. same 1 proc as today
- NO chmgr asks hosed mgr for hosed list @ start of react_to_env
- For all hosed, do async: try to read latest proj.
- NO If OK, inform hosed mgr: status change will be used by next HC iter.
- NO If fail, no change, because that server is already known to be hosed
- For all non-hosed, continue as the chain manager code does today
- Any new errors are added to UpNodes/DownNodes tracking as used today
- At end of react loop, if UpNodes list differs, inform hosed mgr.
- fitness_mon, the fitness monitor
- Map key & val sketch
- Investigate riak_dt data structure definition, manipulating, etc.
- Add dependency on riak_dt
- Update is an entire dict from Observer O
To Do list
DONE remove the escript* stuff from machi_util.erl
DONE Add functions to manipulate 1-chain projections
- Add epoch ID = epoch number + checksum of projection! Done via compare() func.
DONE Change all protocol ops to add epoch ID
DONE Add projection store to each FLU.
DONE What should the API look like? (borrow from chain mgr PoC?)
Yeah, I think that's pretty complete. Steal it now, worry later.
DONE Choose protocol & TCP port. Share with get/put? Separate?
Hrm, I like the idea of having a single TCP port to talk to any single FLU.
To make the protocol "easy" to hack, how about using the same basic method as append/write where there's a variable size blob. But we'll format that blob as a term_to_binary(). Then dispatch to a single func, and pattern match Erlang style in that func.
DONE Do it.
DONE Finish OTP'izing the Chain Manager with FLU & proj store processes
DONE Eliminate the timeout exception for the client: just {error,timeout} ret
DONE Move prototype/chain-manager code to "top" of source tree
DONE Preserve current test code (leave as-is? tiny changes?)
DONE Make chain manager code flexible enough to run "real world" or "sim"
DONE Add projection wedging logic to each FLU.
DONE Implement real data repair, orchestrated by the chain manager
DONE Change all protocol ops to enforce the epoch ID
- Add no-wedging state to make testing easier?
DONE Adapt the projection-aware, CR-implementing client from demo-day
DONE Add major comment sections to the CR-impl client
DONE Simple basho_bench driver, put some unscientific chalk on the benchtop
TODO Create parallel PULSE test for basic API plus chain manager repair
DONE Add client-side vs. server-side checksum type, expand client API?
TODO Add gproc and get rid of registered name rendezvous
TODO Fixes the atom table leak
TODO Fixes the problem of having active sequencer for the same prefix
on two FLUS in the same VM
TODO Fix all known bugs/cruft with Chain Manager (list below)
DONE Fix known bugs
DONE Clean up crufty TODO comments and other obvious cruft
TODO Re-add verification step of stable epochs, including inner projections!
TODO Attempt to remove cruft items in flapping_i?
TODO Move the FLU server to gen_server behavior?
DONE Chain manager CP mode, Plan B
SKIP Maybe? Change ch_mgr to use middleworker
DONE Is it worthwhile? Is the parallelism so important? No, probably.
SKIP Move middleworker func to utility module?
DONE Add new proc to psup group
DONE Name: machi_fitness
DONE ch_mgr keeps its current proc struct: i.e. same 1 proc as today
NO chmgr asks hosed mgr for hosed list @ start of react_to_env
DONE For all hosed, do async: try to read latest proj.
NO If OK, inform hosed mgr: status change will be used by next HC iter.
NO If fail, no change, because that server is already known to be hosed
DONE For all non-hosed, continue as the chain manager code does today
DONE Any new errors are added to UpNodes/DownNodes tracking as used today
DONE At end of react loop, if UpNodes list differs, inform hosed mgr.
DONE fitness_mon, the fitness monitor
DONE Map key & val sketch
Logical sketch:
Map key: ObservingServerName::atom()
Map val: { ObservingServerLastModTime::now(), UnfitList::list(ServerName::atom()), AdminDownList::list(ServerName::atom()), Props::proplist() }
Implementation sketch:
- Use CRDT map.
- If map key is not atom, then atom->string or atom->binary is fine.
- For map value, is it possible CRDT LWW type?
DONE Investigate riak_dt data structure definition, manipulating, etc.
DONE Add dependency on riak_dt
DONE Update is an entire dict from Observer O
DONE Merge my pending map + update map + my last mod time + my unfit list
DONE if merged /= pending:
DONE Schedule async tick (more)
Tick message contains list of servers with differing state as of this instant in time… we want to avoid triggering decisions about fitness/unfitness for other servers where we might have received less than a full time period's worth of waiting.