machi/TODO-shortterm.org
2015-09-22 16:44:49 +09:00

114 lines
4.6 KiB
Org Mode

* To Do list
** DONE remove the escript* stuff from machi_util.erl
** DONE Add functions to manipulate 1-chain projections
- Add epoch ID = epoch number + checksum of projection!
Done via compare() func.
** DONE Change all protocol ops to add epoch ID
** DONE Add projection store to each FLU.
*** DONE What should the API look like? (borrow from chain mgr PoC?)
Yeah, I think that's pretty complete. Steal it now, worry later.
*** DONE Choose protocol & TCP port. Share with get/put? Separate?
Hrm, I like the idea of having a single TCP port to talk to any single
FLU.
To make the protocol "easy" to hack, how about using the same basic
method as append/write where there's a variable size blob. But we'll
format that blob as a term_to_binary(). Then dispatch to a single
func, and pattern match Erlang style in that func.
*** DONE Do it.
** DONE Finish OTP'izing the Chain Manager with FLU & proj store processes
** DONE Eliminate the timeout exception for the client: just {error,timeout} ret
** DONE Move prototype/chain-manager code to "top" of source tree
*** DONE Preserve current test code (leave as-is? tiny changes?)
*** DONE Make chain manager code flexible enough to run "real world" or "sim"
** DONE Add projection wedging logic to each FLU.
** DONE Implement real data repair, orchestrated by the chain manager
** DONE Change all protocol ops to enforce the epoch ID
- Add no-wedging state to make testing easier?
** DONE Adapt the projection-aware, CR-implementing client from demo-day
** DONE Add major comment sections to the CR-impl client
** DONE Simple basho_bench driver, put some unscientific chalk on the benchtop
** TODO Create parallel PULSE test for basic API plus chain manager repair
** DONE Add client-side vs. server-side checksum type, expand client API?
** TODO Add gproc and get rid of registered name rendezvous
*** TODO Fixes the atom table leak
*** TODO Fixes the problem of having active sequencer for the same prefix
on two FLUS in the same VM
** TODO Fix all known bugs/cruft with Chain Manager (list below)
*** DONE Fix known bugs
*** DONE Clean up crufty TODO comments and other obvious cruft
*** TODO Re-add verification step of stable epochs, including inner projections!
*** TODO Attempt to remove cruft items in flapping_i?
** TODO Move the FLU server to gen_server behavior?
* DONE Chain manager CP mode, Plan B
** SKIP Maybe? Change ch_mgr to use middleworker
**** DONE Is it worthwhile? Is the parallelism so important? No, probably.
**** SKIP Move middleworker func to utility module?
** DONE Add new proc to psup group
*** DONE Name: machi_fitness
** DONE ch_mgr keeps its current proc struct: i.e. same 1 proc as today
** NO chmgr asks hosed mgr for hosed list @ start of react_to_env
** DONE For all hosed, do *async*: try to read latest proj.
*** NO If OK, inform hosed mgr: status change will be used by next HC iter.
*** NO If fail, no change, because that server is already known to be hosed
*** DONE For all non-hosed, continue as the chain manager code does today
*** DONE Any new errors are added to UpNodes/DownNodes tracking as used today
*** DONE At end of react loop, if UpNodes list differs, inform hosed mgr.
* DONE fitness_mon, the fitness monitor
** DONE Map key & val sketch
Logical sketch:
Map key: ObservingServerName::atom()
Map val: { ObservingServerLastModTime::now(),
UnfitList::list(ServerName::atom()),
AdminDownList::list(ServerName::atom()),
Props::proplist() }
Implementation sketch:
1. Use CRDT map.
2. If map key is not atom, then atom->string or atom->binary is fine.
3. For map value, is it possible CRDT LWW type?
** DONE Investigate riak_dt data structure definition, manipulating, etc.
** DONE Add dependency on riak_dt
** DONE Update is an entire dict from Observer O
*** DONE Merge my pending map + update map + my last mod time + my unfit list
*** DONE if merged /= pending:
**** DONE Schedule async tick (more)
Tick message contains list of servers with differing state as of this
instant in time... we want to avoid triggering decisions about
fitness/unfitness for other servers where we might have received less
than a full time period's worth of waiting.
**** DONE Spam merged map to All_list -- [Me]
**** DONE Set pending <- merged
*** DONE When we receive an async tick
**** DONE set active map <- pending map for all servers in ticks list
**** DONE Send ch_mgr a react_to_env tick trigger
*** DONE react_to_env tick trigger actions
**** DONE Filter active map to remove stale entries (i.e. no update in 1 hour)
**** DONE If time since last map spam is too long, spam our *pending* map
**** DONE Proceed with normal react processing, using *active* map for AllHosed!