machi/TODO-shortterm.org at master

greg/machi

Fork 0

Scott Lystig Fritchie d7daf203fb Update TODO-shortterm.org for completion of fitness work

2015-09-22 16:44:49 +09:00

4.6 KiB

Raw Permalink Blame History

To Do list
Chain manager CP mode, Plan B
fitness_mon, the fitness monitor

To Do list

DONE remove the escript* stuff from machi_util.erl

DONE Add functions to manipulate 1-chain projections

Add epoch ID = epoch number + checksum of projection! Done via compare() func.

DONE Change all protocol ops to add epoch ID

DONE Add projection store to each FLU.

DONE What should the API look like? (borrow from chain mgr PoC?)

Yeah, I think that's pretty complete. Steal it now, worry later.

DONE Choose protocol & TCP port. Share with get/put? Separate?

Hrm, I like the idea of having a single TCP port to talk to any single FLU.

To make the protocol "easy" to hack, how about using the same basic method as append/write where there's a variable size blob. But we'll format that blob as a term_to_binary(). Then dispatch to a single func, and pattern match Erlang style in that func.

DONE Do it.

DONE Finish OTP'izing the Chain Manager with FLU & proj store processes

DONE Eliminate the timeout exception for the client: just {error,timeout} ret

DONE Move prototype/chain-manager code to "top" of source tree

DONE Preserve current test code (leave as-is? tiny changes?)

DONE Make chain manager code flexible enough to run "real world" or "sim"

DONE Add projection wedging logic to each FLU.

DONE Implement real data repair, orchestrated by the chain manager

DONE Change all protocol ops to enforce the epoch ID

Add no-wedging state to make testing easier?

DONE Adapt the projection-aware, CR-implementing client from demo-day

DONE Add major comment sections to the CR-impl client

DONE Simple basho_bench driver, put some unscientific chalk on the benchtop

TODO Create parallel PULSE test for basic API plus chain manager repair

DONE Add client-side vs. server-side checksum type, expand client API?

TODO Add gproc and get rid of registered name rendezvous

TODO Fixes the atom table leak

TODO Fixes the problem of having active sequencer for the same prefix

on two FLUS in the same VM

TODO Fix all known bugs/cruft with Chain Manager (list below)

DONE Fix known bugs

DONE Clean up crufty TODO comments and other obvious cruft

TODO Re-add verification step of stable epochs, including inner projections!

TODO Attempt to remove cruft items in flapping_i?

TODO Move the FLU server to gen_server behavior?

DONE Chain manager CP mode, Plan B

SKIP Maybe? Change ch_mgr to use middleworker

DONE Is it worthwhile? Is the parallelism so important? No, probably.

SKIP Move middleworker func to utility module?

DONE Add new proc to psup group

DONE Name: machi_fitness

DONE ch_mgr keeps its current proc struct: i.e. same 1 proc as today

NO chmgr asks hosed mgr for hosed list @ start of react_to_env

DONE For all hosed, do async: try to read latest proj.

NO If OK, inform hosed mgr: status change will be used by next HC iter.

NO If fail, no change, because that server is already known to be hosed

DONE For all non-hosed, continue as the chain manager code does today

DONE Any new errors are added to UpNodes/DownNodes tracking as used today

DONE At end of react loop, if UpNodes list differs, inform hosed mgr.

DONE fitness_mon, the fitness monitor

DONE Map key & val sketch

Logical sketch:

Map key: ObservingServerName::atom()

Map val: { ObservingServerLastModTime::now(), UnfitList::list(ServerName::atom()), AdminDownList::list(ServerName::atom()), Props::proplist() }

Implementation sketch:

Use CRDT map.
If map key is not atom, then atom->string or atom->binary is fine.
For map value, is it possible CRDT LWW type?

DONE Investigate riak_dt data structure definition, manipulating, etc.

DONE Add dependency on riak_dt

DONE Update is an entire dict from Observer O

DONE Merge my pending map + update map + my last mod time + my unfit list

DONE if merged /= pending:

DONE Schedule async tick (more)

Tick message contains list of servers with differing state as of this instant in time… we want to avoid triggering decisions about fitness/unfitness for other servers where we might have received less than a full time period's worth of waiting.

4.6 KiB Raw Permalink Blame History Unescape Escape