machi/TODO-shortterm.org

* To Do list

** DONE remove the escript* stuff from machi_util.erl
** DONE Add functions to manipulate 1-chain projections

- Add epoch ID = epoch number + checksum of projection!
  Done via compare() func.

** DONE Change all protocol ops to add epoch ID
** DONE Add projection store to each FLU.

*** DONE What should the API look like? (borrow from chain mgr PoC?)

Yeah, I think that's pretty complete.  Steal it now, worry later.

*** DONE Choose protocol & TCP port. Share with get/put? Separate?

Hrm, I like the idea of having a single TCP port to talk to any single
FLU.

To make the protocol "easy" to hack, how about using the same basic
method as append/write where there's a variable size blob.  But we'll
format that blob as a term_to_binary().  Then dispatch to a single
func, and pattern match Erlang style in that func.

*** DONE Do it.

** DONE Finish OTP'izing the Chain Manager with FLU & proj store processes
** DONE Eliminate the timeout exception for the client: just {error,timeout} ret
** DONE Move prototype/chain-manager code to "top" of source tree
*** DONE Preserve current test code (leave as-is? tiny changes?)
*** DONE Make chain manager code flexible enough to run "real world" or "sim"
** DONE Add projection wedging logic to each FLU.
** DONE Implement real data repair, orchestrated by the chain manager
** DONE Change all protocol ops to enforce the epoch ID

- Add no-wedging state to make testing easier?
    

** DONE Adapt the projection-aware, CR-implementing client from demo-day
** DONE Add major comment sections to the CR-impl client
** DONE Simple basho_bench driver, put some unscientific chalk on the benchtop
** TODO Create parallel PULSE test for basic API plus chain manager repair
** DONE Add client-side vs. server-side checksum type, expand client API?
** TODO Add gproc and get rid of registered name rendezvous
*** TODO Fixes the atom table leak
*** TODO Fixes the problem of having active sequencer for the same prefix
         on two FLUS in the same VM

** TODO Fix all known bugs/cruft with Chain Manager (list below)
*** DONE Fix known bugs
*** DONE Clean up crufty TODO comments and other obvious cruft
*** TODO Re-add verification step of stable epochs, including inner projections!
*** TODO Attempt to remove cruft items in flapping_i?

** TODO Move the FLU server to gen_server behavior?


* DONE Chain manager CP mode, Plan B
** SKIP Maybe? Change ch_mgr to use middleworker
**** DONE Is it worthwhile?  Is the parallelism so important?  No, probably.
**** SKIP Move middleworker func to utility module?
** DONE Add new proc to psup group
*** DONE Name: machi_fitness
** DONE ch_mgr keeps its current proc struct: i.e. same 1 proc as today
** NO chmgr asks hosed mgr for hosed list @ start of react_to_env
** DONE For all hosed, do *async*: try to read latest proj.
*** NO If OK, inform hosed mgr: status change will be used by next HC iter.
*** NO If fail, no change, because that server is already known to be hosed
*** DONE For all non-hosed, continue as the chain manager code does today
*** DONE Any new errors are added to UpNodes/DownNodes tracking as used today
*** DONE At end of react loop, if UpNodes list differs, inform hosed mgr.

* DONE fitness_mon, the fitness monitor
** DONE Map key & val sketch

Logical sketch:

Map key: ObservingServerName::atom()

Map val: { ObservingServerLastModTime::now(),
           UnfitList::list(ServerName::atom()),
           AdminDownList::list(ServerName::atom()),
           Props::proplist() }

Implementation sketch:

1. Use CRDT map.
2. If map key is not atom, then atom->string or atom->binary is fine.
3. For map value, is it possible CRDT LWW type?

** DONE Investigate riak_dt data structure definition, manipulating, etc.
** DONE Add dependency on riak_dt
** DONE Update is an entire dict from Observer O
*** DONE Merge my pending map + update map + my last mod time + my unfit list
*** DONE if merged /= pending:
**** DONE Schedule async tick (more)

Tick message contains list of servers with differing state as of this
instant in time... we want to avoid triggering decisions about
fitness/unfitness for other servers where we might have received less
than a full time period's worth of waiting.

**** DONE Spam merged map to All_list -- [Me]
**** DONE Set pending <- merged

*** DONE When we receive an async tick
**** DONE set active map <- pending map for all servers in ticks list
**** DONE Send ch_mgr a react_to_env tick trigger
*** DONE react_to_env tick trigger actions
**** DONE Filter active map to remove stale entries (i.e. no update in 1 hour)
**** DONE If time since last map spam is too long, spam our *pending* map
**** DONE Proceed with normal react processing, using *active* map for AllHosed!
Remove escript-related proof-of-concept stuff from machi_util.erl I'd first thought that having that code there would be a kind of useful reminder: please move me somewhere else. However, there's quite a bit there that's "cluster of clusters" stuff and not appropriate for the current short-term work. 2015-04-02 05:17:57 +00:00			`* To Do list`

WIP: client side projection store, 1st API op (write) 2015-04-03 03:33:47 +00:00			`** DONE remove the escript* stuff from machi_util.erl`
			`** DONE Add functions to manipulate 1-chain projections`
Remove escript-related proof-of-concept stuff from machi_util.erl I'd first thought that having that code there would be a kind of useful reminder: please move me somewhere else. However, there's quite a bit there that's "cluster of clusters" stuff and not appropriate for the current short-term work. 2015-04-02 05:17:57 +00:00
			`- Add epoch ID = epoch number + checksum of projection!`
Add machi_projection.erl and basic new() test 2015-04-02 07:05:06 +00:00			`Done via compare() func.`
Remove escript-related proof-of-concept stuff from machi_util.erl I'd first thought that having that code there would be a kind of useful reminder: please move me somewhere else. However, there's quite a bit there that's "cluster of clusters" stuff and not appropriate for the current short-term work. 2015-04-02 05:17:57 +00:00
WIP: client side projection store, 1st API op (write) 2015-04-03 03:33:47 +00:00			`** DONE Change all protocol ops to add epoch ID`
WIP: By Jove, I believe the chain manager is working 2015-04-14 06:30:24 +00:00			`** DONE Add projection store to each FLU.`
WIP: client side projection store, 1st API op (write) 2015-04-03 03:33:47 +00:00
			`*** DONE What should the API look like? (borrow from chain mgr PoC?)`

			`Yeah, I think that's pretty complete. Steal it now, worry later.`

			`*** DONE Choose protocol & TCP port. Share with get/put? Separate?`

			`Hrm, I like the idea of having a single TCP port to talk to any single`
			`FLU.`

			`To make the protocol "easy" to hack, how about using the same basic`
			`method as append/write where there's a variable size blob. But we'll`
			`format that blob as a term_to_binary(). Then dispatch to a single`
			`func, and pattern match Erlang style in that func.`

WIP: By Jove, I believe the chain manager is working 2015-04-14 06:30:24 +00:00			`*** DONE Do it.`
WIP: client side projection store, 1st API op (write) 2015-04-03 03:33:47 +00:00
WIP: timeout comments 2015-05-07 09:43:51 +00:00			`** DONE Finish OTP'izing the Chain Manager with FLU & proj store processes`
			`** DONE Eliminate the timeout exception for the client: just {error,timeout} ret`
			`** DONE Move prototype/chain-manager code to "top" of source tree`
			`*** DONE Preserve current test code (leave as-is? tiny changes?)`
			`*** DONE Make chain manager code flexible enough to run "real world" or "sim"`
WIP: tests for wedge state all working 2015-05-08 12:41:08 +00:00			`** DONE Add projection wedging logic to each FLU.`
Add new API func, append_chunk_extra() 2015-05-17 05:10:42 +00:00			`** DONE Implement real data repair, orchestrated by the chain manager`
			`** DONE Change all protocol ops to enforce the epoch ID`
WIP: timeout comments 2015-05-07 09:43:51 +00:00
			`- Add no-wedging state to make testing easier?`

Remove unused verb() 2015-05-17 05:19:37 +00:00
TODO-shortterm.org updates 2015-05-20 02:05:53 +00:00			`** DONE Adapt the projection-aware, CR-implementing client from demo-day`
Finish basic API for machi_cr_client.erl 2015-05-19 11:04:36 +00:00			`** DONE Add major comment sections to the CR-impl client`
Add README.basho_bench.md 2015-05-20 12:03:51 +00:00			`** DONE Simple basho_bench driver, put some unscientific chalk on the benchtop`
Remove unused verb() 2015-05-17 05:19:37 +00:00			`** TODO Create parallel PULSE test for basic API plus chain manager repair`
Round 1 of cleanup 2015-06-02 09:10:45 +00:00			`** DONE Add client-side vs. server-side checksum type, expand client API?`
WIP: timeout comments 2015-05-07 09:43:51 +00:00			`** TODO Add gproc and get rid of registered name rendezvous`
			`*** TODO Fixes the atom table leak`
			`*** TODO Fixes the problem of having active sequencer for the same prefix`
			`on two FLUS in the same VM`
WIP: By Jove, I believe the chain manager is working 2015-04-14 06:30:24 +00:00
TODO-shortterm.org updates 2015-05-20 02:05:53 +00:00			`** TODO Fix all known bugs/cruft with Chain Manager (list below)`
WIP: By Jove, I believe the chain manager is working 2015-04-14 06:30:24 +00:00			`*** DONE Fix known bugs`
WIP: crufty TODO & comment cleanup 2015-04-14 07:17:49 +00:00			`*** DONE Clean up crufty TODO comments and other obvious cruft`
			`*** TODO Re-add verification step of stable epochs, including inner projections!`
			`*** TODO Attempt to remove cruft items in flapping_i?`
WIP: By Jove, I believe the chain manager is working 2015-04-14 06:30:24 +00:00
Remove unused verb() 2015-05-17 05:19:37 +00:00			`** TODO Move the FLU server to gen_server behavior?`

Add stubs of machi_fitness server 2015-09-08 07:11:54 +00:00
			`* DONE Chain manager CP mode, Plan B`
			`** SKIP Maybe? Change ch_mgr to use middleworker`
			`**** DONE Is it worthwhile? Is the parallelism so important? No, probably.`
			`**** SKIP Move middleworker func to utility module?`
			`** DONE Add new proc to psup group`
			`*** DONE Name: machi_fitness`
			`** DONE ch_mgr keeps its current proc struct: i.e. same 1 proc as today`
			`** NO chmgr asks hosed mgr for hosed list @ start of react_to_env`
			`** DONE For all hosed, do async: try to read latest proj.`
			`*** NO If OK, inform hosed mgr: status change will be used by next HC iter.`
			`*** NO If fail, no change, because that server is already known to be hosed`
			`*** DONE For all non-hosed, continue as the chain manager code does today`
			`*** DONE Any new errors are added to UpNodes/DownNodes tracking as used today`
			`*** DONE At end of react loop, if UpNodes list differs, inform hosed mgr.`

Update TODO-shortterm.org for completion of fitness work 2015-09-22 07:44:49 +00:00			`* DONE fitness_mon, the fitness monitor`
Add stubs of machi_fitness server 2015-09-08 07:11:54 +00:00			`** DONE Map key & val sketch`

			`Logical sketch:`

			`Map key: ObservingServerName::atom()`

			`Map val: { ObservingServerLastModTime::now(),`
			`UnfitList::list(ServerName::atom()),`
Add admin down status API to fitness server 2015-09-10 08:30:11 +00:00			`AdminDownList::list(ServerName::atom()),`
Add stubs of machi_fitness server 2015-09-08 07:11:54 +00:00			`Props::proplist() }`

			`Implementation sketch:`

			`1. Use CRDT map.`
			`2. If map key is not atom, then atom->string or atom->binary is fine.`
			`3. For map value, is it possible CRDT LWW type?`

WIP: Stand back, I'm going to try CRDTs! 2015-09-08 10:13:03 +00:00			`** DONE Investigate riak_dt data structure definition, manipulating, etc.`
			`** DONE Add dependency on riak_dt`
Update TODO-shortterm.org for completion of fitness work 2015-09-22 07:44:49 +00:00			`** DONE Update is an entire dict from Observer O`
			`*** DONE Merge my pending map + update map + my last mod time + my unfit list`
			`*** DONE if merged /= pending:`
			`**** DONE Schedule async tick (more)`
Add stubs of machi_fitness server 2015-09-08 07:11:54 +00:00
			`Tick message contains list of servers with differing state as of this`
			`instant in time... we want to avoid triggering decisions about`
			`fitness/unfitness for other servers where we might have received less`
			`than a full time period's worth of waiting.`

Update TODO-shortterm.org for completion of fitness work 2015-09-22 07:44:49 +00:00			`**** DONE Spam merged map to All_list -- [Me]`
			`**** DONE Set pending <- merged`
Add stubs of machi_fitness server 2015-09-08 07:11:54 +00:00
Update TODO-shortterm.org for completion of fitness work 2015-09-22 07:44:49 +00:00			`*** DONE When we receive an async tick`
			`**** DONE set active map <- pending map for all servers in ticks list`
			`**** DONE Send ch_mgr a react_to_env tick trigger`
			`*** DONE react_to_env tick trigger actions`
			`**** DONE Filter active map to remove stale entries (i.e. no update in 1 hour)`
			`**** DONE If time since last map spam is too long, spam our pending map`
			`**** DONE Proceed with normal react processing, using active map for AllHosed!`
Add stubs of machi_fitness server 2015-09-08 07:11:54 +00:00