Witness servers and CP mode clarification needed in design document

cmeiklejohn commented

2015-08-20 18:28:44 +00:00

(Migrated from github.com)

The role of witness servers in Section 11 of the chain manager document should be clarified.

From my initial reading, it seems that the technique of only accepting writes on the majority side of the partition can be presented completely with only "real" servers. "Witness" servers appear to be an optimization that allows continued operation in CP mode when only a minority of real servers are available, which I believe should be presented separately. Am I missing something obvious?

Only mentioned briefly and not presented clear enough is the reasoning for why witness servers should be placed at the front of the chain. Why is this required?

Finally, in Figure 3, given the reasoning in the document, it appears that the following:

[W_0, W_1, S_0, S_1, S_2] : cluster
[W_1, W_0, S_1] : majority partition
[S_0, S_2] : minority partition

will continue to accept writes on the majority partition side, given the presence of two witnesses and one real node. However, given the witnesses store no data, only metadata regarding the current projection, a single failure of S_1 before the partition heals results in data loss.

Shouldn't you require that a majority of the majority side of the partition be real servers for durability in CP mode? (instead of the requirement for only 1, which feels like it should be the invariant for AP mode, not CP mode)

The role of witness servers in Section 11 of the chain manager document should be clarified. From my initial reading, it seems that the technique of only accepting writes on the majority side of the partition can be presented completely with only "real" servers. "Witness" servers appear to be an optimization that allows continued operation in CP mode when only a minority of real servers are available, which I believe should be presented separately. Am I missing something obvious? Only mentioned briefly and not presented clear enough is the reasoning for why witness servers should be placed at the front of the chain. Why is this required? Finally, in Figure 3, given the reasoning in the document, it appears that the following: ``` [W_0, W_1, S_0, S_1, S_2] : cluster [W_1, W_0, S_1] : majority partition [S_0, S_2] : minority partition ``` will continue to accept writes on the majority partition side, given the presence of two witnesses and one real node. However, given the witnesses store no data, only metadata regarding the current projection, a single failure of S_1 before the partition heals results in data loss. Shouldn't you require that a majority of the majority side of the partition be real servers for durability in CP mode? (instead of the requirement for only 1, which feels like it should be the invariant for AP mode, not CP mode)

slfritchie commented

2015-08-21 05:28:04 +00:00

(Migrated from github.com)

Hi, Chris. Many thanks for putting this part of the doc under a microscope. ^_^ I'll try the questions out-of-order.

[SLF: Chris is referring to the doc at https://github.com/basho/machi/blob/master/doc/high-level-chain-mgr.pdf ]

Upon reflection after writing the "witnesses at the front of the chain" text a while ago, I don't believe that it's a hard requirement. However, it makes dealing with the implementation easier. For example, it isn't necessary to maintain strict ordering between epochs of witness servers in the UPI part of the chain (compared to how correctness can be broken if you reorder real servers in the UPI part of the chain). The current CP mode chain manager code feels a bit easier to work with, knowing that the witnesses are all in front. But perhaps that's ex post facto logic at work, I dunno.

If a CP mode chain manager tried to arrange the minority partition as:

[S_0, S_2] : minority partition

... the manager won't because (by definition) a minority length chain isn't sufficient. So S_0 and S_2 will wedge themselves and wait until they can talk to somebody else.

The other partition:

[W_1, W_0, S_1] : majority partition

... will be able to function because the chain is long enough (majority) and contains at least one real server, S_1. If we are in this situation, then we're already beyond the point of being able to operate in a 3 or 2 "real" server situation. We're lucky (and presumably happy) that we can operate at all -- the alternative is to be unavailable. Yes, we will be unavailable, as you point out, if S_1 fails before the partition heals. "Data loss" perhaps needs more specific definition: temporary unavailability (if S_1 returns to service eventually) or permanent unavailability (S_1 never returns to service with its data intact).

Shouldn't you require that a majority of the majority side of the partition be real servers for durability in CP mode? (instead of the requirement for only 1, which feels like it should be the invariant for AP mode, not CP mode)

Hrm, I feel that the best answer is "it depends". If you want to operate a strongly consistent cluster that can tolerate 2 failures and not lose data permanently, then a scheme of 2 witnesses + 3 real servers is sufficient. In the case (above) of the two failures being S_0 and S_2, yeah, you're flirting with unavailability (temporary or permanent) if you have a third failure ... but if you wanted to tolerate a 3rd failure with same consistency & availability, then you ought to be running with 3 witnesses + 4 real servers instead.

If it's helpful, here's a message sequence diagram for the CORFU-style pattern of chain replication that allows strong consistency updates to the chain despite having only a single real server. If there's an epoch change underway, our two witnesses help form the majority quorum that will guarantee that at least 1 of the 3 will observe the change and thus send a negative response to the client.

Client     Witness W_1     Witness W_0     Real S_0
------     -----------     -----------     --------
   In Epoch 5?
 |------------->
      Yes
 <-------------|
            In Epoch 5?
 |---------------------------->
                Yes
 <----------------------------|
           Append Bytes to prefix P in epoch 5?
 |-------------------------------------------->
                  ok!
 <--------------------------------------------|

Hi, Chris. Many thanks for putting this part of the doc under a microscope. ^_^ I'll try the questions out-of-order. [SLF: Chris is referring to the doc at https://github.com/basho/machi/blob/master/doc/high-level-chain-mgr.pdf ] Upon reflection after writing the "witnesses at the front of the chain" text a while ago, I don't believe that it's a hard requirement. However, it makes dealing with the implementation easier. For example, it isn't necessary to maintain strict ordering between epochs of witness servers in the UPI part of the chain (compared to how correctness can be broken if you reorder real servers in the UPI part of the chain). The current CP mode chain manager code feels a bit easier to work with, knowing that the witnesses are all in front. But perhaps that's ex post facto logic at work, I dunno. If a CP mode chain manager tried to arrange the minority partition as: ``` [S_0, S_2] : minority partition ``` ... the manager won't because (by definition) a minority length chain isn't sufficient. So S_0 and S_2 will wedge themselves and wait until they can talk to somebody else. The other partition: ``` [W_1, W_0, S_1] : majority partition ``` ... will be able to function because the chain is long enough (majority) and contains at least one real server, S_1. If we are in this situation, then we're already beyond the point of being able to operate in a 3 or 2 "real" server situation. We're lucky (and presumably happy) that we can operate at all -- the alternative is to be unavailable. Yes, we will be _unavailable_, as you point out, if S_1 fails before the partition heals. "Data loss" perhaps needs more specific definition: temporary unavailability (if S_1 returns to service eventually) or permanent unavailability (S_1 never returns to service with its data intact). > Shouldn't you require that a majority of the majority side of the partition be real servers for durability in CP mode? (instead of the requirement for only 1, which feels like it should be the invariant for AP mode, not CP mode) Hrm, I feel that the best answer is "it depends". If you want to operate a strongly consistent cluster that can tolerate 2 failures and not lose data permanently, then a scheme of 2 witnesses + 3 real servers is sufficient. In the case (above) of the two failures being S_0 and S_2, yeah, you're flirting with unavailability (temporary or permanent) if you have a third failure ... but if you wanted to tolerate a 3rd failure with same consistency & availability, then you ought to be running with 3 witnesses + 4 real servers instead. If it's helpful, here's a message sequence diagram for the CORFU-style pattern of chain replication that allows strong consistency updates to the chain despite having only a single real server. If there's an epoch change underway, our two witnesses help form the majority quorum that will guarantee that at least 1 of the 3 will observe the change and thus send a negative response to the client. ``` Client Witness W_1 Witness W_0 Real S_0 ------ ----------- ----------- -------- In Epoch 5? |-------------> Yes <-------------| In Epoch 5? |----------------------------> Yes <----------------------------| Append Bytes to prefix P in epoch 5? |--------------------------------------------> ok! <--------------------------------------------| ```

slfritchie commented

2015-08-21 07:04:00 +00:00

(Migrated from github.com)

Oops, I don't think I wrote about the first, question, sorry.

From my initial reading, it seems that the technique of only accepting writes on the majority side of the partition can be presented completely with only "real" servers. [...]

Hm, my memory of that section is dim, I'll have to go back and read it. If it's a chocolate + peanut butter description that would best be separate chocolate from the p.b., then I'll definitely consider separating them.

Oops, I don't think I wrote about the first, question, sorry. > From my initial reading, it seems that the technique of only accepting writes on the majority side of the partition can be presented completely with only "real" servers. [...] Hm, my memory of that section is dim, I'll have to go back and read it. If it's a chocolate + peanut butter description that would best be separate chocolate from the p.b., then I'll definitely consider separating them.

slfritchie commented

2015-08-26 11:12:54 +00:00

(Migrated from github.com)

Oops, there was a typo, back two comments ago. I've edited that comment to fix it. The new text is:

Yes, we will be unavailable, as you point out, if S_1 fails [....]

Oops, there was a typo, back two comments ago. I've edited that comment to fix it. The new text is: > Yes, we will be _unavailable_, as you point out, if S_1 fails [....]

cmeiklejohn commented

2015-08-26 18:35:56 +00:00

(Migrated from github.com)

So, I think we should work on rewriting this section and discuss CP mode without witnesses and then expand it to witnesses in a second section.

slfritchie commented

2015-09-09 04:09:13 +00:00

(Migrated from github.com)

Proposal:

Intro
Motivation
Separate AP section(s)
Separate CP section(s)
Joint evaluation?

Also, add motivation details & eval details for AP: causal vs. a some other looser eventual consistency flavor. Ditto for CP: linearizable vs. the looser sequential.

Also: clarify the atomicity guarantees/lack-thereof of how appends & writes work, without and with the "extra bytes" option to the append op. (SLF: though this is a bit trickier, because the semantics of the Machi file service don't really belong in the chain manager docs. Hrrrm, needs more thought....)

Proposal: - Intro - Motivation - Separate AP section(s) - Separate CP section(s) - Joint evaluation? Also, add motivation details & eval details for AP: causal vs. a some other looser eventual consistency flavor. Ditto for CP: linearizable vs. the looser sequential. Also: clarify the atomicity guarantees/lack-thereof of how appends & writes work, without and with the "extra bytes" option to the append op. (SLF: though this is a bit trickier, because the semantics of the Machi file service don't really belong in the chain manager docs. Hrrrm, needs more thought....)

Witness servers and CP mode clarification needed in design document #4