Merging #199

New issue

Open

opened 2020-08-06 16:56:33 +00:00 by gburd · 0 comments

gburd commented

2020-08-06 16:56:33 +00:00

(Migrated from github.com)

I thought I'd capture some notes around merging. There's more discussion in the wiki and in my paper notebooks!

If we allow users to establish multiple initial timelines — that is, to work offline from scratch on more than one device, and then to sign in later to the same account — we will need to support merges of current state.

That is: given a timeline of states A -> B -> C and a timeline of states X -> Y -> Z, produce M(C, Z) which, in the simplest (totally disjoint) case is equivalent to C + Z.

This is not the same thing as attempting to rebase (X -> Y -> Z + 𝛿A + 𝛿B + 𝛿C); it's possible for conflicts to occur between intermediate states that would be resolved in aggregate, and it's also not correct in a theoretical sense to imply even an approximate happened-after relationship.

This kind of merge is a three-way merge from the empty shared root. There are other kinds of merges we might consider: e.g., a long-lived divergence from a non-empty shared earlier state.

I sketched out how I think this might work:

Materialize the remote head in a second datoms table. Reusing the fulltext table avoids the need to rewrite those values.
Ensure that the remote schema is current locally. Abort now if the local code is too old. We rely on both sides having the same concept of uniqueness and cardinality.
Renumber locally. This ensures that by default our local data does not accidentally collide with the main timeline.
For all unique av pairs, take the remote e and tx for that av, renaming local identifiers. This is essentially discarding local duplicates. Idents are a special case of this, and need to be done first in order to resolve attributes! Now we have no uniqueness conflicts. There should be no e conflicts (because we renumbered). This is our smushing step. Each time we rewrite an entity that is used elsewhere in the v position we need to iterate until we converge, because the new local av might yield another new e. Once this step is complete we have fully smushed: all of our entities that can be linked via a unique-identity path to a value will have been unified.
For all cardinality-one ea pairs, look for a conflict between the two datoms tables and resolve according to rules. Resolving a conflict might mean taking the remote value, which might require further smushing.
Now take all datoms in local and INSERT OR IGNORE into remote. Synthesize a single merge tx if desired: using a single tx gives us more formal definition of a merge, but loses history (and either renders meaningless or discards tx metadata). We could point back to txids from the merge tx. Using multiple txes preserves granularity but obscures ordering, unless we reify multiple transaction logs.
Select from remote datoms on tx to populate the remote tx log: that is, we work backwards from the datoms table to the log.
Annotate the new tx with merge info.
Optional: compact the parts table.
Upload the new data.
Rebuild the cache and schema.
Notify consumers about renumbering.

I thought I'd capture some notes around merging. There's more discussion [in the wiki](https://github.com/mozilla/mentat/wiki/Thoughts:-synchronization) and in my paper notebooks! If we allow users to establish multiple initial timelines — that is, to work offline from scratch on more than one device, and then to sign in later to the same account — we will need to support merges of current state. That is: given a timeline of _states_ `A -> B -> C` and a timeline of states `X -> Y -> Z`, produce `M(C, Z)` which, in the simplest (totally disjoint) case is equivalent to `C + Z`. This is not the same thing as attempting to rebase (`X -> Y -> Z + 𝛿A + 𝛿B + 𝛿C`); it's possible for conflicts to occur between intermediate states that would be resolved in aggregate, and it's also not correct in a theoretical sense to imply even an approximate happened-after relationship. This kind of merge is a three-way merge from the empty shared root. There are other kinds of merges we might consider: e.g., a long-lived divergence from a non-empty shared earlier state. I sketched out how I think this might work: 1. Materialize the remote head in a second datoms table. Reusing the fulltext table avoids the need to rewrite those values. 2. Ensure that the remote schema is current locally. Abort now if the local code is too old. We rely on both sides having the same concept of uniqueness and cardinality. 3. Renumber locally. This ensures that by default our local data does not accidentally collide with the main timeline. 4. For all unique `av` pairs, take the remote `e` and `tx` for that `av`, renaming local identifiers. This is essentially discarding local duplicates. Idents are a special case of this, and need to be done first in order to resolve attributes! Now we have no uniqueness conflicts. There should be no `e` conflicts (because we renumbered). This is our smushing step. Each time we rewrite an entity that is used elsewhere in the `v` position we need to iterate until we converge, because the new local `av` might yield another new `e`. Once this step is complete we have fully smushed: all of our entities that can be linked via a unique-identity path to a value will have been unified. 5. For all cardinality-one `ea` pairs, look for a conflict between the two datoms tables and resolve according to rules. Resolving a conflict might mean taking the remote value, which might require further smushing. 6. Now take all datoms in local and `INSERT OR IGNORE` into remote. Synthesize a single merge tx if desired: using a single tx gives us more formal definition of a merge, but loses history (and either renders meaningless or discards tx metadata). We could point back to txids from the merge tx. Using multiple txes preserves granularity but obscures ordering, unless we reify multiple transaction logs. 7. Select from remote datoms on tx to populate the remote tx log: that is, we work backwards from the datoms table to the log. 8. Annotate the new tx with merge info. 9. Optional: compact the parts table. 10. Upload the new data. 11. Rebuild the cache and schema. 12. Notify consumers about renumbering.