0 Transacting: upsert resolution algorithm
Joe Walker edited this page 2017-02-16 10:28:43 +00:00

Note: This page needs bringing up to date . See issue 313 for details.


The goal of the algorithm is to:

  • Upsert temporary IDs to concrete numeric entids existing in the store where possible.
  • To determine that this is not possible and to allocate new entids.
  • To determine that the input is inconsistent and fail fast.

The metaphor we use is that of evolution. At each evolutionary step, a generation contains a number of different populations, each population containing a number of different transaction entities. Transaction entities in the process of being resolved are evolved into successively simpler populations, until no further evolution is possible.

At the end of the evolutionary process, no transaction entity contains an unresolved temporary ID.

Throughout we use negative integers (-1, -2, etc) to denote temporary IDs that need to be resolved.

Populations

We partition the set of transaction entities into the following populations.

Population Description
upserts-e "Simple upserts" that look like [:db/add -1 a v] where a is :db.unique/identity
upserts-ev "Complex upserts" that look like [:db/add -1 a -2] where a is :db.unique/identity
allocations-ev Entities that look like [:db/add -1 b -2], where b is not :db.unique/identity
allocations-e Entities that look like [:db/add -1 b v], where b is not :db.unique/identity
allocations-v Entities that look like [:db/add e b -2], with no condition on b
inert Entities that are not [:db/add ...], or that do not contain a temporary ID, like [:db/add e a v]

Evolutionary step

The population sets are iteratively refined as follows. Each evolutionary step, we do the following:

  1. Try to upsert upserts-e. For each pair [a v], we search for [e a v] in the store. If exactly one such datom exists, we say entities like [-1 a v] have upserted, and we "resolve" the temporary ID -1 to the entid e. More than one datom can't exist, since a is :db.unique/identity. Next generation, those entities will be in the inert population -- they can evolve no further. If such a datom [e a v] does not exist, we say entities like [-1 a v] require allocation. Next generation, those entities will be in the allocations-e population.

  2. Using the temporary IDs that resolved in the previous step, try to evolve the all populations other than upserts-e further. That is, we might be able to evolve an entity [-1 a -2] in upserts-ev to [-1 a e] (in upserts-e) or to [e a -2] (in allocations-v) or event to [e a v] (in inert) next generation. Similarly, we may be able to evolve:

  • some entities in allocations-ev to allocations-e, allocations-v, or inert;
  • some entities in allocations-e to inert;
  • some entities in allocations-v to inert.

It is possible for a single temp ID to upsert to more than one entid. As soon as we witness such a conflicting upsert, we fail.

Eventually, the population upserts-e will be empty. At this point, we evolve every complex upsert [-1 a -2] in upserts-ev to allocations-ev.

Finally, we allocate new entids for any temporary IDs unresolved in allocations-ev, allocations-e, and allocations-v. We're done!

It should not be difficult to write a formal proof of correctness of the algorithm, but I haven't done so.

As a future optimization, entities that upserted do not need to be inserted in the SQL store: they upserted, so they already exist in the DB. (We still need to verify uniqueness and ensure that no overlapping assertions occur.) Similarly, entities that contain temp IDs that required allocation do not need to be checked for existence before they are inserted into the SQL store, so they can be inserted more efficiently than arbitrary entities.