The metaphor we use is that of "evolution", where each "evolutionary
step" contains a number of different "generations". Entities in the
process of being resolved are increasingly "evolved" into simpler
generations, until no further evolution is possible.
The test would fail because we would have an [a v] pair with a string
value, but we were looking for the fulltext rowid in <avs. Using
all_datoms correctly looks up the string value, at the cost of crippling
the speed of <avs.
This sorts fulltext values inserted in a single transaction, not across
transactions. This makes the rowids assigned in the fulltext_values
table internally consistent, even as the order of entities and datoms
changes (as the transaction applying algorithm evolves over time). The
test changes simply make the fulltext values sort easily.
In theory, these fulltext values could be very large, and sorting might
be very expensive. In practice, we expect values to differ in their
first few characters, so that this is efficient (i.e., proportional to
the number of fulltext values inserted and not their size).
Before, the parts of the lookup-ref test that passed took about ~2100ms
on my device. After, the entire lookup-ref test takes 350ms, a
significant speed-up. I did not try to plot the graphs of the two
approaches as the number of lookup-refs increased, since the new
approach is significantly better in all cases I can think of.
This uses a common table expression and multiple SQL calls rather than a
temporary table, since transactions with huge numbers of distinct
lookup-refs are likely to be very rare.
We mark lookup-refs with `lookup-ref`, which is a little awkward because
binding `(let [[a v] lookup-ref] ...)` doesn't directly work, but avoids
some ambiguity present in Datomic and DataScript around interpreting
lookup-refs as multiple value lists. (Which bit the tests in an earlier
version of this patch!)