This uses a common table expression and multiple SQL calls rather than a
temporary table, since transactions with huge numbers of distinct
lookup-refs are likely to be very rare.
We mark lookup-refs with `lookup-ref`, which is a little awkward because
binding `(let [[a v] lookup-ref] ...)` doesn't directly work, but avoids
some ambiguity present in Datomic and DataScript around interpreting
lookup-refs as multiple value lists. (Which bit the tests in an earlier
version of this patch!)
There's no distinction made for fulltext attributes, since the values
found by the retractAttributes SELECT are already rowids into the
fulltext_values table and therefore need no additional mapping.
These temp files will almost certainly live in memory only, speeding our
test suite evaluation significantly. Before this patch, in a warmed
REPL environment I get:
Testing datomish.db-test
Ran 19 tests containing 97 assertions.
0 failures, 0 errors.
"Elapsed time: 1408.720681 msecs"
"Elapsed time: 1343.986464 msecs"
"Elapsed time: 1338.660762 msecs"
After this patch, in a warmed REPL environment I get:
Testing datomish.db-test
Ran 19 tests containing 97 assertions.
0 failures, 0 errors.
"Elapsed time: 587.605168 msecs"
"Elapsed time: 569.522333 msecs"
"Elapsed time: 589.080282 msecs"
We'd like this to be part of the query syntax itself, but doing so
requires extending DataScript's parser.
Instead we generalize our `args` to `options`, and take `:limit`
and `:order-by-vars`. The former must be an integer or nil, and the
latter is an array of `[var direction]` pairs.
This commit includes descriptive error messages and tests for success
and failure.
This caches a partition map per DB, which is helpful because it exposes
what the point in time DB partition state is, but is unhelpful because
the partition state can advance underneath the DB cache. This is
generally true of the approach -- this can happen to the ident/entid
maps, and the datoms themselves -- so we'll roll with it for now.
This reduces the number of SQL UPDATE operations from linear in the
number of id-literals used to constant in the number of known
partitions.
* Alter how clauses are concatenated. They now preserve order more accurately.
* Track mappings between vars and extracted type columns.
* Generate type code constraints.
* Push known types down into :not.
* Push known types down into :or.
* Tests and test fixes.
Note that `go` (and `go-pair`) don't descend into `for` comprehensions
and other situations in which a fn is created. This commit rewrites to
use nested `loop`s, and also improves use of `<av`.
* Batch up datoms into a smaller number of queries, improving transact speed by about 50%.
* Restore transacting FTS attributes.
* Implement retraction of freetext datoms.
This is almost complete; it passes the test suite save for retracting
fulltext datoms correctly.
There's a lot to say about this approach, but I don't have time to give
too many details. The broad outline is as follows. We collect datoms
to add and retract in a tx_lookup table. Depending on flags ("search
value" sv and "search value type tag" svalue_type_tag) we "complete" the
tx_lookup table by joining matching datoms. This allows us to find
datoms that are present (and should not be added as part of the
transaction, or should be retracted as part of the transaction, or
should be replaced as part of the transaction. We complete the
tx_lookup (in place!) in two separate INSERTs to avoid a quadratic
two-table walk (explain the queries to observe that both INSERTs walk
the lookup table once and then use the datoms indexes to complete the
matching values).
We could simplify the code by using multiple lookup tables, both for the
two cases of search parameters (eav vs. ea) and for the incomplete and
completed rows. Right now we differentiate the former with NULL checks,
and the latter by incrementing the added0 column. It performs well
enough, so I haven't tried to understand the performance of separating
these things.
After the tx_lookup table is completed, we build the transaction from
it; and update the datoms materialized view table as well. Observe the
careful handling of the "search value" sv parameters to handle replacing
:db.cardinality/one datoms.
Finally, we read the processed transaction back to produce to the API.
This is strictly to match the Datomic API; we might make allow to skip
this, since many consumers will not want to stream this over the wire.
Rough timings show the transactor processing a single >50k datom
transaction in about 3.5s, of which less than 0.5s is spent in the
expensive joins. Further, repeating the processing of the same
transaction is only about 3.5s again! That's the worst possible for the
joins, since every single inserted datom will already be present in the
database, making the most expensive join match every row.
This was a little more tricky than might be expected because the
initialization process uses the transactor to bootstrap the database.
Since Clojure doesn't accept mutually recursive modules, this
necessitated a third module, namely "db-factory", which uses both "db"
and "transact". While I was here, I started an "api" module, to paper
over the potentially complicated internal module structure for external
consumers. In time, this "api" module may also grow CLJS-specific JS
transformations.
This agrees with Datomic. DataScript allows tx values, possibly to
allow reconstructing DBs from Datom streams, but appears to handle
user-provided tx values in the transactor inconsistently.
The implementation of :db/tx is special and may need to change over
time. We add it as a special ident, with value the current transaction
entity ID, specified per-transaction. This works well right now but
introduces some (internal) ordering requirements that may need to be
loosened.
Internally, we use SQLite's FTS4 to maintain a fulltext_values table of
unique "text" values. Fulltext indexed datoms have value v that is the
rowid into fulltext_values. We manually maintain the map between rowid
and value in the transactor.
For convenience, we expose two views interpolating the real text values
into the datoms structure.
This version includes SQLite-level unique indexes; these should never be
needed. I've included them as a fail-safe while testing; they'll help
us catch errors in the transaction layer above.
In the future, we might add a layer of indirection, hashing values to
avoid duplicating storage, or sorting URLs, or handling fulltext indexed
values differently, or ...
Some of these were just typos, but `with-open` was fatally flawed on
CLJS (we couldn't call `.close` at all), and `deftest-async` was hiding
all failures (due to a typo).
We would prefer to talk about a knowledge base on top of a database, but
all the Datomic and DataScript code (and symbols, like :db/add, etc)
refer to the "database of datoms", so let's roll with that nomenclature
and try to be specific that the persistent storage-layer is SQLite.
This will become more clear when we actually use SQLite's unique
capabilities for text indexing.
This is a well-worn idea: use a `promise-channel` of `[result nil]` or
`[nil error]` pairs. The `go-pair` and `<?` macros handle catching
exceptions (important, given that synchronous CLJ code expects to throw
rather than return an error promise or similar), allowing code like:
```
(go-pair
(let [result (<? (pair-chan-fn))]
(when (not result)
(throw (Exception. "No result!")))
(transform result)))
```
to be expressed naturally. These are the equivalents of `async` and
`await` in JS.
The implementation is complicated by significant incompatibilities
between CLJ and CLJS. The solution presented here takes care to
separate the macro definitions into CLJ. Sadly, this requires
namespacing the per-environment symbols explicitly; but we hope to
minimize such code in files like this.
The most significant restriction to this approach is that consumers must
require the transitive dependencies of the macro-defining modules. See
the included tests (both CLJ and CLJS) for the appropriate
incantations (for pair-chan, core.async, and test).