Add initial description of how querying works.

Richard Newman 2017-01-13 12:34:49 -08:00
parent 6767687c11
commit 76d758f78e

126
Querying.md Normal file

@ -0,0 +1,126 @@
# Querying
## The highest of high-level overviews
Query evaluation follows this conceptual sequence:
1. EDN parsing.
2. Query parsing: turning EDN into an abstract representation of a valid query.
3. Expansion/planning: interpreting a parsed query into a set of relations and data structures within the context of a database.
4. Query translation: turning the plan into SQL.
5. Projection translation: turning the plan into a set of projectors/accumulators that will turn SQL result rows into Datalog results.
6. (Implicit) preparation: creating views, tables, indices, or even whole databases or connections upon which the query will be run.
7. Execution: running the translated SQL against the database.
8. Projection: running the projectors against the resulting cursor to yield Datalog results.
9. Serialization: encoding results for the consumer (_e.g._, as EDN).
## Parsing
A query usually arrives as a string. That string is parsed to EDN, and from there parsed to a `Find*` (_e.g._, `FindTuple` represents a query that returns a tuple).
This entire step can take place in advance (indeed, at compile-time): it doesn't depend on the schema or any data bindings.
The second sub-stage of parsing — from EDN to a specific representation — involves some validation, such that all parsed queries make some amount of sense: for example, that the projected variables intersect with the variables present in the `:in` and `:where` clauses of the query; that all branches of an `or` work with the same variables; that forms are the correct lengths. The type system will help to encode and enforce many (but not all) of these constraints.
## Expansion
This is the point in the process at which the contents of the database — in particular, its schema and ident mappings — are first used.
The parsed query is walked and used to populate a `Context` and a nested set of `ConjoiningClauses`. These collect known types (_e.g._, if `[?x :foo/bar 5]` appears in a query, we know that `?x` must be an entity; if `[(> ?y 5)]` appears, we know that `?y` must be of some numeric type), value constraints, and the beginnings of SQL-oriented planning.
During this processing we might encounter inconsistencies: a query might imply that `?y` must be both an entity and an `instant`. The entire clause is then unsatisfiable, and we can prune as we go. If an entire query is known to return no results prior to execution, we might report this as an error.
Naturally these constraints are tied to the store's schema at the time of query execution: `:foo/bar` might be a unique integer attribute in one store, and a fulltext-indexed string attribute in another. They are also tied to any external bindings, which leaves us an implementation choice: to re-plan each query whenever a binding changes (more expensive, simpler, produces a more efficient query), or to produce a planned query that can accommodate any kind of binding (allows reuse, but leads to runtime limitations).
## Query translation
This is a good time to briefly cover how data is stored in SQLite.
PICTURE GOES HERE
There are at present two tables that store the current set of datoms: `datoms` and `fulltext_datoms`. `fulltext_datoms` refers to strings stored in the `fulltext_values` table; a view called `all_datoms` merges these two datom tables together.
Each attribute is either fulltext or not, so if you know an attribute — `[?x :foo/name ?name]` has attribute `:foo/name` — you know which datom table to consult. We abstract this as a `Source` interface: the database implements this, answering questions like "which table or view should I query for this attribute?". When an attribute isn't known in advance, we fall back to `all_datoms`, but this is relatively rare.
A query with multiple clauses turns into repeated self joins against the appropriate datom tables. For example, given a fulltext attribute `:page/description`, the query:
```
[:find ?title ?url ?description
:in $
:where
[?page :page/title ?title]
[?page :page/url ?url]
[?page :page/description ?description]]
```
turns into SQL with three joins on two tables:
```
SELECT DISTINCT datoms1.v AS title,
datoms2.v AS url,
fulltext_datoms3.v AS description
FROM datoms AS datoms1,
datoms AS datoms2,
fulltext_datoms AS fulltext_datoms3
WHERE datoms1.a = 65540 # :page/title
AND datoms2.e = datoms1.e
AND datoms2.a = 65541 # :page/url
AND fulltext_datoms3.e = datoms1.e
AND fulltext_datoms3.a = 65542; # :page/description
```
SQLite will likely implement this as an `avet` index walk over `datoms` binding `?page` and `?title`, coupled with walking or jumping into the `eavt` index on `datoms` to get bindings for `?url`, coupled with much the same on `fulltext_datoms`, but with an additional lookup by ID into the `fulltext_values` FTS virtual table to get the description string.
A reader with some database experience will notice at this point that two other obvious compositions exist: we could have a separate datoms table for each known attribute, or for each _part_; or we could combine values for related sets of attributes into multiple columns on the same table, producing something that's much closer to a traditional SQL schema. These are relatively straightforward extensions that we plan to explore later, safely hidden behind our API abstractions.
## Projection
A SQL query returns SQL values. But we work with a richer type system — booleans, dates, keywords — and so some translation is necessary. Further, we will support extra-SQL aggregates (simple aggregates are already directly translated into SQL), and other additions like evaluating pull expressions on query results.
These steps are handled by _projectors_.
### A brief diversion into types
A range of operations — projection, filtering, retrieval, sorting, comparison — requires knowledge of a value's type: a SQL `3` might be a timestamp in January 1970, an entity ID, an integer, or a pointer to the third value in the `fulltext_values` table. The query
```
[:find ?e :in $ :where [?e _ 1234]]
```
can't simply be translated to the SQL:
```
SELECT datoms1.e AS e
FROM datoms datoms1
WHERE datoms1.v = 1234;
```
— instead of returning all entities that have some attribute with the integer value 1234 or the entid 1234 (both acceptable interpretations in Datalog), it'll also return entities related to '1234' interpreted as a timestamp, or whatever other magic we include in our storage layer.
For this reason we store a type tag in the database alongside each value, and we use it: the query above might instead be
```
SELECT datoms1.e AS e
FROM datoms datoms1
WHERE datoms1.v = 1234
AND datoms1.value_type_tag IN (0, 5); # entity or number
```
For projection, we need to retrieve this tag:
```
SELECT datoms1.v AS e,
datoms1.value_type_tag AS e_type_tag
FROM datoms datoms1;
```
and the projector will consume the tag and the value to decide how to translate into Datalog.
In practice we rarely need to do any of this. In the case of values retrieved by association to a known attribute, we know the attribute's type in advance: a `:foo/name` value must be a string, and a `:db/ident` value must be a keyword. We don't need to retrieve the tag from the database, and we don't need to switch on the type of the projector at runtime.
Similarly, we know that all entities are entities/refs, and can't be any other kind of value. We can deduce that values compared with various kinds of operations must be numeric, or strings, or whatever. We can even propagate type constraints, so:
```
[:find ?e, ?x, ?y :in $
:where
# ?e must be an entity.
[?e :some/numeric ?x] # Schema: ?x must be numeric.
[?e ?a ?y] # ?a must be an entity.
[(!= ?a :some/numeric)] # =: ?a must be an entity.
[(= ?y ?x)]] # =: ?y must be numeric.
```