4 Querying
Richard Newman edited this page 2017-02-10 21:41:36 -08:00

The highest of high-level overviews

Query evaluation follows this conceptual sequence:

  1. EDN parsing.
  2. Query parsing: turning EDN into an abstract representation of a valid query.
  3. Algebrizing: interpreting a parsed query into a set of relations and data structures within the context of a database.
  4. Query translation: turning the plan into SQL. This is, in a sense, a simple form of query planning.
  5. Projection translation: turning the plan into a set of projectors/accumulators that will turn SQL result rows into Datalog results.
  6. (Implicit) preparation: creating views, tables, indices, or even whole databases or connections upon which the query will be run.
  7. Execution: running the translated SQL against the database.
  8. Projection: running the projectors against the resulting cursor to yield Datalog results.
  9. Serialization: encoding results for the consumer (e.g., as EDN).

Parsing

A query usually arrives as a string. That string is parsed to EDN, and from there parsed to a Find* (e.g., FindTuple represents a query that returns a tuple).

This entire step can take place in advance (indeed, at compile-time): it doesn't depend on the schema or any data bindings.

The second sub-stage of parsing — from EDN to a specific representation — involves some validation, such that all parsed queries make some amount of sense: for example, that the projected variables intersect with the variables present in the :in and :where clauses of the query; that all branches of an or work with the same variables; that forms are the correct lengths. The type system will help to encode and enforce many (but not all) of these constraints.

Algebrizing

This is the point in the process at which the contents of the database — in particular, its schema and ident mappings — are first used.

The parsed query is walked and (in conjunction with the schema and ident map) used to populate a Context and a nested set of ConjoiningClauses. These collect known types (e.g., if [?x :foo/bar 5] appears in a query, we know that ?x must be an entity; if [(> ?y 5)] appears, we know that ?y must be of some numeric type), value constraints, and the beginnings of SQL-oriented planning.

During this algebrizing we might encounter inconsistencies: a query might imply that ?y must be both an entity and an instant. The entire clause is then unsatisfiable, and we can prune as we go. If an entire query is known to return no results prior to execution, we might report this as an error.

Naturally these constraints are tied to the store's schema at the time of query execution: :foo/bar might be a unique integer attribute in one store, and a fulltext-indexed string attribute in another. They are also tied to any external bindings, which leaves us an implementation choice: to re-plan each query whenever a binding changes (more expensive, simpler, produces a more efficient query), or to produce a planned query that can accommodate any kind of binding (allows reuse, but leads to runtime limitations). See #278 for that.

Query translation

This is a good time to briefly cover how data is stored in SQLite.

DB schema diagram

There are at present two tables that store the current set of datoms: datoms and fulltext_datoms. fulltext_datoms refers to strings stored in the fulltext_values table; a view called all_datoms merges these two datom tables together.

Each attribute is either fulltext or not, so if you know an attribute — [?x :foo/name ?name] has attribute :foo/name — you know which datom table to consult. We abstract this as a Source interface: the database implements this, answering questions like "which table or view should I query for this attribute?". When an attribute isn't known in advance, we fall back to all_datoms, but this is relatively rare.

A query with multiple clauses turns into repeated self joins against the appropriate datom tables. For example, given a fulltext attribute :page/description, the query:

[:find ?title ?url ?description
 :in $
 :where
 [?page :page/title ?title]
 [?page :page/url ?url]
 [?page :page/description ?description]]

turns into SQL with three joins on two tables:

SELECT DISTINCT datoms1.v AS title,
                datoms2.v AS url,
                fulltext_datoms3.v AS description
FROM datoms AS datoms1,
     datoms AS datoms2,
     fulltext_datoms AS fulltext_datoms3
WHERE datoms1.a = 65540                  # :page/title
  AND datoms2.e = datoms1.e
  AND datoms2.a = 65541                  # :page/url
  AND fulltext_datoms3.e = datoms1.e
  AND fulltext_datoms3.a = 65542;        # :page/description

SQLite will likely implement this as an avet index walk over datoms binding ?page and ?title, coupled with walking or jumping into the eavt index on datoms to get bindings for ?url, coupled with much the same on fulltext_datoms, but with an additional lookup by ID into the fulltext_values FTS virtual table to get the description string.

A reader with some database experience will notice at this point that two other obvious compositions exist: we could have a separate datoms table for each known attribute, or for each part; or we could combine values for related sets of attributes into multiple columns on the same table, producing something that's much closer to a traditional SQL schema. These are relatively straightforward extensions that we plan to explore later, safely hidden behind our API abstractions.

We attempt to make the produced query predictable: there is a correspondence between the phrasing of the input query (even down to clause order) and the phrasing of the generated SQL, allowing users to iteratively explore query performance. We expect that most queries will be run more often than they will be written, and so at this stage it's more important to allow for predictable tuning than to aim for virtuoso plans.

Projection

A SQL query returns SQL values. But we work with a richer type system — booleans, dates, keywords — and so some translation is necessary. Further, we will support extra-SQL aggregates (simple aggregates are already directly translated into SQL), and other additions like evaluating pull expressions on query results.

These steps are handled by projectors.

A brief diversion into types

A range of operations — projection, filtering, retrieval, sorting, comparison — requires knowledge of a value's type: a SQL 3 might be a timestamp in January 1970, an entity ID, an integer, or a pointer to the third value in the fulltext_values table. The query

[:find ?e :in $ :where [?e _ 1234]]

can't simply be translated to the SQL:

SELECT datoms1.e AS e
  FROM datoms datoms1
 WHERE datoms1.v = 1234;

— instead of returning all entities that have some attribute with the integer value 1234 or the entid 1234 (both acceptable interpretations in Datalog), it'll also return entities related to '1234' interpreted as a timestamp, or whatever other magic we include in our storage layer.

For this reason we store a type tag in the database alongside each value, and we use it: the query above might instead be

SELECT datoms1.e AS e
  FROM datoms datoms1
 WHERE datoms1.v = 1234
   AND datoms1.value_type_tag IN (0, 5);   # entity or number

For projection, we need to retrieve this tag:

SELECT datoms1.v AS e,
       datoms1.value_type_tag AS e_type_tag
  FROM datoms datoms1;

and the projector will consume the tag and the value to decide how to translate into Datalog.

In practice we rarely need to do any of this. In the case of values retrieved by association to a known attribute, we know the attribute's type in advance: a :foo/name value must be a string, and a :db/ident value must be a keyword. We don't need to retrieve the tag from the database, and we don't need to switch on the type of the projector at runtime.

Similarly, we know that all entities are entities/refs, and can't be any other kind of value. We can deduce that values compared with various kinds of operations must be numeric, or strings, or whatever. We can even propagate type constraints, so:

[:find ?e, ?x, ?y :in $
 :where
 # ?e must be an entity.
 [?e :some/numeric ?x]       # Schema: ?x must be numeric.
 [?e ?a ?y]                  # ?a must be an entity.
 [(!= ?a :some/numeric)]     # =: ?a must be an entity.
 [(= ?y ?x)]]                # =: ?y must be numeric.