[query] Second algebrizing phase for bound variables in prepared statements #121

New issue

Open

opened 2020-08-06 16:55:16 +00:00 by gburd · 0 comments

gburd commented

2020-08-06 16:55:16 +00:00

(Migrated from github.com)

In the ClojureScript implementation of Mentat, we processed a query from EDN through to SQL results each time it was run.

This isn't ideal for real-world use: we much prefer to use prepared statements, which allow us to parse a query once, plan it infrequently, and run it many times.

Nevertheless, in the ClojureScript implementation we did things right: external inputs (bound variables) were threaded into the mix after algebrizing, at the point of translation from query context to SQL, rather than being substituted before algebrizing.

At present, our Rust algebrizing phase (part-implemented!) deliberately ignores bound variables, just as in CLJS.

That allows algebrizing to be done once for each schema: a query can be 'prepared' once, and only needs to be re-prepared if the set of idents or attributes changes. Values that match any deduced type constraints can be substituted into the same prepared query. (Values that don't match will cause the query to automatically return no results.)

For many uses of bound (:in) variables, this is fine: queries like

[:find ?person ?email
 :in $ ?name
 :where
 [?person :person/name ?name]
 [?person :person/email ?email]]

won't benefit from further analysis at the point prior to execution when ?name is known. But queries like

[:find ?x ?z :in $ ?a :where [?x ?a ?z]]

certainly would — knowing ?a allows us to know both which table to query and the expected type of ?z when projected. A query planned with an unbound ?a must query all_datoms and project ?z's value_type_tag to do runtime type projection, both of which are expensive.

We should do two things:

Implement some heuristic to decide whether a second phase of algebrizing is useful. Perhaps this is as simple as "there are :in variables whose types are unknown", or perhaps "one of the :in variables names an attribute".
Implement a second, cheap phase of algebrizing, and use it prior to query execution as appropriate.

Queries with no bound parameters, or that we're happy not re-algebrizing, can be translated directly into SQLite statements for each connection. Other queries will generate a new SQL query each time (though obviously memoization is possible), and so the potential speedups of a more precise query formulation must be traded off against the loss of preparation. This might be a choice we leave to users, just as SQL databases give the option of keeping prepared statements.

I expect queries with certain kinds of :in parameters will always need a second phase of processing, or need to be entirely processed prior to execution. A query with a source var %foo needs to know which tables provide %foo.

Queries can also take a ?log variable representing the database transaction log. That's used in a tx-ids or tx-data expression. It's not clear to me how we'll handle that yet.

This issue tracks figuring out this work when the first stage of the algebrizer is complete.

In the ClojureScript implementation of Mentat, we processed a query from EDN through to SQL results each time it was run. This isn't ideal for real-world use: we much prefer to use prepared statements, which allow us to parse a query once, plan it infrequently, and run it many times. Nevertheless, in the ClojureScript implementation we did things right: external inputs (bound variables) were threaded into the mix after algebrizing, at the point of translation from query context to SQL, rather than being substituted before algebrizing. At present, our Rust algebrizing phase (part-implemented!) deliberately ignores bound variables, just as in CLJS. That allows algebrizing to be done once for each schema: a query can be 'prepared' once, and only needs to be re-prepared if the set of idents or attributes changes. Values that match any deduced type constraints can be substituted into the same prepared query. (Values that don't match will cause the query to automatically return no results.) For many uses of bound (`:in`) variables, this is fine: queries like ``` [:find ?person ?email :in $ ?name :where [?person :person/name ?name] [?person :person/email ?email]] ``` won't benefit from further analysis at the point prior to execution when `?name` is known. But queries like ``` [:find ?x ?z :in $ ?a :where [?x ?a ?z]] ``` certainly would — knowing `?a` allows us to know both which table to query and the expected type of `?z` when projected. A query planned with an unbound `?a` must query `all_datoms` and project `?z`'s `value_type_tag` to do runtime type projection, both of which are expensive. We should do two things: - Implement some heuristic to decide whether a second phase of algebrizing is useful. Perhaps this is as simple as "there are `:in` variables whose types are unknown", or perhaps "one of the `:in` variables names an attribute". - Implement a second, cheap phase of algebrizing, and use it prior to query execution as appropriate. Queries with no bound parameters, or that we're happy not re-algebrizing, can be translated directly into SQLite statements for each connection. Other queries will generate a new SQL query each time (though obviously memoization is possible), and so the potential speedups of a more precise query formulation must be traded off against the loss of preparation. This might be a choice we leave to users, just as SQL databases give the option of keeping prepared statements. I expect queries with certain kinds of `:in` parameters will always need a second phase of processing, or need to be entirely processed prior to execution. A query with a source var `%foo` needs to know which tables provide `%foo`. Queries can also take a `?log` variable representing the database transaction log. That's used in a `tx-ids` or `tx-data` expression. It's not clear to me how we'll handle that yet. This issue tracks figuring out this work when the first stage of the algebrizer is complete.