[meta] Materialized views and caches #161

New issue

Open

opened 2020-08-06 16:55:57 +00:00 by gburd · 0 comments

gburd commented

2020-08-06 16:55:57 +00:00

(Migrated from github.com)

One of the potential value propositions for Mentat is semi-automatic materialization of views and caches, offering applications both flexible storage and fast querying.

I see applications working at three levels:

The most general, and also the slowest, is direct query access to the database. This would be used for development, for non-speed-critical queries, and for queries where speed is willingly traded for space.
The next is materialization of views on disk. This approach would be taken for queries that need to be faster, but either need to interface with other query parts, or don't need to be in memory.

Both simple attribute lookups and complex queries could be materialized; the difference is simply in the width of the table and the columns they have.
The final level is in-memory caching. We already do this for schema data, because we need it synchronously to algebrize queries and process transactions. Applications, too, might need fast synchronous access to a subset of data — lookup tables, URLs, scores. This approach works best for lookups that need to be very fast but don't take up too much memory.

The interaction with cached data should explicitly define which kind of cache is expected: write-through, write-back, write-around.

Exploring materialized views will involve:

Defining a mechanism for describing views, probably as a find-rel query. This will need to live alongside the data.
Implementing a module to update views when a relevant transaction is written. Updates could be incremental or complete, with obvious performance concerns.
Extending the transactor to trigger view updates accordingly. This involves formalizing when updates occur: inside a transact? As a transact listener (inconsistent but higher throughput)? Should applications choose?
Potentially implementing q_many to avoid the overhead of algebrizing on each write.
Potentially generating queries that project directly into an SQL INSERT/UPDATE, rather than round-tripping through in-memory projection.
Adjusting the algebrizer to generate queries that use materialized views when appropriate.
Ensuring that the algebrizer doesn't use materialized views when generating queries for materialized views!

Exploring caching will involve:

Formalizing the relationship between the cache and a materialized view. Is the cache always backed by a view? Can we keep a subset in memory?
Deciding whether caches are writable, and how those map to transacts.
Extending the transactor to update caches appropriately when their source data changes.
Deciding how to share caches between threads.

Thoughts, crew?

One of the potential value propositions for Mentat is semi-automatic materialization of views and caches, offering applications both flexible storage and fast querying. I see applications working at three levels: 1. The most general, and also the slowest, is direct query access to the database. This would be used for development, for non-speed-critical queries, and for queries where speed is willingly traded for space. 2. The next is materialization of views on disk. This approach would be taken for queries that need to be faster, but either need to interface with other query parts, or don't need to be in memory. Both simple attribute lookups and complex queries could be materialized; the difference is simply in the width of the table and the columns they have. 3. The final level is in-memory caching. We already do this for schema data, because we need it synchronously to algebrize queries and process transactions. Applications, too, might need fast synchronous access to a subset of data — lookup tables, URLs, scores. This approach works best for lookups that need to be very fast but don't take up too much memory. The interaction with cached data should explicitly define which kind of cache is expected: write-through, write-back, write-around. Exploring materialized views will involve: - Defining a mechanism for describing views, probably as a `find-rel` query. This will need to live alongside the data. - Implementing a module to update views when a relevant transaction is written. Updates could be incremental or complete, with obvious performance concerns. - Extending the transactor to trigger view updates accordingly. This involves formalizing when updates occur: inside a transact? As a transact listener (inconsistent but higher throughput)? Should applications choose? - Potentially implementing `q_many` to avoid the overhead of algebrizing on each write. - Potentially generating queries that project directly into an SQL `INSERT`/`UPDATE`, rather than round-tripping through in-memory projection. - Adjusting the algebrizer to generate queries that use materialized views when appropriate. - Ensuring that the algebrizer doesn't use materialized views when generating queries for materialized views! Exploring caching will involve: - Formalizing the relationship between the cache and a materialized view. Is the cache always backed by a view? Can we keep a subset in memory? - Deciding whether caches are writable, and how those map to transacts. - Extending the transactor to update caches appropriately when their source data changes. - Deciding how to share caches between threads. Thoughts, crew?