0
Thoughts: open questions
Richard Newman edited this page 2017-01-09 11:35:49 -08:00
Many of these are already tracked as issues, but here's a list of the larger work items and areas still to explore.
This was written 2016-12-05, so it's already a little outdated, but worth capturing for posterity.
- One of Mentat's advantages over something like SQLite is in abstracting away some of the details of how to store and index data. There are more opportunities here: graph-oriented or geospatial indexing, for example. If Mozilla moves into heavy location-based systems, geospatial storage becomes a must. I have some experience in this area, and I have consulting contacts with more.
- Synchronization (harder) and replication (easier). We expose transaction log data to applications; with the aid of the schema it should be possible to build a relatively straightforward generalized synchronization or replication protocol.
- Encrypted storage. What are the tradeoffs to using SQLCipher instead of SQLite? It's a trivial change that gives us pervasive encryption of all data, but we don't know the performance implications. If we ATTACH two databases to encrypt only some data (per-attribute encryption), do we over-complicate querying or handling the transaction log, and are we able to transact mixed collections of datoms atomically?
- Rules. We haven't built a rule engine yet. This is important for some kinds of graph traversal.
- User-supplied extension functions are an unknown. There are two layers: those that can be expressed as combinations of SQL functions, and those that can't. I'd rather avoid textual "macros"…
- Improvements to disk storage representation. There are small things we can do to save space on index flags, then larger things: automatically splitting the datoms table by entity or attribute or type; separating the transaction log into a separate attached database to reduce database fragmentation; etc. Measurement is important here. (The nice thing about having an abstraction over storage is that we get to make these kinds of changes!)
- Read concurrency. We already know that we need a second connection to get read/write isolation —do we benefit from a connection pool when we're only dispatching reads, or is it just added complexity?
- Excision. This should be relatively straightforward with our SQL representation, but how that plays into synchronization is less clear: there must be coordination primitives (e.g., snapshot barriers) to make this work.
- Querying history. We know in principle how to make this work: anything from directly traversing the transaction table (for occasional use), through to materializing a new temporary datoms table from the log (for a bigger query).
- Consumers. All of this is pointless without real data and real users, so some effort should be spent on migrations — which will bring with it a need for tuning and debugging tools, backup, WebExtension APIs, access control logic, developer documentation, and so on.
- There seems to be some appetite for heterogeneous 'dispatched' storage — that is, explain in some way the shape of your data, and have the storage decide whether to use a document store, a key-value store, or whatever. Relatedly, one might instead (or as well) describe the kinds of access one needs — write-back with a synchronous write API and a certain flush interval, perhaps — and expect storage to take care of the rest.
Additionally, when the UAS model is introduced, the realities of asynchronous writes (always present, even in synchronous API's like Firefox's prefs) become unavoidable. It's tempting to layer locally async or locally sync APIs on top of a write-through/back/etc. cache. It would be interesting to investigate whether this can be generalized.