greg/mentat

Fork 0

An embedded relational datalog storage engine inspired by Datomic and DataScript. This is a fork of [Mozilla's unmaintained project by the same name](https://github.com/mozilla/mentat).

Find a file

Nick Alexander 15b4195a6e Schema alteration. Fixes #294 and #295 . (#370 ) r=rnewman * Pre: Don't retract :db/ident in test. Datomic (and eventually Mentat) don't allow to retract :db/ident in this way, so this runs afoul of future work to support mutating metadata. * Pre: s/VALUETYPE/VALUE_TYPE/. This is consistent with the capitalization (which is "valueType") and the other identifier. * Pre: Remove some single quotes from error output. * Part 1: Make materialized views be uniform [e a v value_type_tag]. This looks ahead to a time when we could support arbitrary user-defined materialized views. For now, the "idents" materialized view is those datoms of the form [e :db/ident :namespaced/keyword] and the "schema" materialized view is those datoms of the form [e a v] where a is in a particular set of attributes that will become clear in the following commits. This change is not backwards compatible, so I'm removing the open current (really, v2) test. It'll be re-instated when we get to https://github.com/mozilla/mentat/issues/194. * Pre: Map TypedValue::Ref to TypedValue::Keyword in debug output. * Part 3: Separate `schema_to_mutate` from the `schema` used to interpret. This is just to keep track of the expected changes during bootstrapping. I want bootstrap metadata mutations to flow through the same code path as metadata mutations during regular transactions; by differentiating the schema used for interpretation from the schema that will be updated I expect to be able to apply bootstrap metadata mutations to an empty schema and have things like materialized views created (using the regular code paths). This commit has been re-ordered for conceptual clarity, but it won't compile because it references the metadata module. It's possible to make it compile -- the functionality is there in the schema module -- but it's not worth the rebasing effort until after review (and possibly not even then, since we'll squash down to a single commit to land). * Part 2: Maintain entids separately from idents. In order to support historical idents, we need to distinguish the "current" map from entid -> ident from the "complete historical" map ident -> entid. This is what Datomic does; in Datomic, an ident is never retracted (although it can be replaced). This approach is an important part of allowing multiple consumers to share a schema fragment as it migrates forward. This fixes a limitation of the Clojure implementation, which did not handle historical idents across knowledge base close and re-open. The "entids" materialized view is naturally a slice of the "datoms" table. The "idents" materialized view is a slice of the "transactions" table. I hope that representing in this way, and casting the problem in this light, might generalize to future materialized views. * Pre: Add DiffSet. * Part 4: Collect mutations to a `Schema`. I haven't taken your review comment about consuming AttributeBuilder during each fluent function. If you read my response and still want this, I'm happy to do it in review. * Part 5: Handle :db/ident and :db.{install,alter}/attribute. This "loops" the committed datoms out of the SQL store and back through the metadata (schema, but in future also partition map) processor. The metadata processor updates the schema and produces a report of what changed; that report is then used to update the SQL store. That update includes: - the materialized views ("entids", "idents", and "schema"); - if needed, a subset of the datoms themselves (as flags change). I've left a TODO for handling attribute retraction in the cases that it makes sense. I expect that to be straight-forward. * Review comment: Rename DiffSet to AddRetractAlterSet. Also adds a little more commentary and a simple test. * Review comment: Use ToIdent trait. * Review comment: partially revert "Part 2: Maintain entids separately from idents." This reverts commit 23a91df9c35e14398f2ddbd1ba25315821e67401. Following our discussion, this removes the "entids" materialized view. The next commit will remove historical idents from the "idents" materialized view. * Post: Use custom Either rather than std::result::Result. This is not necessary, but it was suggested that we might be paying an overhead creating Err instances while using error_chain. That seems not to be the case, but this change shows that we don't actually use any of the Result helper methods, so there's no reason to overload Result. This change might avoid some future confusion, so I'm going to land it anyway. Signed-off-by: Nick Alexander <nalexander@mozilla.com> * Review comment: Don't preserve historical idents. * Review comment: More prepared statements when updating materialized views. * Post: Test altering :db/cardinality and :db/unique. These tests fail due to a Datomic limitation, namely that the marker flag :db.alter/attribute can only be asserted once for an attribute! That is, [:db.part/db :db.alter/attribute :attribute] will only be transacted at most once. Since older versions of Datomic required the :db.alter/attribute flag, I can only imagine they either never wrote :db.alter/attribute to the store, or they handled it specially. I'll need to remove the marker flag system from Mentat in order to address this fundamental limitation. * Post: Remove some more single quotes from error output. * Post: Add assert_transact! macro to unwrap safely. I was finding it very difficult to track unwrapping errors while making changes, due to an underlying Mac OS X symbolication issue that makes running tests with RUST_BACKTRACE=1 so slow that they all time out. * Post: Don't expect or recognize :db.{install,alter}/attribute. I had this all working... except we will never see a repeated `[:db.part/db :db.alter/attribute :attribute]` assertion in the store! That means my approach would let you alter an attribute at most one time. It's not worth hacking around this; it's better to just stop expecting (and recognizing) the marker flags. (We have all the data to distinguish the various cases that we need without the marker flags.) This brings Mentat in line with the thrust of newer Datomic versions, but isn't compatible with Datomic, because (if I understand correctly) Datomic automatically adds :db.{install,alter}/attribute assertions to transactions. I haven't purged the corresponding :db/ident and schema fragments just yet: - we might want them back - we might want them in order to upgrade v1 and v2 databases to the new on-disk layout we're fleshing out (v3?). * Post: Don't make :db/unique :db.unique/* imply :db/index true. This patch avoids a potential bug with the "schema" materialized view. If :db/unique :db.unique/value implies :db/index true, then what happens when you _retract_ :db.unique/value? I think Datomic defines this in some way, but I really want the "schema" materialized view to be a slice of "datoms" and not have these sort of ambiguities and persistent effects. Therefore, to ensure that we don't retract a schema characteristic and accidentally change more than we intended to, this patch stops having any schema characteristic imply any other schema characteristic(s). To achieve that, I added an Option<Unique::{Value,Identity}> type to Attribute; this helps with this patch, and also looks ahead to when we allow to retract :db/unique attributes. * Post: Allow to retract :db/ident. * Post: Include more details about invalid schema changes. The tests use strings, so they hide the chained errors which do in fact provide more detail. * Review comment: Fix outdated comment. * Review comment: s/_SET/_SQL_LIST/. * Review comment: Use a sub-select for checking cardinality. This might be faster in practice. * Review comment: Put `attribute::Unique` into its own namespace.		2017-03-20 13:18:59 -07:00
.vscode	Add a VSCode test configuration for cargo test --all.	2017-03-13 16:56:23 +00:00
build	Ensure minimum rustc version in a build script. r=nalexander (#326 )	2017-02-17 12:04:45 -08:00
core	Schema alteration. Fixes #294 and #295 . (#370 ) r=rnewman	2017-03-20 13:18:59 -07:00
db	Schema alteration. Fixes #294 and #295 . (#370 ) r=rnewman	2017-03-20 13:18:59 -07:00
edn	Support a limited set of '.'-prefixed non-keyword symbols. (#352 ) r=nalexander	2017-03-06 15:01:19 -08:00
fixtures	Add test databases.	2017-01-10 12:09:00 -08:00
parser-utils	Extract partial storage abstraction; use error-chain throughout. Fixes #328 . r=rnewman (#341 )	2017-02-24 15:33:48 -08:00
query	Implement projection and querying. (#353 ) r=nalexander	2017-03-06 14:40:10 -08:00
query-algebrizer	Schema alteration. Fixes #294 and #295 . (#370 ) r=rnewman	2017-03-20 13:18:59 -07:00
query-parser	Rebased conversion of mentat_query_parser to use error-chain. r=nalexander	2017-02-27 16:16:54 -08:00
query-projector	Use sqlite3_limit instead of hard-coded SQLITE_MAX_VARIABLE_NUMBER (#371 ) (#288 ) r=rnewman	2017-03-13 09:39:19 -07:00
query-sql	(#362 ) Part 4: handle unknown attributes by expanding type codes. r=nalexander	2017-03-08 17:44:27 -08:00
query-translator	(#362 ) Part 4: handle unknown attributes by expanding type codes. r=nalexander	2017-03-08 17:44:27 -08:00
sql	Convert mentat_sql to use error-chain. r=nalexander	2017-02-27 16:16:49 -08:00
src	Schema alteration. Fixes #294 and #295 . (#370 ) r=rnewman	2017-03-20 13:18:59 -07:00
tests	Implement basic query limits. (#361 ) r=nalexander	2017-03-08 17:41:42 -08:00
tx	Convert EDN transaction tests to Rust code. Fixes #271 . (#364 ) r=rnewman	2017-03-20 11:29:17 -07:00
tx-parser	Extract partial storage abstraction; use error-chain throughout. Fixes #328 . r=rnewman (#341 )	2017-02-24 15:33:48 -08:00
.gitignore	Rudimentary printing of EDN values. (#209 ) r=jsantell	2017-01-28 14:18:17 -08:00
.travis.yml	Simplify .travis.yml to use `cargo test --all`.	2017-02-20 11:04:18 -08:00
Cargo.toml	Use sqlite3_limit instead of hard-coded SQLITE_MAX_VARIABLE_NUMBER (#371 ) (#288 ) r=rnewman	2017-03-13 09:39:19 -07:00
CONTRIBUTING.md	More Rust notes.	2017-02-21 10:45:21 -08:00
LICENSE	Change license to Apache. Fixes #74 .	2016-11-22 11:40:37 -08:00
README.md	Tweak testing commands.	2017-03-09 09:00:46 -08:00

README.md

Project Mentat

Project Mentat is a persistent, embedded knowledge base. It draws heavily on DataScript and Datomic.

The first version of Project Mentat, named Datomish, was written in ClojureScript, targeting both Node (on top of promise_sqlite) and Firefox (on top of Sqlite.jsm). It also works in pure Clojure on the JVM on top of jdbc-sqlite. The name was changed to avoid confusion with Datomic.

This branch is for rewriting Mentat in Rust, giving us a smaller compiled output, better performance, more type safety, better tooling, and easier deployment into Firefox and mobile platforms.

Motivation

Mentat is intended to be a flexible relational (not key-value, not document-oriented) store that doesn't leak its storage schema to users, and doesn't make it hard to grow its domain schema and run arbitrary queries.

Our short-term goal for Project Mentat is to build a system that, as the basis for a User Agent Service, can support multiple Tofino UX experiments without having a storage engineer do significant data migration, schema work, or revving of special-purpose endpoints.

By abstracting away the storage schema, and by exposing change listeners outside the database (not via triggers), we hope to allow both the data store itself and embedding applications to use better architectures, meeting performance goals in a way that allows future evolution.

Data storage is hard

We've observed that data storage is a particular area of difficulty for software development teams:

It's hard to define storage schemas well. A developer must:
- Model their domain entities and relationships.
- Encode that model efficiently and correctly using the features available in the database.
- Plan for future extensions and performance tuning.
In a SQL database, the same schema definition defines everything from high-level domain relationships through to numeric field sizes in the same smear of keywords. It's difficult for someone unfamiliar with the domain to determine from such a schema what's a domain fact and what's an implementation concession — are all part numbers always 16 characters long, or are we trying to save space? — or, indeed, whether a missing constraint is deliberate or a bug.

The developer must think about foreign key constraints, compound uniqueness, and nullability. They must consider indexing, synchronizing, and stable identifiers. Most developers simply don't do enough work in SQL to get all of these things right. Storage thus becomes the specialty of a few individuals.

Which one of these is correct?
```
{:db/id          :person/email
  :db/valueType   :db.type/string
  :db/cardinality :db.cardinality/many     ; People can have multiple email addresses.
  :db/unique      :db.unique/identity      ; For our purposes, each email identifies one person.
  :db/index       true}                    ; We want fast lookups by email.         
{:db/id          :person/friend
  :db/valueType   :db.type/ref
  :db/cardinality :db.cardinality/many}    ; People can have many friends.
```
```
CREATE TABLE people (
  id INTEGER PRIMARY KEY,  -- Bug: because of the primary key, each person can have no more than 1 email.
  email VARCHAR(64),       -- Bug?: no NOT NULL, so a person can have no email.
                           -- Bug: nobody will ever have a long email address, right?
);
CREATE TABLE friendships (
  FOREIGN KEY person REFERENCES people(id),  -- Bug?: no indexing, so lookups by friend or person will be slow.
  FOREIGN KEY friend REFERENCES people(id),  -- Bug: no compound uniqueness constraint, so we can have dupe friendships.
);
```
They both have limitations — the Mentat schema allows only for an open world (it's possible to declare friendships with people whose email isn't known), and requires validation code to enforce email string correctness — but we think that even such a tiny SQL example is harder to understand and obscures important domain decisions.
Queries are intimately tied to structural storage choices. That not only hides the declarative domain-level meaning of the query — it's hard to tell what a query is trying to do when it's a 100-line mess of subqueries and LEFT OUTER JOINs — but it also means a simple structural schema change requires auditing every query for correctness.
Developers often capture less event-shaped than they perhaps should, simply because their initial requirements don't warrant it. It's quite common to later want to know when a fact was recorded, or in which order two facts were recorded (particularly for migrations), or on which device an event took place… or even that a fact was ever recorded and then deleted.
Common queries are hard. Storing values only once, upserts, complicated joins, and group-wise maxima are all difficult for non-expert developers to get right.
It's hard to evolve storage schemas. Writing a robust SQL schema migration is hard, particularly if a bad migration has ever escaped into the wild! Teams learn to fear and avoid schema changes, and eventually they ship a table called metadata, with three TEXT columns, so they never have to write a migration again. That decision pushes storage complexity into application code. (Or they start storing unversioned JSON blobs in the database…)
It's hard to share storage with another component, let alone share data with another component. Conway's Law applies: your software system will often grow to have one database per team.
It's hard to build efficient storage and querying architectures. Materialized views require knowledge of triggers, or the implementation of bottleneck APIs. Ad hoc caches are often wrong, are almost never formally designed (do you want a write-back, write-through, or write-around cache? Do you know the difference?), and often aren't reusable. The average developer, faced with a SQL database, has little choice but to build a simple table that tries to meet every need.

Comparison to DataScript

DataScript asks the question: "What if creating a database would be as cheap as creating a Hashmap?"

Mentat is not interested in that. Instead, it's strongly interested in persistence and performance, with very little interest in immutable databases/databases as values or throwaway use.

One might say that Mentat's question is: "What if an SQLite database could store arbitrary relations, for arbitrary consumers, without them having to coordinate an up-front storage-level schema?"

(Note that domain-level schemas are very valuable.)

Another possible question would be: "What if we could bake some of the concepts of CQRS and event sourcing into a persistent relational store, such that the transaction log itself were of value to queries?"

Some thought has been given to how databases as values — long-term references to a snapshot of the store at an instant in time — could work in this model. It's not impossible; it simply has different performance characteristics.

Just like DataScript, Mentat speaks Datalog for querying and takes additions and retractions as input to a transaction. Unlike DataScript, Mentat's API is asynchronous.

Unlike DataScript, Mentat exposes free-text indexing, thanks to SQLite.

Comparison to Datomic

Datomic is a server-side, enterprise-grade data storage system. Datomic has a beautiful conceptual model. It's intended to be backed by a storage cluster, in which it keeps index chunks forever. Index chunks are replicated to peers, allowing it to run queries at the edges. Writes are serialized through a transactor.

Many of these design decisions are inapplicable to deployed desktop software; indeed, the use of multiple JVM processes makes Datomic's use in a small desktop app, or a mobile device, prohibitive.

Mentat is designed for embedding, initially in an Electron app (Tofino). It is less concerned with exposing consistent database states outside transaction boundaries, because that's less important here, and dropping some of these requirements allows us to leverage SQLite itself.

Comparison to SQLite

SQLite is a traditional SQL database in most respects: schemas conflate semantic, structural, and datatype concerns, as described above; the main interface with the database is human-first textual queries; sparse and graph-structured data are 'unnatural', if not always inefficient; experimenting with and evolving data models are error-prone and complicated activities; and so on.

Mentat aims to offer many of the advantages of SQLite — single-file use, embeddability, and good performance — while building a more relaxed and expressive data model on top.

Contributing

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

See CONTRIBUTING.md for further notes.

This project is very new, so we'll probably revise these guidelines. Please comment on an issue before putting significant effort in if you'd like to contribute.

Building

Right now this code is located on a branch, so you first need to git checkout rust. To build and test the project, we are using Cargo.

To build all of the crates in the project use:

cargo build

To run tests use:

# Run tests for everything.
cargo test --all

# Run tests for just the query-parser folder (specify the crate, not the folder),
# printing debug output.
cargo test -p mentat_query_parser -- --nocapture

To start the server use:

cargo run serve

To pass in custom arguments to the cli through Cargo, you'll need to pass -- after the command to ensure they get passed properly. For example:

cargo run serve -- --help

For most cargo commands you can pass the -p argument to run the command just on that package. So, cargo build -p mentat_query_parser will build just the "query-parser" folder.

License

Project Mentat is currently licensed under the Apache License v2.0. See the LICENSE file for details.

SQLite dependencies

Mentat uses partial indices, which are available in SQLite 3.8.0 and higher.

It also uses FTS4, which is a compile time option.