Add discussion of storage difficulties. r=nalexander (#344 )

* Add discussion of storage difficulties. * Replace mention of MVP with discussion of initial requirements.
Accept :db/id in nested maps. (Fixes #178.) (#206 ) r=rnewman
2017-02-27 15:50:17 -08:00 · 2017-02-17 11:39:51 -08:00 · 2017-01-15 11:27:06 -08:00 · 2016-12-16 10:56:34 -08:00
5 changed files with 86 additions and 8 deletions
--- a/README.md
+++ b/README.md
@ -6,15 +6,67 @@ Datomish compiles into a single JavaScript file, and is usable both in Node (on

 There's an example Firefox restartless add-on in the [`addon`](https://github.com/mozilla/datomish/tree/master/addon) directory; build instructions are below.

+We are in the early stages of reimplementing Datomish in [Rust](https://www.rust-lang.org/). You can follow that work in [its long-lived branch](https://github.com/mozilla/datomish/tree/rust), and issue #133.

 ## Motivation

 Datomish is intended to be a flexible relational (not key-value, not document-oriented) store that doesn't leak its storage schema to users, and doesn't make it hard to grow its domain schema and run arbitrary queries.

-Our short-term goal is to build a system that, as the basis for a User Agent Service, can support multiple [Tofino](https://github.com/mozilla/tofino) UX experiments without having a storage engineer do significant data migration, schema work, or revving of special-purpose endpoints.
+Our short-term goal for Project Mentat is to build a system that, as the basis for a User Agent Service, can support multiple [Tofino](https://github.com/mozilla/tofino) UX experiments without having a storage engineer do significant data migration, schema work, or revving of special-purpose endpoints.

 By abstracting away the storage schema, and by exposing change listeners outside the database (not via triggers), we hope to allow both the data store itself and embedding applications to use better architectures, meeting performance goals in a way that allows future evolution.

+## Data storage is hard
+
+We've observed that data storage is a particular area of difficulty for software development teams:
+
+- It's hard to define storage schemas well. A developer must:
+  - Model their domain entities and relationships.
+  - Encode that model _efficiently_ and _correctly_ using the features available in the database.
+  - Plan for future extensions and performance tuning.
+  
+  In a SQL database, the same schema definition defines everything from high-level domain relationships through to numeric field sizes in the same smear of keywords. It's difficult for someone unfamiliar with the domain to determine from such a schema what's a domain fact and what's an implementation concession — are all part numbers always 16 characters long, or are we trying to save space? — or, indeed, whether a missing constraint is deliberate or a bug.
+  
+  The developer must think about foreign key constraints, compound uniqueness, and nullability. They must consider indexing, synchronizing, and stable identifiers. Most developers simply don't do enough work in SQL to get all of these things right. Storage thus becomes the specialty of a few individuals.
+
+   Which one of these is correct?
+   
+   ```edn
+   {:db/id          :person/email
+     :db/valueType   :db.type/string
+     :db/cardinality :db.cardinality/many     ; People can have multiple email addresses.
+     :db/unique      :db.unique/identity      ; For our purposes, each email identifies one person.
+     :db/index       true}                    ; We want fast lookups by email.         
+   {:db/id          :person/friend
+     :db/valueType   :db.type/ref
+     :db/cardinality :db.cardinality/many}    ; People can have many friends.
+   ```
+   ```sql
+   CREATE TABLE people (
+     id INTEGER PRIMARY KEY,  -- Bug: because of the primary key, each person can have no more than 1 email.
+     email VARCHAR(64),       -- Bug?: no NOT NULL, so a person can have no email.
+                              -- Bug: nobody will ever have a long email address, right?
+   );
+   CREATE TABLE friendships (
+     FOREIGN KEY person REFERENCES people(id),  -- Bug?: no indexing, so lookups by friend or person will be slow.
+     FOREIGN KEY friend REFERENCES people(id),  -- Bug: no compound uniqueness constraint, so we can have dupe friendships.
+   );
+   ```
+   
+   They both have limitations — the Mentat schema allows only for an open world (it's possible to declare friendships with people whose email isn't known), and requires validation code to enforce email string correctness — but we think that even such a tiny SQL example is harder to understand and obscures important domain decisions.
+
+- Queries are intimately tied to structural storage choices. That not only hides the declarative domain-level meaning of the query — it's hard to tell what a query is trying to do when it's a 100-line mess of subqueries and `LEFT OUTER JOIN`s — but it also means a simple structural schema change requires auditing _every query_ for correctness.
+
+- Developers often capture less event-shaped than they perhaps should, simply because their initial requirements don't warrant it. It's quite common to later want to [know when a fact was recorded](https://bugzilla.mozilla.org/show_bug.cgi?id=1341939), or _in which order_ two facts were recorded (particularly for migrations), or on which device an event took place… or even that a fact was _ever_ recorded and then deleted.
+
+- Common queries are hard. Storing values only once, upserts, complicated joins, and group-wise maxima are all difficult for non-expert developers to get right.
+
+- It's hard to evolve storage schemas. Writing a robust SQL schema migration is hard, particularly if a bad migration has ever escaped into the wild! Teams learn to fear and avoid schema changes, and eventually they ship a table called `metadata`, with three `TEXT` columns, so they never have to write a migration again. That decision pushes storage complexity into application code. (Or they start storing unversioned JSON blobs in the database…)
+
+- It's hard to share storage with another component, let alone share _data_ with another component. Conway's Law applies: your software system will often grow to have one database per team.
+
+- It's hard to build efficient storage and querying architectures. Materialized views require knowledge of triggers, or the implementation of bottleneck APIs. _Ad hoc_ caches are often wrong, are almost never formally designed (do you want a write-back, write-through, or write-around cache? Do you know the difference?), and often aren't reusable. The average developer, faced with a SQL database, has little choice but to build a simple table that tries to meet every need.
+

 ## Comparison to DataScript

@ -46,7 +98,7 @@ Datomish is designed for embedding, initially in an Electron app ([Tofino](https

 ## Comparison to SQLite

-SQLite is a traditional SQL database in most respects: schemas conflate semantic, structural, and datatype concerns; the main interface with the database is human-first textual queries; sparse and graph-structured data are 'unnatural', if not always inefficient; experimenting with and evolving data models are error-prone and complicated activities; and so on.
+SQLite is a traditional SQL database in most respects: schemas conflate semantic, structural, and datatype concerns, as described above; the main interface with the database is human-first textual queries; sparse and graph-structured data are 'unnatural', if not always inefficient; experimenting with and evolving data models are error-prone and complicated activities; and so on.

 Datomish aims to offer many of the advantages of SQLite — single-file use, embeddability, and good performance — while building a more relaxed and expressive data model on top.

--- a/src/common/datomish/query.cljc
+++ b/src/common/datomish/query.cljc
@ -110,7 +110,7 @@
  (->
    context
    context->sql-clause
-    (sql/format args :quoting sql-quoting-style)))
+    (sql/format :params args :quoting sql-quoting-style)))

 (defn- validate-with [with]
  (when-not (or (nil? with)
@ -215,7 +215,7 @@
  [context find args]
  (->
    (find->sql-clause context find)
-    (sql/format args :quoting sql-quoting-style)))
+    (sql/format :params args :quoting sql-quoting-style)))

 (defn parse
  "Parse a Datalog query array into a structured `find` expression."
--- a/src/common/datomish/transact/explode.cljc
+++ b/src/common/datomish/transact/explode.cljc
@ -57,10 +57,11 @@
           (not (db/id-literal? v)))
      ;; Another entity is given as a nested map.
      (if (ds/ref? (db/schema db) straight-a*)
-        (let [other (assoc v (reverse-ref a) eid
-                           ;; TODO: make the new ID have the same part as the original eid.
-                           ;; TODO: make the new ID not show up in the tempids map.  (Does Datomic exposed the new ID this way?)
-                           :db/id (db/id-literal :db.part/user))]
+        (let [other (-> v
+                        (assoc (reverse-ref a) eid)
+                        ;; TODO: make the new ID have the same part as the original eid.
+                        ;; TODO: make the new ID not show up in the tempids map.  (Does Datomic exposed the new ID this way?)
+                        (update :db/id #(or %1 (db/id-literal :db.part/user))))]
          (explode-entity db other))
        (raise "Bad attribute " a ": nested map " v " given but attribute name requires {:db/valueType :db.type/ref} in schema"
               {:error :transact/entity-map-type-ref
--- a/test/datomish/db_test.cljc
+++ b/test/datomish/db_test.cljc
@ -506,6 +506,19 @@
            ExceptionInfo #"\{:db/valueType :db.type/ref\}"
            (<? (d/<transact! conn [{:db/id 101 :aka {:name "Petr"}}])))))))

+(deftest-db test-explode-maps-with-db-id conn
+  (let [{tx0 :tx} (<? (d/<transact! conn test-schema))]
+    (testing "recursively nested maps with specified :db/id are accepted"
+      (<? (d/<transact! conn [{:db/id 101 :name "Oleg"}]))
+
+      (<? (d/<transact! conn [{:db/id 101 :friends {:db/id 201 :name "Ivan" :friends {:db/id 301 :name "Petr"}}}]))
+      (is (= (<? (<datoms-after (d/db conn) tx0))
+             #{[101 :name "Oleg"]
+               [101 :friends 201]
+               [201 :name "Ivan"]
+               [201 :friends 301 ]
+               [301 :name "Petr"]})))))
+
 (deftest-db test-explode-reverse-refs conn
  (let [{tx0 :tx} (<? (d/<transact! conn test-schema))]
    (testing "reverse refs are accepted"
--- a/test/datomish/test/query.cljc
+++ b/test/datomish/test/query.cljc
@ -925,3 +925,15 @@
                                               {:select ['x]
                                                :from [:def]})}
                                     :foo])})))))
+
+(deftest-db test-sql-quoting conn
+  (testing "ansi sql quoting applied when there are no inputs"
+    (is (= ["SELECT DISTINCT \"datoms0\".\"e\" AS \"order\" FROM \"datoms\" \"datoms0\" WHERE (\"datoms0\".\"a\" = \"order\") "]
+           (datomish.query/find->sql-string
+             (conn->context conn)
+             (datomish.query/parse
+               '[:find ?order
+                 :in $
+                 :where
+                 [?order :item/order]])
+             nil)))))
Author	SHA1	Message	Date
Richard Newman	9a9dfb502a	Add discussion of storage difficulties. r=nalexander (#344 ) * Add discussion of storage difficulties. * Replace mention of MVP with discussion of initial requirements.	2017-02-27 15:50:17 -08:00
Nick Alexander	74861447e4	Accept :db/id in nested maps. (Fixes #178.) (#206 ) r=rnewman	2017-02-17 11:39:51 -08:00
Paul	84a80f40f5	Fix SQL quoting when calling honeysql/format (#175 ). r=rnewman	2017-01-15 11:27:06 -08:00
Richard Newman	a17142673e	Add a note about reimplementing Datomish in Rust.	2016-12-16 10:56:34 -08:00