Write up thoughts on compacting history.

2017-02-08 11:10:35 -08:00 · 2017-02-08 11:10:35 -08:00 · b8a7fcf398
commit b8a7fcf398
parent 8ac9f3891b
1 changed files with 18 additions and 0 deletions
--- a/Thoughts:-compacting-rolling-up-history.md
+++ b/Thoughts:-compacting-rolling-up-history.md
@ -0,0 +1,18 @@
+Preserving all history won't always be practical. Consumers with significant write volumes, where some of those writes are of cardinality-one properties, or are retractions, will grow their space consumption _more than linearly_ in the number of extant datoms.
+
+(The worst-case consumer is one that adds and retracts the same datom over and over, adding two rows to the transaction log each time, never growing the datoms table.)
+
+Datomic provides one mechanism to address this: `noHistory`. This is fine for some attributes, but some workloads will instead want to keep as much history as is practical within a given space limit or timeframe. Browser history is a great example: Firefox throws away the oldest browsing history when the Places database gets too big, with limits computed at runtime based on the capacity of the device.
+
+We can imagine at least two ways to do this:
+
+- For a given compaction threshold transaction T, find all matching add/retract pairs prior to T, and delete both. (cardinality-one updates are effected by retracting the old value and adding the new one, so this works.)
+
+    Queries of states prior to T will see missing values for any retracted datoms, but preserved original timestamps for non-retracted datoms.
+- For a given snapshot transaction S, truncate all history prior to S, find all extant datoms with a transaction ID less than S, and re-add those datoms to the transaction log as having been added in S. (Equivalently, collapse all old datoms 'up' into S.)
+
+    Querying of states prior to S is impossible: that history has been flattened completely. The truncated transaction log could be stored in cold storage if desired.
+
+Compaction is a little more complex for consumers than `noHistory`: consumers know that `noHistory` attributes are hard to query at points in time, but they won't be expecting non-`noHistory` data to have existed at one point in time and not exist at that point in time after compaction. Careful documentation (and coordination between consumers!) will be needed.
+
+It might make sense to conditionally compact based on transaction metadata, attribute, part, or schema fragment.