Investigate methods of reducing the on-disk size of Mentat databases #17

Open
opened 2020-08-06 15:04:51 +00:00 by gburd · 5 comments
gburd commented 2020-08-06 15:04:51 +00:00 (Migrated from github.com)

This document highlights many of the issues and concerns https://docs.google.com/document/d/14ywV4PlBAdOsJrxd7QhMcxbo7pw8cdc1kLAb7I4QhFY/edit?ts=5b7b7f87.

Some work to measure the size of the database when storing history is in https://github.com/mozilla/application-services/pull/191. For my places.sqlite (100k places, 150k visits), it gets around 200MB larger.

It's worth noting that 'disk usage' was one of the primary concerns reported by user research for Fenix (although it's not clear if this size increase (relative to places) is the kind of thing that would make a dent relative to stuff like caches and the like -- A very informal poll of some friends of mine found that Fennec typically uses around 500MB of space (app + data), another 200MB isn't a trivial increase, but doesn't substantially change where we are in terms of app size).

Some bugs which may help (suggested by @rnewman):

I think something like sqlite's zipvfs extension would likely help (as the databases compress well), but have not tried it. Implementing it ourselves is likely beyond the scope of this effort (I took a look at the effort required and it wasn't exactly trivial). Additionally, whatever we do would need to somehow integrate with sqlcipher (I also took a look at bolting compression into sqlcipher before the encryption, but the fact that this makes the block output a variable size seemed to make this problematic).

Other notes:

  1. Storing strings as fulltext and using the compress/uncompress options of FTS4 did not help, since the strings in each column are relatively small. Additionally, the performance overhead here was substantial even for a very fast compressor (LZ4).
  2. Most string data seems to be duplicated ~4 times, in datoms, timelined_transactions, and in the indices idx_datoms_eavt, idx_datoms_aevt.
  3. During RustConf, @rnewman suggested that ultimately mentat will likely not want to use sqlite, and instead want to read datoms chunks directly out of something like RKV. These chunks could be compressed more easily. This seems out of scope, as it would be a massive change to mentat, but is worth writing down.

Additional concerns exist around the fact that this problem may be exacerbated by materialized views (perhaps #33 will help or prevent this?)

Original by @thomcc

This document highlights many of the issues and concerns https://docs.google.com/document/d/14ywV4PlBAdOsJrxd7QhMcxbo7pw8cdc1kLAb7I4QhFY/edit?ts=5b7b7f87. Some work to measure the size of the database when storing history is in https://github.com/mozilla/application-services/pull/191. For my places.sqlite (100k places, 150k visits), it gets around 200MB larger. It's worth noting that 'disk usage' was one of the primary concerns reported by user research for Fenix (although it's not clear if this size increase (relative to places) is the kind of thing that would make a dent relative to stuff like caches and the like -- A very informal poll of some friends of mine found that Fennec typically uses around 500MB of space (app + data), another 200MB isn't a trivial increase, but doesn't substantially change where we are in terms of app size). Some bugs which may help (suggested by @rnewman): - https://github.com/mozilla/mentat/issues/29: Pack the various flag columns into one TINYINT -- appears to give a 1%-2% benefit, which is probably not substantial enough to justify the effort IMO. - https://github.com/mozilla/mentat/issues/69: implement something like place's url_hash automatically. If this could keep strings values out of the aevt/eavt indices it could have a huge benefit. - https://github.com/mozilla/mentat/issues/32: Interning keywords. Our test schema didn't use these and I'm not sure what the actual use case for them is over `:db.type/ref` to a `{ :db/ident :my/keyword }`, so this doesn't seem like a high priority. - https://github.com/mozilla/mentat/issues/33: Store the data canonically in a sql table instead of in datoms. This is interesting but seems like a lot of work. I think something like [sqlite's zipvfs extension](https://www.sqlite.org/zipvfs/doc/trunk/www/index.wiki) would likely help (as the databases compress well), but have not tried it. Implementing it ourselves is likely beyond the scope of this effort (I took a look at the effort required and it wasn't exactly trivial). Additionally, whatever we do would need to somehow integrate with sqlcipher (I also took a look at bolting compression into sqlcipher before the encryption, but the fact that this makes the block output a variable size seemed to make this problematic). Other notes: 1. Storing strings as fulltext and using the `compress`/`uncompress` options of FTS4 did not help, since the strings in each column are relatively small. Additionally, the performance overhead here was substantial even for a very fast compressor (LZ4). 2. Most string data seems to be duplicated ~4 times, in datoms, timelined_transactions, and in the indices idx_datoms_eavt, idx_datoms_aevt. 3. During RustConf, @rnewman suggested that ultimately mentat will likely not want to use sqlite, and instead want to read datoms chunks directly out of something like RKV. These chunks could be compressed more easily. This seems out of scope, as it would be a massive change to mentat, but is worth writing down. Additional concerns exist around the fact that this problem may be exacerbated by materialized views (perhaps #33 will help or prevent this?) Original by @thomcc
rnewman commented 2020-08-06 15:24:11 +00:00 (Migrated from github.com)

Interning strings is likely to be necessary.

The way AllegroGraph stores (or stored; it's many years since I worked on it) data is probably the way Mentat should eventually go, and I've spent some time thinking about this: using fixed-size UPIs to encode the tag and small data, fixed-size tuples, collected into index chunks (e.g., spogi in AllegroGraph, for subject-predicate-object-graph-id, supporting queries like [:foo/bar :foo/name ?name]).

The engine can materialize whichever index chunks you need to make raw queries fast in the application, and not bother with the rest. Strings go elsewhere, and store the hash of the string in the UPI. Merge index chunks when appropriate. Use metaindex chunks to find candidates to run queries.

This would make most of the existing Mentat's query implementation moot — we put a lot of effort into translating queries to SQL, managing tables and types and such, and that would instead be replaced by a planner and executor that used index chunks directly. This is a lot closer to how Datomic works, too. If I were starting over today (and I have thought about it!) this is where I would start: drop the up-front requirement for FTS, start with a minimal core, and prove out chunk management, ID rewriting, and syncing as the main spikes.

There will always be overhead with an approach like Mentat's: it stores strictly more data in the form of the log. It would be possible to compress the log, potentially at write-time, via simple run-length encoding — no need to repeat duplicate subjects or transaction IDs of a sequence of datoms in the log.

Interning strings is likely to be necessary. The way AllegroGraph stores (or stored; it's many years since I worked on it) data is probably the way Mentat should eventually go, and I've spent some time thinking about this: using fixed-size UPIs to encode the tag and small data, fixed-size tuples, collected into index chunks (e.g., `spogi` in AllegroGraph, for subject-predicate-object-graph-id, supporting queries like `[:foo/bar :foo/name ?name]`). The engine can materialize whichever index chunks you need to make raw queries fast in the application, and not bother with the rest. Strings go elsewhere, and store the hash of the string in the UPI. Merge index chunks when appropriate. Use metaindex chunks to find candidates to run queries. This would make most of the existing Mentat's query implementation moot — we put a lot of effort into translating queries to SQL, managing tables and types and such, and that would instead be replaced by a planner and executor that used index chunks directly. This is a lot closer to how Datomic works, too. If I were starting over today (and I have thought about it!) this is where I would start: drop the up-front requirement for FTS, start with a minimal core, and prove out chunk management, ID rewriting, and syncing as the main spikes. There will always be overhead with an approach like Mentat's: it stores strictly more data in the form of the log. It would be possible to compress the log, potentially at write-time, via simple run-length encoding — no need to repeat duplicate subjects or transaction IDs of a sequence of datoms in the log.
thomcc commented 2020-08-06 18:58:45 +00:00 (Migrated from github.com)

There's some bitrotted code you can find around for implementing a compressed VFS. I suspect a lower-level approach .

For something like that I'd strongly consider using SQLite as the blob store, and using incremental BLOB io https://www.sqlite.org/c3ref/blob.html.

That said, this is more because because of the stability problems rkv has encountered with using LMDB in the wild (e.g. resorting to implementing a SafeMode backend, which just uses normal file IO to read/write a bincode-serialized HashMap<Box<[u8]>, BTreeSet<Box<[u8]>>>), rather than it being a good fit for SQLite (it's very much not relational at that point).

Also, it's worth noting that there exist (bitrotted) compressed VFS impls: (one example I have bookmarked is https://github.com/ryanhomer/sqlite3-compression-encryption-vfs/ -- not vouching for it's quality) in the wild if you look. I don't know how far from production these are (probably far, but not so far you couldn't get a feel for it), but they exist.

That said, this would be redundant if you're using a blob store, since you could just ensure the blobs are compressed when inserting, and decompressed when fetched.

I also expect there to be a lot of approaches you could use to compress the transaction log much better than just RLE (although doing an RLE or delta encoding first could definitely help!). The rows should be the kind of highly structured struct-like data that both arithmetic coding and ANS variants eat up (ANS with less fiddling, admittedly).


Edit: Sorry, I hit accidentally tabbed and entered, sending this before I meant to.

There's some bitrotted code you can find around for implementing a compressed VFS. I suspect a lower-level approach . For something like that I'd strongly consider using SQLite as the blob store, and using incremental BLOB io https://www.sqlite.org/c3ref/blob.html. That said, this is more because because of the stability problems `rkv` has encountered with using LMDB in the wild (e.g. resorting to implementing a `SafeMode` backend, which just uses normal file IO to read/write a `bincode`-serialized `HashMap<Box<[u8]>, BTreeSet<Box<[u8]>>>`), rather than it being a good fit for SQLite (it's very much not relational at that point). Also, it's worth noting that there exist (bitrotted) compressed VFS impls: (one example I have bookmarked is https://github.com/ryanhomer/sqlite3-compression-encryption-vfs/ -- not vouching for it's quality) in the wild if you look. I don't know how far from production these are (probably far, but not so far you couldn't get a feel for it), but they exist. That said, this would be redundant if you're using a blob store, since you could just ensure the blobs are compressed when inserting, and decompressed when fetched. I also expect there to be a lot of approaches you could use to compress the transaction log much better than just RLE (although doing an RLE or delta encoding first could definitely help!). The rows should be the kind of highly structured struct-like data that both arithmetic coding and ANS variants eat up (ANS with less fiddling, admittedly). --- Edit: Sorry, I hit accidentally tabbed and entered, sending this before I meant to.
rnewman commented 2020-08-06 19:10:53 +00:00 (Migrated from github.com)

My understanding of the LMDB issues in Firefox is that there were something like three crashes, and no indication of a useful root cause — the space in which browsers run is a lot more complicated than most apps, so this could have been any number of things. From talking to the current maintainers I didn't hear anything that made me concerned about using LMDB or rkv in the places that I have used it.

In the context of Mentat a shift to chunk storage is a significant rework, likely a rewrite without API compatibility being maintained: all of query execution, transaction, and opening/file management will be totally different.

I was hanging on to the name "Ghola" for the stuff I've been dreaming about; you heard it here first 😄

My understanding of the LMDB issues in Firefox is that there were something like three crashes, and no indication of a useful root cause — the space in which browsers run is a lot more complicated than most apps, so this could have been any number of things. From talking to the current maintainers I didn't hear anything that made me concerned about using LMDB or rkv in the places that I have used it. In the context of Mentat a shift to chunk storage is a significant rework, likely a rewrite without API compatibility being maintained: all of query execution, transaction, and opening/file management will be totally different. I was hanging on to the name "Ghola" for the stuff I've been dreaming about; you heard it here first 😄
thomcc commented 2020-08-06 19:34:55 +00:00 (Migrated from github.com)

My understanding of the LMDB issues in Firefox is that there were something like three crashes, and no indication of a useful root cause — the space in which browsers run is a lot more complicated than most apps, so this could have been any number of things. From talking to the current maintainers I didn't hear anything that made me concerned about using LMDB or rkv in the places that I have used it.

That's entirely possible. I didn't dig a ton here, just noticed that... we can't use it in firefox. (although, memory mapping is truly fraught with peril, so it's not really that surprising to me that a DB that uses it very aggressively has robustness problems... 😬)

Either way, the chunk store approach (if you go through the effort to implement it) also makes supporting different backends (as datomic does) easy, since it's well within the capabilities of almost any conceivable database.

That said yeah, it seems like a massive effort would be needed. Probably too much to be worthwhile in the near future... I suspect there are other ways of improving the performance issues (SQLite natively supports some sort of materialized view via https://sqlite.org/gencol.html now, which helps with several of those issues).

Also, if I'm being honest, in retrospect I think size issues were probably overblown (despite being the one who did a lot of the investigation into them) Fenix made a big deal about them to us but... ultimately the network cache ends up being far larger than the mentat store was ever likely to reach.


... If I had to give some reservations about the mentat architecture, they'd be around how the "single DB with everything" model doesn't play well with SQLite's locking and transaction model (which locks entire files...).

The more I use SQLite, the more I think you're best off using several database files that you attach as-needed. That is, basically emulate per-table locking :p. This looses transactionals across databases, but... in practice the solution to the locking issues is frequently "don't use transactions everywhere" a la desktop places... That's only viable for some schemas though -- something like mentat's wouldn't get any benefit from doing this at all, and would likely just cause tons and tons of headaches.

Still, I genuinely suspect if there's a bullet e.g. Fenix dodged, it's likely to be this, and I don't really remember it being on our radar (Certainly was far from mine... But I was also mostly an outside observer for mentat)...

That said, perhaps mentat would allow us to make sufficiently fine-grained transactions that it wouldn't be as big of a deal (I genuinely don't know here -- it's possible), and of course the chunk-storage approach seems to allow for a number of creative solutions here (as well as taking advantage of whatever native support the DB has for transaction management...) So ultimately there were paths forward, just... hard ones.

> My understanding of the LMDB issues in Firefox is that there were something like three crashes, and no indication of a useful root cause — the space in which browsers run is a lot more complicated than most apps, so this could have been any number of things. From talking to the current maintainers I didn't hear anything that made me concerned about using LMDB or rkv in the places that I have used it. That's entirely possible. I didn't dig a ton here, just noticed that... we can't use it in firefox. (although, memory mapping is truly fraught with peril, so it's not really *that* surprising to me that a DB that uses it very aggressively has robustness problems... 😬) Either way, the chunk store approach (if you go through the effort to implement it) also makes supporting different backends (as datomic does) easy, since it's well within the capabilities of almost any conceivable database. That said yeah, it seems like a massive effort would be needed. Probably too much to be worthwhile in the near future... I suspect there are other ways of improving the performance issues (SQLite natively supports some sort of materialized view via https://sqlite.org/gencol.html now, which helps with several of those issues). Also, if I'm being honest, in retrospect I think size issues were probably overblown (despite being the one who did a lot of the investigation into them) Fenix made a big deal about them to us but... ultimately the network cache ends up being far larger than the mentat store was *ever* likely to reach. --- ... If I had to give some reservations about the mentat architecture, they'd be around how the "single DB with everything" model doesn't play well with SQLite's locking and transaction model (which locks entire files...). The more I use SQLite, the more I think you're best off using several database files that you attach as-needed. That is, basically emulate per-table locking :p. This looses transactionals across databases, but... in practice the solution to the locking issues is frequently "don't use transactions everywhere" a la desktop places... That's only viable for some schemas though -- something like mentat's wouldn't get any benefit from doing this at all, and would likely just cause tons and tons of headaches. Still, I genuinely suspect if there's a bullet e.g. Fenix dodged, it's likely to be this, and I don't really remember it being on our radar (Certainly was far from mine... But I was also mostly an outside observer for mentat)... That said, perhaps mentat would allow us to make sufficiently fine-grained transactions that it wouldn't be as big of a deal (I genuinely don't know here -- it's possible), and of course the chunk-storage approach seems to allow for a number of creative solutions here (as well as taking advantage of whatever native support the DB has for transaction management...) So ultimately there were paths forward, just... hard ones.
rnewman commented 2020-08-06 22:00:29 +00:00 (Migrated from github.com)

Yeah, once you get to the immutable chunk world, your 'transactions' become the same as Datomic's transacts: the problem of figuring out what goes in the log. Serial writes where the actual application happens later is a pretty smart way to build a DB: high throughput.

In my experience, applications that use data drive towards two polar opposites: one big store that owns the data, or tiny flat files each owned by a single service, each communicating. This is true even when 'tiny' means 50TB DynamoDB tables.

The problem with the former is throughput. The problems with the latter are evolution, management, reinvention of wheels, and the inevitable need to query across and manage references to things in the other services — all the stuff that Mentat was trying to address. The theory is/was that performance problems can be addressed with work, but one simply can't get to Profile in the Cloud or similar data-centric feature sets without changing who owns the data.

On a tangent, big stores either drive towards logic in or near the store that can help you with stuff (which includes relational constraints, triggers, and reified storage abstractions like Mentat), or offloading that problem to the application (e.g., DynamoDB) so that storage can focus on durability and whatever slice of performance they care about. AWS services are built to leverage the latter by enforcing all access via a controlling service, which is similar to how we built Firefox for iOS: it doesn't matter how dumb your storage is if you control the code.

And then you squint and you see the dumb storage inside Mentat, and then you think about how tricky Dynamo apps are to write when client-side encryption comes into play. And around and around we go. 😂

Yeah, once you get to the immutable chunk world, your 'transactions' become the same as Datomic's transacts: the problem of figuring out what goes in the log. Serial writes where the actual application happens later is a pretty smart way to build a DB: high throughput. In my experience, applications that use data drive towards two polar opposites: one big store that owns the data, or tiny flat files each owned by a single service, each communicating. This is true even when 'tiny' means 50TB DynamoDB tables. The problem with the former is throughput. The problems with the latter are evolution, management, reinvention of wheels, and the inevitable need to query across and manage references to things in the other services — all the stuff that Mentat was trying to address. The theory is/was that performance problems can be addressed with work, but one simply can't get to Profile in the Cloud or similar data-centric feature sets without changing who owns the data. On a tangent, big stores either drive towards logic in or near the store that can help you with stuff (which includes relational constraints, triggers, and reified storage abstractions like Mentat), or offloading that problem to the application (e.g., DynamoDB) so that storage can focus on durability and whatever slice of performance they care about. AWS services are built to leverage the latter by enforcing all access via a controlling service, which is similar to how we built Firefox for iOS: it doesn't matter how dumb your storage is if you control the code. And then you squint and you see the dumb storage inside Mentat, and then you think about how tricky Dynamo apps are to write when client-side encryption comes into play. And around and around we go. 😂
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: greg/mentat#17
No description provided.