Investigate methods of reducing the on-disk size of Mentat databases #290

Open
opened 2020-08-06 16:57:55 +00:00 by gburd · 0 comments
gburd commented 2020-08-06 16:57:55 +00:00 (Migrated from github.com)

This document highlights many of the issues and concerns https://docs.google.com/document/d/14ywV4PlBAdOsJrxd7QhMcxbo7pw8cdc1kLAb7I4QhFY/edit?ts=5b7b7f87.

Some work to measure the size of the database when storing history is in https://github.com/mozilla/application-services/pull/191. For my places.sqlite (100k places, 150k visits), it gets around 200MB larger.

It's worth noting that 'disk usage' was one of the primary concerns reported by user research for Fenix (although it's not clear if this size increase (relative to places) is the kind of thing that would make a dent relative to stuff like caches and the like -- A very informal poll of some friends of mine found that Fennec typically uses around 500MB of space (app + data), another 200MB isn't a trivial increase, but doesn't substantially change where we are in terms of app size).

Some bugs which may help (suggested by @rnewman):

I think something like sqlite's zipvfs extension would likely help (as the databases compress well), but have not tried it. Implementing it ourselves is likely beyond the scope of this effort (I took a look at the effort required and it wasn't exactly trivial). Additionally, whatever we do would need to somehow integrate with sqlcipher (I also took a look at bolting compression into sqlcipher before the encryption, but the fact that this makes the block output a variable size seemed to make this problematic).

Other notes:

  1. Storing strings as fulltext and using the compress/uncompress options of FTS4 did not help, since the strings in each column are relatively small. Additionally, the performance overhead here was substantial even for a very fast compressor (LZ4).
  2. Most string data seems to be duplicated ~4 times, in datoms, timelined_transactions, and in the indices idx_datoms_eavt, idx_datoms_aevt.
  3. During RustConf, @rnewman suggested that ultimately mentat will likely not want to use sqlite, and instead want to read datoms chunks directly out of something like RKV. These chunks could be compressed more easily. This seems out of scope, as it would be a massive change to mentat, but is worth writing down.

Additional concerns exist around the fact that this problem may be exacerbated by materialized views (perhaps #33 will help or prevent this?)

This document highlights many of the issues and concerns https://docs.google.com/document/d/14ywV4PlBAdOsJrxd7QhMcxbo7pw8cdc1kLAb7I4QhFY/edit?ts=5b7b7f87. Some work to measure the size of the database when storing history is in https://github.com/mozilla/application-services/pull/191. For my places.sqlite (100k places, 150k visits), it gets around 200MB larger. It's worth noting that 'disk usage' was one of the primary concerns reported by user research for Fenix (although it's not clear if this size increase (relative to places) is the kind of thing that would make a dent relative to stuff like caches and the like -- A very informal poll of some friends of mine found that Fennec typically uses around 500MB of space (app + data), another 200MB isn't a trivial increase, but doesn't substantially change where we are in terms of app size). Some bugs which may help (suggested by @rnewman): - https://github.com/mozilla/mentat/issues/29: Pack the various flag columns into one TINYINT -- appears to give a 1%-2% benefit, which is probably not substantial enough to justify the effort IMO. - https://github.com/mozilla/mentat/issues/69: implement something like place's url_hash automatically. If this could keep strings values out of the aevt/eavt indices it could have a huge benefit. - https://github.com/mozilla/mentat/issues/32: Interning keywords. Our test schema didn't use these and I'm not sure what the actual use case for them is over `:db.type/ref` to a `{ :db/ident :my/keyword }`, so this doesn't seem like a high priority. - https://github.com/mozilla/mentat/issues/33: Store the data canonically in a sql table instead of in datoms. This is interesting but seems like a lot of work. I think something like [sqlite's zipvfs extension](https://www.sqlite.org/zipvfs/doc/trunk/www/index.wiki) would likely help (as the databases compress well), but have not tried it. Implementing it ourselves is likely beyond the scope of this effort (I took a look at the effort required and it wasn't exactly trivial). Additionally, whatever we do would need to somehow integrate with sqlcipher (I also took a look at bolting compression into sqlcipher before the encryption, but the fact that this makes the block output a variable size seemed to make this problematic). Other notes: 1. Storing strings as fulltext and using the `compress`/`uncompress` options of FTS4 did not help, since the strings in each column are relatively small. Additionally, the performance overhead here was substantial even for a very fast compressor (LZ4). 2. Most string data seems to be duplicated ~4 times, in datoms, timelined_transactions, and in the indices idx_datoms_eavt, idx_datoms_aevt. 3. During RustConf, @rnewman suggested that ultimately mentat will likely not want to use sqlite, and instead want to read datoms chunks directly out of something like RKV. These chunks could be compressed more easily. This seems out of scope, as it would be a massive change to mentat, but is worth writing down. Additional concerns exist around the fact that this problem may be exacerbated by materialized views (perhaps #33 will help or prevent this?)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: greg/mentat#290
No description provided.