[db] Optimize bulk insertions using fixed-size statement caches and base expansion #118

New issue

Open

opened 2020-08-06 16:55:12 +00:00 by gburd · 0 comments

gburd commented

2020-08-06 16:55:12 +00:00

(Migrated from github.com)

I don't know the best way to do this, so there's some experimentation needed. I can think of the following considerations:

average transaction size is > 1 assertion (since there's always a :db/tx :db/txInstant MILLISECONDS datom)
most transactions will be small, say < 5 assertions
some transactions will be huge, say 10,000 or 100,000 assertions
SQLite will accept only a limited number of bindings per statement handle (usually 1000, but possibly as many 500,000 on some platforms!)
there's a finite store of SQLite statement handles available

So we want to cache a certain number of insertion statements for some number of assertions, and then chunk our input into those insertion statements. If we had unlimited bindings per statement and unlimited statement handles, we might do a base-2 expansion and lazily cache insertions of 1, 2, 4, 8, 16, assertions. However, given our considerations, it might be better to cache insertions for 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, (999 / bindings-per-insertion) assertions, effectively dividing the parameter space into "little" and "huge" regimes.

If we wanted to be really fancy, we could determine our caching strategy dynamically at initialization time, or we could generate different strategies at compile time for different hardware profiles (like most FFI implementations do).

This seems to me similar to "Montgomery ladders" for fast multiplication and "base phi expansions" for doing arithmetic in algebraic groups that have non-trivial computable morphisms, with some additional restrictions thrown in.

This ticket tracks making bulk insertions (like at https://github.com/ncalexan/mentat/blob/db-tests/db/src/db.rs#L631) more efficient. This is follow-up to https://github.com/mozilla/mentat/pull/214#discussion_r99420334. I don't know the best way to do this, so there's some experimentation needed. I can think of the following considerations: - average transaction size is > 1 assertion (since there's always a `:db/tx :db/txInstant MILLISECONDS` datom) - most transactions will be small, say < 5 assertions - some transactions will be huge, say 10,000 or 100,000 assertions - SQLite will accept only a limited number of bindings per statement handle (usually 1000, but possibly as many 500,000 on some platforms!) - there's a finite store of SQLite statement handles available So we want to cache a certain number of insertion statements for some number of assertions, and then chunk our input into those insertion statements. If we had unlimited bindings per statement and unlimited statement handles, we might do a base-2 expansion and lazily cache insertions of 1, 2, 4, 8, 16, assertions. However, given our considerations, it might be better to cache insertions for 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, (999 / bindings-per-insertion) assertions, effectively dividing the parameter space into "little" and "huge" regimes. If we wanted to be _really_ fancy, we could determine our caching strategy dynamically at initialization time, or we could generate different strategies at compile time for different hardware profiles (like most FFI implementations do). This seems to me similar to "Montgomery ladders" for fast multiplication and "base phi expansions" for doing arithmetic in algebraic groups that have non-trivial computable morphisms, with some additional restrictions thrown in.