Implement a hash-based indexing option for attributes #82

Open
opened 2020-08-06 16:54:36 +00:00 by gburd · 0 comments
gburd commented 2020-08-06 16:54:36 +00:00 (Migrated from github.com)

Our schema language supports :index true, which we interpret as a SQL index on vaet.

For strings, that's inefficient: firstly we get duplication of string content, which is space-expensive, but we're also comparing strings rather than primitive values.

Bug 889561 removed the URL unique index in Places, replacing it with a url_hash column that's queried first. The hash is computed by mfbt's AddToHash, which goes a lil' something like this:

    if (mode.IsEmpty()) {
      // URI-like strings (having a prefix before a colon), are handled specially,
      // as a 48 bit hash, where first 16 bits are the prefix hash, while the
      // other 32 are the string hash.
      // The 16 bits have been decided based on the fact hashing all of the IANA
      // known schemes, plus "places", does not generate collisions.
      nsAString::const_iterator start, tip, end;
      str.BeginReading(tip);
      start = tip;
      str.EndReading(end);
      if (FindInReadable(NS_LITERAL_STRING(":"), tip, end)) {
        const nsDependentSubstring& prefix = Substring(start, tip);
        uint64_t prefixHash = static_cast<uint64_t>(HashString(prefix) & 0x0000FFFF);
        // The second half of the url is more likely to be unique, so we add it.
        uint32_t srcHash = HashString(str);
        uint64_t hash = (prefixHash << 32) + srcHash;
        result->SetAsInt64(hash);
      } else {
        uint32_t hash = HashString(str);
        result->SetAsInt64(hash);
      }
    } else if (mode.Equals(NS_LITERAL_CSTRING("prefix_lo"))) {
      // Keep only 16 bits.
      uint64_t hash = static_cast<uint64_t>(HashString(str) & 0x0000FFFF) << 32;
      result->SetAsInt64(hash);
    } else if (mode.Equals(NS_LITERAL_CSTRING("prefix_hi"))) {
      // Keep only 16 bits.
      uint64_t hash = static_cast<uint64_t>(HashString(str) & 0x0000FFFF) << 32;
      // Make this a prefix upper bound by filling the lowest 32 bits.
      hash +=  0xFFFFFFFF;
      result->SetAsInt64(hash);
    } else {
      return NS_ERROR_FAILURE;
    }

Because we can find ea->v from datoms, we can support indexed lookup from v->eat by defining a separate materialized view, heat (direct index, but collision avoidance needed) or hi (relying on an easy indexed join against datoms), for values that are indexed only for direct lookup and not for range queries.

This is also particularly useful for fulltext_values: the FTS virtual table requires a table scan to look up exact values, which makes insertion slow.

Our schema language supports `:index true`, which we interpret as a SQL index on `vaet`. For strings, that's inefficient: firstly we get duplication of string content, which is space-expensive, but we're also comparing strings rather than primitive values. Bug 889561 removed the URL unique index in Places, replacing it with a `url_hash` column that's queried first. The hash is computed by mfbt's `AddToHash`, which goes a lil' something like this: ``` if (mode.IsEmpty()) { // URI-like strings (having a prefix before a colon), are handled specially, // as a 48 bit hash, where first 16 bits are the prefix hash, while the // other 32 are the string hash. // The 16 bits have been decided based on the fact hashing all of the IANA // known schemes, plus "places", does not generate collisions. nsAString::const_iterator start, tip, end; str.BeginReading(tip); start = tip; str.EndReading(end); if (FindInReadable(NS_LITERAL_STRING(":"), tip, end)) { const nsDependentSubstring& prefix = Substring(start, tip); uint64_t prefixHash = static_cast<uint64_t>(HashString(prefix) & 0x0000FFFF); // The second half of the url is more likely to be unique, so we add it. uint32_t srcHash = HashString(str); uint64_t hash = (prefixHash << 32) + srcHash; result->SetAsInt64(hash); } else { uint32_t hash = HashString(str); result->SetAsInt64(hash); } } else if (mode.Equals(NS_LITERAL_CSTRING("prefix_lo"))) { // Keep only 16 bits. uint64_t hash = static_cast<uint64_t>(HashString(str) & 0x0000FFFF) << 32; result->SetAsInt64(hash); } else if (mode.Equals(NS_LITERAL_CSTRING("prefix_hi"))) { // Keep only 16 bits. uint64_t hash = static_cast<uint64_t>(HashString(str) & 0x0000FFFF) << 32; // Make this a prefix upper bound by filling the lowest 32 bits. hash += 0xFFFFFFFF; result->SetAsInt64(hash); } else { return NS_ERROR_FAILURE; } ``` Because we can find `ea->v` from `datoms`, we can support indexed lookup from `v->eat` by defining a separate materialized view, `heat` (direct index, but collision avoidance needed) or `hi` (relying on an easy indexed join against `datoms`), for values that are indexed only for direct lookup and not for range queries. This is also particularly useful for `fulltext_values`: the FTS virtual table requires a table scan to look up exact values, which makes insertion slow.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: greg/mentat#82
No description provided.