greg/sparsemap

Fork 0

This is a C99 implementation of a sparse, compressed bitmap index. In the best case, it can store 2048 bits in just 8 bytes. In the worst case, it stores the 2048 bits uncompressed and requires an additional 8 bytes of overhead.

Find a file

Greg Burd 4eb1d463b2 WIP		2024-05-14 21:00:02 -04:00
.idea	WIP	2024-05-10 11:27:43 -04:00
.idx	revert split and merge for now	2024-05-13 02:07:45 +00:00
examples	compare against roaring bitmaps	2024-05-03 15:15:39 -04:00
include	WIP	2024-05-10 11:27:43 -04:00
lib	many new API; fixes	2024-05-09 15:50:56 -04:00
src	WIP	2024-05-14 21:00:02 -04:00
tests	WIP	2024-05-14 21:00:02 -04:00
.clang-format	locate first span of length 'n' using rank and select (#3 )	2024-04-07 20:38:57 +00:00
.clang-tidy	revert split and merge for now	2024-05-13 02:07:45 +00:00
.envrc	select/rank for unset as well as set bits (#4 )	2024-04-24 20:32:09 +00:00
.gitignore	compare against roaring bitmaps	2024-05-03 15:15:39 -04:00
flake.lock	select/rank for unset as well as set bits (#4 )	2024-04-24 20:32:09 +00:00
flake.nix	compare against roaring bitmaps	2024-05-03 15:15:39 -04:00
LICENSE	license	2024-04-10 15:49:15 -04:00
Makefile	WIP	2024-05-14 21:00:02 -04:00
README.md	revert split and merge for now	2024-05-13 02:07:45 +00:00

README.md

Sparsemap

Bitsets, also called bitmaps, are commonly used as fast data structures. Unfortunately, they can use too much memory. To compensate, we often use compressed bitmaps.

sparsemap is a sparse, compressed bitmap. In the best case, it can store 2048 bits in just 8 bytes. In the worst case, it stores the 2048 bits uncompressed and requires an additional 8 bytes of overhead.

The "best" case happens when large consecutive sequences of the bits are either set ("1") or not set ("0"). If your numbers are consecutive 64bit integers then sparsemap can compress up to 16kb in just 8 bytes.

How does it work?

On the lowest level stores bits in sm_bitvec_t's (a uint32_t or uint64_t).

Each sm_bitvec_t has an additional descriptor (2 bits). A single word prepended to each sm_bitvec_t describes its condition. The descriptor word and the sm_bitvec_t's have the same size. The descriptor of a sm_bitvec_t specifies whether the sm_bitvec_t consists only of set bits ("1"), unset bits ("0") or has a mixed payload. In the first and second cases, the sm_bitvec_t is not stored.

An example shows a sequence of 4 x 16 bits (here, each sm_bitvec_t and the Descriptor word has 16 bits):

  Descriptor:
  00 00 00 00 11 00 11 10
  ^^ ^^ ^^ ^^-- sm_bitvec_t #0 - #3 are "0000000000000000"
              ^^-- sm_bitvec_t #4 is "1111111111111111"
                 ^^-- sm_bitvec_t #5 is "0000000000000000"
                    ^^-- sm_bitvec_t #6 is "1111111111111111"
                       ^^-- sm_bitvec_t #7 is "0110010101111001"

Since the first 7 sm_bitvec_t's are either all "1" or "0" they are not stored. The actual memory sequence looks like this:

  0000000011001110 0110010101111001

Instead of storing 8 Words (16 bytes), we only store 2 Words (2 bytes): one for the descriptor, and one for the last sm_bitvec_t #7.

The sparsemap stores a list of chunk maps, and for each chunk map, it stores the absolute address (i.e. if the user sets bit 0 and bit 10000, and the chunk map capacity is 2048, the sparsemap creates two chunk maps; the first starts at offset 0, the second starts at offset 8192).

Usage instructions

Copy the files src/sparsemap.c and include/sparsemap.h into your project. Review the examples/* and tests/* code.

Final words

This bitmap has efficient compression when used on long sequences of set (or unset) bits (i.e. with a word size of 64 bit and a payload of consecutive numbers without gaps, the payload of 2048 x sizeof(uint64_t) = 16kb will occupy only 8 bytes!).

However, if the sequence is not consecutive and has gaps, it's possible that the compression is inefficient, and the size (in the worst case) is identical to an uncompressed bit vector (sometimes higher due to the bytes required for metadata). In such cases, other compression schemes are more efficient (i.e. http://lemire.me/blog/archives/2008/08/20/the-mythical-bitmap-index/). We include in lib the amalgamated (git 2dc8070) and well-known Roaring Bitmaps and use it in the soak test to ensure our results are as accurate as theirs.

This library was originally created by Christoph Rupp in C++ and then translated to C and further improved by Greg Burd greg@burd.me for use in LMDB and OpenLDAP.