sparsemap/README.md

74 lines
3.3 KiB
Markdown
Raw Permalink Normal View History

2024-04-08 22:14:47 +00:00
# Sparsemap
2024-05-03 19:15:39 +00:00
Bitsets, also called bitmaps, are commonly used as fast data structures.
Unfortunately, they can use too much memory. To compensate, we often use
compressed bitmaps.
2024-05-03 21:12:57 +00:00
`sparsemap` is a sparse, compressed bitmap. In the best case, it can store 2048
bits in just 8 bytes. In the worst case, it stores the 2048 bits uncompressed and
requires an additional 8 bytes of overhead.
2024-04-03 00:41:55 +00:00
The "best" case happens when large consecutive sequences of the bits are
either set ("1") or not set ("0"). If your numbers are consecutive 64bit
integers then sparsemap can compress up to 16kb in just 8 bytes.
## How does it work?
2024-04-05 00:11:09 +00:00
On the lowest level stores bits in sm_bitvec_t's (a uint32_t or uint64_t).
2024-04-03 00:41:55 +00:00
2024-04-05 00:11:09 +00:00
Each sm_bitvec_t has an additional descriptor (2 bits). A single word prepended
to each sm_bitvec_t describes its condition. The descriptor word and the
2024-05-03 19:15:39 +00:00
sm_bitvec_t's have the same size. The descriptor of a sm_bitvec_t
2024-04-05 00:11:09 +00:00
specifies whether the sm_bitvec_t consists only of set bits ("1"), unset
2024-05-03 21:12:57 +00:00
bits ("0") or has a mixed payload. In the first and second cases, the
2024-04-05 00:11:09 +00:00
sm_bitvec_t is not stored.
2024-04-03 00:41:55 +00:00
2024-04-05 00:11:09 +00:00
An example shows a sequence of 4 x 16 bits (here, each sm_bitvec_t and the
2024-04-03 00:41:55 +00:00
Descriptor word has 16 bits):
Descriptor:
00 00 00 00 11 00 11 10
2024-04-05 00:11:09 +00:00
^^ ^^ ^^ ^^-- sm_bitvec_t #0 - #3 are "0000000000000000"
^^-- sm_bitvec_t #4 is "1111111111111111"
^^-- sm_bitvec_t #5 is "0000000000000000"
2024-05-13 02:07:45 +00:00
^^-- sm_bitvec_t #6 is "1111111111111111"
2024-04-05 00:11:09 +00:00
^^-- sm_bitvec_t #7 is "0110010101111001"
2024-04-03 00:41:55 +00:00
2024-04-05 00:11:09 +00:00
Since the first 7 sm_bitvec_t's are either all "1" or "0" they are not stored.
2024-04-03 00:41:55 +00:00
The actual memory sequence looks like this:
0000000011001110 0110010101111001
Instead of storing 8 Words (16 bytes), we only store 2 Words (2 bytes): one
2024-05-03 21:12:57 +00:00
for the descriptor, and one for the last sm_bitvec_t #7.
2024-04-03 00:41:55 +00:00
2024-05-03 21:12:57 +00:00
The sparsemap stores a list of chunk maps, and for each chunk map, it stores the
2024-04-05 00:11:09 +00:00
absolute address (i.e. if the user sets bit 0 and bit 10000, and the chunk map
capacity is 2048, the sparsemap creates two chunk maps; the first starts at
offset 0, the second starts at offset 8192).
2024-04-03 00:41:55 +00:00
2024-04-08 22:14:47 +00:00
## Usage instructions
2024-04-03 00:41:55 +00:00
2024-05-03 19:15:39 +00:00
Copy the files `src/sparsemap.c` and `include/sparsemap.h` into your project.
Review the `examples/*` and `tests/*` code.
2024-04-03 00:41:55 +00:00
## Final words
2024-04-05 00:11:09 +00:00
This bitmap has efficient compression when used on long sequences of set (or
2024-05-03 21:12:57 +00:00
unset) bits (i.e. with a word size of 64 bit and a payload of consecutive
2024-04-05 00:11:09 +00:00
numbers without gaps, the payload of 2048 x sizeof(uint64_t) = 16kb will occupy
2024-05-03 21:12:57 +00:00
only 8 bytes!).
2024-04-03 00:41:55 +00:00
However, if the sequence is not consecutive and has gaps, it's possible that
2024-04-05 00:11:09 +00:00
the compression is inefficient, and the size (in the worst case) is identical
to an uncompressed bit vector (sometimes higher due to the bytes required for
2024-04-03 00:41:55 +00:00
metadata). In such cases, other compression schemes are more efficient (i.e.
2024-05-03 19:15:39 +00:00
http://lemire.me/blog/archives/2008/08/20/the-mythical-bitmap-index/). We
include in `lib` the amalgamated (git `2dc8070`) and well-known
[Roaring Bitmaps](https://github.com/RoaringBitmap/CRoaring/tree/master) and
use it in the soak test to ensure our results are as accurate as theirs.
2024-04-03 00:41:55 +00:00
2024-05-06 19:43:47 +00:00
This library was originally created by [Christoph Rupp](https://crupp.de) in
C++ and then translated to C and further improved by Greg Burd <greg@burd.me>
for use in LMDB and OpenLDAP.