sparsemap/README.md

# Sparsemap

Bitsets, also called bitmaps, are commonly used as fast data structures.
Unfortunately, they can use too much memory. To compensate, we often use
compressed bitmaps.

`sparsemap` is a sparse, compressed bitmap. In the best case, it can store 2048
bits in just 8 bytes. In the worst case, it stores the 2048 bits uncompressed and
requires an additional 8 bytes of overhead.

The "best" case happens when large consecutive sequences of the bits are
either set ("1") or not set ("0"). If your numbers are consecutive 64bit
integers then sparsemap can compress up to 16kb in just 8 bytes.

## How does it work?

On the lowest level stores bits in sm_bitvec_t's (a uint32_t or uint64_t).

Each sm_bitvec_t has an additional descriptor (2 bits). A single word prepended
to each sm_bitvec_t describes its condition. The descriptor word and the
sm_bitvec_t's have the same size. The descriptor of a sm_bitvec_t
specifies whether the sm_bitvec_t consists only of set bits ("1"), unset
bits ("0") or has a mixed payload. In the first and second cases, the
sm_bitvec_t is not stored.

An example shows a sequence of 4 x 16 bits (here, each sm_bitvec_t and the
Descriptor word has 16 bits):

      Descriptor:
      00 00 00 00 11 00 11 10
      ^^ ^^ ^^ ^^-- sm_bitvec_t #0 - #3 are "0000000000000000"
                  ^^-- sm_bitvec_t #4 is "1111111111111111"
                     ^^-- sm_bitvec_t #5 is "0000000000000000"
                        ^^-- sm_bitvec_t #6 is "1111111111111111"
                           ^^-- sm_bitvec_t #7 is "0110010101111001"

Since the first 7 sm_bitvec_t's are either all "1" or "0" they are not stored.
The actual memory sequence looks like this:

      0000000011001110 0110010101111001

Instead of storing 8 Words (16 bytes), we only store 2 Words (2 bytes): one
for the descriptor, and one for the last sm_bitvec_t #7.

A 2nd example shows a sequence of 4 x 16 bits (here, each sm_bitvec_t and the
Descriptor word has 16 bits), in this example there is a run of adjacent 1s
greater than 64 bits long after 64 0s.

      Descriptor:              Vector for descriptor #1 as follows:
      00 01 00 00 00 00 00 10  00 00 00 00 10 00 01 00
      ^^-- sm_bitvec_t #0 is "0000000000000000"
         ^^-- sm_bitvec_t #1 is a run of 132 (0x84 or 0b0000000010000100) 1s

The first 2 bits of the descriptor above (a sm_bitvec_t) indicate a run of 16
zeros. The number of zeros is 16 in this case because the length of a bitvec_t
is 16 (in this somewhat contrived example), but commonly this is either 32 or 64
depending on your system's architecture.  After that, the next 2 bits in the
descriptor are '01' indicating a run-length encoded set of adjacent 1s longer
than 16 (again in this case where the bitvec_t is 16 bits wide).  The
corosponding bitvec_t contains the actual length, in this case it is 132 (0x84
or 0b0000000010000100).

The actual memory sequence for this second example looks like this:

      0001000000000000 0000000010000100

Using this method of RLE for adjacent 1s we can compress (again, in this case
where bitvec_t is 16 bits wide) 2^16 or 65536 adjacent 1s.

The sparsemap stores a list of chunk maps, and for each chunk map, it stores the
absolute address (i.e. if the user sets bit 0 and bit 10000, and the chunk map
capacity is 2048, the sparsemap creates two chunk maps; the first starts at
offset 0, the second starts at offset 8192).

## Usage instructions

Copy the files `src/sparsemap.c` and `include/sparsemap.h` into your project.
Review the `examples/*` and `tests/*` code.

## Final words

This bitmap has efficient compression when used on long sequences of set (or
unset) bits (i.e. with a word size of 64 bit and a payload of consecutive
numbers without gaps, the payload of 2048 x sizeof(uint64_t) = 16kb will occupy
only 8 bytes!).

However, if the sequence is not consecutive and has gaps, it's possible that
the compression is inefficient, and the size (in the worst case) is identical
to an uncompressed bit vector (sometimes higher due to the bytes required for
metadata). In such cases, other compression schemes are more efficient (i.e.
http://lemire.me/blog/archives/2008/08/20/the-mythical-bitmap-index/).  We
include in `lib` the amalgamated (git `2dc8070`) and well-known
[Roaring Bitmaps](https://github.com/RoaringBitmap/CRoaring/tree/master) and
use it in the soak test to ensure our results are as accurate as theirs.

This library was originally created by [Christoph Rupp](https://crupp.de) in
C++ and then translated to C and further improved by Greg Burd <greg@burd.me>
for use in LMDB and OpenLDAP.
more tests 2024-04-08 22:14:47 +00:00			`# Sparsemap`

compare against roaring bitmaps 2024-05-03 19:15:39 +00:00			`Bitsets, also called bitmaps, are commonly used as fast data structures.`
			`Unfortunately, they can use too much memory. To compensate, we often use`
			`compressed bitmaps.`

spelling 2024-05-03 21:12:57 +00:00			`sparsemap` is a sparse, compressed bitmap. In the best case, it can store 2048
			`bits in just 8 bytes. In the worst case, it stores the 2048 bits uncompressed and`
			`requires an additional 8 bytes of overhead.`
import as C++ 2024-04-03 00:41:55 +00:00
			`The "best" case happens when large consecutive sequences of the bits are`
			`either set ("1") or not set ("0"). If your numbers are consecutive 64bit`
			`integers then sparsemap can compress up to 16kb in just 8 bytes.`

			`## How does it work?`

updates 2024-04-05 00:11:09 +00:00			`On the lowest level stores bits in sm_bitvec_t's (a uint32_t or uint64_t).`
import as C++ 2024-04-03 00:41:55 +00:00
updates 2024-04-05 00:11:09 +00:00			`Each sm_bitvec_t has an additional descriptor (2 bits). A single word prepended`
			`to each sm_bitvec_t describes its condition. The descriptor word and the`
compare against roaring bitmaps 2024-05-03 19:15:39 +00:00			`sm_bitvec_t's have the same size. The descriptor of a sm_bitvec_t`
updates 2024-04-05 00:11:09 +00:00			`specifies whether the sm_bitvec_t consists only of set bits ("1"), unset`
spelling 2024-05-03 21:12:57 +00:00			`bits ("0") or has a mixed payload. In the first and second cases, the`
updates 2024-04-05 00:11:09 +00:00			`sm_bitvec_t is not stored.`
import as C++ 2024-04-03 00:41:55 +00:00
updates 2024-04-05 00:11:09 +00:00			`An example shows a sequence of 4 x 16 bits (here, each sm_bitvec_t and the`
import as C++ 2024-04-03 00:41:55 +00:00			`Descriptor word has 16 bits):`

			`Descriptor:`
			`00 00 00 00 11 00 11 10`
updates 2024-04-05 00:11:09 +00:00			`^^ ^^ ^^ ^^-- sm_bitvec_t #0 - #3 are "0000000000000000"`
			`^^-- sm_bitvec_t #4 is "1111111111111111"`
			`^^-- sm_bitvec_t #5 is "0000000000000000"`
revert split and merge for now 2024-05-13 02:07:45 +00:00			`^^-- sm_bitvec_t #6 is "1111111111111111"`
updates 2024-04-05 00:11:09 +00:00			`^^-- sm_bitvec_t #7 is "0110010101111001"`
import as C++ 2024-04-03 00:41:55 +00:00
updates 2024-04-05 00:11:09 +00:00			`Since the first 7 sm_bitvec_t's are either all "1" or "0" they are not stored.`
import as C++ 2024-04-03 00:41:55 +00:00			`The actual memory sequence looks like this:`

			`0000000011001110 0110010101111001`

			`Instead of storing 8 Words (16 bytes), we only store 2 Words (2 bytes): one`
spelling 2024-05-03 21:12:57 +00:00			`for the descriptor, and one for the last sm_bitvec_t #7.`
import as C++ 2024-04-03 00:41:55 +00:00
WIP 2024-07-01 08:12:37 +00:00			`A 2nd example shows a sequence of 4 x 16 bits (here, each sm_bitvec_t and the`
			`Descriptor word has 16 bits), in this example there is a run of adjacent 1s`
			`greater than 64 bits long after 64 0s.`

			`Descriptor: Vector for descriptor #1 as follows:`
			`00 01 00 00 00 00 00 10 00 00 00 00 10 00 01 00`
			`^^-- sm_bitvec_t #0 is "0000000000000000"`
			`^^-- sm_bitvec_t #1 is a run of 132 (0x84 or 0b0000000010000100) 1s`

			`The first 2 bits of the descriptor above (a sm_bitvec_t) indicate a run of 16`
			`zeros. The number of zeros is 16 in this case because the length of a bitvec_t`
			`is 16 (in this somewhat contrived example), but commonly this is either 32 or 64`
			`depending on your system's architecture. After that, the next 2 bits in the`
			`descriptor are '01' indicating a run-length encoded set of adjacent 1s longer`
			`than 16 (again in this case where the bitvec_t is 16 bits wide). The`
			`corosponding bitvec_t contains the actual length, in this case it is 132 (0x84`
			`or 0b0000000010000100).`

			`The actual memory sequence for this second example looks like this:`

			`0001000000000000 0000000010000100`

			`Using this method of RLE for adjacent 1s we can compress (again, in this case`
			`where bitvec_t is 16 bits wide) 2^16 or 65536 adjacent 1s.`

spelling 2024-05-03 21:12:57 +00:00			`The sparsemap stores a list of chunk maps, and for each chunk map, it stores the`
updates 2024-04-05 00:11:09 +00:00			`absolute address (i.e. if the user sets bit 0 and bit 10000, and the chunk map`
			`capacity is 2048, the sparsemap creates two chunk maps; the first starts at`
			`offset 0, the second starts at offset 8192).`
import as C++ 2024-04-03 00:41:55 +00:00
more tests 2024-04-08 22:14:47 +00:00			`## Usage instructions`
import as C++ 2024-04-03 00:41:55 +00:00
compare against roaring bitmaps 2024-05-03 19:15:39 +00:00			Copy the files `src/sparsemap.c` and `include/sparsemap.h` into your project.
			Review the `examples/` and `tests/` code.
import as C++ 2024-04-03 00:41:55 +00:00
			`## Final words`

updates 2024-04-05 00:11:09 +00:00			`This bitmap has efficient compression when used on long sequences of set (or`
spelling 2024-05-03 21:12:57 +00:00			`unset) bits (i.e. with a word size of 64 bit and a payload of consecutive`
updates 2024-04-05 00:11:09 +00:00			`numbers without gaps, the payload of 2048 x sizeof(uint64_t) = 16kb will occupy`
spelling 2024-05-03 21:12:57 +00:00			`only 8 bytes!).`
import as C++ 2024-04-03 00:41:55 +00:00
			`However, if the sequence is not consecutive and has gaps, it's possible that`
updates 2024-04-05 00:11:09 +00:00			`the compression is inefficient, and the size (in the worst case) is identical`
			`to an uncompressed bit vector (sometimes higher due to the bytes required for`
import as C++ 2024-04-03 00:41:55 +00:00			`metadata). In such cases, other compression schemes are more efficient (i.e.`
compare against roaring bitmaps 2024-05-03 19:15:39 +00:00			`http://lemire.me/blog/archives/2008/08/20/the-mythical-bitmap-index/). We`
			include in `lib` the amalgamated (git `2dc8070`) and well-known
			`[Roaring Bitmaps](https://github.com/RoaringBitmap/CRoaring/tree/master) and`
			`use it in the soak test to ensure our results are as accurate as theirs.`
import as C++ 2024-04-03 00:41:55 +00:00
integrat ecrupp review suggestions 2024-05-06 19:43:47 +00:00			`This library was originally created by [Christoph Rupp](https://crupp.de) in`
select/rank for unset as well as set bits (#4) Reviewed-on: https://git.burd.me/greg/sparsemap/pulls/4 Co-authored-by: Greg Burd <greg@burd.me> Co-committed-by: Greg Burd <greg@burd.me> 2024-04-24 20:32:09 +00:00			`C++ and then translated to C and further improved by Greg Burd <greg@burd.me>`
			`for use in LMDB and OpenLDAP.`