diff --git a/.gitignore b/.gitignore index 1dbba9c..0244d92 100644 --- a/.gitignore +++ b/.gitignore @@ -2,7 +2,7 @@ *.so **/*.o tests/test -examples/ex_?? +examples/ex_* .cache hints.txt tmp/ diff --git a/README.md b/README.md index 4788f12..895f061 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,3 @@ - `sparsemap` is a sparse, compressed bitmap. In best case, it can store 2048 bits in just 8 bytes. In worst case, it stores the 2048 bits uncompressed and requires additional 8 bytes of overhead. @@ -9,82 +8,55 @@ integers then sparsemap can compress up to 16kb in just 8 bytes. ## How does it work? -On the lowest level, bits are stored in BitVectors (a uint32_t or uint64_t). +On the lowest level stores bits in sm_bitvec_t's (a uint32_t or uint64_t). -Each BitVector has an additional descriptor (2 bits). All descriptors are -stored in a single Word which is prepended to the BitVectors. (The descriptor -Word and the BitVectors have the same size.) The descriptor of a BitVector -specifies whether the BitVector consists only of set bits ("1"), unset +Each sm_bitvec_t has an additional descriptor (2 bits). A single word prepended +to each sm_bitvec_t describes its condition. The descriptor word and the +sm_bitvec_t's have the same size.) The descriptor of a sm_bitvec_t +specifies whether the sm_bitvec_t consists only of set bits ("1"), unset bits ("0") or has a mixed payload. In the first and second case the -BitVector is not stored. +sm_bitvec_t is not stored. -An example shows a sequence of 4 x 16 bits (here, each BitVector and the +An example shows a sequence of 4 x 16 bits (here, each sm_bitvec_t and the Descriptor word has 16 bits): Descriptor: 00 00 00 00 11 00 11 10 - ^^ ^^ ^^ ^^-- BitVector #0 - #3 are "0000000000000000" - ^^-- BitVector #4 is "1111111111111111" - ^^-- BitVector #5 is "0000000000000000" - ^^-- BitVector #7 is "1111111111111111" - ^^-- BitVector #7 is "0110010101111001" + ^^ ^^ ^^ ^^-- sm_bitvec_t #0 - #3 are "0000000000000000" + ^^-- sm_bitvec_t #4 is "1111111111111111" + ^^-- sm_bitvec_t #5 is "0000000000000000" + ^^-- sm_bitvec_t #7 is "1111111111111111" + ^^-- sm_bitvec_t #7 is "0110010101111001" -Since the first 7 BitVectors are either all "1" or "0" they are not stored. +Since the first 7 sm_bitvec_t's are either all "1" or "0" they are not stored. The actual memory sequence looks like this: 0000000011001110 0110010101111001 Instead of storing 8 Words (16 bytes), we only store 2 Words (2 bytes): one -for the Descriptor, one for last BitVector #7. +for the descriptor, one for last sm_bitvec_t #7. -Since such a construct (it's called a MiniMap) has a limited capacity, another -structure is created on top of it, a `Sparsemap`. The Sparsemap stores a list -of MiniMaps, and for each MiniMap it stores the absolute address. I.e. if -the user sets bit 0 and bit 10000, and the MiniMap capacity is 2048, the -Sparsemap creates two MiniMaps. The first starts at offset 0, the second starts -at offset 8192. +The sparsemap stores a list of chunk maps, and for each chunk map it stores the +absolute address (i.e. if the user sets bit 0 and bit 10000, and the chunk map +capacity is 2048, the sparsemap creates two chunk maps; the first starts at +offset 0, the second starts at offset 8192). # Usage instructions -The file `main.cc` has example code. Here is a small excerpt: - - // we need a buffer to store the SparseMap - unsigned char buffer[1024]; - - sparsemap::SparseMap sm; - sm.create(buffer, sizeof(buffer)); - - // after initialization, the used size is just 4 bytes (sizeof(uint32_t)) - assert(sm.get_size() == sizeof(uint32_t)); - - // set the first bit - sm.set(0, true); - - // check that the first bit was set - assert(sm.is_set(0) == true); - - // unset the first bit - assert(sm.is_set(1) == false); - - // check that the first bit is now no longer set - sm.set(0, false); +The file `examples/ex_1.c` has example code. ## Final words -This bitmap implementation has very efficient compression when using on long -sequences of set (or unset) bits. I.e. with a word size of 64bit, and a -payload of consecutive numbers without gaps, the payload of 2048 x -sizeof(uint64_t) = 16kb can be stored in just 8 bytes! +This bitmap has efficient compression when used on long sequences of set (or +unset) bits (i.e. with a word size of 64bit, and a payload of consecutive +numbers without gaps, the payload of 2048 x sizeof(uint64_t) = 16kb will occupy +only 8 bytes! However, if the sequence is not consecutive and has gaps, it's possible that -the compression is completely inefficient, and the size basically is identical -to an uncompressed bitvector (even higher because a few bytes are required for +the compression is inefficient, and the size (in the worst case) is identical +to an uncompressed bit vector (sometimes higher due to the bytes required for metadata). In such cases, other compression schemes are more efficient (i.e. http://lemire.me/blog/archives/2008/08/20/the-mythical-bitmap-index/). This library was originally created for hamsterdb [http://hamsterdb.com] in -order to compress the key indices. For several technical reasons this turned -out to be impossible, though. (If you're curious then feel free to drop -me a mail.) I'm releasing it as open source, hoping that others can make good -use of it. - +C++ and then translated to C99 code by Greg Burd .