updates

2024-04-04 20:11:09 -04:00 · 2024-04-04 20:11:09 -04:00 · 6b87651b58
commit 6b87651b58
parent 6ffa748974
2 changed files with 27 additions and 55 deletions
--- a/.gitignore
+++ b/.gitignore
@ -2,7 +2,7 @@
 *.so
 **/*.o
 tests/test
-examples/ex_??
+examples/ex_*
 .cache
 hints.txt
 tmp/
--- a/README.md
+++ b/README.md
@ -1,4 +1,3 @@
-
 `sparsemap` is a sparse, compressed bitmap. In best case, it can store 2048
 bits in just 8 bytes. In worst case, it stores the 2048 bits uncompressed and
 requires additional 8 bytes of overhead.
@ -9,82 +8,55 @@ integers then sparsemap can compress up to 16kb in just 8 bytes.

 ## How does it work?

-On the lowest level, bits are stored in BitVectors (a uint32_t or uint64_t).
+On the lowest level stores bits in sm_bitvec_t's (a uint32_t or uint64_t).

-Each BitVector has an additional descriptor (2 bits). All descriptors are
-stored in a single Word which is prepended to the BitVectors. (The descriptor
-Word and the BitVectors have the same size.) The descriptor of a BitVector
-specifies whether the BitVector consists only of set bits ("1"), unset
+Each sm_bitvec_t has an additional descriptor (2 bits). A single word prepended
+to each sm_bitvec_t describes its condition. The descriptor word and the
+sm_bitvec_t's have the same size.) The descriptor of a sm_bitvec_t
+specifies whether the sm_bitvec_t consists only of set bits ("1"), unset
 bits ("0") or has a mixed payload. In the first and second case the
-BitVector is not stored.
+sm_bitvec_t is not stored.

-An example shows a sequence of 4 x 16 bits (here, each BitVector and the
+An example shows a sequence of 4 x 16 bits (here, each sm_bitvec_t and the
 Descriptor word has 16 bits):

      Descriptor:
      00 00 00 00 11 00 11 10
-      ^^ ^^ ^^ ^^-- BitVector #0 - #3 are "0000000000000000"
-                  ^^-- BitVector #4 is "1111111111111111"
-                     ^^-- BitVector #5 is "0000000000000000"
-                        ^^-- BitVector #7 is "1111111111111111"
-                           ^^-- BitVector #7 is "0110010101111001"
+      ^^ ^^ ^^ ^^-- sm_bitvec_t #0 - #3 are "0000000000000000"
+                  ^^-- sm_bitvec_t #4 is "1111111111111111"
+                     ^^-- sm_bitvec_t #5 is "0000000000000000"
+                        ^^-- sm_bitvec_t #7 is "1111111111111111"
+                           ^^-- sm_bitvec_t #7 is "0110010101111001"

-Since the first 7 BitVectors are either all "1" or "0" they are not stored.
+Since the first 7 sm_bitvec_t's are either all "1" or "0" they are not stored.
 The actual memory sequence looks like this:

      0000000011001110 0110010101111001

 Instead of storing 8 Words (16 bytes), we only store 2 Words (2 bytes): one
-for the Descriptor, one for last BitVector #7.
+for the descriptor, one for last sm_bitvec_t #7.

-Since such a construct (it's called a MiniMap) has a limited capacity, another
-structure is created on top of it, a `Sparsemap`. The Sparsemap stores a list
-of MiniMaps, and for each MiniMap it stores the absolute address. I.e. if
-the user sets bit 0 and bit 10000, and the MiniMap capacity is 2048, the
-Sparsemap creates two MiniMaps. The first starts at offset 0, the second starts
-at offset 8192.
+The sparsemap stores a list of chunk maps, and for each chunk map it stores the
+absolute address (i.e. if the user sets bit 0 and bit 10000, and the chunk map
+capacity is 2048, the sparsemap creates two chunk maps; the first starts at
+offset 0, the second starts at offset 8192).

 # Usage instructions

-The file `main.cc` has example code. Here is a small excerpt:
-
-    // we need a buffer to store the SparseMap
-    unsigned char buffer[1024];
-
-    sparsemap::SparseMap<uint32_t, uint64_t> sm;
-    sm.create(buffer, sizeof(buffer));
-
-    // after initialization, the used size is just 4 bytes (sizeof(uint32_t))
-    assert(sm.get_size() == sizeof(uint32_t));
-
-    // set the first bit
-    sm.set(0, true);
-
-    // check that the first bit was set
-    assert(sm.is_set(0) == true);
-
-    // unset the first bit
-    assert(sm.is_set(1) == false);
-
-    // check that the first bit is now no longer set
-    sm.set(0, false);
+The file `examples/ex_1.c` has example code.

 ## Final words

-This bitmap implementation has very efficient compression when using on long
-sequences of set (or unset) bits. I.e. with a word size of 64bit, and a
-payload of consecutive numbers without gaps, the payload of 2048 x
-sizeof(uint64_t) = 16kb can be stored in just 8 bytes!
+This bitmap has efficient compression when used on long sequences of set (or
+unset) bits (i.e. with a word size of 64bit, and a payload of consecutive
+numbers without gaps, the payload of 2048 x sizeof(uint64_t) = 16kb will occupy
+only 8 bytes!

 However, if the sequence is not consecutive and has gaps, it's possible that
-the compression is completely inefficient, and the size basically is identical
-to an uncompressed bitvector (even higher because a few bytes are required for
+the compression is inefficient, and the size (in the worst case) is identical
+to an uncompressed bit vector (sometimes higher due to the bytes required for
 metadata). In such cases, other compression schemes are more efficient (i.e.
 http://lemire.me/blog/archives/2008/08/20/the-mythical-bitmap-index/).

 This library was originally created for hamsterdb [http://hamsterdb.com] in
-order to compress the key indices. For several technical reasons this turned
-out to be impossible, though. (If you're curious then feel free to drop
-me a mail.) I'm releasing it as open source, hoping that others can make good
-use of it.
-
+C++ and then translated to C99 code by Greg Burd <greg@burd.me>.