pcompress/Changelog

== 3.1 Bugfix Release ==
Avoid auto-selection variable chunking for buffer sizes below threshold.
Do not auto-select Global Dedupe for below threshold buffers.
Fix issue #18.
Do not try to generate a target filename in pipe mode.
Scale default compression buffer size for levels > 8.
Fix issue #17.
Use LZ4 and Libbsc extra padding space for the compression buffer in adaptive modes.
Fix issue #16.
Fix issue #15.
Fix issue #14
Updated fix for issue #12
Add check for correct Libarchive version.
Update header checks to abort if not found.

== 3.0 Major Beta Release ==
Basic capability to list contents of an archive without extracting to disk.
Bump version and update command help text.
Update README with additional option details.
Change nftw() to depth-first scan to handle restoring directory permissions correctly.
When sorting cause directories to be sorted after files and in descending order of nesting level.
Take out stray printf().
Detect some DICOM formats and use BSC for DICOM data.
Overhaul documentation part #1
Detect and handle uncompressed PDF files using libbsc.
Force binary/text data detection for tar archives.
Get rid of unnecessary CLI option.
Add full pipeline mode check when archiving.
Update to PackJPG 2.5i.
Fixes crash with malformed Jpeg.
Fix issue #12.
Fix issue #13.
Create output directory with correct mode.
Fix the flow where pathname list is not sorted.
Fix ppmd decompression bug introduced in previous commit.
Reduce compression level for automatic pathname sorting.
Change to extraction directory only after opening archive.
Free PPMD buffer after compression, rather than caching.
Introduce new API in allocator to release buffer to OS.
Release LZMA buffers after use.
Drastically reduce memory consumption of PPMD8 in adaptive mode (Use lower max model order).
Detect AR archives and set the type.
Re-use a less common type code for AR.
Use Dispack generically for all executables and AR archives.
Move MSDOS COM single-byte magic number checks to last in the list.
Move advanced options flag into context structure.
Include dtd files as text type.
Update PackJPG to version 2.5h.
Fix missing bounds checking in Delta2.
Check harder with more strides in Delta2 for extreme compression levels.
Make LibArchive filter process buffer more generic.
Include explicit CLI flags for PackJPG and Dispack.
Avoid auto-selection of filters if advanced options are specified.
Add more robust checks for Jpeg and packJPG format files in filter routine.
Use case-insensitive checks for extension names.
Enable more features based on compression level, when archiving.
Do not use Libbsc for TIFFs. Not all TIFFs compress well with Libbsc.
Fix DEBUG-STATS build for Dispack.
Use adapt2 as default compression in archive mode.
Add more filter auto-selection by compression level in archive mode.
Replace odd stride lengths in Delta2 with standard numeric type lengths and improve performance.
Remove fast path exit to allow compressing headers and zero paddings via LZ4.
Use libbsc for AVI and MP4 files.
Use Libbsc for MP4 and FLAC files.
Change some rare file type codes to indicate some common types.
Use Libbsc for TIFF images.
Workaround for packJPG limitation.
Remove external Libbsc option.
Add forked and optimized copy of LGPL version of Libbsc.
Strip out Sort Transform from Libbsc copy.
Reduce Libbsc memory use.
Avoid redundant adler32 of data block in Libbsc.
Use Libbsc for DNA Sequence data instead of PPMD. Faster, better compression.
Fix pz extension handling for real.
Avoid Delta2,LZP for TIFF files. Negatively impacts compression.
Use libbsc/ppmd for BMP files.
Fix extension based hashing.
Do not append .pz extension to filenames already having it.
Some code formatting changes.
Fix PackJPG library usage. PackJPG interface doc is incomplete, ugh!
Handle the case where PackJPG expands the file rather than compressing.
Fix Dispack decoding.
Add Dispack filter with auto-detection of x86 executables in archive mode.
More elaborate magic header based detection of 32-bit and 64-bit x86 binaries.
Always use fast-mode LZ4 in Adaptive modes.
Optimize preprocessed compression and avoid a bunch of memory copies.
Fix a crash.
Add a few more file types.
More comments.
Fix fd leak.
Fix issues when handling Jpegs where packJPG borks.
Add fast handling of totally incompressible data (like Jpegs) in adaptive modes.
Add function to indicate totally incompressible data when archiving.
Reformat if statements in some places to reduce branching.
Enable auto-filtering of archive entries based on compression level.
Miscellaneous fixes.
Work in progress changes for packJPG encoding and decoding.
Enhance custom LibArchive filter functionlity.
Add basic framework for file type based filters during libarchive stage.
Add packJPG filter for Jpeg files (not active yet).
Directory format changes for clarity.
Add own implementation of archive entry extraction to allow custom filters.
Fix magic number check for endianness.
Structured handling of file types.
Handling of already compressed data based on compression algorithm.
Add a few more extension types.
Leverage file type detection(archiver) to improve compression performance.
Use detected file/data type(archiver) for Adaptive compression modes.
Update type flags and add more extensions.
Add file type detection based on magic values.
Add extension based file type detection and setting segment data type.
Use Bob Jenkins Minimal Perfect Hash to check for known extensions.
Use semaphore signaling and direct buffer copy for extraction.
Clear off private xattrs when extracting.
Enable pathname sorting only for high compression levels.
Replace slow pipe with direct memory copy for archive extraction.
Avoid using pipe during archive creation. Use semaphores and direct memory copy.
Use mmap to read from the pathlist file for performance.
Functionality to sort pathnames based on file extension and size.
More changes for archiving.
Allow multiple filenames on command line when archiving.
Remove unneded small block writes with libarchive.
Make log_msg() add newline by default.
Archiving support using Libarchive.
Change all perror() calls to use logger.
Make the config script a little verbose.
Ability to specify output compressed pathname.
Fix log level handling.
Add a simple log facility.
Refactor all printfs to use log facility.
Cleanup pointer casting in code to use macros.

== 2.4 Bugfix Release ==
Add identifiers to error messages for clarity.
Fix init of dedupe block size.
Tweak free memory detection to include swap and shared memory consideration.
Fix incorrect chunk size initialization from a previous commit.
Remove confusing option with little practical utility.
Update test cases and documentation.
Additional error checks in RLE encoding for bsdiff extra data.
Add a buffer overflow check in RLE encoder.
Avoid calling RLE encoding if extra data length is zero.
Make 2KB block size default for non-global deduplication.
Update test cases for new 2KB block size support.
Truncate password file after zeroing.
Add more example usage.
Avoid calling compression routine when dedupe reduces data size to zero.
Default compression level only when compressing.
Fix issue #11.
Increase default chunk size to 8MB.
Use default compression level of 1 (fast mode) for LZ4.
Support for deduplication using 2KB block size.
Add basic file format documentation.
Reduce memory threshold for switching to Similarity based Deduplication.
Avoid unnecessary re-hashing of 64-bit keys of the segment index.
Update free memory computation to include cached buffers.
Fix a potential rare corner case.

== 2.3 Update Release ==
Fix multiple crashes for some corner cases.
Increase max block size for variable dedup block sizes greater than 16KB.
Update test cases and fix a test script bug.
Use wrapper script to set paths when launching pcompress from build directory.
Use smaller max block size when doing global dedupe.
Fix init of executable name.
Make data partitioning between threads more effective.
Remove unnecessary computation to make Fixed block chunking faster.
Fix Dedupe Mode initialization.
Improve accuracy of the KMV sketch computation and speed it up.
Reduce similarity indicators to reduce memory use with low impact on dedupe ratio.
Add missing init of rabin block size.
Reduce a rollign hash parameter for a slight speedup with no side effect.
Update README with pointers to relevant analysis and documentation.
Remove an rpath entry meant for testing.
Make default symbol visibility to hidden with explicit public visibility specified.
Add missing static scope to a few more places.
Make Pcompress functionality into a library - initial changes.

== 2.2 Bugfix Release ==
Update invalid environment variable handling to actually fail rather than auto-correct.
Fix a crash with invalid PCOMPRESS_CHUNK_HASH_GLOBAL.
Update testcase to correctly detect core files.
Drastic simplification of Min-heap code and resultant Delta speedup.
Correspond segment size to chunk size for Segmented Dedupe for better accuracy.
Add a testcase for issue #10.
Fix issue #10.
Fix issue #8.

== 2.1 Update Release ==
Add more tests covering Segmented Global Dedupe.
Fix some tests.
Switch location of Dedupe context creation to allow correct index memory sizing.
Update README with details of Global Dedupe block hash selection.
Add SSE2 optimizations for Segmented Dedupe.
Fix segment offset sorting.
Get rid of incorrect duplicate checks in index.
Allow SKEIN to be used as a Global Dedupe chunk lookup hash.
Add a qsort variant optimized for integers and use in global dedupe.
Cleanup LZMA CRC64/32 declarations and add a header.
Fix heapq header.
Use openmp parallelism always for chunk hash computation during Global Dedupe.
Use SHA256 for Global Dedupe chunk lookup hash by default.
Allow changing Global Dedupe chunk lookup hash via env variable.
Fix crash with some older GCC versions. Reported in issue #7.
Fix issue #7.
Ensure tempfile cleanup even with error abort.
Fix bugs and improve accuracy in Segmented Dedupe.
Fix segment hashlist size computation.
Remove unnecessary sync of segment hashlist file writes.
Pass correct number of threads to index creation routine.
Add more error checks.
Handle correct positioning of segment hashlist file offset on write error.
Add missing semaphore signaling at dedupe abort points with global dedupe.
Use closer min-values sampling for improved segmented dedupe accuracy.
Update proper checksum info in README.
Fix sizing of similarity hash buffer.
Tweak index size computation.

== 2.0 Major Release ==
Add test cases for Global Deduplication.
Update documentation and code comments.
Remove tempfile pathname after creation to ensure clean removal after process exit.
Simplify segment lookup loop.
Fix assertion.
Change Segmented Dedupe flow to improve parallelism.
Periodically sync writes to segcache file.
Use simple insertion sort for small numbers of elements.
Capability to output data to stdout when compressing.
Always use segmented similarity bases dedupe when using -G option in pipe mode.
Standardize on average 8MB segment size for segmented dedupe.
Fix hashtable sizing.
Some miscellaneous cleanups.
Update README with details of new features.
Optimize index lookup for 8-byte keys.
More cleanups.
More tweaks to slightly improve segment dedupe efficiency.
Some minor cleanps.
Improve segment similarity detection and drastically reduce index size.
Improve duplicate segment match detection.
Tweak percentage intervals computation to improve segmented dedupe ratio.
Avoid repeat processing of already processed segments.
Clean up temp cache dir handling.
Allow temp dir setting via specific env variable to point to fast devices like ramdisk,ssd.
Several bugfixes.
Avoid matching with self during hash lookup.
Several fixes and optimizations.
Many optimizations and changes to Segmented Global Dedupe.
Use chunk hash based similarity matching rather than content based.
Use sorting to order hash buffer rather than min-heap for better accuracy.
Use fast CRC64 for similarity hash for speed and lower memory requirements.
Many optimizations to segmented global dedupe.
Use chunk hash based cumulative similarity matching instead of chunk content.
Update usage text and add minor tweaks.
Properly cleanup global dedupe state.
Complete implementation for Segmented Global Deduplication.
Work in progress changes for Segmented Global Deduplication.
Work in progress changes for Segmented Global Deduplication.
Work in progress changes for scalable segmented global deduplication.
Allow user-specified environment setting to control in-memory index size.
Implement global dedupe in pipe mode.
Update hash index calculations to use upto 75% memavail when file size is not known.
Use little-endian nonce format for Salsa20.
Add global index cleanup function.
Fix location of sem_wait().
More comments.
Add check to disable Delta Compression with Global deduplication for now.
Implement Global Deduplication.

== 1.4.0 Update Release ==
Update couple more test parameters with new crypto options.
Update README and test cases with new crypto options.
Update usage text.
Cleanup more stack parameters after use in various crypto functions.
Fix increment of XSalsa20 192-bit nonce value.
Handle nonce bytes in endian neutral way.
Use 128-bit key length when decompressing older version archives.
Add XSalsa20 encryption algorithm from the NaCL library.
Include 128-bit key support based on the Salsa20 eSTREAM submission.
Allow variable-length nonces.
Use random bytes for initial nonce value.
Increase PBE hash rounds to 50000.
Move Scrypt helper function out of AES module.
Change default encryption key length to 256 bits.
Add optional ability to change key length at runtime via cli option.
Include key length property in archive header.
Fix header HMAC to include salt, nonce and key length properties.
Retain backward compatibility to handle older format archives.
Fix compilation of AES ASM code.
Add AES-NI optimized code derived from latest OpenSSL upstream.
Add AES instruction set detection.
Add Vector Permute AES from OpenSSL 1.0.1e. Remain compatible with older OpenSSL versions.
Add compatibility to decode old-format parallel hashes created with version 1.2.
Bump archive version to 7 as parallel hashes are now merkle style hashes.
Add lookup and insert functionality for global index.
Use PPMd fallback for adapt2 if BSC is not enabled.
Improve XML detection in adaptive mode.
Make global dedupe bits buildable and fix errors.
Rename Adaptive compression type constants to avoid conflict with global constants.
Work in progress global dedupe.
Fixes for issues/warnings reported in issue #4.
Global dedupe work in progress.
Use OpenMP parallelism when computing xxHashes for chunks.
Use 2-stage Merkle Tree hashing for parallel hashes for better crypto properties.
Update xxhash comment.
Reduce dedupe hash table collisions by half.
Improve Deduplication performance by another 95%.
Start sliding window scanning near minimum chunk size boundaries to avoid scanning whole chunk.

== 1.3.0 Major Performance Release ==
Fix return from parallel versions of Keccak_Hash() function.
Add parallel versions of various checksums for single-segment, single-thread compression.
Use BLAKE2 parallel version for single-chunk archives (whole file in one chunk).
Set decompression threads correctly for single-chunk archives.
Update default checksum to BLAKE256.
Add optimized BLAKE2 implementations with runtime detection of CPU capability (SSE/AVX).
Major changes to use Intel's optimized SHA512 code for SHA512 and SHA512/256.
Remove earlier SHA256 code which is slower than SHA512/256 (on 64-bit CPU).
Use HMAC from Alan Saddi's implementation for cleaner, faster code.
Changes for generalized runtime SSE/AVX/XOP detection.
Multi instruction set XXhash build with runtime selection.
Extend CPUID code to detect more instruction sets.
Add options for BLAKE2 hash.
Move GCC builtins into utils header.
Bump file format version number due to extended digest flags.
Add descriptions to digest list.
Rationalize XXHash implementation to deal with 32-byte blocks instead of 16-byte.
Fix XXHash performance degradation for small keys.
Modify a data analysis loop in adaptive compress to make it auto-vectorizable.
Improve Deduplication throughtput by 90%.
Use SSE4 register as sliding window for default 16-byte window size.
Use local variable for sliding window position to avoid spurios memory access in non-SIMD case.
Avoid computing breakpoint check value if processed length < minimum block length.
Improve SSE version detection.
Add SSE4 detection.
Fix setting of some opt flags in Makefile.in.
Many optimization tweaks
Optimize Rabin Deduplication and Bsdiff
Vectorize XXHash using SSE4
Add SSE2 improvements to CTR mode AES.
Add debug print of encryption and HMAC throughput.
Fix error message for invalid option.
Implement algo-specific minimum distance match for Delta Compression.
Fixes and performance improvements for Dedupe Delta Compression
Avoid using fingerprints in minhash computation and fix write amplification
Modify min-heap to use 64bit values
Improve bsdiff performance
Fix pointer comparison in bsdiff
Use 32bit offsets in bsdiff to reduce memory usage
Improve Zero RLE Encoder performance
Add more buffer overflow checks in Zero RLE Decoder
Use SSE3 lddqu in the matchfinder if SSE3 is enabled.
Remove outdated LZP note.

== 1.2.0 Major Stable Release ==
Fix issue #1 and issue #2.
Enable building with older openssl (at least 0.9.8e).
Add variants of missing functions in older openssl versions.
Allow proper linking with libraries in alternate locations and setting RUNPATH.
Increase hash rounds for nonce generation.
Fix calculation of extra scratch space for Dedupe.
Add missing return value check in small buffer Delta2.
Update test failure detection in test driver.
Fix numeric parsing.
Fix dedupe bug introduced in last commit.
Reset valid flag when resetting dedupe context.
Cleanup test suites.
Do not abort test suite on failure of a test case.
Fix Delta2 handling for small buffers.
Fix LZP handling during preprocessing.
Fix type flag handling during preprocessing.
Update test cases to use configurable list of test corpus files.
Update some more int64_t datatypes to uint64_t
Update to latest XXHash version.
Fix Keccak invocation and reset default checksum back to SKEIN 256.
Improve Dedupe performance.
Fix compiler warning in allocator code.
Major enhancements to Delta2 encoding.
Fix the byteswap macros.
Start adding assertions.
Introduce strict compiler flags and fix scores of warnings/issues.
Avoid different optimization flags for Dedupe sources.
Fix liberal mixing of uint64_t and int64_t (should all be uint64_t).
Fix corner case crash when decompressing.
Fix Delta2 decode handling when not compressing.
More debug statements and other minor changes.
Avoid transposing below-threshold spans. Reduces compression ratio.
Use little-endian storage format for numbers to optimize for x86.
Improve embedded table detection.
Reduce Delta2 header sizes.
Fix preprocessing behavior when LZP does not compress but Delta2 works.
Improve Delta2 scanning speed and effectiveness.
Add destination buffer overflow check in Delta2.
Add rough speed computation.
More tweaks to Delta2 implementation.
Ability to run individual test suites from makefile.
Silence more Gcc warnings.
Update testing note.
Improve 64-bit compiler and platform checks.
Ensure core dumps are enabled during testing.
Enable building with alternate Zlib and Bzlib.
Fix correct setting of output size when using Delta2 without LZP.
README formatting.
Make Delta2 encoding independent of LZP.
Tweak Delta2 parameters.
Update README and test cases.
Update README to align with current features/behavior.
Fine tune transpose parameters.
Fix minor nits.
Change confusing structure member name.
Add Matrix Transpose of Dedupe index to compress it better.
Fix handling of Dedupe index compression failure.
Ensure intermediate file cleanup in tests.
Update to latest LZ4.
Get rid of size_t in places where 64-bitness is assumed.
Add support for 64-bit Keccak implementation.
Sanitize error message from tests.
Add more tests.
Improve platform detection in config script.
Add adaptive delta encoding test.
Implement Adaptive Delta Encoding (Delta2).
Work in progress global dedupe config setup.

== 1.1.0 Stable Release ==
Fix building without Libbsc support.
Add more tests for corrupted encrypted files.
Fix thread error reporting.
Update error condition tests to not truncate archive.
Fix issues with error handling.
Add new tests for out of range values and corrupted file.
Avoid CRC32 when decompressing older version archives.
Use HMAC on header and encrypted data, avoid regular digest when encrypting.
Add CRC32 header checksums in non-cryptographic mode.
Zlib optimizations. Use raw deflate streams to avoid unnecessary adler32.
Change some function signatures to improve algo init function behavior.
Fix corner case dedupe bug in error handling flow.
Bump archive version signature.
Work in progress global dedupe config loader.
Use fixed rolling-hash mask for better block size approximation.

== 1.0.0 Stable Release ==
Fix chunk flag setup when compression fails in adaptive mode.
Prevent display of non-fatal errors during compression.
Add buffer overflow check in Ppmd compression routines.
Fix pipe mode encryption check.
Change file difference check in tests.
Fix encryption with adaptive modes.
Add missing zero-out of algorithm data field.
Add a functionality test suite.
Tweak chunking parameters for better block size distribution and dedupe ratio.
Add some more debug mode info.
Minor fix for adapt mode.
Use Libbsc for XML data in adapt2 mode.

== 0.9.1 Minor update Release ==
Portability to Debian based distros.
Enable SSE4/AVX detection for AMD platforms (Bulldozer has both).
Portable long long int print formatting to silence gcc 4.6 warnings.
Add another example invocation.

== 0.9.0 Major Release ==
Implement HMAC verification for chunk header.
Do not ask for decryption passwd twice.
Refactor crypto utility functions in a separate file.
Add HMAC verification for file header.
Add a few extra input validation checks.
Include space in chunk header for HMAC.
Support for chunk-level encryption using AES and Scrypt based PBE.
Couple of minor fixes.
Incorporate SSE/AVX optimized Intel SHA-256 implementation.
Add support for runtime cpuid detection.
Add support for SHA256 and SHA512 digests from OpenSSL library.

== 0.8.6 Update Release ==
Fix polynomial computation.
Fix incorrect block length when doing fixed-block dedupe.
Remove unused structure member.
Switch to multiplicative rolling hash for good

== 0.8.5 Update Release ==
Update adaptive mode heuristic based on algorithms.
Remove incorrect check in PPMd decompression code.
Speed up adaptive modes by using heuristics to select compression algorithm.
Select similarity percentage based on dedupe block size for effectiveness.
Fix check for size reduction in dedupe.
Fix polynomial table computation.
Change hashing and length bias to reduce hashtable bucket collisions.
Add support for user-selectable 60% or 40% similarity for Delta Compression.
Overall slight speedup.
Rewrite core dedupe logic to simplify code and improve performance.
Hashtable based chunk-level deduplication instead of Quicksort.
Fix a corner case bug in Dedupe decompression.
Speed up Hash computation for dedupe blocks.
Add support for Fixed-Block deduplication.
Reduce dedupe loop checks for slight speed edge.
Implement K-min-values Sketch for Similarity detection.

== 0.8.1 Bugfix Release ==
Fix return code handling in LZP pre-compression, crashed adaptive modes.
Allow user-specified minimum Dedupe block size.
Compute similarity sketch only if Delta Compression enabled.
Fix secondary sketch computation, some more accuracy in diff detection.
Add xxHash for Rabin block checksums, slightly faster than CRC64.
Fix missing initialization of character counts table.
Some file reorganization.
Add ASM version of Skein for x64 platforms with auto-detection
Error checking for checksum flag when decompressing
Add support for Skein512 and Skein256 checksums
Import Skein code from NIST CD submission
Make checksum algorithms pluggable
Fix handling of huge buffers (>2GB) in LZP
Cleanup of some buffer sizing code

== 0.8 Beta release ==
Add support for libbsc a high-performance block sorting compressor.
Enable external algorithm threading for single chunk compressed files.
Add config script to generate Makefile based on flags.
Add install target and installation readme file.
Improve memory efficiency when total file size < total size of chunks.
Fix freeing of Zlib structures.
Add LZP Pre-Compression support ported from libbsc.
Add generic pre-processing wrappers for future support of other pre-processors.
Clean up computation of Rabin block sizes.
Compute Rabin scratch space accurately to avoid RAM wastage.
Delay allocation of per-thread chunks for performance and memory efficiency.
Avoid allocating double-buffer for single-chunk files.
Introduce lzmaMt option to indicate multithreaded LZMA.
Add multithreaded LZMA port from p7zip
Compute balanced thread count between chunk threads and algo threads
Generic way to handle querying algorithm parameters
Clean up unnecessary includes
Speed up sort comparator function.
Reduce memory consumption and improve performance in dedupe
Re-introduce crc64 for dedup blocks to avoid wasted memcpy-s
Restructe block array to be an array of pointers allocated on demand
Fix a corner case issue when splitting chunks at a dedup boundary
Improve Rabin computations using an irreducible polynomial
Slight improvement to similarity computation
A simple mechanism to include DEBUG mode stats
Implement secondary sketch based on character counts to refine similarity checksum.
Proper checksum update for last block.

== 0.7 Beta release ==
Bump version to 0.7
Fix handling of compression flags in adaptive mode
Fix error handling when chunk size is too small for dedupe
Fix initialization of adaptive modes.
Compute and compare Mean sketch cksum to improve similarity comparison
Fix optflags settings in Makefile
Small optimization in zero RLE encoder to avoid scanning during lookahead
Some minor fixes
Bias fingerprint value with occurrence counts for a better sketch
Fix latent bug when calling algo deinit in decompression code
Reduce diff threshold for slightly greater delta encoding
Limit similar buffer size difference for less wasted diffing
Change zlib compression wrapper to use faster deflateReset mechanism
Reduce optimization level for Dedupe code, it goes faster
Fix handling of incompressible chunks.
Fix handling of various dedup failures.
Add NULL compression option for dedup only compression.
Remove unneeded checks in qsort comparator.
Fix buffer sizing for LZ4.
Fix exit condition checks in LZ4 decompression wrapper.
Fix buffer size calculation when decompressing LZ4, Zlib and Bzip2 compressed chunks.
Slight SSE optimization in LZ4HC.

== 0.6 Alpha bugfix release ==
Fix crash when algo init function returns error.
Fix LZFX error handling.
More updates to README.
Further improve LZMA compression parameters to utilize all the 14 levels.
Tweak some Rabin parmeters for better reduction with zlib and Bzip2.
Increase the small size slabs a bit.
Fix slab sizing.
Fix buffer size computation when allocating Rabin block array.
Reduce memory usage of Rabin block array.
Add an SSE optimization for bsdiff.