Commit graph

141 commits

Author SHA1 Message Date
Moinak Ghosh
3d7a179a77 Work in progress changes for scalable segmented global deduplication.
Allow user-specified environment setting to control in-memory index size.
2013-04-06 15:15:27 +05:30
Moinak Ghosh
c357452079 Implement global dedupe in pipe mode.
Update hash index calculations to use upto 75% memavail when file size is not known.
Use little-endian nonce format for Salsa20.
2013-03-29 15:18:25 +05:30
Moinak Ghosh
19b304f30c Add global index cleanup function.
Fix location of sem_wait().
More comments.
2013-03-25 21:04:16 +05:30
Moinak Ghosh
fbf4658635 Implement Global Deduplication. 2013-03-24 23:21:17 +05:30
Moinak Ghosh
876796be5c Work in progress changes for global dedupe. 2013-03-21 22:00:38 +05:30
Moinak Ghosh
b7fdeb08bc Work in progress global dedupe changes. 2013-03-20 22:47:03 +05:30
Moinak Ghosh
f2806d4ffa Work in progress global dedupe changes. 2013-03-19 20:13:44 +05:30
Moinak Ghosh
f8f23e5200 Major License text cleanup. 2013-03-07 20:26:48 +05:30
Moinak Ghosh
e16b408061 Move Scrypt helper function out of AES module.
Fix a compiler warning.
2013-03-03 21:55:59 +05:30
Moinak Ghosh
5f6217bb1f Add lookup and insert functionality for global index.
Make global dedupe code buildable.
2013-02-21 23:07:07 +05:30
Moinak Ghosh
6badbcaea7 Make global dedupe bits buildable and fix errors.
Rename Adaptive compression type constants to avoid conflict with global constants.
2013-02-17 21:05:40 +05:30
Moinak Ghosh
7386f82a4f Work in progress global dedupe. 2013-02-16 23:33:06 +05:30
Moinak Ghosh
f89473d29c Fixes for issues/warnings reported in issue #4. 2013-02-15 22:53:17 +05:30
Moinak Ghosh
24d62bfde9 Global dedupe work in progress. 2013-02-14 23:10:53 +05:30
Moinak Ghosh
1eae57c8a2 Starting changes for single-file global dedupe. 2013-02-12 21:53:04 +05:30
Moinak Ghosh
3e1737b4ab Use OpenMP parallelism when computing xxHashes for chunks. 2013-02-02 09:27:58 +05:30
Moinak Ghosh
af4c6e1d84 Reduce dedupe hash table collisions by half. 2013-01-31 00:38:41 +05:30
Moinak Ghosh
3d8f3ada1c Improve Deduplication performance by another 95%.
Start sliding window scanning near minimum chunk size boundaries to avoid scanning whole chunk.
2013-01-30 22:41:13 +05:30
Moinak Ghosh
5c8704c5bb Improve Deduplication throughtput by 90%.
Use SSE4 register as sliding window for default 16-byte window size.
Use local variable for sliding window position to avoid spurios memory access in non-SIMD case.
Avoid computing breakpoint check value if processed length < minimum block length.
2013-01-22 15:54:42 +05:30
Moinak Ghosh
3888c8d316 Many optimization tweaks
Optimize Rabin Deduplication and Bsdiff
Vectorize XXHash using SE4
2013-01-20 22:02:26 +05:30
Moinak Ghosh
455c8107d5 Use pre-increment for shorter instruction length and slight speed. 2013-01-17 22:54:30 +05:30
Moinak Ghosh
39dbc4be43 Implement algo-specific minimum distance match for Delta Compression. 2013-01-14 13:20:07 +05:30
Moinak Ghosh
d49a088eea Fixes and performance improvements for Dedupe Delta Compression
Avoid using fingerprints in minhash computation and fix write amplification
Modify min-heap to use 64bit values
Improve bsdiff performance
Fix pointer comparison in bsdiff
Use 32bit offsets in bsdiff to reduce memory usage
Improve Zero RLE Encoder performance
Add more buffer overflow checks in Zero RLE Decoder
2013-01-13 22:04:59 +05:30
Moinak Ghosh
d9eb82e0e8 Fix numeric parsing.
Fix dedupe bug introduced in last commit.
Reset valid flag when resetting dedupe context.
Cleanup test suites.
Do not abort test suite on failure of a test case.
2013-01-03 00:27:18 +05:30
Moinak Ghosh
13d9378acd Update to latest XXHash version. 2012-12-31 11:53:47 +05:30
Moinak Ghosh
28224d29d3 Improve Dedupe performance.
Add more debug timing stats.
Change default checksum to Keccak 256 (SIMD version 4x faster than Skein).
Fix compiler warning in allocator code.
2012-12-29 23:43:41 +05:30
Moinak Ghosh
26a4f42506 Introduce strict compiler flags and fix scores of warnings/issues.
Avoid different optimization flags for Dedupe sources.
Fix liberal mixing of uint64_t and int64_t (should all be uint64_t).
Fix corner case crash when decompressing.
2012-12-27 23:06:48 +05:30
Moinak Ghosh
a43fdd7d2c Improve Delta2 scanning speed and effectiveness.
Add destination buffer overflow check in Delta2.
Add rough speed computation.
2012-12-23 00:44:56 +05:30
Moinak Ghosh
224fb529e9 Get rid of size_t in places where 64-bitness is assumed. 2012-12-09 10:15:06 +05:30
Moinak Ghosh
6c3173f929 Work in progress global dedupe config setup. 2012-11-29 22:28:50 +05:30
Moinak Ghosh
d250322490 Fix issues with error handling.
Add new tests for out of range values and corrupted file.
2012-11-24 23:53:07 +05:30
Moinak Ghosh
d054e0f713 Zlib optimizations. Use raw deflate streams to avoid unnecessary adler32.
Change some function signatures to improve algo init function behavior.
Fix corner case dedupe bug in error handling flow.
Bump archive version signature.
2012-11-22 21:02:50 +05:30
Moinak Ghosh
3b1d6b55fe Work in progress global dedupe config loader. 2012-11-19 21:41:56 +05:30
Moinak Ghosh
393ced991a A couple of minor cleanups. 2012-11-18 20:20:16 +05:30
Moinak Ghosh
b2cbf0699e Use fixed rolling-hash mask for better block size approximation. 2012-11-11 14:58:39 +05:30
Moinak Ghosh
eacbf207aa Tweak chunking parameters for better block size distribution and dedupe ratio. 2012-11-08 19:41:33 +05:30
Moinak Ghosh
e437390e53 Add some more debug mode info. 2012-11-07 22:55:14 +05:30
Moinak Ghosh
c5ebe1f30a Portability to Debian based distros.
Enable SSE4/AVX detection for AMD platforms (Bulldozer has both).
Portable long long int print formatting to silence gcc 4.6 warnings.
2012-10-21 21:03:07 +05:30
Moinak Ghosh
9475ccc3d6 Fix polynomial computation.
Fix incorrect block length when doing fixed-block dedupe.
2012-10-02 00:20:44 +05:30
Moinak Ghosh
0019efbadb Adjust break pattern check mask for closer approximation to average block size.
Remove unused structure member.
2012-09-29 23:31:45 +05:30
Moinak Ghosh
24e6f4e629 Switch to multiplicative rolling hash for good distribution properties. 2012-09-29 00:09:49 +05:30
Moinak Ghosh
8f8af7ed6b Update adaptive mode heuristic based on algorithms.
Remove incorrect check in PPMd decompression code.
More refactoring of variable names.
2012-09-27 22:29:08 +05:30
Moinak Ghosh
449dc35675 Speed up adaptive modes by using heuristics to select compression algorithm.
Select similarity percentage based on dedupe block size for effectiveness.
2012-09-26 19:47:32 +05:30
Moinak Ghosh
333b7b011e Fix check for size reduction in dedupe.
Tweak debug message for clarity.
2012-09-25 23:24:33 +05:30
Moinak Ghosh
3544a8c708 Fix polynomial table computation.
Change hashing and length bias to reduce hashtable bucket collisions.
Add support for user-selectable 60% or 40% similarity for Delta Compression.
Overall slight speedup.
2012-09-24 22:20:27 +05:30
Moinak Ghosh
8386e72566 Rewrite core dedupe logic to simplify code and improve performance.
Hashtable based chunk-level deduplication instead of Quicksort.
Fix a corner case bug in Dedupe decompression.
2012-09-23 14:57:09 +05:30
Moinak Ghosh
99a8e4cd98 Speed up Hash computation for dedupe blocks.
Add missing initialization of sliding window.
Update help text.
2012-09-19 20:29:44 +05:30
Moinak Ghosh
e3befd9e16 Add support for Fixed-Block deduplication.
More refactoring of symbol names.
2012-09-16 11:12:58 +05:30
Moinak Ghosh
b9355a5dcc Reduce dedupe loop checks for slight speed edge.
Beginnings of Fixed-block dedupe.
Update variable name for clarity.
2012-09-15 11:14:58 +05:30
Moinak Ghosh
f3f472b860 Implement K-min-values Sketch for Similarity detection. 2012-09-11 20:26:36 +05:30
Moinak Ghosh
e6f042aaf8 Allow user-specified minimum Dedupe block size.
Compute similarity sketch only if Delta Compression enabled.
2012-09-05 22:43:54 +05:30
Moinak Ghosh
560fa85aab Fix secondary sketch computation, some more accuracy in diff detection. 2012-09-04 23:28:02 +05:30
Moinak Ghosh
262566b59a Add xxHash for Rabin block checksums, slightly faster than CRC64.
Fix missing initialization of character counts table.
Some file reorganization.
2012-09-02 20:40:32 +05:30
Moinak Ghosh
eda312ce1e Add support for Skein512 and Skein256 checksums
Import Skein code from NIST CD submission
Make checksum algorithms pluggable
Fix handling of huge buffers (>2GB) in LZP
Cleanup of some buffer sizing code
Speed up CRC64 calculation in dedupe chunking
2012-08-31 22:36:06 +05:30
Moinak Ghosh
bf149e880d Add LZP Pre-Compression support ported from libbsc.
Add generic pre-processing wrappers for future support of other pre-processors.
Clean up computation of Rabin block sizes.
Compute Rabin scratch space accurately to avoid RAM wastage.
2012-08-23 22:58:44 +05:30
Moinak Ghosh
023dcae19a Speed up sort comparator function. 2012-08-17 13:18:50 +05:30
Moinak Ghosh
2dadf411fa Reduce memory consumption and improve performance in dedupe
Re-introduce crc64 for dedup blocks to avoid wasted memcpy-s
Restructe block array to be an array of pointers allocated on demand
Fix a corner case issue when splitting chunks at a dedup boundary
2012-08-17 11:03:02 +05:30
Moinak Ghosh
55d0485d34 Improve Rabin computations using an irreducible polynomial
Slight improvement to similarity computation
A simple mechanism to include DEBUG mode stats
Include stdint for common int types
2012-08-15 20:13:40 +05:30
Moinak Ghosh
3150bdbed7 Implement secondary sketch based on character counts to refine similarity checksum.
Proper checksum update for last block.
Update comments.
2012-08-12 13:06:49 +05:30
Moinak Ghosh
bde917c8e9 Fix handling of compression flags in adaptive mode
Fix error handling when chunk size is too small for dedupe
Bump version to 0.6
2012-08-10 10:47:11 +05:30
Moinak Ghosh
f2ffcad2fd Compute and compare Mean sketch cksum to improve similarity comparison
Fix optflags settings in Makefile
Small optimization in zero RLE encoder to avoid scanning during lookahead
Some minor fixes
2012-08-09 23:57:24 +05:30
Moinak Ghosh
400d0bfa72 Bias fingerprint value with occurrence counts for a better sketch
Fix latent bug when calling algo deinit in decompression code
Reduce diff threshold for slightly greater delta encoding
Limit similar buffer size difference for less wasted diffing
Change zlib compression wrapper to use faster deflateReset mechanism
Reduce optimization level for Dedupe code, it goes faster
2012-08-08 22:40:58 +05:30
Moinak Ghosh
927da81562 Remove unneeded checks in qsort comparator. 2012-08-05 18:58:40 +05:30
Moinak Ghosh
203008def9 Further improve LZMA compression parameters to utilize all the 14 levels.
Tweak some Rabin parmeters for better reduction with zlib and Bzip2.
2012-07-30 23:30:13 +05:30
Moinak Ghosh
94563a7ecd Fix buffer size computation when allocating Rabin block array.
Reduce memory usage of Rabin block array.
Add an SSE optimization for bsdiff.
Move integer hashing function to utils file.
More updates to README.
2012-07-28 23:55:24 +05:30
Moinak Ghosh
c7cc7b469c Update chunk size computation to reduce memory usage.
Implement runtime bypass of custom allocator.
Update README.
2012-07-27 22:03:24 +05:30
Moinak Ghosh
53d4311534 Make LZFX Hash size dynamic.
Use smaller min rabin block when using fast compression algos.
Add missing check for algo init function return value.
2012-07-23 21:43:12 +05:30
Moinak Ghosh
8cfd54fe34 Add LZFX Compression support, a very fast lightweight compressor.
Avoid a branch in the rabin loop.
2012-07-23 00:15:08 +05:30
Moinak Ghosh
7e14909ad1 Separate initial rabin boundary detection and block splitting for performance.
Also fix a rare corner case latent bug.
2012-07-22 21:27:44 +05:30
Moinak Ghosh
962a2cae8a Compress Dedup index only if it is at least 90 bytes to avoid expansion.
Some minor cleanup.
2012-07-22 00:00:41 +05:30
Moinak Ghosh
b69dcf4d55 Remove debug statements. 2012-07-20 21:38:39 +05:30
Moinak Ghosh
fd7c7e9a65 Use 4-byte ints for header values instead of 8-byte size_t.
Use RLE on control data if it reduces the size.
Update some comments.
Use scratch space at end of data chunk, if available.
2012-07-20 20:53:46 +05:30
Moinak Ghosh
e788eb43b8 Implement Delta Encoding based on modified bsdiff.
Change to more accurate Sketch value computation approach.
2012-07-19 21:41:07 +05:30
Moinak Ghosh
1da2c40888 Use a rolling checksum based sketch value for a rabin chunk instead of a CRC64 checksum.
Avoids additional table-lookup memory access.
Reduce Rabin window size to avoid overflows in sketch value.
No need to maintain rolling checksum in Rabin context.
A few comment cleanups.
2012-07-13 22:06:55 +05:30
Moinak Ghosh
0091a0da02 Remove debug messages.
Fix last segment detection.
2012-07-10 21:11:31 +05:30
Moinak Ghosh
a873f92e41 Fix crash when decompressing deduped archive.
Ensure correct level is passed to lzma.
Avoid branch when wrapping rabin window position and check for rabin window size to be power of 2.
Update rabin parameters check for adaptive modes.
Add detection of 7-bit text/8-bit binary data for later use.
2012-07-10 20:14:23 +05:30
Moinak Ghosh
db0c9ea9ac Improve LZMA compression parameters at extreme levels.
Fix incorrect thread calculation.
Remove some cruft.
2012-07-09 23:28:11 +05:30
Moinak Ghosh
010f49f412 Implement ability to partition chunks at the last rabin boundary instead of fixed size. 2012-07-08 21:44:08 +05:30
Moinak Ghosh
d3f5287ee5 Update License info to LGPLv3. 2012-07-07 22:18:29 +05:30
Moinak Ghosh
172432698e Adjust Rabin parameters for chunksize > LZMA window size. 2012-07-07 00:01:13 +05:30
Moinak Ghosh
ea923b84f0 Use different min block size and Rabin break pattern depending on compression algo.
Cleanup some cruft.
2012-07-06 23:24:12 +05:30
Moinak Ghosh
f5ce45b16e Techniques to better reduce Rabin Metadata.
Fix wrong chunk sizing with dedup enabled.
2012-07-06 00:16:02 +05:30
Moinak Ghosh
774384c204 Remove debug statement. 2012-07-04 23:39:03 +05:30
Moinak Ghosh
3f0e1952ef Fix compile warnings. 2012-07-04 23:20:10 +05:30
Moinak Ghosh
1eee08040f Use diferent average Rabin block sizes depending on compression algorithm.
Misc cleanups.
2012-07-03 22:47:24 +05:30
Moinak Ghosh
a13c61e926 Change rabin index encoding scheme for better metadata compression. 2012-07-02 22:08:03 +05:30
Moinak Ghosh
a1825a2305 Implement Parallel deduplication support.
Restructure compression functions to take chunk flag as argument.
Add missing error flag printing in LZMA.
Only create enough threads as needed by chunk size and file size.
Minor cleanups and variable name changes.
2012-07-01 21:44:02 +05:30
Moinak Ghosh
f9c3644459 Updates to Rabin based Dedup.
Change command line option.
2012-06-29 23:45:06 +05:30
Moinak Ghosh
cbf9728278 Implement Deduplication based on Rabin Fingerprinting: work in progress.
Fix bug that prevented pipe mode from being used.
Allow building without specialized allocator.
Use basic optimize flag in debuig build.
2012-06-29 18:23:55 +05:30
Moinak Ghosh
8f5f531967 Add license and other minor fixes. 2012-06-21 20:40:43 +05:30
Moinak Ghosh
733923cbf2 Add ability to adjust chunk boundary based on Rabin Fingerprinting to improve compression.
Remove unnecessary checks in compression loop.
2012-06-21 20:27:05 +05:30