Moinak Ghosh
8b73303488
Some minor code cleanup.
2013-07-05 22:22:11 +05:30
Moinak Ghosh
e10a13ad94
Improve accuracy of the KMV sketch computation and speed it up.
2013-07-03 19:24:06 +05:30
Moinak Ghosh
ab1ced942d
Update invalid environment variable handling to actually fail rather than auto-correct.
2013-05-28 21:38:35 +05:30
Moinak Ghosh
e9ce7a5ed2
Fix a crash with invalid PCOMPRESS_CHUNK_HASH_GLOBAL.
...
Update testcase to correctly detect core files.
2013-05-26 23:38:10 +05:30
Moinak Ghosh
ddaa3b6b6d
Drastic simplification of Min-heap code and resultant Delta speedup.
2013-05-25 17:34:38 +05:30
Moinak Ghosh
41b036adac
Fix issue #8 .
2013-05-10 19:51:24 +05:30
Moinak Ghosh
c27317d7da
Add SSE2 optimizations for Segmented Dedupe.
2013-05-05 23:34:26 +05:30
Moinak Ghosh
6ecc400571
Fix segment offset sorting.
...
Get rid of incorrect duplicate checks in index.
2013-05-05 18:50:52 +05:30
Moinak Ghosh
c6da2325e3
Allow SKEIN to be used as a Global Dedupe chunk lookup hash.
2013-05-04 15:59:29 +05:30
Moinak Ghosh
0cf94c308a
Add a qsort variant optimized for integers and use in global dedupe.
...
Cleanup LZMA CRC64/32 declarations and add a header.
Fix heapq header.
2013-05-03 22:06:55 +05:30
Moinak Ghosh
c43e99f422
Use openmp parallelism always for chunk hash computation during Global Dedupe.
2013-05-02 23:24:43 +05:30
Moinak Ghosh
120877348c
Use SHA256 for Global Dedupe chunk lookup hash by default.
...
Allow changing Global Dedupe chunk lookup hash via env variable.
2013-05-02 00:05:05 +05:30
Moinak Ghosh
b23b5789fb
Fix bugs and improve accuracy in Segmented Dedupe.
...
Fix segment hashlist size computation.
Remove unnecessary sync of segment hashlist file writes.
Pass correct number of threads to index creation routine.
Add more error checks.
Handle correct positioning of segment hashlist file offset on write error.
Add missing semaphore signaling at dedupe abort points with global dedupe.
Use closer min-values sampling for improved segmented dedupe accuracy.
Update proper checksum info in README.
2013-04-30 19:35:18 +05:30
Moinak Ghosh
074e265f70
Fix sizing of similarity hash buffer.
2013-04-26 22:36:14 +05:30
Moinak Ghosh
75f62d6a36
Simplify segment lookup loop.
...
Fix assertion.
2013-04-26 10:56:29 +05:30
Moinak Ghosh
5bb028fe03
Change Segmented Dedupe flow to improve parallelism.
...
Periodically sync writes to segcache file.
Use simple insertion sort for small numbers of elements.
2013-04-25 23:42:32 +05:30
Moinak Ghosh
79a6e7f770
Capability to output data to stdout when compressing.
...
Always use segmented similarity bases dedupe when using -G option in pipe mode.
Standardize on average 8MB segment size for segmented dedupe.
Fix hashtable sizing.
Some miscellaneous cleanups.
Update README with details of new features.
2013-04-24 23:03:58 +05:30
Moinak Ghosh
6c5d8d9e18
Optimize index lookup for 8-byte keys.
...
More cleanups.
2013-04-24 19:49:43 +05:30
Moinak Ghosh
eabd670790
Improve segment similarity detection and drastically reduce index size.
2013-04-23 23:15:32 +05:30
Moinak Ghosh
b32f4b3f9a
Improve duplicate segment match detection.
2013-04-23 20:51:12 +05:30
Moinak Ghosh
6b7d883393
Tweak percentage intervals computation to improve segmented dedupe ratio.
...
Avoid repeat processing of already processed segments.
2013-04-23 18:53:56 +05:30
Moinak Ghosh
2c4024792a
Several bugfixes.
...
Avoid matching with self during hash lookup.
2013-04-22 22:07:07 +05:30
Moinak Ghosh
6b23f6a73a
Several fixes and optimizations.
2013-04-22 19:52:18 +05:30
Moinak Ghosh
c0b4aa0116
Many optimizations and changes to Segmented Global Dedupe.
...
Use chunk hash based similarity matching rather than content based.
Use sorting to order hash buffer rather than min-heap for better accuracy.
Use fast CRC64 for similarity hash for speed and lower memory requirements.
2013-04-21 18:11:16 +05:30
Moinak Ghosh
3b8a5813fd
Many optimizations to segmented global dedupe.
...
Use chunk hash based cumulative similarity matching instead of chunk content.
2013-04-19 22:51:51 +05:30
Moinak Ghosh
426c0d0bf2
Properly cleanup global dedupe state.
2013-04-18 21:36:36 +05:30
Moinak Ghosh
8ae571124d
Complete implementation for Segmented Global Deduplication.
2013-04-18 21:26:24 +05:30
Moinak Ghosh
a22b52cf08
Work in progress changes for Segmented Global Deduplication.
2013-04-14 23:51:54 +05:30
Moinak Ghosh
50251107de
Work in progress changes for Segmented Global Deduplication.
2013-04-09 22:23:51 +05:30
Moinak Ghosh
3d7a179a77
Work in progress changes for scalable segmented global deduplication.
...
Allow user-specified environment setting to control in-memory index size.
2013-04-06 15:15:27 +05:30
Moinak Ghosh
c357452079
Implement global dedupe in pipe mode.
...
Update hash index calculations to use upto 75% memavail when file size is not known.
Use little-endian nonce format for Salsa20.
2013-03-29 15:18:25 +05:30
Moinak Ghosh
19b304f30c
Add global index cleanup function.
...
Fix location of sem_wait().
More comments.
2013-03-25 21:04:16 +05:30
Moinak Ghosh
fbf4658635
Implement Global Deduplication.
2013-03-24 23:21:17 +05:30
Moinak Ghosh
876796be5c
Work in progress changes for global dedupe.
2013-03-21 22:00:38 +05:30
Moinak Ghosh
b7fdeb08bc
Work in progress global dedupe changes.
2013-03-20 22:47:03 +05:30
Moinak Ghosh
f8f23e5200
Major License text cleanup.
2013-03-07 20:26:48 +05:30
Moinak Ghosh
f89473d29c
Fixes for issues/warnings reported in issue #4 .
2013-02-15 22:53:17 +05:30
Moinak Ghosh
3e1737b4ab
Use OpenMP parallelism when computing xxHashes for chunks.
2013-02-02 09:27:58 +05:30
Moinak Ghosh
af4c6e1d84
Reduce dedupe hash table collisions by half.
2013-01-31 00:38:41 +05:30
Moinak Ghosh
3d8f3ada1c
Improve Deduplication performance by another 95%.
...
Start sliding window scanning near minimum chunk size boundaries to avoid scanning whole chunk.
2013-01-30 22:41:13 +05:30
Moinak Ghosh
5c8704c5bb
Improve Deduplication throughtput by 90%.
...
Use SSE4 register as sliding window for default 16-byte window size.
Use local variable for sliding window position to avoid spurios memory access in non-SIMD case.
Avoid computing breakpoint check value if processed length < minimum block length.
2013-01-22 15:54:42 +05:30
Moinak Ghosh
3888c8d316
Many optimization tweaks
...
Optimize Rabin Deduplication and Bsdiff
Vectorize XXHash using SE4
2013-01-20 22:02:26 +05:30
Moinak Ghosh
455c8107d5
Use pre-increment for shorter instruction length and slight speed.
2013-01-17 22:54:30 +05:30
Moinak Ghosh
39dbc4be43
Implement algo-specific minimum distance match for Delta Compression.
2013-01-14 13:20:07 +05:30
Moinak Ghosh
d49a088eea
Fixes and performance improvements for Dedupe Delta Compression
...
Avoid using fingerprints in minhash computation and fix write amplification
Modify min-heap to use 64bit values
Improve bsdiff performance
Fix pointer comparison in bsdiff
Use 32bit offsets in bsdiff to reduce memory usage
Improve Zero RLE Encoder performance
Add more buffer overflow checks in Zero RLE Decoder
2013-01-13 22:04:59 +05:30
Moinak Ghosh
d9eb82e0e8
Fix numeric parsing.
...
Fix dedupe bug introduced in last commit.
Reset valid flag when resetting dedupe context.
Cleanup test suites.
Do not abort test suite on failure of a test case.
2013-01-03 00:27:18 +05:30
Moinak Ghosh
13d9378acd
Update to latest XXHash version.
2012-12-31 11:53:47 +05:30
Moinak Ghosh
28224d29d3
Improve Dedupe performance.
...
Add more debug timing stats.
Change default checksum to Keccak 256 (SIMD version 4x faster than Skein).
Fix compiler warning in allocator code.
2012-12-29 23:43:41 +05:30
Moinak Ghosh
26a4f42506
Introduce strict compiler flags and fix scores of warnings/issues.
...
Avoid different optimization flags for Dedupe sources.
Fix liberal mixing of uint64_t and int64_t (should all be uint64_t).
Fix corner case crash when decompressing.
2012-12-27 23:06:48 +05:30
Moinak Ghosh
a43fdd7d2c
Improve Delta2 scanning speed and effectiveness.
...
Add destination buffer overflow check in Delta2.
Add rough speed computation.
2012-12-23 00:44:56 +05:30