Commit graph

152 commits

Author SHA1 Message Date
Moinak Ghosh
f970b41e34 A bunch of improvements and fixes.
- Fix heap corruption in DICT Filter.
- Make default Dedup block size as 8KB.
- Revamp executable file handling: Part#1.
- Developed new E8E9 filter that works better than Dispack on raw data blocks.
- Remove block-based Dispack encoding. File-specific Dispack filter to be added.
- Improve file header based executable file detection.
- Introduce new sorting algorithm for filenames without extension.
2014-12-11 19:15:36 +05:30
Moinak Ghosh
e3c32ed6d6 Remove unneeded archive writing function.
Improve filter scratch buffer handling.
Improve memory accounting.
Remove delayed allocation when compressing. Allows better memory estimation.
Some cstyle fixes.
2014-09-24 21:54:36 +05:30
Moinak Ghosh
3debf1340c Fix missing newline in error message. 2014-09-21 22:08:10 +05:30
Moinak Ghosh
10f40e1c6f Part 1 changes to allow dual licensing to MPLV2.
Make external LGPL code/features disabled in MPLV2 variant.
Nuke some unwanted whitespace (cstyle).
2014-07-24 22:20:30 +05:30
Moinak Ghosh
0433452b37 Miscellaneous refactoring.
Add some headers for OSX.
2014-05-24 23:52:30 +05:30
Moinak Ghosh
a62e1aa5f7 Config script option to disable AVX.
Fix compiler (Gcc 4.9) warnings.
2014-05-05 19:40:53 +05:30
Moinak Ghosh
63bef473cc Working MAC OS X port.
Compatibility layer for semaphore handling.
2014-05-04 21:11:31 +05:30
Moinak Ghosh
f2da433188 More portability tweaks.
Handle platform specific yasm parameters.
Resolve namespace conflict on OSX.
Do not build Skein ASM code on OSX.
2014-04-30 22:46:24 +05:30
Moinak Ghosh
935717373b Capability to list offset and length of each block when deduplication for external use. 2014-03-30 17:35:21 +05:30
Moinak Ghosh
f8d3ddfe39 Fix issue #15. 2014-01-15 22:42:18 +05:30
Moinak Ghosh
7f81869874 Archiving support using Libarchive: Work in progress changes.
Change all perror() calls to use logger.
Make the config script a little verbose.
2013-10-20 23:54:27 +05:30
Moinak Ghosh
8c1f4ebe61 Add a simple log facility.
Refactor all printfs to use log facility.
2013-10-02 20:45:33 +05:30
Moinak Ghosh
fa78621cbf Cleanup pointer casting in code to use macros. 2013-09-22 20:11:15 +05:30
Moinak Ghosh
fc65111bae Fix issue #11.
Increase default chunk size to 8MB.
Use default compression level of 1 (fast mode) for LZ4.
2013-08-24 22:58:50 +05:30
Moinak Ghosh
3db5188445 Support for deduplication using 2KB block size. 2013-08-19 13:38:52 +05:30
Moinak Ghosh
ef98422bd4 Add basic file format documentation.
Reduce memory threshold for switching to Similarity based Deduplication.
2013-08-18 20:11:20 +05:30
Moinak Ghosh
58f3113558 Avoid unnecessary re-hashing of 64-bit keys of the segment index. 2013-08-17 22:08:55 +05:30
Moinak Ghosh
d31c6433c2 Update free memory computation to include cached buffers.
Fix a potential rare corner case.
2013-08-17 11:31:44 +05:30
Moinak Ghosh
f35d0ff4ef Fix multiple crashes for some corner cases.
Increase max block size for variable dedup block sizes greater than 16KB.
Update test cases and fix a test script bug.
2013-08-09 21:55:06 +05:30
Moinak Ghosh
fe18afbcf4 Use wrapper script to set paths when launching pcompress from build directory.
Use smaller max block size when doing global dedupe.
Fix init of executable name.
2013-08-07 22:03:52 +05:30
Moinak Ghosh
f34cfb1aa6 Make data partitioning between threads more effective.
Remove unnecessary computation to make Fixed block chunking faster.
2013-07-21 09:31:59 +05:30
Moinak Ghosh
2a218e9da5 Fix Dedupe Mode initialization. 2013-07-12 18:21:49 +05:30
Moinak Ghosh
8b73303488 Some minor code cleanup. 2013-07-05 22:22:11 +05:30
Moinak Ghosh
e10a13ad94 Improve accuracy of the KMV sketch computation and speed it up. 2013-07-03 19:24:06 +05:30
Moinak Ghosh
6b67e98747 Reduce similarity indicators to reduce memory use with low impact on dedupe ratio. 2013-06-30 22:38:05 +05:30
Moinak Ghosh
17db67564d Reduce a rollign hash parameter for a slight speedup with no side effect. 2013-06-24 21:13:32 +05:30
Moinak Ghosh
c0dd0102a5 A few minor fixes. 2013-06-14 22:25:01 +05:30
Moinak Ghosh
ab1ced942d Update invalid environment variable handling to actually fail rather than auto-correct. 2013-05-28 21:38:35 +05:30
Moinak Ghosh
e9ce7a5ed2 Fix a crash with invalid PCOMPRESS_CHUNK_HASH_GLOBAL.
Update testcase to correctly detect core files.
2013-05-26 23:38:10 +05:30
Moinak Ghosh
ddaa3b6b6d Drastic simplification of Min-heap code and resultant Delta speedup. 2013-05-25 17:34:38 +05:30
Moinak Ghosh
0a1e3b39ef Correspond segment size to chunk size for Segmented Dedupe for better accuracy. 2013-05-15 22:20:45 +05:30
Moinak Ghosh
41b036adac Fix issue #8. 2013-05-10 19:51:24 +05:30
Moinak Ghosh
c27317d7da Add SSE2 optimizations for Segmented Dedupe. 2013-05-05 23:34:26 +05:30
Moinak Ghosh
6ecc400571 Fix segment offset sorting.
Get rid of incorrect duplicate checks in index.
2013-05-05 18:50:52 +05:30
Moinak Ghosh
c6da2325e3 Allow SKEIN to be used as a Global Dedupe chunk lookup hash. 2013-05-04 15:59:29 +05:30
Moinak Ghosh
0cf94c308a Add a qsort variant optimized for integers and use in global dedupe.
Cleanup LZMA CRC64/32 declarations and add a header.
Fix heapq header.
2013-05-03 22:06:55 +05:30
Moinak Ghosh
c43e99f422 Use openmp parallelism always for chunk hash computation during Global Dedupe. 2013-05-02 23:24:43 +05:30
Moinak Ghosh
120877348c Use SHA256 for Global Dedupe chunk lookup hash by default.
Allow changing Global Dedupe chunk lookup hash via env variable.
2013-05-02 00:05:05 +05:30
Moinak Ghosh
eae16b82d3 Fix issue #7.
Ensure tempfile cleanup even with error abort.
2013-05-01 18:01:17 +05:30
Moinak Ghosh
b23b5789fb Fix bugs and improve accuracy in Segmented Dedupe.
Fix segment hashlist size computation.
Remove unnecessary sync of segment hashlist file writes.
Pass correct number of threads to index creation routine.
Add more error checks.
Handle correct positioning of segment hashlist file offset on write error.
Add missing semaphore signaling at dedupe abort points with global dedupe.
Use closer min-values sampling for improved segmented dedupe accuracy.
Update proper checksum info in README.
2013-04-30 19:35:18 +05:30
Moinak Ghosh
074e265f70 Fix sizing of similarity hash buffer. 2013-04-26 22:36:14 +05:30
Moinak Ghosh
2f2fc23771 Tweak index size computation. 2013-04-26 19:21:11 +05:30
Moinak Ghosh
aed69b2d53 Add test cases for Global Deduplication.
Update documentation and code comments.
Remove tempfile pathname after creation to ensure clean removal after process exit.
2013-04-26 18:32:00 +05:30
Moinak Ghosh
75f62d6a36 Simplify segment lookup loop.
Fix assertion.
2013-04-26 10:56:29 +05:30
Moinak Ghosh
5bb028fe03 Change Segmented Dedupe flow to improve parallelism.
Periodically sync writes to segcache file.
Use simple insertion sort for small numbers of elements.
2013-04-25 23:42:32 +05:30
Moinak Ghosh
79a6e7f770 Capability to output data to stdout when compressing.
Always use segmented similarity bases dedupe when using -G option in pipe mode.
Standardize on average 8MB segment size for segmented dedupe.
Fix hashtable sizing.
Some miscellaneous cleanups.
Update README with details of new features.
2013-04-24 23:03:58 +05:30
Moinak Ghosh
6c5d8d9e18 Optimize index lookup for 8-byte keys.
More cleanups.
2013-04-24 19:49:43 +05:30
Moinak Ghosh
5d6ffd969d More tweaks to slightly improve segment dedupe efficiency.
Use on average 8MB segments for all cases.
Some minor cleanps.
2013-04-24 19:13:07 +05:30
Moinak Ghosh
eabd670790 Improve segment similarity detection and drastically reduce index size. 2013-04-23 23:15:32 +05:30
Moinak Ghosh
b32f4b3f9a Improve duplicate segment match detection. 2013-04-23 20:51:12 +05:30