Commit graph

17 commits

Author SHA1 Message Date
Moinak Ghosh
2f2fc23771 Tweak index size computation. 2013-04-26 19:21:11 +05:30
Moinak Ghosh
aed69b2d53 Add test cases for Global Deduplication.
Update documentation and code comments.
Remove tempfile pathname after creation to ensure clean removal after process exit.
2013-04-26 18:32:00 +05:30
Moinak Ghosh
5bb028fe03 Change Segmented Dedupe flow to improve parallelism.
Periodically sync writes to segcache file.
Use simple insertion sort for small numbers of elements.
2013-04-25 23:42:32 +05:30
Moinak Ghosh
79a6e7f770 Capability to output data to stdout when compressing.
Always use segmented similarity bases dedupe when using -G option in pipe mode.
Standardize on average 8MB segment size for segmented dedupe.
Fix hashtable sizing.
Some miscellaneous cleanups.
Update README with details of new features.
2013-04-24 23:03:58 +05:30
Moinak Ghosh
6c5d8d9e18 Optimize index lookup for 8-byte keys.
More cleanups.
2013-04-24 19:49:43 +05:30
Moinak Ghosh
5d6ffd969d More tweaks to slightly improve segment dedupe efficiency.
Use on average 8MB segments for all cases.
Some minor cleanps.
2013-04-24 19:13:07 +05:30
Moinak Ghosh
eabd670790 Improve segment similarity detection and drastically reduce index size. 2013-04-23 23:15:32 +05:30
Moinak Ghosh
6b7d883393 Tweak percentage intervals computation to improve segmented dedupe ratio.
Avoid repeat processing of already processed segments.
2013-04-23 18:53:56 +05:30
Moinak Ghosh
d29f125ca7 Clean up temp cache dir handling.
Allow temp dir setting via specific env variable to point to fast devices like ramdisk,ssd.
2013-04-22 22:57:31 +05:30
Moinak Ghosh
2c4024792a Several bugfixes.
Avoid matching with self during hash lookup.
2013-04-22 22:07:07 +05:30
Moinak Ghosh
6b23f6a73a Several fixes and optimizations. 2013-04-22 19:52:18 +05:30
Moinak Ghosh
c0b4aa0116 Many optimizations and changes to Segmented Global Dedupe.
Use chunk hash based similarity matching rather than content based.
Use sorting to order hash buffer rather than min-heap for better accuracy.
Use fast CRC64 for similarity hash for speed and lower memory requirements.
2013-04-21 18:11:16 +05:30
Moinak Ghosh
3b8a5813fd Many optimizations to segmented global dedupe.
Use chunk hash based cumulative similarity matching instead of chunk content.
2013-04-19 22:51:51 +05:30
Moinak Ghosh
2f6ccca6e5 Update usage text and add minor tweaks. 2013-04-18 22:55:49 +05:30
Moinak Ghosh
8ae571124d Complete implementation for Segmented Global Deduplication. 2013-04-18 21:26:24 +05:30
Moinak Ghosh
a22b52cf08 Work in progress changes for Segmented Global Deduplication. 2013-04-14 23:51:54 +05:30
Moinak Ghosh
50251107de Work in progress changes for Segmented Global Deduplication. 2013-04-09 22:23:51 +05:30
Renamed from rabin/global/db.c (Browse further)