A parallelized data deduplication and command line compression utility.

Find a file

Moinak Ghosh f83652aa90 Update README.		2012-07-27 22:25:44 +05:30
bsdiff	Use 4-byte ints for header values instead of 8-byte size_t.	2012-07-20 20:53:46 +05:30
lz4	Add support for LZ4 compression including multi-pass LZ4.	2012-07-25 21:07:36 +05:30
lzfx	Add support for LZ4 compression including multi-pass LZ4.	2012-07-25 21:07:36 +05:30
lzma	New capability in allocator to add slab caches with user-specified size.	2012-05-31 13:06:40 +05:30
rabin	Update chunk size computation to reduce memory usage.	2012-07-27 22:03:24 +05:30
adaptive_compress.c	Update License info to LGPLv3.	2012-07-07 22:18:29 +05:30
allocator.c	Update chunk size computation to reduce memory usage.	2012-07-27 22:03:24 +05:30
allocator.h	Make LZFX Hash size dynamic.	2012-07-23 21:43:12 +05:30
bzip2_compress.c	Fix huge chunk handling in zlib compression routines.	2012-07-27 00:11:01 +05:30
COPYING	Update License info to LGPLv3.	2012-07-07 22:18:29 +05:30
COPYING.LESSER	Update License info to LGPLv3.	2012-07-07 22:18:29 +05:30
lz4_compress.c	Add support for LZ4 compression including multi-pass LZ4.	2012-07-25 21:07:36 +05:30
lzfx_compress.c	Add support for LZ4 compression including multi-pass LZ4.	2012-07-25 21:07:36 +05:30
lzma_compress.c	Fix crash when decompressing deduped archive.	2012-07-10 20:14:23 +05:30
main.c	Display release version in usage text.	2012-07-27 22:07:56 +05:30
Makefile	Add support for LZ4 compression including multi-pass LZ4.	2012-07-25 21:07:36 +05:30
pcompress.h	Display release version in usage text.	2012-07-27 22:07:56 +05:30
ppmd_compress.c	Update License info to LGPLv3.	2012-07-07 22:18:29 +05:30
README.md	Update README.	2012-07-27 22:25:44 +05:30
utils.c	Remove debug messages.	2012-07-10 21:11:31 +05:30
utils.h	Add support for LZ4 compression including multi-pass LZ4.	2012-07-25 21:07:36 +05:30
zlib_compress.c	Update chunk size computation to reduce memory usage.	2012-07-27 22:03:24 +05:30

README.md

Pcompress

Pcompress is a utility to do compression and decompression in parallel by splitting input data into chunks. It has a modular structure and includes support for multiple algorithms like LZMA, Bzip2, PPMD, etc., with CRC64 chunk checksums. SSE optimizations for the bundled LZMA are included. It also implements chunk-level Content-Aware Deduplication and Delta Compression features based on a Semi-Rabin Fingerprinting scheme. Delta Compression is implemented via the widely popular bsdiff algorithm. Similarity is detected using a custom hashing of maximal features of a block. When doing chunk-level dedupe it attempts to merge adjacent non-duplicate blocks index entries into a single larger entry to reduce metadata. In addition to all these it can internally split chunks at rabin boundaries to help dedupe and compression.

It has low metadata overhead and overlaps I/O and compression to achieve maximum parallelism. It also bundles a simple slab allocator to speed repeated allocation of similar chunks. It can work in pipe mode, reading from stdin and writing to stdout. It also provides some adaptive compression modes in which multiple algorithms are tried per chunk to determine the best one for the given chunk. Finally it support 14 compression levels to allow for ultra compression modes in some algorithms.

Usage

To compress a file:
   pcompress -c <algorithm> [-l <compress level>] [-s <chunk size>] <file>
   Where <algorithm> can be the folowing:
   lzfx   - Very fast and small algorithm based on LZF.
   lz4    - Ultra fast, high-throughput algorithm reaching RAM B/W at level1.
   zlib   - The base Zlib format compression (not Gzip).
   lzma   - The LZMA (Lempel-Ziv Markov) algorithm from 7Zip.
   bzip2  - Bzip2 Algorithm from libbzip2.
   ppmd   - The PPMd algorithm excellent for textual data. PPMd requires
            at least 64MB X CPUs more memory than the other modes.
   adapt  - Adaptive mode where ppmd or bzip2 will be used per chunk,
            depending on which one produces better compression. This mode
            is obviously fairly slow and requires lots of memory.
   adapt2 - Adaptive mode which includes ppmd and lzma. This requires
            more memory than adapt mode, is slower and potentially gives
            the best compression.
   <chunk_size> - This can be in bytes or can use the following suffixes:
            g - Gigabyte, m - Megabyte, k - Kilobyte.
            Larger chunks produce better compression at the cost of memory.
   <compress_level> - Can be a number from 0 meaning minimum and 14 meaning
            maximum compression.

To decompress a file compressed using above command:
   pcompress -d <compressed file> <target file>

To operate as a pipe, read from stdin and write to stdout:
   pcompress -p ...

Attempt Rabin fingerprinting based deduplication on chunks:
   pcompress -D ...
   pcompress -D -r ... - Do NOT split chunks at a rabin boundary. Default is to split.

Perform Delta Encoding in addition to Exact Dedup:
   pcompress -E ... - This also implies '-D'.

Number of threads can optionally be specified: -t <1 - 256 count>
Pass '-M' to display memory allocator statistics
Pass '-C' to display compression statistics

Examples

Compress "file.tar" using bzip2 level 6, 64MB chunk size and use 4 threads. In addition perform exact deduplication and delta compression prior to compression.

pcompress -D -E -c bzip2 -l6 -s64m -t4 file.tar

Compress "file.tar" using extreme compression mode of LZMA and a chunk size of of 1GB. Allow pcompress to detect the number of CPU cores and use as many threads.

pcompress -c lzma -l14 -s1g file.tar