pcompress/README.md

Pcompress
=========

Copyright (C) 2012 Moinak Ghosh. All rights reserved.
Use is subject to license terms.

Pcompress is a utility to do compression and decompression in parallel by
splitting input data into chunks. It has a modular structure and includes
support for multiple algorithms like LZMA, Bzip2, PPMD, etc., with CRC64
chunk checksums. SSE optimizations for the bundled LZMA are included. It
also implements chunk-level Content-Aware Deduplication and Delta
Compression features based on a Semi-Rabin Fingerprinting scheme. Delta
Compression is implemented via the widely popular bsdiff algorithm.
Similarity is detected using a custom hashing of maximal features of a
block. When doing chunk-level dedupe it attempts to merge adjacent
non-duplicate blocks index entries into a single larger entry to reduce
metadata. In addition to all these it can internally split chunks at
rabin boundaries to help dedupe and compression.

It has low metadata overhead and overlaps I/O and compression to achieve
maximum parallelism. It also bundles a simple slab allocator to speed
repeated allocation of similar chunks. It can work in pipe mode, reading
from stdin and writing to stdout. It also provides some adaptive compression
modes in which multiple algorithms are tried per chunk to determine the best
one for the given chunk. Finally it supports 14 compression levels to allow
for ultra compression modes in some algorithms.

Usage
=====

    To compress a file:
       pcompress -c <algorithm> [-l <compress level>] [-s <chunk size>] <file>
       Where <algorithm> can be the folowing:
       lzfx   - Very fast and small algorithm based on LZF.
       lz4    - Ultra fast, high-throughput algorithm reaching RAM B/W at level1.
       zlib   - The base Zlib format compression (not Gzip).
       lzma   - The LZMA (Lempel-Ziv Markov) algorithm from 7Zip.
       bzip2  - Bzip2 Algorithm from libbzip2.
       ppmd   - The PPMd algorithm excellent for textual data. PPMd requires
                at least 64MB X CPUs more memory than the other modes.
       adapt  - Adaptive mode where ppmd or bzip2 will be used per chunk,
                depending on which one produces better compression. This mode
                is obviously fairly slow and requires lots of memory.
       adapt2 - Adaptive mode which includes ppmd and lzma. This requires
                more memory than adapt mode, is slower and potentially gives
                the best compression.
       <chunk_size> - This can be in bytes or can use the following suffixes:
                g - Gigabyte, m - Megabyte, k - Kilobyte.
                Larger chunks produce better compression at the cost of memory.
       <compress_level> - Can be a number from 0 meaning minimum and 14 meaning
                maximum compression.

    To decompress a file compressed using above command:
       pcompress -d <compressed file> <target file>

    To operate as a pipe, read from stdin and write to stdout:
       pcompress -p ...

    Attempt Rabin fingerprinting based deduplication on chunks:
       pcompress -D ...
       pcompress -D -r ... - Do NOT split chunks at a rabin boundary. Default
                             is to split.

    Perform Delta Encoding in addition to Exact Dedup:
       pcompress -E ... - This also implies '-D'.

    Number of threads can optionally be specified: -t <1 - 256 count>
    Pass '-M' to display memory allocator statistics
    Pass '-C' to display compression statistics

Environment Variables
=====================

Set ALLOCATOR_BYPASS=1 in the environment to avoid using the the built-in
allocator. Due to the the way it rounds up an allocation request to the nearest
slab the built-in allocator can allocate extra unused memory.

Examples
========

Compress "file.tar" using bzip2 level 6, 64MB chunk size and use 4 threads. In
addition perform exact deduplication and delta compression prior to compression.

    pcompress -D -E -c bzip2 -l6 -s64m -t4 file.tar

Compress "file.tar" using extreme compression mode of LZMA and a chunk size of
of 1GB. Allow pcompress to detect the number of CPU cores and use as many threads.

    pcompress -c lzma -l14 -s1g file.tar

Compression Algorithms
======================

LZFX	- Ultra Fast, average compression. This algorithm is the fastest overall.
	  Levels: 1 - 5
LZ4	- Very Fast, better compression than LZFX.
	  Levels: 1 - 3
Zlib	- Fast, better compression.
	  Levels: 1 - 9
Bzip2	- Slow, much better compression than Zlib.
	  Levels: 1 - 9
LZMA	- Very slow. Extreme compression.
	  Levels: 1 - 14
PPMD	- Slow. Extreme compression for Text, average compression for binary.
	  Levels: 1 - 14.

Adapt	- Very slow synthetic mode. Both Bzip2 and PPMD are tried per chunk and
	  better result selected.
	  Levels: 1 - 14
Adapt2	- Ultra slow synthetic mode. Both LZMA and PPMD are tried per chunk and
	  better result selected. Can give best compression ration when splitting
	  file into multiple chunks.
	  Levels: 1 - 14

It is possible for a single chunk to span the entire file if enough RAM is
available. However for adaptive modes to be effective for large files, especially
multi-file archives splitting into chunks is required so that best compression
algorithm can be selected for textual and binary portions.

Caveats
=======
This utility can gobble up RAM depending on compression algorithm,
compression level, and dedupe being enabled. Larger chunk sizes can give
better compression ratio but at the same time use more RAM.

In some cases for files less than a gigabyte. Using Delta Compression in addition
to exact Dedupe can have a slight negative impact on LZMA compression ratio
especially when using the large-window ultra compression levels above 12.