<p>Pcompress is an attempt to revisit <strong>Data Compression</strong> using unique combinations of existing and some new techniques. Both high compression ratio and performance are key goals along with the ability to leverage all the cores on a multi-core CPU. It also aims to bring to the table scalable, high-throughput Global <strong>Deduplication</strong> of archival storage. The deduplication capability is also available for single-file compression modes providing very interesting capabilities. Other projects providing some of these features include <ahref="http://ck.kolivas.org/apps/lrzip/">Lrzip</a>, <ahref="http://www.exdupe.com/">eXdupe</a>. Full archivers providing some of the similar features include the excellent <ahref="http://freearc.org/">FreeArc</a> and <ahref="http://peazip.sourceforge.net/">PeaZIP</a>. Pcompress is not an archiver but provides a unique combination of features to both maximize compression ratio and provide high speed.</p>
<strong>Parallel</strong>: Compress and Decompress in parallel by splitting input data into chunks. With Content-Aware Deduplication chunks are split at a content-defined boundary to improve Dedulication and compression.</li>
<li>
<strong>Scalable</strong>: Chunks are independent and can scale to any number of cores provided enough memory is available.</li>
<li>
<strong>Deduplication</strong>: High-speed Content-aware chunk-level Deduplication based on Rabin fingerprinting. Duplicate comparison uses exact byte-for-byte comparison and techniques to reduce Dedupe index size.</li>
<li>
<strong>Delta Compression</strong>: Deduplication also provides Delta Compression of closely matching chunks using <ahref="http://www.daemonology.net/bsdiff/">Bsdiff</a>. <ahref="http://en.wikipedia.org/wiki/MinHash">Minhashing</a> is used to detect similar chunks.</li>
<li>
<strong>Fixed Block option</strong>: Fixed block Deduplication is also supported and works extremely fast.</li>
<li>
<strong>Metadata Compression</strong>: The Dedupe Index is transformed and compressed.</li>
<li>
<strong>Multiple Algorithms</strong>: Support for multiple compression algorithms like LZMA, LZMA-Multithreaded, Bzip2, PPMD, LZ4 etc. Adaptive modes allow selecting an algorithm per chunk based on heuristics.</li>
<li>
<strong>Strong Data Integrity</strong>: Strong Data Integrity verification with option of using SKEIN, SHA2 or KECCAK. Headers are also checksummed using CRC32.</li>
<li>
<strong>Filters</strong>: Pre-compression filters: LZP, Delta2. These improve compression ratio across the board at a little extra computational cost.</li>
<li>
<strong>LZP</strong>: LZP (Lempel-Ziv Prediction) searches for repeating patterns of bytes.</li>
<li>
<strong>Delta2</strong>: Delta2 Encoding probes for embedded tables of numeric data and Run Length encodes arithmetic sequences at high throughput.</li>
<li>
<strong>Matrix Transform</strong>: A form of <ahref="http://moinakg.wordpress.com/2012/12/13/linear-algebra-meets-data-compression/">Matrix transpose</a> is used to better compress the Dedupe Index.</li>
<li>
<strong>Encryption</strong>: Support for AES Encryption using Key generation based on the strong <ahref="http://en.wikipedia.org/wiki/Scrypt">Scrypt</a> algorithm. AES is used in CTR mode.</li>
<li>
<strong>Message Authentication</strong>: Encryption mode uses HMAC, Skein MAC or Keccak MAC for Data Integrity and Authentication. The MAC approach from iSCSI is followed for improved security (<ahref="http://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-probably-broken">http://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-probably-broken</a>.</li>
<strong>Overlapped processing</strong>: Overlapped computation and I/O to maximize throughput.</li>
<li>
<strong>Streamable</strong>: Ability to work in streaming pipe mode reading from stdin and writing to stdout.</li>
<li>
<strong>Custom Allocator</strong>: Uses an internal mempool allocator to speed up repeated allocation of similarly sized chunks. Option to disable this at runtime is provided.</li>
<li>
<strong>Solid Mode</strong>: Given enough available memory an entire file can be compressed inside a single chunk. This however is mostly a single-threaded operation.</li>
<li>
<strong>Padding</strong>: A compressed archive or file can be zero-padded to round off to a multiple of a block size for certain storage media like Tapes.</li>
</ul><p>Other open-source deduplication software like <ahref="http://opendedup.org/">OpenDedup</a> and <ahref="http://www.lessfs.com/wordpress/">LessFS</a> use fixed block dedupe only. Some software like <ahref="http://backuppc.sourceforge.net/">BackupPC</a> does file-level dedupe only (single-instance storage). Of course OpenDedup and LessFS are Fuse based filesystems doing inline dedupe of primary storage while Pcompress is only meant for archival storage as of today.</p>
<p>Set ALLOCATOR_BYPASS=1 in the environment to avoid using the the built-in allocator. Due to the the way it rounds up an allocation request to the nearest slab the built-in allocator can allocate extra unused memory. In addition you may want to use a different allocator in your environment.</p>
<p>Compress "file.tar" using bzip2 level 6, 64MB chunk size and use 4 threads. In addition perform identity deduplication and delta compression prior to compression.</p>
<p>Compress "file.tar" using extreme compression mode of LZMA and a chunk size of of 1GB. Allow pcompress to detect the number of CPU cores and use as many threads.</p>
<p>It is possible for a single chunk to span the entire file if enough RAM is available. However for adaptive modes to be effective for large files, especially multi-file archives, splitting into chunks is required so that best compression algorithm can be selected for textual and binary portions.</p>