<p>Pcompress is an attempt to revisit <strong>Data Compression</strong> using unique combinations of existing and some new techniques. Both high compression ratio and performance are key goals along with the ability to leverage all the cores on a multi-core CPU. It also aims to bring to the table scalable, high-throughput Global <strong>Deduplication</strong> of archival storage. The deduplication capability is also available for single-file compression modes providing very interesting capabilities. Other projects providing some of these features include <ahref="http://ck.kolivas.org/apps/lrzip/">Lrzip</a>, <ahref="http://www.exdupe.com/">eXdupe</a>. Full archivers providing some of the similar features include the excellent <ahref="http://freearc.org/">FreeArc</a> and <ahref="http://peazip.sourceforge.net/">PeaZIP</a>. Pcompress is not an archiver but provides a unique combination of features to both maximize compression ratio and provide high speed.</p>
<strong>Parallel</strong>: Compress and Decompress in parallel by splitting input data into chunks. With Content-Aware Deduplication chunks are split at a content-defined boundary to improve Dedulication and compression.</li>
<li>
<strong>Scalable</strong>: Chunks are independent and can scale to any number of cores provided enough memory is available.</li>
<li>
<strong>Deduplication</strong>: High-speed Content-aware chunk-level Deduplication based on Rabin fingerprinting. Duplicate comparison uses exact byte-for-byte comparison and techniques to reduce Dedupe index size.</li>
<strong>Global Deduplication</strong>: Scalable, fast Content-aware Deduplication across the entire dataset. Duplicate comparison uses cryptographic hash based matching to detect duplicate blocks. By default SHA256 is used but this can be changed via an environment variable. Global Deduplication can reach petascale using only a relatively tiny in-memory index.</li>
<strong>Delta Compression</strong>: Deduplication also provides Delta Compression of closely matching chunks using <ahref="http://www.daemonology.net/bsdiff/">Bsdiff</a>. <ahref="http://en.wikipedia.org/wiki/MinHash">Minhashing</a> is used to detect similar chunks.</li>
<li>
<strong>Fixed Block option</strong>: Fixed block Deduplication is also supported and works extremely fast.</li>
<li>
<strong>Metadata Compression</strong>: The Dedupe Index is transformed and compressed.</li>
<li>
<strong>Multiple Algorithms</strong>: Support for multiple compression algorithms like LZMA, LZMA-Multithreaded, Bzip2, PPMD, LZ4 etc. Adaptive modes allow selecting an algorithm per chunk based on heuristics.</li>
<strong>Strong Data Integrity</strong>: Strong Data Integrity verification with option of using BLAKE2, SHA2 or KECCAK. Headers are also checksummed using CRC32.</li>
<strong>Filters</strong>: Pre-compression filters: LZP, Delta2. These improve compression ratio across the board at a little extra computational cost.</li>
<li>
<strong>LZP</strong>: LZP (Lempel-Ziv Prediction) searches for repeating patterns of bytes.</li>
<li>
<strong>Delta2</strong>: Delta2 Encoding probes for embedded tables of numeric data and Run Length encodes arithmetic sequences at high throughput.</li>
<li>
<strong>Matrix Transform</strong>: A form of <ahref="http://moinakg.wordpress.com/2012/12/13/linear-algebra-meets-data-compression/">Matrix transpose</a> is used to better compress the Dedupe Index.</li>
<li>
<strong>Encryption</strong>: Support for AES Encryption using Key generation based on the strong <ahref="http://en.wikipedia.org/wiki/Scrypt">Scrypt</a> algorithm. AES is used in CTR mode.</li>
<li>
<strong>Message Authentication</strong>: Encryption mode uses HMAC, Skein MAC or Keccak MAC for Data Integrity and Authentication. The MAC approach from iSCSI is followed for improved security (<ahref="http://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-probably-broken">http://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-probably-broken</a>.</li>
<strong>Overlapped processing</strong>: Overlapped computation and I/O to maximize throughput.</li>
<li>
<strong>Streamable</strong>: Ability to work in streaming pipe mode reading from stdin and writing to stdout.</li>
<li>
<strong>Custom Allocator</strong>: Uses an internal mempool allocator to speed up repeated allocation of similarly sized chunks. Option to disable this at runtime is provided.</li>
<li>
<strong>Solid Mode</strong>: Given enough available memory an entire file can be compressed inside a single chunk. This however is mostly a single-threaded operation.</li>
<li>
<strong>Padding</strong>: A compressed archive or file can be zero-padded to round off to a multiple of a block size for certain storage media like Tapes.</li>
</ul><p>Other open-source deduplication software like <ahref="http://opendedup.org/">OpenDedup</a> and <ahref="http://www.lessfs.com/wordpress/">LessFS</a> use fixed block dedupe only. Some software like <ahref="http://backuppc.sourceforge.net/">BackupPC</a> does file-level dedupe only (single-instance storage). Of course OpenDedup and LessFS are Fuse based filesystems doing inline dedupe of primary storage while Pcompress is only meant for archival storage as of today.</p>
<ahref="results1.html">Benchmarks Set #1</a>, <ahref="http://moinakg.wordpress.com/2013/05/26/updated-compression-benchmarks/">Analysis</a><br>
<ahref="results2.html">Benchmarks Set #2</a>, <ahref="http://moinakg.wordpress.com/2013/05/27/updated-compression-benchmarks-part-2/">Analysis</a></p>
<p>Set ALLOCATOR_BYPASS=1 in the environment to avoid using the the built-in allocator. Due to the the way it rounds up an allocation request to the nearest slab the built-in allocator can allocate extra unused memory. In addition you may want to use a different allocator in your environment.</p>
<p>Compress "file.tar" using bzip2 level 6, 64MB chunk size and use 4 threads. In addition perform identity deduplication and delta compression prior to compression.</p>
<p>Compress "file.tar" using extreme compression mode of LZMA and a chunk size of of 1GB. Allow pcompress to detect the number of CPU cores and use as many threads.</p>
<p>It is possible for a single chunk to span the entire file if enough RAM is available. However for adaptive modes to be effective for large files, especially multi-file archives, splitting into chunks is required so that best compression algorithm can be selected for textual and binary portions.</p>