<p>Pcompress is an attempt to revisit <strong>Data Compression</strong> using unique combinations of existing and some new techniques. Both high compression ratio and performance are key goals along with the ability to leverage all the cores on a multi-core CPU. It also aims to bring to the table scalable, high-throughput Global <strong>Deduplication</strong> of archival storage. The deduplication capability is also available for single-file compression modes providing very interesting capabilities. Other projects providing some of these features include <ahref="http://ck.kolivas.org/apps/lrzip/">Lrzip</a>, <ahref="http://www.exdupe.com/">eXdupe</a>. Full archivers providing some of the similar features include the excellent <ahref="http://freearc.org/">FreeArc</a> and <ahref="http://peazip.sourceforge.net/">PeaZIP</a>. Pcompress is not an archiver but provides a unique combination of features to both maximize compression ratio and provide high speed.</p>
<p>Pcompress can do both compression and decompression in parallel by splitting input data into chunks. It has a modular structure and includes support for multiple algorithms like LZMA, Bzip2, PPMD, etc, with SKEIN/SHA checksums for data integrity. It can also do Lempel-Ziv-Prediction pre-compression (derived from libbsc) to improve compression ratios across the board. SSE optimizations for the bundled LZMA are included. It also implements chunk-level Content-Aware Deduplication and Delta Compression features
based on a rolling hash algorithm derived from the Rabin Fingerprinting approach. Other open-source deduplication software like <ahref="http://opendedup.org/">OpenDedup</a> and <ahref="http://www.lessfs.com/wordpress/">LessFS</a> use fixed block dedupe while <ahref="http://backuppc.sourceforge.net/">BackupPC</a> does file-level dedupe only (single-instance storage). Of course OpenDedup and LessFS are Fuse based filesystems doing inline dedupe of primary storage while Pcompress is only meant for archival storage as of today.</p>
<p>Delta Compression is implemented via the widely popular bsdiff algorithm. Chunk Similarity is detected using an adaptation of <ahref="http://en.wikipedia.org/wiki/MinHash">MinHashing</a>. It has low metadata overhead and overlaps I/O and compression to achieve maximum parallelism. It also bundles a simple mempool allocator to speed repeated allocation of similar chunks. It can work in pipe mode, reading from stdin and writing to stdout. It also provides adaptive compression modes in which some simple data heuristics are applied in an attempt to select a good algorithm per chunk.</p>
<p>Pcompress also supports encryption via AES and uses Scrypt from <ahref="http://www.tarsnap.com/">Tarsnap</a> for secure Password Based Key generation.</p>
<p>Set ALLOCATOR_BYPASS=1 in the environment to avoid using the the built-in allocator. Due to the the way it rounds up an allocation request to the nearest slab the built-in allocator can allocate extra unused memory. In addition you may want to use a different allocator in your environment.</p>
<p>Compress "file.tar" using bzip2 level 6, 64MB chunk size and use 4 threads. In addition perform identity deduplication and delta compression prior to compression.</p>
<p>Compress "file.tar" using extreme compression mode of LZMA and a chunk size of of 1GB. Allow pcompress to detect the number of CPU cores and use as many threads.</p>
<p>It is possible for a single chunk to span the entire file if enough RAM is available. However for adaptive modes to be effective for large files, especially multi-file archives splitting into chunks is required so that best compression algorithm can be selected for textual and binary portions.</p>
<p>This utility is not meant for resource constrained environments. Minimum memory usage (RES/RSS) with barely meaningful settings is around 10MB. This occurs when using the minimal LZFX compression algorithm at level 2 with a 1MB chunk size and running 2 threads.</p>
<p>Normally this utility requires lots of RAM depending on compression algorithm, compression level, and dedupe being enabled. Larger chunk sizes can give better compression ratio but at the same time use more RAM.</p>