pcompress/params.json

1 line
17 KiB
JSON
Raw Normal View History

2013-03-07 16:53:20 +00:00
{"name":"Pcompress","tagline":"A Parallel Compression and Deduplication utility","body":"Introduction\r\n============\r\nPcompress is an attempt to revisit **Data Compression** using unique combinations of existing and some new techniques. Both high compression ratio and performance are key goals along with the ability to leverage all the cores on a multi-core CPU. It also aims to bring to the table scalable, high-throughput Global **Deduplication** of archival storage. The deduplication capability is also available for single-file compression modes providing very interesting capabilities. Other projects providing some of these features include [Lrzip](http://ck.kolivas.org/apps/lrzip/), [eXdupe](http://www.exdupe.com/). Full archivers providing some of the similar features include the excellent [FreeArc](http://freearc.org/) and [PeaZIP](http://peazip.sourceforge.net/). Pcompress is not an archiver but provides a unique combination of features to both maximize compression ratio and provide high speed.\r\n\r\nFeatures\r\n========\r\n* **Parallel**: Compress and Decompress in parallel by splitting input data into chunks. With Content-Aware Deduplication chunks are split at a content-defined boundary to improve Dedulication and compression.\r\n* **Scalable**: Chunks are independent and can scale to any number of cores provided enough memory is available.\r\n* **Deduplication**: High-speed Content-aware chunk-level Deduplication based on Rabin fingerprinting. Duplicate comparison uses exact byte-for-byte comparison and techniques to reduce Dedupe index size.\r\n* **Delta Compression**: Deduplication also provides Delta Compression of closely matching chunks using [Bsdiff](http://www.daemonology.net/bsdiff/). [Minhashing](http://en.wikipedia.org/wiki/MinHash) is used to detect similar chunks.\r\n* **Fixed Block option**: Fixed block Deduplication is also supported and works extremely fast.\r\n* **Metadata Compression**: The Dedupe Index is transformed and compressed.\r\n* **Multiple Algorithms**: Support for multiple compression algorithms like LZMA, LZMA-Multithreaded, Bzip2, PPMD, LZ4 etc. Adaptive modes allow selecting an algorithm per chunk based on heuristics.\r\n* **Strong Data Integrity**: Strong Data Integrity verification with option of using BLAKE2, SHA2 or KECCAK. Headers are also checksummed using CRC32.\r\n* **Filters**: Pre-compression filters: LZP, Delta2. These improve compression ratio across the board at a little extra computational cost.\r\n* **LZP**: LZP (Lempel-Ziv Prediction) searches for repeating patterns of bytes.\r\n* **Delta2**: Delta2 Encoding probes for embedded tables of numeric data and Run Length encodes arithmetic sequences at high throughput.\r\n* **Matrix Transform**: A form of [Matrix transpose](http://moinakg.wordpress.com/2012/12/13/linear-algebra-meets-data-compression/) is used to better compress the Dedupe Index.\r\n* **Encryption**: Support for AES Encryption using Key generation based on the strong [Scrypt](http://en.wikipedia.org/wiki/Scrypt) algorithm. AES is used in CTR mode.\r\n* **Message Authentication**: Encryption mode uses HMAC, Skein MAC or Keccak MAC for Data Integrity and Authentication. The MAC approach from iSCSI is followed for improved security ([http://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-probably-broken](http://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-probably-broken).\r\n* **Metadata**: Low metadata overhead.\r\n* **Overlapped processing**: Overlapped computation and I/O to maximize throughput.\r\n* **Streamable**: Ability to work in streaming pipe mode reading from stdin and writing to stdout.\r\n* **Custom Allocator**: Uses an internal mempool allocator to speed up repeated allocation of similarly sized chunks. Option to disable this at runtime is provided.\r\n* **Solid Mode**: Given enough available memory an entire file can be compressed inside a single chunk. This however is mostly a single-threaded operation.\r\n* **Padding**: A compressed archive or file can be zero-padded to round off to a multiple of a block size for