diff --git a/index.html b/index.html index 5fc0e7f..26a0a20 100644 --- a/index.html +++ b/index.html @@ -43,12 +43,50 @@

Pcompress is an attempt to revisit Data Compression using unique combinations of existing and some new techniques. Both high compression ratio and performance are key goals along with the ability to leverage all the cores on a multi-core CPU. It also aims to bring to the table scalable, high-throughput Global Deduplication of archival storage. The deduplication capability is also available for single-file compression modes providing very interesting capabilities. Other projects providing some of these features include Lrzip, eXdupe. Full archivers providing some of the similar features include the excellent FreeArc and PeaZIP. Pcompress is not an archiver but provides a unique combination of features to both maximize compression ratio and provide high speed.

-

Pcompress can do both compression and decompression in parallel by splitting input data into chunks. It has a modular structure and includes support for multiple algorithms like LZMA, Bzip2, PPMD, etc, with SKEIN/SHA checksums for data integrity. It can also do Lempel-Ziv-Prediction pre-compression (derived from libbsc) to improve compression ratios across the board. SSE optimizations for the bundled LZMA are included. It also implements chunk-level Content-Aware Deduplication and Delta Compression features -based on a rolling hash algorithm derived from the Rabin Fingerprinting approach. Other open-source deduplication software like OpenDedup and LessFS use fixed block dedupe while BackupPC does file-level dedupe only (single-instance storage). Of course OpenDedup and LessFS are Fuse based filesystems doing inline dedupe of primary storage while Pcompress is only meant for archival storage as of today.

+

Features

-

Delta Compression is implemented via the widely popular bsdiff algorithm. Chunk Similarity is detected using an adaptation of MinHashing. It has low metadata overhead and overlaps I/O and compression to achieve maximum parallelism. It also bundles a simple mempool allocator to speed repeated allocation of similar chunks. It can work in pipe mode, reading from stdin and writing to stdout. It also provides adaptive compression modes in which some simple data heuristics are applied in an attempt to select a good algorithm per chunk.

- -

Pcompress also supports encryption via AES and uses Scrypt from Tarsnap for secure Password Based Key generation.

+

Other open-source deduplication software like OpenDedup and LessFS use fixed block dedupe only. Some software like BackupPC does file-level dedupe only (single-instance storage). Of course OpenDedup and LessFS are Fuse based filesystems doing inline dedupe of primary storage while Pcompress is only meant for archival storage as of today.

NOTE: This utility is Not an archiver. It compresses only single files or datastreams. To archive use something else like tar, cpio or pax.

@@ -72,7 +110,7 @@ based on a rolling hash algorithm derived from the Rabin Fingerprinting approach

Release Downloads

-

https://github.com/moinakg/pcompress/downloads

+

http://code.google.com/p/pcompress/downloads/list

Usage

@@ -139,6 +177,14 @@ Other flags: '-L' - Enable LZP pre-compression. This improves compression ratio of all algorithms with some extra CPU and very low RAM overhead. Using delta encoding in conjunction with this may not always be beneficial. + + '-P' - Enable Adaptive Delta Encoding. It can improve compresion ratio further + for data containing tables of numerical values especially if those are + in an arithmetic series. In this implementation basic Delta Encoding is + combined with Run-Length encoding and Matrix transpose + NOTE - Both -L and -P can be used together to give maximum benefit on most + datasets. + '-S' <cksum> - Specify chunk checksum to use: CRC64, SKEIN256, SKEIN512, SHA256 and SHA512. Default one is SKEIN256. The implementation actually uses SKEIN @@ -212,7 +258,10 @@ LZ4 - Very Fast, better compression than LZFX. Zlib - Fast, better compression. Levels: 1 - 9 Bzip2 - Slow, much better compression than Zlib. - Levels: 1 - 9

+ Levels: 1 - 9 +Libbsc - A new Block-Sorting compressor similar conceptually to Bzip2 but gives + much better compression. + Levels: 1 - 9

LZMA - Very slow. Extreme compression. Levels: 1 - 14 @@ -222,6 +271,9 @@ Bzip2 - Slow, much better compression than Zlib. RAM. Use these levels only if you have at the minimum 4GB RAM on your system.

+

LzmaMt - Extreme compression, faster than plain LZMA as it is multithreaded. + Compression ratio is only slightly less than plain LZMA.

+

PPMD - Slow. Extreme compression for Text, average compression for binary. In addition PPMD decompression time is also high for large chunks. This requires lots of RAM similar to LZMA. @@ -240,13 +292,27 @@ Adapt2 - Ultra slow synthetic mode. Both LZMA and PPMD are tried per chunk and dedupe, it uses upto 3.5GB physical RAM and requires 6GB of virtual memory space.

-

It is possible for a single chunk to span the entire file if enough RAM is available. However for adaptive modes to be effective for large files, especially multi-file archives splitting into chunks is required so that best compression algorithm can be selected for textual and binary portions.

+

It is possible for a single chunk to span the entire file if enough RAM is available. However for adaptive modes to be effective for large files, especially multi-file archives, splitting into chunks is required so that best compression algorithm can be selected for textual and binary portions.

-

Caveats

+

Memory Usage

-

This utility is not meant for resource constrained environments. Minimum memory usage (RES/RSS) with barely meaningful settings is around 10MB. This occurs when using the minimal LZFX compression algorithm at level 2 with a 1MB chunk size and running 2 threads.

+

As can be seen from above memory usage can vary greatly based on compression/ +pre-processing algorithms and chunk size. A variety of configurations are possible +depending on resource availability in the system.

-

Normally this utility requires lots of RAM depending on compression algorithm, compression level, and dedupe being enabled. Larger chunk sizes can give better compression ratio but at the same time use more RAM.

+

The minimum possible meaningful settings while still giving about 50% compression +ratio and very high speed is with the LZFX algorithm with 1MB chunk size and 2 +threads:

+ +
    pcompress -c lzfx -l2 -s1m -t2 <file>
+
+ +

This uses about 6MB of physical RAM (RSS). Earlier versions of the utility before +the 0.9 release comsumed much more memory. This was improved in the later versions. +When using Linux the virtual memory consumption may appear to be very high but it +is just address space usage rather than actual RAM and should be ignored. It is only +the RSS that matters. This is a result of the memory arena mechanism in Glibc that +improves malloc() performance for multi-threaded applications.