343 lines
No EOL
19 KiB
HTML
343 lines
No EOL
19 KiB
HTML
<!doctype html>
|
|
<html>
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta http-equiv="X-UA-Compatible" content="chrome=1">
|
|
<title>Pcompress by moinakg</title>
|
|
<link rel="stylesheet" href="stylesheets/styles.css">
|
|
<link rel="stylesheet" href="stylesheets/pygment_trac.css">
|
|
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
|
|
<script src="javascripts/main.js"></script>
|
|
<!--[if lt IE 9]>
|
|
<script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
|
|
<![endif]-->
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">
|
|
|
|
</head>
|
|
<body>
|
|
|
|
<header>
|
|
<h1>Pcompress</h1>
|
|
<p>A Parallel Compression and Deduplication utility</p>
|
|
</header>
|
|
|
|
<div id="banner">
|
|
<span id="logo"></span>
|
|
|
|
<a href="https://github.com/moinakg/pcompress" class="button fork"><strong>View On GitHub</strong></a>
|
|
<div class="downloads">
|
|
<span>Downloads:</span>
|
|
<ul>
|
|
<li><a href="https://github.com/moinakg/pcompress/zipball/master" class="button">ZIP</a></li>
|
|
<li><a href="https://github.com/moinakg/pcompress/tarball/master" class="button">TAR</a></li>
|
|
</ul>
|
|
</div>
|
|
</div><!-- end banner -->
|
|
|
|
<div class="wrapper">
|
|
<nav>
|
|
<ul></ul>
|
|
</nav>
|
|
<section>
|
|
<h1>Introduction</h1>
|
|
|
|
<p>Pcompress is an attempt to revisit <strong>Data Compression</strong> using unique combinations of existing and some new techniques. Both high compression ratio and performance are key goals along with the ability to leverage all the cores on a multi-core CPU. It also aims to bring to the table scalable, high-throughput Global <strong>Deduplication</strong> of archival storage. The deduplication capability is also available for single-file compression modes providing very interesting capabilities. Other projects providing some of these features include <a href="http://ck.kolivas.org/apps/lrzip/">Lrzip</a>, <a href="http://www.exdupe.com/">eXdupe</a>. Full archivers providing some of the similar features include the excellent <a href="http://freearc.org/">FreeArc</a> and <a href="http://peazip.sourceforge.net/">PeaZIP</a>. Pcompress is not an archiver but provides a unique combination of features to both maximize compression ratio and provide high speed.</p>
|
|
|
|
<h1>Features</h1>
|
|
|
|
<ul>
|
|
<li>
|
|
<strong>Parallel</strong>: Compress and Decompress in parallel by splitting input data into chunks. With Content-Aware Deduplication chunks are split at a content-defined boundary to improve Dedulication and compression.</li>
|
|
<li>
|
|
<strong>Scalable</strong>: Chunks are independent and can scale to any number of cores provided enough memory is available.</li>
|
|
<li>
|
|
<strong>Deduplication</strong>: High-speed Content-aware chunk-level Deduplication based on Rabin fingerprinting. Duplicate comparison uses exact byte-for-byte comparison and techniques to reduce Dedupe index size.</li>
|
|
<li>
|
|
<strong>Delta Compression</strong>: Deduplication also provides Delta Compression of closely matching chunks using <a href="http://www.daemonology.net/bsdiff/">Bsdiff</a>. <a href="http://en.wikipedia.org/wiki/MinHash">Minhashing</a> is used to detect similar chunks.</li>
|
|
<li>
|
|
<strong>Fixed Block option</strong>: Fixed block Deduplication is also supported and works extremely fast.</li>
|
|
<li>
|
|
<strong>Metadata Compression</strong>: The Dedupe Index is transformed and compressed.</li>
|
|
<li>
|
|
<strong>Multiple Algorithms</strong>: Support for multiple compression algorithms like LZMA, LZMA-Multithreaded, Bzip2, PPMD, LZ4 etc. Adaptive modes allow selecting an algorithm per chunk based on heuristics.</li>
|
|
<li>
|
|
<strong>Strong Data Integrity</strong>: Strong Data Integrity verification with option of using BLAKE2, SHA2 or KECCAK. Headers are also checksummed using CRC32.</li>
|
|
<li>
|
|
<strong>Filters</strong>: Pre-compression filters: LZP, Delta2. These improve compression ratio across the board at a little extra computational cost.</li>
|
|
<li>
|
|
<strong>LZP</strong>: LZP (Lempel-Ziv Prediction) searches for repeating patterns of bytes.</li>
|
|
<li>
|
|
<strong>Delta2</strong>: Delta2 Encoding probes for embedded tables of numeric data and Run Length encodes arithmetic sequences at high throughput.</li>
|
|
<li>
|
|
<strong>Matrix Transform</strong>: A form of <a href="http://moinakg.wordpress.com/2012/12/13/linear-algebra-meets-data-compression/">Matrix transpose</a> is used to better compress the Dedupe Index.</li>
|
|
<li>
|
|
<strong>Encryption</strong>: Support for AES Encryption using Key generation based on the strong <a href="http://en.wikipedia.org/wiki/Scrypt">Scrypt</a> algorithm. AES is used in CTR mode.</li>
|
|
<li>
|
|
<strong>Message Authentication</strong>: Encryption mode uses HMAC, Skein MAC or Keccak MAC for Data Integrity and Authentication. The MAC approach from iSCSI is followed for improved security (<a href="http://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-probably-broken">http://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-probably-broken</a>.</li>
|
|
<li>
|
|
<strong>Metadata</strong>: Low metadata overhead.</li>
|
|
<li>
|
|
<strong>Overlapped processing</strong>: Overlapped computation and I/O to maximize throughput.</li>
|
|
<li>
|
|
<strong>Streamable</strong>: Ability to work in streaming pipe mode reading from stdin and writing to stdout.</li>
|
|
<li>
|
|
<strong>Custom Allocator</strong>: Uses an internal mempool allocator to speed up repeated allocation of similarly sized chunks. Option to disable this at runtime is provided.</li>
|
|
<li>
|
|
<strong>Solid Mode</strong>: Given enough available memory an entire file can be compressed inside a single chunk. This however is mostly a single-threaded operation.</li>
|
|
<li>
|
|
<strong>Padding</strong>: A compressed archive or file can be zero-padded to round off to a multiple of a block size for certain storage media like Tapes.</li>
|
|
</ul><p>Other open-source deduplication software like <a href="http://opendedup.org/">OpenDedup</a> and <a href="http://www.lessfs.com/wordpress/">LessFS</a> use fixed block dedupe only. Some software like <a href="http://backuppc.sourceforge.net/">BackupPC</a> does file-level dedupe only (single-instance storage). Of course OpenDedup and LessFS are Fuse based filesystems doing inline dedupe of primary storage while Pcompress is only meant for archival storage as of today.</p>
|
|
|
|
<p>NOTE: This utility is Not an archiver. It compresses only single files or datastreams. To archive use something else like tar, cpio or pax.</p>
|
|
|
|
<h1>NEWS</h1>
|
|
|
|
<p>Blog: <a href="https://moinakg.wordpress.com/tag/pcompress/">https://moinakg.wordpress.com/tag/pcompress/</a>.</p>
|
|
|
|
<p>Releases: <a href="http://freecode.com/projects/pcompress">http://freecode.com/projects/pcompress</a></p>
|
|
|
|
<h1>Compression Benchmarks</h1>
|
|
|
|
<p><a href="https://moinakg.wordpress.com/2012/11/01/compression-benchmarks/">Benchmarks Part #1</a></p>
|
|
|
|
<p><a href="https://moinakg.wordpress.com/2012/11/03/compression-benchmarks-2/">Benchmarks Part #2</a></p>
|
|
|
|
<h1>Deduplication Chunking Analysis</h1>
|
|
|
|
<p><a href="https://moinakg.wordpress.com/2012/11/11/inside-content-defined-chunking-in-pcompress/">Content Defined Chunking #1</a></p>
|
|
|
|
<p><a href="https://moinakg.wordpress.com/2012/11/15/inside-content-defined-chunking-in-pcompress-part-2/">Content Defined Chunking #2</a></p>
|
|
|
|
<h1>Release Downloads</h1>
|
|
|
|
<p><a href="http://code.google.com/p/pcompress/downloads/list">http://code.google.com/p/pcompress/downloads/list</a></p>
|
|
|
|
<h1>Usage</h1>
|
|
|
|
<pre><code>To compress a file:
|
|
pcompress -c <algorithm> [-l <compress level>] [-s <chunk size>] <file>
|
|
Where <algorithm> can be the folowing:
|
|
lzfx - Very fast and small algorithm based on LZF.
|
|
lz4 - Ultra fast, high-throughput algorithm reaching RAM B/W at level1.
|
|
zlib - The base Zlib format compression (not Gzip).
|
|
lzma - The LZMA (Lempel-Ziv Markov) algorithm from 7Zip.
|
|
lzmaMt - Multithreaded version of LZMA. This is a faster version but
|
|
uses more memory for the dictionary. Thread count is balanced
|
|
between chunk processing threads and algorithm threads.
|
|
bzip2 - Bzip2 Algorithm from libbzip2.
|
|
ppmd - The PPMd algorithm excellent for textual data. PPMd requires
|
|
at least 64MB X CPUs more memory than the other modes.
|
|
|
|
libbsc - A Block Sorting Compressor using the Burrows Wheeler Transform
|
|
like Bzip2 but runs faster and gives better compression than
|
|
Bzip2 (See: libbsc.com).
|
|
|
|
adapt - Adaptive mode where ppmd or bzip2 will be used per chunk,
|
|
depending on heuristics. If at least 50% of the input data is
|
|
7-bit text then PPMd will be used otherwise Bzip2.
|
|
adapt2 - Adaptive mode which includes ppmd and lzma. If at least 80% of
|
|
the input data is 7-bit text then PPMd will be used otherwise
|
|
LZMA. It has significantly more memory usage than adapt.
|
|
none - No compression. This is only meaningful with -D and -E so Dedupe
|
|
can be done for post-processing with an external utility.
|
|
<chunk_size> - This can be in bytes or can use the following suffixes:
|
|
g - Gigabyte, m - Megabyte, k - Kilobyte.
|
|
Larger chunks produce better compression at the cost of memory.
|
|
<compress_level> - Can be a number from 0 meaning minimum and 14 meaning
|
|
maximum compression.
|
|
</code></pre>
|
|
|
|
<p>NOTE: The option "libbsc" uses Ilya Grebnov's block sorting compression library
|
|
from <a href="http://libbsc.com/">http://libbsc.com/</a> . It is only available if pcompress in built with
|
|
that library. See INSTALL file for details.</p>
|
|
|
|
<pre><code>To decompress a file compressed using above command:
|
|
pcompress -d <compressed file> <target file>
|
|
|
|
To operate as a pipe, read from stdin and write to stdout:
|
|
pcompress -p ...
|
|
|
|
Attempt Rabin fingerprinting based deduplication on chunks:
|
|
pcompress -D ...
|
|
pcompress -D -r ... - Do NOT split chunks at a rabin boundary. Default
|
|
is to split.
|
|
|
|
Perform Delta Encoding in addition to Identical Dedup:
|
|
pcompress -E ... - This also implies '-D'. This performs Delta Compression
|
|
between 2 blocks if they are 40% to 60% similar. The
|
|
similarity %age is selected based on the dedupe block
|
|
size to balance performance and effectiveness.
|
|
pcompress -EE .. - This causes Delta Compression to happen if 2 blocks are
|
|
at least 40% similar regardless of block size. This can
|
|
effect greater final compression ratio at the cost of
|
|
higher processing overhead.
|
|
|
|
Number of threads can optionally be specified: -t <1 - 256 count>
|
|
Other flags:
|
|
'-L' - Enable LZP pre-compression. This improves compression ratio of all
|
|
algorithms with some extra CPU and very low RAM overhead. Using
|
|
delta encoding in conjunction with this may not always be beneficial.
|
|
|
|
'-P' - Enable Adaptive Delta Encoding. It can improve compresion ratio further
|
|
for data containing tables of numerical values especially if those are
|
|
in an arithmetic series. In this implementation basic Delta Encoding is
|
|
combined with Run-Length encoding and Matrix transpose
|
|
NOTE - Both -L and -P can be used together to give maximum benefit on most
|
|
datasets.
|
|
|
|
'-S' <cksum>
|
|
- Specify chunk checksum to use: CRC64, BLAKE2-256, BLAKE2-512, SHA256 and
|
|
SHA512. Default one is BLAKE2-256. The implementation actually uses SKEIN
|
|
512-256. This is 25% slower than simple CRC64 but is many times more
|
|
robust than CRC64 in detecting data integrity errors. BLAKE is a
|
|
finalist in the NIST SHA-3 standard selection process and is one of
|
|
the fastest in the group, especially on x86 platforms. BLAKE2 is faster
|
|
than BLAKE and even faster than MD5.
|
|
BLAKE2 512-256 is about 60% faster than SHA 512-256 on x64 platforms.
|
|
|
|
'-F' - Perform Fixed Block Deduplication. This is faster than fingerprinting
|
|
based content-aware deduplication in some cases. However this is mostly
|
|
usable for disk dumps especially virtual machine images. This generally
|
|
gives lower dedupe ratio than content-aware dedupe (-D) and does not
|
|
support delta compression.
|
|
'-M' - Display memory allocator statistics
|
|
'-C' - Display compression statistics
|
|
</code></pre>
|
|
|
|
<p>NOTE: It is recommended not to use '-L' with libbsc compression since libbsc uses
|
|
LZP internally as well.</p>
|
|
|
|
<pre><code>Encryption flags:
|
|
'-e <ALGO>'
|
|
Encrypt chunks using the given encryption algorithm. The algo parameter
|
|
can be one of AES or SALSA20. Both are used in CTR stream encryption
|
|
mode.
|
|
The password can be prompted from the user or read from a file. Unique
|
|
keys are generated every time pcompress is run even when giving the same
|
|
password. Of course enough info is stored in the compresse file so that
|
|
the key used for the file can be re-created given the correct password.
|
|
|
|
Default key length if 256 bits but can be reduced to 128 bits using the
|
|
'-k' option.
|
|
|
|
The Scrypt algorithm from Tarsnap is used
|
|
(See: http://www.tarsnap.com/scrypt.html) for generating keys from
|
|
passwords. The CTR mode AES mechanism from Tarsnap is also utilized.
|
|
|
|
'-w <pathname>'
|
|
Provide a file which contains the encryption password. This file must
|
|
be readable and writable since it is zeroed out after the password is
|
|
read.
|
|
|
|
'-k <key length>'
|
|
Specify the key length. Can be 16 for 128 bit keys or 32 for 256 bit
|
|
keys. Default value is 32 for 256 bit keys.
|
|
</code></pre>
|
|
|
|
<p>NOTE: When using pipe-mode via -p the only way to provide a password is to use '-w'.</p>
|
|
|
|
<h1>Environment Variables</h1>
|
|
|
|
<p>Set ALLOCATOR_BYPASS=1 in the environment to avoid using the the built-in allocator. Due to the the way it rounds up an allocation request to the nearest slab the built-in allocator can allocate extra unused memory. In addition you may want to use a different allocator in your environment.</p>
|
|
|
|
<h1>Examples</h1>
|
|
|
|
<p>Compress "file.tar" using bzip2 level 6, 64MB chunk size and use 4 threads. In addition perform identity deduplication and delta compression prior to compression.</p>
|
|
|
|
<pre><code>pcompress -D -E -c bzip2 -l6 -s64m -t4 file.tar
|
|
</code></pre>
|
|
|
|
<p>Compress "file.tar" using extreme compression mode of LZMA and a chunk size of of 1GB. Allow pcompress to detect the number of CPU cores and use as many threads.</p>
|
|
|
|
<pre><code>pcompress -c lzma -l14 -s1g file.tar
|
|
</code></pre>
|
|
|
|
<p>Compress "file.tar" using lz4 at max compression with LZ-Prediction pre-processing and encryption enabled. Chunksize is 100M:</p>
|
|
|
|
<pre><code>pcompress -c lz4 -l3 -e -L -s100m file.tar
|
|
</code></pre>
|
|
|
|
<h1>Compression Algorithms</h1>
|
|
|
|
<p>LZFX - Ultra Fast, average compression. This algorithm is the fastest overall.
|
|
Levels: 1 - 5
|
|
LZ4 - Very Fast, better compression than LZFX.
|
|
Levels: 1 - 3
|
|
Zlib - Fast, better compression.
|
|
Levels: 1 - 9
|
|
Bzip2 - Slow, much better compression than Zlib.
|
|
Levels: 1 - 9
|
|
Libbsc - A new Block-Sorting compressor similar conceptually to Bzip2 but gives
|
|
much better compression.
|
|
Levels: 1 - 9</p>
|
|
|
|
<p>LZMA - Very slow. Extreme compression.
|
|
Levels: 1 - 14
|
|
Till level 9 it is standard LZMA parameters. Levels 10 - 12 use
|
|
more memory and higher match iterations so are slower. Levels
|
|
13 and 14 use larger dictionaries upto 256MB and really suck up
|
|
RAM. Use these levels only if you have at the minimum 4GB RAM on
|
|
your system.</p>
|
|
|
|
<p>LzmaMt - Extreme compression, faster than plain LZMA as it is multithreaded.
|
|
Compression ratio is only slightly less than plain LZMA.</p>
|
|
|
|
<p>PPMD - Slow. Extreme compression for Text, average compression for binary.
|
|
In addition PPMD decompression time is also high for large chunks.
|
|
This requires lots of RAM similar to LZMA.
|
|
Levels: 1 - 14.</p>
|
|
|
|
<p>Adapt - Very slow synthetic mode. Both Bzip2 and PPMD are tried per chunk and
|
|
better result selected.
|
|
Levels: 1 - 14
|
|
Adapt2 - Ultra slow synthetic mode. Both LZMA and PPMD are tried per chunk and
|
|
better result selected. Can give best compression ratio when splitting
|
|
file into multiple chunks.
|
|
Levels: 1 - 14
|
|
Since both LZMA and PPMD are used together memory requirements are
|
|
quite extensive especially if you are also using extreme levels above
|
|
10. For example with 64MB chunk, Level 14, 2 threads and with or without
|
|
dedupe, it uses upto 3.5GB physical RAM and requires 6GB of virtual
|
|
memory space.</p>
|
|
|
|
<p>It is possible for a single chunk to span the entire file if enough RAM is available. However for adaptive modes to be effective for large files, especially multi-file archives, splitting into chunks is required so that best compression algorithm can be selected for textual and binary portions.</p>
|
|
|
|
<h1>Memory Usage</h1>
|
|
|
|
<p>As can be seen from above memory usage can vary greatly based on compression/
|
|
pre-processing algorithms and chunk size. A variety of configurations are possible
|
|
depending on resource availability in the system.</p>
|
|
|
|
<p>The minimum possible meaningful settings while still giving about 50% compression
|
|
ratio and very high speed is with the LZFX algorithm with 1MB chunk size and 2
|
|
threads:</p>
|
|
|
|
<pre><code> pcompress -c lzfx -l2 -s1m -t2 <file>
|
|
</code></pre>
|
|
|
|
<p>This uses about 6MB of physical RAM (RSS). Earlier versions of the utility before
|
|
the 0.9 release comsumed much more memory. This was improved in the later versions.
|
|
When using Linux the virtual memory consumption may appear to be very high but it
|
|
is just address space usage rather than actual RAM and should be ignored. It is only
|
|
the RSS that matters. This is a result of the memory arena mechanism in Glibc that
|
|
improves malloc() performance for multi-threaded applications.</p>
|
|
</section>
|
|
<footer>
|
|
<p>Project maintained by <a href="https://github.com/moinakg">moinakg</a></p>
|
|
<p><small>Hosted on GitHub Pages — Theme by <a href="http://twitter.com/#!/michigangraham">mattgraham</a></small></p>
|
|
</footer>
|
|
</div>
|
|
<!--[if !IE]><script>fixScale(document);</script><![endif]-->
|
|
<script type="text/javascript">
|
|
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
|
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
|
</script>
|
|
<script type="text/javascript">
|
|
try {
|
|
var pageTracker = _gat._getTracker("UA-36422648-1");
|
|
pageTracker._trackPageview();
|
|
} catch(err) {}
|
|
</script>
|
|
|
|
</body>
|
|
</html> |