A parallelized data deduplication and command line compression utility.

Find a file

Moinak Ghosh ea345a902a Overhaul documentation part #1 Detect and handle uncompressed PDF files using libbsc. Force binary/text data detection for tar archives. Get rid of unnecessary CLI option. Add full pipeline mode check when archiving.		2013-12-30 23:24:37 +05:30
archive	Overhaul documentation part #1	2013-12-30 23:24:37 +05:30
bsc	Add forked and optimized copy of LGPL version of Libbsc.	2013-11-30 22:13:33 +05:30
bsdiff	Add fast handling of totally incompressible data (like Jpegs) in adaptive modes.	2013-11-15 21:06:23 +05:30
crypto	Add a simple log facility.	2013-10-02 20:45:33 +05:30
filters	Update to PackJPG 2.5i.	2013-12-28 21:56:03 +05:30
lz4	Major License text cleanup.	2013-03-07 20:26:48 +05:30
lzfx	Major License text cleanup.	2013-03-07 20:26:48 +05:30
lzma	Free PPMD buffer after compression, rather than caching.	2013-12-21 23:32:27 +05:30
rabin	Archiving support using Libarchive: Work in progress changes.	2013-10-20 23:54:27 +05:30
test	Remove confusing option with little practical utility.	2013-09-01 15:02:28 +05:30
utils	Overhaul documentation part #1	2013-12-30 23:24:37 +05:30
.gitignore	Remove external Libbsc option.	2013-11-30 22:43:31 +05:30
adaptive_compress.c	Overhaul documentation part #1	2013-12-30 23:24:37 +05:30
allocator.c	Free PPMD buffer after compression, rather than caching.	2013-12-21 23:32:27 +05:30
allocator.h	Free PPMD buffer after compression, rather than caching.	2013-12-21 23:32:27 +05:30
bzip2_compress.c	Structured handling of file types.	2013-11-09 16:46:19 +05:30
Changelog	Update Changelog and tweak free memory detection for 2.4 release.	2013-09-05 21:12:37 +05:30
compressed_file_format.txt	Add fast handling of totally incompressible data (like Jpegs) in adaptive modes.	2013-11-15 21:06:23 +05:30
config	Remove external Libbsc option.	2013-11-30 22:43:31 +05:30
COPYING	Update License info to LGPLv3.	2012-07-07 22:18:29 +05:30
COPYING.LESSER	Update License info to LGPLv3.	2012-07-07 22:18:29 +05:30
INSTALL	Remove external Libbsc option.	2013-11-30 22:43:31 +05:30
libbsc_compress.c	Add forked and optimized copy of LGPL version of Libbsc.	2013-11-30 22:13:33 +05:30
lz4_compress.c	Remove fast path exit to allow compressing headers and zero paddings via LZ4.	2013-12-04 23:05:18 +05:30
lzfx_compress.c	Structured handling of file types.	2013-11-09 16:46:19 +05:30
lzma_compress.c	Free PPMD buffer after compression, rather than caching.	2013-12-21 23:32:27 +05:30
main.c	Add a simple log facility.	2013-10-02 20:45:33 +05:30
Makefile.in	Really fix issue #12 .	2013-12-28 21:49:55 +05:30
none_compress.c	Leverage file type detection(archiver) to improve compression performance.	2013-11-08 23:50:28 +05:30
pcompress.c	Overhaul documentation part #1	2013-12-30 23:24:37 +05:30
pcompress.h	Overhaul documentation part #1	2013-12-30 23:24:37 +05:30
ppmd_compress.c	Free PPMD buffer after compression, rather than caching.	2013-12-21 23:32:27 +05:30
README.md	Overhaul documentation part #1	2013-12-30 23:24:37 +05:30
zlib_compress.c	Structured handling of file types.	2013-11-09 16:46:19 +05:30

README.md

Pcompress

Pcompress is an archiver that also does compression and decompression in parallel by splitting input data into chunks. It has a modular structure and includes support for multiple algorithms like LZMA, Bzip2, PPMD, etc, with SKEIN/SHA checksums for data integrity. Compression algorithms are selected based on the file type to maximize compression gains using a file and data anaylis based adaptive technique. It also includes various data transformation filters to improve compression.

It also implements Variable Block Deduplication and Delta Compression features based on a Polynomial Fingerprinting scheme. Delta Compression is done via the widely popular bsdiff algorithm. Similarity is detected using a technique based on MinHashing. Deduplication metadata is also compressed to reduce overheads. In addition to all these it can internally split chunks at file and rabin boundaries to help Dedupe and compression.

It has low metadata overhead and overlaps I/O and compression to achieve maximum parallelism. It also bundles a simple slab allocator to speed repeated allocation of similar chunks. It can work in pipe mode, reading from stdin and writing to stdout. SIMD vector optimizations using the x86 SSE instruction set are used to speed up various operations. Finally it supports 14 compression levels to allow for ultra compression parameters in some algorithms.

Pcompress also supports encryption via AES, Salsa20 and uses Scrypt from Tarsnap for Password Based Key generation. A unique key is generated per session even if the same password is used and HMAC is used to do authentication.

Links of Interest

Project Home Page: http://moinakg.github.io/pcompress/

http://moinakg.github.io/pcompress/#deduplication-chunking-analysis

http://moinakg.github.io/pcompress/#compression-benchmarks

http://moinakg.wordpress.com/2013/04/26/pcompress-2-0-with-global-deduplication/

http://moinakg.wordpress.com/2013/03/26/coordinated-parallelism-using-semaphores/

http://moinakg.wordpress.com/2013/06/11/architecture-for-a-deduplicated-archival-store-part-1/

http://moinakg.wordpress.com/2013/06/15/architecture-for-a-deduplicated-archival-store-part-2/

Standard Usage

Standard usage consists of a few common options to control basic behavior. A variety of
parameters including global deduplication are automatically set based on the compression
level.

Archiving
---------
   pcompress -a [-v] [-l <compress level>] [-s <chunk size>] [-c <algorithm>]
                [<file1> <directory1> <file2> ...] [-t <number>] [-S <chunk checksum>]
                <archive filename or '-'>

   Archives a given set of files and/or directories into a compressed PAX archive. The
   PAX datastream is encoded into a custom format compressed file that can only be
   handled by Pcompress.

   -a       Enables archive mode where pathnames specified in the command line are
            archived using LibArchive and then compressed.

   -l <compress level>
            Select a compression level from 1 (least compression, fastest) to 14
            (ultra compression, slow). Default: 6

   -s <chunk size>
            Archive data is split into chunks that are processed in parallel. This value
            specifies the maximum chunk size. Blocks may be smaller than this. Values
            can be in bytes or <number><suffix> format where suffix can be k - KB, m - MB,
            g - GB. Default: 8m
            Larger chunks can produce better compression at the cost of memory.

   -c <algorithm>
            Specifies the compression algorithm to use. Default algorithm when archiving
            is adapt2 (Second Adaptive Mode). This is the ideal mode for archiving giving
            best compression gains. However adapt (Adaptive Mode) can be used which is a
            little faster but give lower compression gains.
            Other algorithms can be used if all the files are of the same known type. For
            example ppmd (slow) or libbsc (fast) can be used if all the files only have
            ASCII text. See section "Compression Algorithms" for details.

   -v       Enables verbose mode where each file/directory is printed as it is processed.

   -t <number>
            Sets the number of threads that Pcompress can use. Pcompress automatically
            uses thread count = core count. However with larger chunk size (-s option)
            and/or ultra compression levels, large amounts of memory can be used. In this
            case thread count can be reduced to reduce memory consumption.

   -S <chunk checksum>
            Specify then chunk checksum to use. Default: BLAKE256. The following checksums
            are available:

                 CRC64 - Extremely Fast 64-bit CRC from LZMA SDK.
                SHA256 - SHA512/256 version of Intel's optimized (SSE,AVX) SHA2 for x86.
                SHA512 - SHA512 version of Intel's optimized (SSE,AVX) SHA2 for x86.
             KECCAK256 - Official 256-bit NIST SHA3 optimized implementation.
             KECCAK512 - Official 512-bit NIST SHA3 optimized implementation.
              BLAKE256 - Very fast 256-bit BLAKE2, derived from the NIST SHA3
                         runner-up BLAKE.
              BLAKE512 - Very fast 256-bit BLAKE2, derived from the NIST SHA3
                         runner-up BLAKE.

             The fastest checksum is the BLAKE2 family.

   <archive filename>
            Pathname of the resulting archive. A '.pz' extension is automatically added
            if not already present. This can also be specified as '-' in order to send
            the compressed archive stream to stdout.

Single File Compression
-----------------------
   pcompress -c <algorithm> [-l <compress level>] [-s <chunk size>] [-p] [<file>]
             [-t <number>] [-S <chunk checksum>] [<target file or '-'>]

   Takes a single file as input and produces a compressed file. Archiving is not performed.
   This can also work as compression pipeline.

   -c <algorithm>
            See above. Also see section "Compression Algorithms" for details.

   -l <compress level>
   -s <chunk size>
   -t <number>
   -S <chunk checksum>
            See above.
            Note: In singe file compression mode with adapt2 or adapt algorithm, larger
                  chunks may not produce better compression. Smaller chunks can result
                  in better data analysis here.

   -p       Make Pcompress work in full pipeline mode. Data is ingested via stdin
            compressed and output via stdout. No filenames are used.

   <target file>
            Pathname of the compressed file to be created. This can be '-' to send the
            compressed data to stdout.

Decompression and Archive extraction
------------------------------------
   pcompress -d <compressed file or '-'> [-m] [-K] [<target file or directory>]

   -m        Enable restoring *all* permissions, ACLs, Extended Attributes etc.
             Equivalent to the '-p' option in tar. Ownership is only extracted if run as
             root user.

   -K        Do not overwrite newer files.

   -m and -K are only meaningful if the compressed file is an archive. For single file
   compressed mode these options are ignored.

   <compressed file>
            Specifies the compressed file or archive. This can be '-' to indicate reading
            from stdin while write goes to <target file>

   <target file or directory>
            This can be a filename or a directory depending on how the archive was created.
            If single file compression was used then this can be the name of the target
            file that will hold the uncompressed data.
            If this is omitted then an output file is created by appending '.out' to the
            compressed filename.

            If Archiving was done then this should be the name of a directory into which
            extracted files are restored. The directory is created if it does not exist.
            If this is omitted the files are extracted into the current directory.

Compression Algorithms

lzfx    - Fast, average compression. At high compression levels this can be faster
          than LZ4.
          Effective Levels: 1 - 5
lz4     - Very Fast, sometimes better compression than LZFX.
          Effective Levels: 1 - 3
zlib    - Fast, better compression.
          Effective Levels: 1 - 9
bzip2   - Slow, much better compression than Zlib.
          Effective Levels: 1 - 9

lzma    - Very slow. Extreme compression. Recommended: Use lzmaMt variant mentioned
          below.
          Effective Levels: 1 - 14
          Till level 9 it is standard LZMA parameters. Levels 10 - 12 use
          more memory and higher match iterations so are slower. Levels
          13 and 14 use larger dictionaries upto 256MB and really suck up
          RAM. Use these levels only if you have at the minimum 4GB RAM on
          your system.
lzmaMt  - This is the multithreaded variant of lzma and typically runs faster.
          However in a few cases this can produce slightly lesser compression
          gain.

PPMD    - Slow. Extreme compression for Text, average compression for binary.
          In addition PPMD decompression time is also high for large chunks.
          This requires lots of RAM similar to LZMA. PPMd requires
          at least 64MB X core-count more memory than the other modes.
          Effective Levels: 1 - 14.

Adapt   - Synthetic mode with text/binary detection. For pure text data PPMD is
          used otherwise Bzip2 is selected per chunk.
          Effective Levels: 1 - 14
Adapt2  - Slower synthetic mode. For pure text data PPMD is otherwise LZMA is
          applied. Can give very good compression ratio when splitting file
          into multiple chunks.
          Effective Levels: 1 - 14
          Since both LZMA and PPMD are used together memory requirements are
          large especially if you are also using extreme levels above 10. For
          example with 100MB chunks, Level 14, 2 threads and with or without
          dedupe, it uses upto 2.5GB physical RAM (RSS).

none    - No compression. This is only meaningful with -G or -D. So Dedupe
          can be done for post-processing with an external utility.

Enabled features based on Compression Level

1 to 3  - No features, just compression and archiving, if needed.
4       - Global Deduplication with avg block size of 8KB.
5       - Global Dedup block size 8KB, Adaptive Delta Encoding.
6 to 8  - Global Dedup block size reduced to 4KB, Adaptive Delta Encoding.
9       - Global Dedup block size reduced to 2KB, Adaptive Delta Encoding, Dispack.
10      - Global Dedup block size 2KB, Adaptive Delta Encoding with extra rounds, Dispack,
          LZP Preprocessing
10 - 14 - Global Dedup block size 2KB, Adaptive Delta Encoding with extra rounds, Dispack,
          LZP Preprocessing, PackJPG filter for Jpegs.

Encryption

Pcompress supports encryption and authentication in both archive and single-file
compresion modes. Encryption options are discussed below.

NOTE: When using pipe-mode via -p the only way to provide a password is to use '-w'.
      See below.

   -e <ALGO>
            Encrypt chunks using the given encryption algorithm. The algo parameter
            can be one of AES or SALSA20. Both are used in CTR stream encryption
            mode.
            The password can be prompted from the user or read from a file. Unique
            keys are generated every time pcompress is run even when giving the same
            password. Of course enough info is stored in the compresse file so that
            the key used for the file can be re-created given the correct password.

            Default key length if 256 bits but can be reduced to 128 bits using the
            '-k' option.

            The Scrypt algorithm from Tarsnap is used
            (See: http://www.tarsnap.com/scrypt.html) for generating keys from
            passwords. The CTR mode AES mechanism from Tarsnap is also utilized.

   -w <pathname>
            Provide a file which contains the encryption password. This file must
            be readable and writable since it is zeroed out after the password is
            read.

   -k <key length>
            Specify the key length. Can be 16 for 128 bit keys or 32 for 256 bit
            keys. Default value is 32 for 256 bit keys.

Advanced usage

A variety of advanced options are provided if one wishes fine grained control
as opposed to automatic settings. If advanced options are used then auto-setting
of parameters get disabled. The various advanced options are discussed below.

Attempt Polynomial fingerprinting based deduplication on a per-chunk basis:
   pcompress -D ...

Perform Delta Encoding in addition to Identical Dedup:
   pcompress -E ... - This also implies '-D'. This performs Delta Compression
                      between 2 blocks if they are 40% to 60% similar. The
                      similarity %age is selected based on the dedupe block
                      size to balance performance and effectiveness.
   pcompress -EE .. - This causes Delta Compression to happen if 2 blocks are
                      at least 40% similar regardless of block size. This can
                      effect greater final compression ratio at the cost of
                      higher processing overhead.

   -L       Enable LZP pre-compression. This improves compression ratio of all
            algorithms with some extra CPU and very low RAM overhead. Using
            delta encoding in conjunction with this may not always be beneficial.
            However Adaptive Delta Encoding is beneficial along with this.

   -P       Enable Adaptive Delta Encoding. It can improve compresion ratio further
            for data containing tables of numerical values especially if those are
            in an arithmetic series. In this implementation basic Delta Encoding is
            combined with Run-Length encoding and Matrix transpose
   NOTE -   Both -L and -P can be used together to give maximum benefit on most
            datasets.

   -F       Perform Fixed Block Deduplication. This is faster than fingerprinting
            based content-aware deduplication in some cases. However this is mostly
            usable for disk dumps especially virtual machine images. This generally
            gives lower dedupe ratio than content-aware dedupe (-D) and does not
            support delta compression.

   -B <0..5>
            Specify an average Dedupe block size. 0 - 2K, 1 - 4K, 2 - 8K ... 5 - 64K.
            Default deduplication block size is 4KB for Global Deduplication and 2KB
            otherwise.
   -B 0
            This uses blocks as small as 2KB for deduplication. This option can be
            used for datasets of a few GBs to a few hundred TBs in size depending on
            available RAM.
              
   -M       Display memory allocator statistics.
   -C       Display compression statistics.

Global Deduplication
--------------------
   -G       This flag enables Global Deduplication. This makes pcompress maintain an
            in-memory index to lookup cryptographic block hashes for duplicates. Once
            a duplicate is found it is replaced with a reference to the original block.
            This allows detecting and eliminating duplicate blocks across the entire
            dataset. In contrast using only '-D' or '-F' flags does deduplication only
            within the chunk but uses less memory and is much faster than Global Dedupe.

            The '-G' flag can be combined with either '-D' or '-F' flags to indicate
            rabin chunking or fixed chunking respectively. If these flags are not
            specified then the default is to assume rabin chunking via '-D'.
            All other Dedupe flags have the same meanings in this context.

            Delta Encoding is not supported with Global Deduplication at this time. The
            in-memory hashtable index can use upto 75% of free RAM depending on the size
            of the dataset. In Pipe mode the index will always use 75% of free RAM since
            the dataset size is not known. This is the simple full block index mode. If
            the available RAM is not enough to hold all block checksums then older block
            entries are discarded automatically from the matching hash slots.

            If pipe mode is not used and the given dataset is a file then Pcompress
            checks whether the index size will exceed three times of 75% of the available
            free RAM. In such a case it automatically switches to a Segmented Deduplication
            mode. Here data is first split into blocks as above. Then upto 2048 blocks are
            grouped together to form a larger segment. The individual block hashes for a
            segment are stored on a tempfile on disk. A few min-values hashes are then
            computed from the block hashes of the segment which are then loaded into the
            index. These hashes are used to detect segments that are approximately similar
            to each other. Once found the block hashes of the matching segments are loaded
            from the temp file and actual deduplication is performed. This allows the
            in-memory index size to be approximately 0.0025% of the total dataset size and
            requires very few disk reads for every 2048 blocks processed.
              
            In pipe mode Global Deduplication always uses a segmented similarity based
            index. It allows efficient network transfer of large data.

Environment Variables

Set ALLOCATOR_BYPASS=1 in the environment to avoid using the the built-in
allocator. Due to the the way it rounds up an allocation request to the nearest
slab the built-in allocator can allocate extra unused memory. In addition you
may want to use a different allocator in your environment.

The variable PCOMPRESS_INDEX_MEM can be set to limit memory used by the Global
Deduplication Index. The number specified is in multiples of a megabyte.

The variable PCOMPRESS_CACHE_DIR can point to a directory where some temporary
files relating to the Global Deduplication process can be stored. This for example
can be a directory on a Solid State Drive to speed up Global Deduplication. The
space used in this directory is proportional to the size of the dataset being
processed and is slightly more than 8KB for every 1MB of data.

The default checksum used for block hashes during Global Deduplication is SHA256.
However this can be changed by setting the PCOMPRESS_CHUNK_HASH_GLOBAL environment
variable. The list of allowed checksums for this is:

SHA256   , SHA512
KECCAK256, KECCAK512
BLAKE256 , BLAKE512
SKEIN256 , SKEIN512

Even though SKEIN is not supported as a chunk checksum (not deemed necessary
because BLAKE2 is available) it can be used as a dedupe block checksum. One may
ask why? The reasoning is we depend on hashes to find duplicate blocks. Now SHA256
is the default because it is known to be robust and unbroken till date. Proven as
yet in the field. However one may want a faster alternative so we have choices
from the NIST SHA3 finalists in the form of SKEIN and BLAKE which are neck to
neck with SKEIN getting an edge. SKEIN and BLAKE have seen extensive cryptanalysis
in the intervening years and are unbroken with only marginal theoretical issues
determined. BLAKE2 is a derivative of BLAKE and is tremendously fast but has not
seen much specific cryptanalysis as yet, even though it is not new but just a
performance optimized derivate. So cryptanalysis that applies to BLAKE should
also apply and justify BLAKE2. However the paranoid may well trust SKEIN a bit
more than BLAKE2 and SKEIN while not being as fast as BLAKE2 is still a lot faster
than SHA2.

Examples

Archive contents of directory /usr/include into usr.pz. Default chunk or per-thread segment size is 8MB and default compression level is 6.

pcompress -a /usr/include usr

Archive the given listr of files into file.pz and max compresion level and all features enabled. A maximum chunk size of 20MB is used. Also use verbose mode which lists each file as it is processed.

pcompress -a -v -l14 -s20m file1 file2 file3 file

Simple compress "file.tar" using zlib(gzip) algorithm. Default chunk or per-thread segment size is 8MB and default compression level is 6. Output file created will be file.tar.pz

pcompress -c zlib file.tar

Simple compress "file.tar" using zlib(gzip) algorithm with output file file.compressed.pz

pcompress -c zlib file.tar file.compressed

Compress "file.tar" using Zlib and per-thread chunk or segment size of 10MB and Compression level 9. Compressed output is sent to stdout using '-' which is then redirected to a file.

pcompress -c zlib -l9 -s10m file.tar - > /path/to/compress_file.tar.pz

It is possible for a single chunk to span the entire file if enough RAM is available. However for adaptive modes to be effective for large files, especially multi-file archives splitting into chunks is required so that best compression algorithm can be selected for textual and binary portions.

Pre-Processing Algorithms

As can be seen above a multitude of pre-processing algorithms are available that
provide further compression effectiveness beyond what the usual compression
algorithms can achieve by themselves. These are summarized below:

1) Deduplication     : Per-Chunk (or per-segment) deduplication based on Rabin
                       fingerprinting.

2) Delta Compression : A similarity based (minhash) comparison of Rabin blocks.
                       Two blocks at least 60% similar with each other are diffed
                       using bsdiff.

3) LZP               : LZ Prediction is a variant of LZ77 that replaces repeating
                       runs of text with shorter codes.

4) Adaptive Delta    : This is a simple form of Delta Encoding where arithmetic
                       progressions are detected in the data stream and
                       collapsed via Run-Length encoding.

4) Matrix Transpose  : This is used automatically in Delta Encoding and
                       Deduplication. This attempts to transpose columnar
                       repeating sequences of bytes into row-wise sequences so
                       that compression algorithms can work better.

Memory Usage

As can be seen from above memory usage can vary greatly based on compression/
pre-processing algorithms and chunk size. A variety of configurations are possible
depending on resource availability in the system.

The minimum possible meaningful settings while still giving about 50% compression
ratio and very high speed is with the LZFX algorithm with 1MB chunk size and 2
threads:

    pcompress -c lzfx -l2 -s1m -t2 <file>

This uses about 6MB of physical RAM (RSS). Earlier versions of the utility before
the 0.9 release comsumed much more memory. This was improved in the later versions.
When using Linux the virtual memory consumption may appear to be very high but it
is just address space usage rather than actual RAM and should be ignored. It is only
the RSS that matters. This is a result of the memory arena mechanism in Glibc that
improves malloc() performance for multi-threaded applications.