Update README to align with current features/behavior.
This commit is contained in:
parent
a98778d62f
commit
b01d255f6c
1 changed files with 47 additions and 19 deletions
66
README.md
66
README.md
|
@ -103,7 +103,9 @@ NOTE: The option "libbsc" uses Ilya Grebnov's block sorting compression library
|
||||||
However Adaptive Delta Encoding is beneficial along with this.
|
However Adaptive Delta Encoding is beneficial along with this.
|
||||||
|
|
||||||
'-P' - Enable Adaptive Delta Encoding. This implies '-L' as well. It improves
|
'-P' - Enable Adaptive Delta Encoding. This implies '-L' as well. It improves
|
||||||
compresion ratio further at the cost of more CPU overhead.
|
compresion ratio further at the cost of more CPU overhead. Delta
|
||||||
|
Encoding is combined with Run-Length encoding and Matrix transpose
|
||||||
|
of certain kinds of data to improve subsequent compression results.
|
||||||
|
|
||||||
'-S' <cksum>
|
'-S' <cksum>
|
||||||
- Specify chunk checksum to use: CRC64, SKEIN256, SKEIN512, SHA256 and
|
- Specify chunk checksum to use: CRC64, SKEIN256, SKEIN512, SHA256 and
|
||||||
|
@ -120,6 +122,7 @@ NOTE: The option "libbsc" uses Ilya Grebnov's block sorting compression library
|
||||||
usable for disk dumps especially virtual machine images. This generally
|
usable for disk dumps especially virtual machine images. This generally
|
||||||
gives lower dedupe ratio than content-aware dedupe (-D) and does not
|
gives lower dedupe ratio than content-aware dedupe (-D) and does not
|
||||||
support delta compression.
|
support delta compression.
|
||||||
|
|
||||||
'-M' - Display memory allocator statistics
|
'-M' - Display memory allocator statistics
|
||||||
'-C' - Display compression statistics
|
'-C' - Display compression statistics
|
||||||
|
|
||||||
|
@ -197,31 +200,56 @@ PPMD - Slow. Extreme compression for Text, average compression for binary.
|
||||||
This requires lots of RAM similar to LZMA.
|
This requires lots of RAM similar to LZMA.
|
||||||
Levels: 1 - 14.
|
Levels: 1 - 14.
|
||||||
|
|
||||||
Adapt - Very slow synthetic mode. Both Bzip2 and PPMD are tried per chunk and
|
Adapt - Synthetic mode with text/binary detection. For pure text data PPMD is
|
||||||
better result selected.
|
used otherwise Bzip2 is selected per chunk.
|
||||||
Levels: 1 - 14
|
Levels: 1 - 14
|
||||||
Adapt2 - Ultra slow synthetic mode. Both LZMA and PPMD are tried per chunk and
|
Adapt2 - Slower synthetic mode. For pure text data PPMD is otherwise LZMA is
|
||||||
better result selected. Can give best compression ratio when splitting
|
applied. Can give very good compression ratio when splitting file
|
||||||
file into multiple chunks.
|
into multiple chunks.
|
||||||
Levels: 1 - 14
|
Levels: 1 - 14
|
||||||
Since both LZMA and PPMD are used together memory requirements are
|
Since both LZMA and PPMD are used together memory requirements are
|
||||||
quite extensive especially if you are also using extreme levels above
|
large especially if you are also using extreme levels above 10. For
|
||||||
10. For example with 64MB chunk, Level 14, 2 threads and with or without
|
example with 100MB chunks, Level 14, 2 threads and with or without
|
||||||
dedupe, it uses upto 3.5GB physical RAM and requires 6GB of virtual
|
dedupe, it uses upto 2.5GB physical RAM (RSS).
|
||||||
memory space.
|
|
||||||
|
|
||||||
It is possible for a single chunk to span the entire file if enough RAM is
|
It is possible for a single chunk to span the entire file if enough RAM is
|
||||||
available. However for adaptive modes to be effective for large files, especially
|
available. However for adaptive modes to be effective for large files, especially
|
||||||
multi-file archives splitting into chunks is required so that best compression
|
multi-file archives splitting into chunks is required so that best compression
|
||||||
algorithm can be selected for textual and binary portions.
|
algorithm can be selected for textual and binary portions.
|
||||||
|
|
||||||
Caveats
|
Pre-Processing Algorithms
|
||||||
=======
|
=========================
|
||||||
This utility is not meant for resource constrained environments. Minimum memory
|
As can be seen above a multitude of pre-processing algorithms are available that provide
|
||||||
usage (RES/RSS) with barely meaningful settings is around 10MB. This occurs when
|
further compression effectiveness beyond what the usual compression algorithms can
|
||||||
using the minimal LZFX compression algorithm at level 2 with a 1MB chunk size and
|
achieve by themselves. These are summarized below:
|
||||||
running 2 threads.
|
|
||||||
Normally this utility requires lots of RAM depending on compression algorithm,
|
1) Deduplication : Per-Chunk (or per-segment) deduplication based on Rabin
|
||||||
compression level, and dedupe being enabled. Larger chunk sizes can give
|
fingerprinting.
|
||||||
better compression ratio but at the same time use more RAM.
|
2) LZP : LZ Prediction is a variant of LZ77 that replaces repeating runs of
|
||||||
|
text with shorter codes.
|
||||||
|
3) Adaptive Delta : This is a simple form of Delta Encoding where arithmetic progressions
|
||||||
|
are detected in the data stream and collapsed via Run-Length encoding.
|
||||||
|
4) Matrix Transpose: This is used automatically in Delta Encoding and Deduplication. This
|
||||||
|
attempts to transpose columnar repeating sequences of bytes into
|
||||||
|
row-wise sequences so that compression algorithms can work better.
|
||||||
|
|
||||||
|
Memory Usage
|
||||||
|
============
|
||||||
|
As can be seen from above memory usage can vary greatly based on compression/
|
||||||
|
pre-processing algorithms and chunk size. A variety of configurations are possible
|
||||||
|
depending on resource availability in the system.
|
||||||
|
|
||||||
|
The minimum possible meaningful settings while still giving about 50% compression
|
||||||
|
ratio and very high speed is with the LZFX algorithm with 1MB chunk size and 2
|
||||||
|
threads:
|
||||||
|
|
||||||
|
pcompress -c lzfx -l2 -s1m -t2 <file>
|
||||||
|
|
||||||
|
This uses about 6MB of physical RAM (RSS). Earlier versions of the utility before
|
||||||
|
the 0.9 release comsumed much more memory. This was improved in the later versions.
|
||||||
|
When using Linux the virtual memory consumption may appear to be very high but it
|
||||||
|
is just address space usage rather than actual RAM and should be ignored. It is only
|
||||||
|
the RSS that matters. This is a result of the memory arena mechanism in Glibc that
|
||||||
|
improves malloc() performance for multi-threaded applications.
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue