It relies on separate external utilities such as tar for tasks such as handling multiple files, and other tools for encryption, and archive splitting.
bzip2 compresses data in blocks between 100 and 900 kB and uses the Burrows–Wheeler transform to convert frequently recurring character sequences into strings of identical letters.
There have been some modifications to the algorithm, such as pbzip2, which uses multi-threading to improve compression speed on multi-CPU and multi-core computers.
The author of bzip2 has stated that the RLE step was a historical mistake and was only intended to protect the original BWT implementation from pathological cases.
The block is entirely self-contained, with input and output buffers remaining of the same size—in bzip2, the operating limit for this stage is 900 kB.
For the block-sort, a (notional) matrix is created, in which row i contains the whole of the buffer, rotated to start from the i-th symbol.
In practice, it is not necessary to construct the full matrix; rather, the sort is performed using pointers for each position in the buffer.
Much "natural" data contains identical symbols that recur within a limited range (text is a good example).
As a more complicated example: This process replaces fixed-length symbols in the range 0–258 with variable-length codes based on the frequency of use.
This has the advantage of having very responsive Huffman dynamics without having to continuously supply new tables, as would be required in DEFLATE.
As a result of the earlier MTF encoding, code lengths would start at 2–3 bits long (very frequently used codes) and gradually increase, meaning that the delta format is fairly efficient, requiring around 300 bits (38 bytes) per full Huffman table.
Because of the first-stage RLE compression (see above), the maximum length of plaintext that a single 900 kB bzip2 block can contain is around 46 MB (45,899,236 bytes).
This can occur if the whole plaintext consists entirely of repeated values (the resulting .bz2 file in this case is 46 bytes long).
An even smaller file of 40 bytes can be achieved by using an input containing entirely values of 251, an apparent compression ratio of 1147480.9:1.
This means that bzip2 files can be decompressed in parallel, making it a good format for use in big data applications with cluster computing frameworks like Hadoop and Apache Spark.
LZMA is generally more space-efficient than bzip2 at the expense of even slower compression speed, while having faster decompression.
[10] bzip2 compresses data in blocks of size between 100 and 900 kB and uses the Burrows–Wheeler transform to convert frequently-recurring character sequences into strings of identical letters.
[11] bzip3,[12] a modern compressor that shares common ancestry and set of algorithms with bzip2, switched back to arithmetic coding.
Motivated by the long time required for compression, a modified version was created in 2003 called pbzip2 that used multi-threading to encode the file in multiple chunks, giving almost linear speedup on multi-CPU and multi-core computers.
The grep-based bzgrep tool allows directly searching through compressed text without needing to uncompress the contents first.