Why BWT works

In issue #13, there was a very enlightening article on the Burrows‐Wheeler transformation (BWT), one exceptional algorithm used in data compression. I once checked it out myself and wrote an introduction to it for the comp.compression FAQ. As a question of why it works was raised but not answered, I’ll try to recount the facts in a digestible form, here.

In data compression, we deal with something called redundance. Basically, it means predictability of the data stream against some particular model of the stream source. These sources are modelled statistically, something which comes from information theory. (Compression is the most famous application of Claude Shannon’s famous theory.) In information theory, information is defined to be something that reduces uncertainty. If we have a stream of constant numbers, it is quite clear it carries no information: if the stream can assume only one value and can answer no questions. If lots of unpredictable transitions occur, lots of questions can be answered. If we have a set of n yes/no questions, we need n bits to answer them all, if they are completely independent (i.e. answering one has no bearing on any of the others). Since independent questions mean independent answers and equally probable yes/no alternatives, we see that most information is relayed when the stream is completely random. This means that, statistically, white noise (equiprobable random numbers) is the equivalent of perfect lack of redundancy. We aim at similar properties in compressing data. This is why compressed files sound like wide band noise. (For encrypted files the reason is a bit different.)

Now about compression. What if we encounter a data stream which is supposed to carry information, but shows a definite tendency towards some dependency between symbols? Say, we have English language as ASCII. We notice that some letters are more common than other ones and some series of letters tend to be followed by some letters more often than others. In this case, we notice that receiving some sequence of letters reduces our uncertainty about what follows. For instance, receiving a period makes it almost certain that a space follows. It is said that a correlation exists. Such correlations are what makes compression tick: having received something, we know more about what follows and need less bits (questions answered) to uniquely determine the following characters. Huffman compression achieves this by using less bits for common symbols and more for less common, leading to a lower average number of bits transmitted. Arithmetic coders do the same by assigning a wider window for a character and so reducing the change happening in the upper and lower coding bounds. This then leads to fewer additional agreeing bound bits and fewer bits put on the wire. Lempel‐Ziv coders cope by encoding offsets to previously seen data. In this case, the prediction is seen in that only data which has been seen a short while ago can be used for offsets and longer offsets take more room. This means that seeing the same data over and over again leads to less space used (sending mostly offsets, not the actual data). We see that if we can predict (at least partially) what is going to happen, we can shave off bits. Prediction is what counts in BWT as well.

BWT uses a weird strategy. We form cyclic permutations (rotations) of the input block, sort them, and transmit the last character of the resulting array. What happens is that identical characters tend to group together in the last row and so the output locally consists of a small set of distinct symbols. Why?

Think about what I said earlier about prediction: if we have correlations in the data (predictability/less information/redundancy), receiving a certain string enables us to partially predict what comes next. Think about some symbol in the input stream (say e). If correlations are present, the same string (for e, probably th) probably tends to precede it all the time. The same works backwards, as well: some character may tend to precede the same strings over and over again. In the compression step of BWT, all characters take on the role of last symbol on some row of the transform matrix. At the same time, the string which comes after the character ends up in the front of the array. When we sort, similar suffix strings end up on adjacent rows of the sorted matrix and the correlations make their prefix symbols land in sequences in the last row. In effect, the characters are predicted by their suffixes. This means that as long as a suffix string at least partially tells what the previous character was, symbols will tend to group together in the output.

One aspect of the process easily missed is that an automatical weighting mechanism is present: characters close to (at the beginning of a row) the predicted one (the last in a row) receive more weight since they are the primary keys in the sorting operation. Second, since the strings are always sorted in their entirety, a flat distribution in the primary predictor symbol does not affect how the secondary sorting key (second character in the rows) affects the result. In this case, well predicted characters will land in multiple locations, but with long block lengths, they are still well localized and behave nicely in the following MTF stage. Consequently, having correlations which skip characters (which is common in record based formats, lists and directories of all kinds) can at least partially be utilized. This is a unique feature of higher order compressors and BWT in particular.

The quality of the predictions (i.e. the amount of redundance) determines how strict the final grouping of characters is. If every t in the text is suffixed by he, all t:s will end up in a big row in the last column. If this is not the case, t:s will still end up mainly near their best predictor, he. Locally, we get an output which has greatly skewed first order statistics (e.g. for any 32 letter sequence, we might expect only two or three separate symbols to appear). This is good excepting the fact that the statistics change quite rapidly.

This is why the move‐to‐front coder is included. MTF encodes the number of distinct symbols seen after the last occurrence of the input character. Coding uses a conceptual stack of all the output symbols (256 in the case of 8‐bit bytes). When we receive a symbol, the offset of the symbol in the stack is encoded and the symbol is moved to top of stack. This means that if only a small set of symbols is in active use at any time, these symbols will stay at the top of the stack and will continuously be coded as low indices in the output. If the set roams (i.e. the common symbols vary with time), the average output will be higher. (We need to bring stuff to the front of the stack before a low index results.) Only if the input has completely flat statistics (every symbol is equally probable) will the output be flat.

Now, prefixed with a BWT transform, MTF receives just the right kind of input: long runs of just a few common characters. So constant low output indices result. What does this mean? It means that BWT/MTF often achieves a prodigious skewing of first order statistics. This means that any statistical compression method which attacks common symbols will be effective in compressing the output. Static Huffman, Shannon‐Fano, or arithmetic coding methods all work very well. String matching (LZ) algorithms do not work, however, since the shuffling involved breaks intersymbol correlations very efficiently. Basically, the BWT/MTF step takes the complex symbol interdependencies in the input (which first order statistical coders cannot leverage) and turns them into skewed first order statistics with less high order correlations; BWT/MTF is a kind of impedance match between simple statistical coding and typical, highly cross‐correlated input data.

A few final remarks about BWT:

At a large scale, BWT in itself does not modify the first order statistics of the data stream. It just reorders the symbols inside the data blocks. This means that even though higher order statistics may not allow any compression, the first order statistics will not become worse. Combined with MTF, such an effect may take place, though.
BWT, and invertible transformations in general, will not make data more compressible. It just converts data compressible under one model to make it compressible under another. It cannot be used to build an infinite compressor (which, by virtue of the counting theorem, does not exist). In fact, BWT does not admit iteration at all—since it flattens the higher order statistics it relies on, more than one pass will not make things any better.
Using MTF effectively kills time‐variance in the signal. This is what it is supposed to do. Consequently, using an adaptive compression scheme with BWT/MTF is not worth doing. However, dropping MTF, any adaptive first order coder will do. The problem is, MTF is nearly optimal for BWT since the variance in first order statistics is quite high. This is seen in that trying different update heuristics for the MTF stack does little good to resulting compression ratios. Most adaptive compressors will do worse than MTF+static coding.

For people interested in compression, the comp.compression FAQ is a good starting point. It deals both with basic theory and analyses most of the common compression algorithms in an intuitive and easy to understand way. The FAQ is maintained by Jean Loup Gailly. Along with Mark Nelson, he is one of the authors of The Data Compression Book, the definite introductory reference on data compression theory and practice.