APPENDIX: Audio coding/compression

The surge in MP3 activity over the last few years has created a tremendous demand for facts and information about audio compression. As digital processing has become cheaper, the technology has also spread to mainstream audio. Recent examples include Dolby Digital, MiniDisc and DVD. This appendix deals with something essential, then.

Redundancy in sound. Lossy vs. lossless compression.

To achieve compression of any kind, there needs to be something in the data that can be thrown away without causing damage when using the data for its intended purpose. Any excess like this is called redundancy. Depending on the criteria we use to decide what parts of the data are not significant, there are multiple different kinds of redundancy. The kinds we are interested in, here, are perceptual and statistical. The first refers to those parts of the sound we cannot hear (or more accurately, anything we can change in the signal without causing perceptual difference with the original), the second refers to the part which can be inferred from some particular statistical model we are using. Statistical redundancy is what makes general purpose, lossless compression methods (like those used in PkZip, ARJ and so on) possible, the fact that there is a lot of perceptual redundancy left after most statistical redundancy has already been tapped out is the reason lossy sound and image compressors are so successful. Instead of trying to predict exactly the next value the signal should take, we can go at the problem from another angle entirely and ask how and to what amount we can make errors in our coding without any perceptible difference in the decoded output. This is the area of psychoacoustical or perceptual modelling; statistical predictors are supplanted by models which tell what in the signal the listener should and should not hear.

In the preceding, the word model came up. Modelling and prediction based on it form an essential part of any coding scheme. To see what is going on, it is helpful to think of sent bits not as numbers but as questions answered. If we have a model to give a clue as to what is most likely to happen next, we can get by with fewer questions answered—in a sense the model does part of the answering. When dealing with signals, prediction takes on another meaning. In this case we can talk about getting close to a desired answer instead of just getting it right or wrong. Lossy compression is based on this—getting close enough is sufficient. If a lossless encoding is desired, we can always compute the difference between our prediction and the true signal we are coding and transmit that as well. When we have a good model, the error (usually called the residue) is small and takes fewer bits to transmit than the original. We can also take the difference approach further and consider what happens when we use a simple (and less efficient) predictor but this time use prediction on the residue as well. And to the residue of that stage, et cetera. Approaches like these lead to hierarchical codecs and recursive refinement algorithms of which discrete wavelets, subband methods and the pyramidal algorithms of image processing are but a few examples.

So modelling is used to discover redundance, both statistical and perceptual. It’s easily seen that statistical compression leads to results an entire order of magnitude worse than does perceptual coding, though. So what’s the catch? First, our perception is far from perfect. A lot of information is lost when we listen to sounds and there is considerable slack between a perfect reconstruction of a sound and what remains after our audition is done with it. Second, a large part of what is not heard is quite difficult to model statistically. Hiss is an excellent example: it is easy to create and robust to replicate (in the sense that the actual series of samples we output isn’t relevant as long as a number of statistical conditions are satisfied), yet it can be made impossible to model or compress statistically (this is the essence of white noise).

The previous example generalizes to arbitrary data streams, as well. Data streams with lots of stochastic content must, in the absence of out of band regularity assumptions, necessarily imply extremely large, high order models and significant periods of adaptation before proper compression can be achieved. As the data approaches white noise, model size grows, best compression achievable lessens and the time to get within a certain percentage of the best compression becomes considerable.

Time domain coding: deltas, prediction and adaptation

From the above we now know that if something is known in advance about the signals we are going to code, that knowledge can perhaps be exploited to model a signal and that way get some of the information sent for free. The assumptions, coders and models used in text compression (dictionary coders and statistical modellers) are not directly suited to dealing with sound, however. This is because the data contains stochastic components and is generally better described in the spectral than in the time one—in digitized analog signals, the information is generally spread out in the long term correlations between sample values instead of being highly time localised and the relative phase insensitivity of the ear gives the coder great liberties with regard to the reconstructed time structure of the signals. This fundamental difference in the ideal domain of description renders text compressors practically useless for direct compression of signals, something which is easily seen in the facts that statistical coding never seems to have enough time to properly adapt to the correlations present in the signal and dictionary methods suffer from lookup mismatches caused by the noise/nonlinear components. Instead we must use coding which deals with the frequency/phase structure of the sound. Linear filtering and spectral transformations are then the essence of audio coding.

The oldest and simplest trick for numerical sound coding is to use deltas (differences) instead of actual sample values. This coding scheme relies on the fact that most musical sound rolls off rather rapidly as we go to higher frequencies: a signal with little high end has quite a limited slope in the time domain and so storing differences between successive samples in the mean leads to far smaller stored values than storing the actual sample values. Smaller values equal a more restricted part of the numerical range being used and, then, the possiblity of losing some of the bits (less range means fewer bits will suffice to code the range). This approach may not work directly (after all, there might be some full range values every now and then), but combined with statistical compression, delta coding can result in two to one effective compression ratios. (Lossless statistical compression employs the fact that even though full range values occur every now and then, the mean values are low and so we can use variable length code words to reduce the average number of bits sent per delta.)

A bit more generally, we might store differences between the signal and some arbitrary predicted value. If the prediction constantly hits close to the actual signal, we might be able to reduce the mean differences even further. Delta coding is seen to be a special case of this approach: we use the value of the previous sample as the predictor. Indeed, such predictive codecs are quite effective. We can even generalize a bit from here: if the predictor is a linear filter, it combines with the differencing operation to form yet another linear filter. We might then consider what happens if we, instead of aiming for accurate time domain prediction, try to match to the signal a filter with a frequency response inversely proportional to the spectrum of the signal. In this case, predominant frequencies are attenuated and the prediction residue stays even lower than in the simple delta coding variant. To keep the filter in synchronization with the varying characteristics of the signal, we need to change its coefficients every now and then. Naturally this will either have to be done starting from the decoded data or the coefficients will have to be sent explicitly if we wish the decoder to be able to decipher the coded values correctly. Applying some extra assumptions, we may even be able to dispense with the residue altogether because a set of filter coefficients will uniquely identify the magnitude part of a spectrum. This leads easily to an open loop architecture in which each update to the filter is sent explicitly instead of being calculated based on previously decoded information. This is one way to view the essence of linear predictive coding (LPC) methods.

If perfect reconstruction is not a goal, we might also consider whether losing some numerical precision (bits transmitted) from or trying some alternative coding scheme for the coded residue is more advantageous than dropping the same number of bits from the original signal. And indeed we see that this is the case: to reconstruct the signal from the delta code, we need the inverse filter of the one used to encode the signal (this includes both the predictor and the differencing operation). Such a decoding filter will have the greatest gain in the frequencies where the coding filter attenuated the most and vice versa. Now, if the response of the coding filter approximated the spectrum of the signal well, any quantization noise inserted by the bit shaving step will in the decoding filter assume a spectrum proportional to the actual signal reconstructed. This means that considerably more noise can be tolerated than when directly quantizing the signal itself—masking will take care of most of the noise which is now concentrated on frequencies with most of the signal power. Delta coding illustrates the point beautifully: the coding filter is a high pass one and so the inverse filter must be lowpass. Hence any noise inserted after coding will be decoded into a signal with little high end and mostly be drowned by the greater low end content of the actual signal (which was the original assumption).

After calculating deltas or applying a more general filter, we are left with the residue. Often the predictor isn’t very good (high order time variant linear inverse filters are computationally expensive), but we still want to achieve some proper compression. In this case, we will have to do something about the residue. The simplest alternative—simple truncation—was already mentioned, but will work only if the reconstruction filter is of high enough order to keep the noise inserted in areas of the spectrum where they will be masked. This is often not the case. One alternative, then, is to employ a time domain approximation which codes the difference values with a constant, low number of bits and somehow guarantees that the error between the original and reconstructed signals will be kept in check. It is clear that we need something more powerful than sheer truncation because that would produce serious errors when the limited range of the coded word was exceeded. (Remember: the differences between successive samples can reach the same values as the signal samples themselves plus we have to worry about the maximum positive and negative values as these float depending on the previous sample.) When the deltas stay small, maximum coding quality is achieved by using the truncated values. But when the deltas grow beyond the range covered by the transmitted words, we need to decode the words to bigger jumps to keep up with the waveform. This means that the actual size of sample difference coded by a given transmitted code will have to adapt. This results in what is called Adaptive Delta Pulse Code Modulation, or ADPCM, and is used in many commercial codecs. The original SoundBlaster Pro cards could decode certain ADPCM variants, and MicroSoft’s RIFF WAVE (.wav) sound files can encapsulate some others. The principle is that we code differences with a given set of codewords (typically 2 to 4 bits), the difference coded by a given codeword is driven up and down based on the previous codewords sent and we always send the codeword that results in the best approximation to the delta between the value decoded from previous outputs and the current incoming signal value. This way the reconstructed waveform will try to follow the coded signal and we get lossy compression. The loss is seen as a variable slew rate limitation of the output and occasional overshoots when high frequency content is present. In practice, ADPCM is computationally cheap but not very good—the compression ratios are at most 4:1 and even then the perceptual quality of the decompressed signal is not very high.

Frequency domain coding: subbands and quantization

Now it was already mentioned above that it is useful to keep any errors (noise) inserted by the codec close to powerful signal components so that the signal can mask the noise. In the above discussion some suitable time variant filter was used to accomplish this. But such filters are far too complex to update in real time and in reality have quite a number of stability problems and so on. Hence, we need a processing model which guarantees any errors inserted by the coding step will be masked by the actual signal. Let’s look at a very simple example first.

Suppose we have a signal which contains frequencies only from a very narrow, well defined band. If we can store the information in this band with the maximum efficiency implied by information theory (i.e. one nth of the total bandwidth will need one nth of the bitrate), we could think about what happens when we insert noise into this representation. Naturally any reasonable reconstruction of the original, bandlimited signal would only produce frequencies in the band of interest. So the noise inserted by the process would also stay there. If we could now further guarantee that the band was narrower than one critical bandwidth and that the total power of the noise inserted would always stay below the masked threshold of hearing in the presence of the signal, it would follow directly from psychoacoustics that we would not be able to hear the noise over the signal. Put in a computer compatible form, in each frequency band we can calculate over all amplitudes and spectra of the input signal a fixed number that is ratio of the maximum of energy over all to‐be‐masked noises that will not be heard in the presence of a signal with the aforementioned amplitude and spectrum. This ratio is the maximum permissible signal to mask ratio (SMR) in the band. If we can then guarantee that the energy of any noise inserted is at most the SMR of the band times the signal in the band, the noise will not be heard. It is then very easy to compute how many bits must be dropped from the presentation of the signal in the band before the power of the noise generated creeps over the maximum permissible value—the number of bits is a logarithmic function of the range of the signal, as are the decibel value of the SMR. We get a direct table lookup and a variable width truncation based on its outcome.

Now we have seen that a perceptually transparent lossy coding of a narrow frequency band is possible and is actually quite straight forward to implement. The next logical step is to divide the whole audible bandwidth into such slices and code them separately. This is done with an array of parallel filters, called a filterbank. Easy, eh? Not really. This is because whenever we build parallel constructs from linear filters, we also have to worry about what happens to the phases of the signals in the adjacent filters. Were the filters ideal, no problem would arise since any frequency passed by one filter would not be passed by any other. But real implementations of linear filters are never perfect and there will be overlap between the passbands of adjacent filters. (After all, we cannot leave any guard bands because then those bands would be greatly attenuated in the total response of the coder‐decoder chain.) If any signal components then combine out of phase in the resynthesis step, we will end up with notches in the frequency response of the codec.

The steps taken here to ensure that noise will not be heard are reminiscent of another audio technology, namely, noise reduction. NR also uses frequency band methods to guarantee that any noise inserted by an analog transport will not be heard. In the context of noise reduction, we usually explain the workings of the method by saying that as the input filter increases the power of the signal and the output filter attenuates it, any noise introduced inbetween will also get attenuated. But this is precisely what we are doing here, too. Only in this case we can actually control the power of the noise (since we’re generating it purposefully) and we do not have to resort to varying the amplitude of the input signal. This is what happens when we keep the noise below a certain percentage of the signal power. This attenuates the coding noise whenever the mask is not there and the only difference to the analog noise reduction circuits is how the signal/noise ratio is kept favorable. All this leads to one perfectly legal way to viewing lossy perceptual compression: the loss occurs because of imprecise coding and takes the form of newly inserted quantization noise which is then reduced to inaudible levels by a high tech digital noise reduction architecture working at a superbly fine frequency and time granularity.

We already saw that building a working filter bank isn’t quite as easy as it would seem. It would be nice to have effective alternatives and indeed there are some. This is because we already know of a different way of transforming signals to the spectral domain: the Fourier transform. More generally, different classes of signals can be encoded most efficiently through the use of different transformed domains. In the case of Fourier transformations, the domain is a simple spectral one. In the case of wavelet type analyses, we have more complex structure. What these useful transformations have in common is that they in some sense approximate a filterbank—indeed, the bank based methods, called subband methods, contain the latter (called transformation methods) as a (possibly time variant) subset. Transform methods have the nice property that they can be efficiently implemented and they often have considerable theory behind them. The downside (especially with the simple Fourier type) is that they usually assume that the transformed signal is of finite length and so require splicing of the input for processing. This means we have to deal with the time discontinuities introduced by the chunking operation and, in the case of discrete Fourier methods, the openly cyclic nature of the transformation. DFTs also have the problem that while the frequency resolution of our ears is logarithmic (little resolution in the high frequencies, a lot in the lower ones), the DFT produces equidistant frequency bins and so allocates far too great a proportion of the total frequency resolution to the high end of the spectrum. This necessitates long transformations which then degrade the time resolution of the transformation.

More on models

Above, when we constructed our first approximation to a lossy subband coder, we used our knowledge about masking within critical bands to arrive at a bound to the amount of noise we can insert wihtout doing perceptible damage to the signal. But we do know from psychoacoustics that masking is not so simple. The results for critical bands really are of quite limited in applicability since the exact shape of the masking threshold created by a sound (especially a complex one, since the critical band results were originally derived for sinusoids), its time features and any further effects which might cover and bring out noise (e.g. binaural effects) were not considered at all. Next we’ll see how to model the masking curve more completely to arrive at even higher levels of compression.

We saw in the chapter on hearing that the mask elicited by a sinusoid stretches a whole lot higher than a simple critical bandwidth. We also know that composite sound with possibly composite spectra can have quite complex masking properties with a lot of coverage upwards in the spectrum. This suggests more coding noise can be tolerated if signal content is present below the band currently being coded. The problem is, the shape of the masking curve seems to vary over the audio bandwidth. To get into a slightly more manageable representation, we need a change of scale. Namely, we have to use a frequency scale on which the critical bandwidth assumes a constant numerical value. This is the Bark scale and on it, the mask put up by a sinusoid at some frequency suddenly becomes almost identical in shape regardless of the frequency of the masker. Only the vertical scale varies, but this really represents just a pointwise multiplication. There is not very much variation based on masker amplitude, either, so we seem to have a very good representation to work in—from there on, before we start to calculate the SMRs we switch to the Bark scale.

After we have the mask contribution of each analysis band, we need to combine these to form a plausible estimate of the total masking threshold over the audio band. There are multiple way to do this, some more conservative than others. Absolutely the safest one would be to calculate in each band the minimum value over all the masking contributions. This way we would end up with an estimate that could never be too low. But very little compression would result. A better way would be to instead calculate the maximum. Here we assume that if some band is masked to some particular degrees by multiple separate sounds, combining those sounds will in each band give at least same masking effect the strongest separate masker would have given by itself. This is not an unreasonable assumption, although in some very rare cases (like when highly audible beats are formed and the spectrum is line like) it might in fact momentarily fail. But even this is considered overly conservative. In most cases what we do is in each band we simply sum all the contributions. In addition to agreeing well with psychoacoustics for strictly steady state spectra, such a transformation is very easy to calculate. This may be a bit surprising but when we remember that we are in the Bark domain and the masking contribution of all bands are just shifted, scaled versions of one static prototype, we see that the calculation is really a linear convolution of the prototype masking curve with the measured spectrum and can be implemented by linear filtering over the magnitude spectrum. This is a rather funky way to utilize the LTI assumption. Actual implementations use both FIR/IIR approximations and the more heavy weight FFT version (especially when the masking curve needs to change). What follows then is a SMR calculation and a conversion back to the original frequency domain for quantization, in which ever order is more convenient.

Up till now, we have only discussed static calculations within a single analysis block. But we know that masking phenomena do have temporal aspects. To take advantage of these, we will then need to model the onset and offset of masking. If we assume that the mask caused by signal energy at one frequency sets in and recedes precisely in the same way in all the other bands (modulo the frequency dependent constant scale factor we get from the masking threshold curve), these effects can be quite convincingly approximated by passing the calculated threshold for each band through a first order IIR lowpass filter. This way we get a relatively cheap exponential decay which can be in good agreement with the measured time behavior of the masking phenomenon. A second filter, some hysteresis logic and a slight processing delay can also be used to separately accommodate backward masking. On the other hand, the time resolution of a Fourier based analysis will be so poor in the higher frequencies that such processing is of little practical use. With alternative decompositions the situation is of course a bit different.

For further improvement in prediction quality, very short transients and extremely sparse spectra will need to be processed separately because they do not obey the usual rules of masking. Very rapid transients, in particular, cause considerable splashing of energy over the whole audio band and so may fool the masking calculation into raising the noise floor too high. This effect is most pronounced when block transformations are used, since whole blocks are affected.

Entropy coding and variable rate codecs. Bit allocation.

After the coding and quantization steps, we are left with a representation of the sound which has little perceptual redundancy. But so far no statistical compression has been implemented. Far from removing statistical redundancy from the signal, quantization actually tends to increase it and even if no qquantization is employed, the preceding filtering step would make the data considerably more amenable to both dictionary based and statistical modelling compressors. This is because most of the long range dependencies in the sound have been transformed into the division of bands and out of band noise is effectively reduced. We now work at a level considerably closer to what we hear, so interesting (musical) sound tends to show quite a lot of structure which can be exploited. As for quantization, it greatly reduces the range of values a signal can take and as the original signal often varies smoothly, there tends to be some repetition in the quantized values. This is precisely what statistical compression needs to function. The reduction in data rate resulting from going to wideband signals to subbands isn’t a bad thing either. All in all, classical lossless compression strategies mesh surprisingly well with the preceding lossy processing. If we decide to go with statistical compression, we most commonly insert a low order Huffman or Arithmetic coder, because most of the long range dependencies in the input have already been utilized in the band division step and so the added overhead and increased fault tolerance of higher order modellers (including dictionary coders) is not justified.

Now, compression also brings its own problems to the actual implementation. This is because statistical coders like the two mentioned above rely on variable length codes. It is difficult to tell beforehand how many bits the compressed signal will take, so maintaining a constant bit rate in the output of the coder becomes quite complicated. After all, we need to remember that in digital audio applications, computer ones are practically the only ones in which variable bit rates can be allowed. Furthermore, the results produced by the perceptual codec vary greatly depending on the masking properties of the signal and we should always try to fill the output pipe as fully as possible so that we get maximum sound quality from the bits available. Combined with the difficulty of estimating just how many bits the statistically compressed bitstream will take, the compressor becomes quite complex. We need to somehow prioritize the incoming, quantized channel coefficients so that when we run out of bits, we can drop the less important ones and when we have extra capacity, we can transmit the most useful information available. Then we need a way to delimit the variable length coefficients without too much overhead and some mechanism which guarantees that the decoder always has enough information to follow the coder’s doings. This implies some extra state in the decoder and the attendant state maintenance machinery, as well as iterative compression if a given initial bit allocation over/underflows the available space in the output frame. Of course, the data required for such iteration cannot all be available to the decoder (since the iteration aims at deciding which coefficients to include in the stream) and so explicit information on what was included in the output must be sent as well.

In the preceding it became abundantly clear that there are a lot of things both the encoder and the decoder need to keep track of. This state will often need to be in synch in both ends of the pipe, so synchronous updates will somehow need to be guaranteed. Two general state update strategies exist. In the first approach the coder changes state based on data decoded from its own output and so ensures that the decoder always has the same data available that the coder used. This is called backward adaptation. Its prime benefit is simplicity and efficiency—no separate control data needs to be passed. On the other hand, the scheme is not very error tolerant (errors may cause us to lose synch) and there is little flexibility in the chain. When we pass explicit parameters to control decoder state, we are doing forward adaptation. This is quite flexible as the encoder can always control what the decoder is doing plus errors cannot accumulate over time. But now we use bandwidth for the control data as well. Most actual coders cross these two basic approaches. An example of backward adaptive processing is the way ADPCM works—we do not send the delta between the current input and the previous input but rather take the difference of the input and what the decoder would see as the current value. Predictive methods in general rely on backward adaptation. Rice coding will serve as the example for forward adaptation. This scheme codes fixed length blocks of data by first calculating the maximum amplitude of the signal in the block and sending that explicitly after which the signal block is normalized to fullscale amplitude, quantized to a fixed bit depth and sent on its way. Here, control data comes in the form of the gain coefficient and adaptation is seen in that the noise floor floats based on signal amplitude. (When the input signal is of small amplitude, it will be greatly amplified and any quantization noise will be greatly attenuated when the decoding step returns the signal to its original amplitude.)

When some forward adaptive capability is present, we might wish to do a bit more than just stuff each out going frame of bits to its fullest. This is because the information content of audio signals tends to vary greatly. Attack transients, for instance, have a great deal of noise content and spectral complexity while a steady state portion of the same sound might well be extremely compressible. We might then wish to vary the bit rate produced within some limits so that when we are processing an easy segment of the signal, we lower the bit rate a bit and when the next difficult part comes, we use the bits spared earlier. Of course, doing this optimally requires some processing delay and some high level optimization on behalf of the coder and a method to tell the decoder (which cannot look ahead in the signal) what the precise number of bits produced is. Buffering is also required in the receiving end so that the incoming stream of bits can be received at a constant rate but read from the buffer at a varible one. The constraint is that at each moment in time, the receiver must know which bits are reserved for which band and that any variations in the transmitted bit rate must never cause buffer over/underruns. The deliberate slack created by the approach is usually referred to as the bit reserve.

Multichannel coding specifics

You may have noticed that directional hearing, multiple channels and surround issues have not cropped up yet. This is very deliberate, since most of the practical compression centers around coding single channels of sound—sound DSP generally centers on one dimensional signals as they are a lot easier to handle. In fact, the theory of multichannel coding and storage is only now begin to develop. As a simple example, some codecs like Sony’s ATRAC simply code each channel separately. This is the trivial approach which will not be discussed in much detail, later on. Similarly, most sound codecs really throw in multichannel capability more or less as an afterthought. Depending on one’s point of view, this can be a necessary relief or a great shame.

The first nontrivial scheme for coding multiple channels relies on the fact that multichannel audio often has strong correlations between the different channels. Dolby Laboratories’ age old matrix surround (called Dolby Surround or Dolby Stereo) uses precisely this fact to embed more than one audio channel into a stereo recording—since the three front channels (left, centre and right) in a movie soundtrack are almost invariably composed by just panning sound sources around and most sources stay in the middle third of the available space, most of the information is actually carried by the average of the channels. This means that differences between the channels will be quite small and we can effectively hide an extra channel there. As a further benefit, the surround signal we would like to place there should not be readily localised (so phase response is not a problem), it can be band limited (high definition is not an issue either) and some leakage between the channels is tolerable. We then code the surround signal by placing ±90° phase shifted versions of the signal in the main channels and decode by differencing (the total phase difference will be very nearly 180° over the limited band of the surround channel and differencing the main channels reproduces the surround, provided the main channel signals have little content other than the average). The phase shift makes channel overload and coherent localisation of any interchannel leakage much less likely. Modified Dolby B noise reduction is used to further reduce crosstalk and the surround signal is delayed on playback by some 10‐20 milliseconds to let the Haas effect pull any leakage from the main channels to the front. This example shows how we can use assumptions about the acoustical material to derive coding efficiency.

Dolby Surround also gives another example of the ways we can make extra assumptions count. This is exemplified by Pro Logic decoding, which is a specific algorithm to decode an existing, unmodified Surround track so that interchannel leakage is reduced. The general concept is called steering, and it is a form of multichannel, and sometimes multiband, dynamic processing. The central reasoning goes as follows: we perceive leakage when there is a dominant direction from which a sound should emanate but when we also hear that the same sound diffuses to other directions. As any such dominant direction will largely be reproduced by a single channel of the reproduction array, dominant sounds can be assumed to be the ones that are the loudest and it should be unlikely that in the presence of multiple dominant sounds coming from different directions we could hear any slight leakage from one of them to an improper direction, we should be able to detect such singular dominant directions by the amplitude balance of the decoded channels and reduce any leakage by simply attenuating the directions which do not carry the dominant sound. (Notice the huge pile of assumptions presented thus far.) So to make for better playback, we decode the matrixed signal as usual (producing four more or less separate channels as far as Dolby hype goes), then calculate the instantaneous power at each channel and form two differences: front‐back and left‐right. These should measure the concentration of sound energy in the listening plane and are used to form a dominance vector with the differences as components. When the length of this vector exceeds a given threshold, we know that some direction carries significantly more energy than the others. We then use the direction of the vector to attenuate the outgoing signals to the speakers which do not reside in this dominant direction. To make the system more stable, we include some hysteresis and rigidity so that sudden changes in signal power do not wiggle the soundfield unnervingly and also some provisions for sideways and front‐back movement so that if the sound field is symmetrical about one of the axes, the other one will be attenuated. This sort of steering is precisely what was used in early quadraphonic systems and is the reason why most quadraphonic music suffers from the excessive use of ping‐pong sound sources—these are the only ones which a matrix quadraphonic system can reliably separate (by steering them to the right speaker(s)). Circle Surround is a newer technology which extends the steering model by applying it separately in separate frequency bands (to avoid movement on one band from moving still sound sources on the others and to match the amount and reaction time of the steer to the varying time constant of our perception over the audio band) and by employing more sophisticated analysis. SRS Labs hype portends CS as a system capable of delivering soundtracks almost indistinquishable from 5.1 discrete channels from a stereo source. It is widely acknowledged that while well though out steering logic can considerably raise the quality of a theatrical sound system, it is of dubious value with less specialized audio and, especially, recorded music.

Dolby Surround uses a kind of transformation domain and a fair pile of psychoacoustic/application specific assumptions to fit 4 channels into two actual transmitted signals. The coding is then inherently lossy and not suitable for compression, per se. However, the above scheme leads us to wonder whether using a similar transformation to do a one to one mapping (4 channels to 4 channels, for instance) would be beneficial compared to coding the channels separately. Indeed this is the case, especially with heavily produced material where sound sources are balanced between the channels by intensity panning. The most common example of a coding like this is when we transmit a sum and a difference instead of two separate channels. (These are called quadratures because they have strict 90° phase shift over the whole operable bandwidth, and are used to transmit stereo in FM radio and in MPEG audio codecs to code stereo material). We notice that in the general case of n‐input/n‐output operation arbitrary combinations of the kind described above above can be expressed as a multiplication by a fixed n times n square matrix of the incoming multichannel signal, thought of as an n‐vector. Consequently, such operations are called matrixing. Three channel matrixing is used in television and image compression systems (like NTSC, PAL/SECAM, JPEG and MPEG) to transmit color information. In analog television and radio systems, we actually matrix because of compatibility reasons (to retrofit unichannel systems like black and white television or FM mono radio for multichannel while still using the original bandwidth to advantage), but it was seen early on that similar transformations bring out some of the common features in the different channels and aid in compression. A very primitive lossy application is seen in television systems where the color difference channels (chrominance) are transmitted at a resolution lower than the luminance one (they are only given half the bandwidth). In audio coding, the same principle applies. But since audio signals cannot be bandlimited without perceptual degradation, we now have some new problems—lossy audio coders insert broadband noise and it isn’t quite obvious that channels bound for directional sources and differences between them obey similar psychoacoustics.

More concretely, how do we know that noise inserted into a difference channel will still be masked when it is decoded? And the answer is, of course, we do not. Running separate conventional masking models on matrixed channels may severely distort the spatial image reproduced. Incidentally this is one of the better ways to discern mp3’s from the PCM masters. The problem is most pronounced when the difference channel carries considerable energy and instead of simple intensity panning, sound sources affect the channels essentially independently. In a system which treats all the channels equally (i.e. there are no dedicated surround channels) we can analyze the phenomenon through the concepts of directional unmasking and binaural masking release. Masking as a phenomenon strongly depends on the angular separation of masker and masked and also on their absolute angular positions with respect to the pinnæ and so somewhat on the exact time and phase relationships between them. This directional aspect of masking can in the worst case ebb off the masking effect by close to 7dB from the more conventional figures observed. In multichannel environments, interchannel interference can lead to signal cancellation and thus reduce the masking effect. This is what is usually called binaural masking release and can lower the masking threshold by over 20db! As for surround, I know of no comprehensive theory of masking for weakly directional/nondirectional sound sources. As such it is understandable how a conventional masking model on matrixed material will produce problems. With sum (average) channels the problems are usually less pronounced because the masker and masked tend to stay together even after decoding and intensity panning is so common. But quantization noise will still leak from one channel to the other after decoding. This is easily seen because signals present in one channel will cause the psychoacoustic model to raise the masking threshold irregardless of whether the other channel contains sufficient energy to mask the resultant noise (which is evenly distributed after decoding the sum channel). Matrixing clearly isn’t the best possible scheme for lossy multichannel coding. But eventhough modern coders rarely resort to explicit matrixing, it is important to know its limitations since audio material with analog matrix surround often needs to be transferred over lossy stereo channels for backwards compatibility.

We have already discussed coding in the spectral domain. The logical continuation of the preceding is to try matrixing subchannel coefficients. But this approach will likely do little good—after all, we went to the frequency domain to achieve entropy compaction. Now, above I mentioned that localised crosscovariances are a good thing. We might ask whether artificially forcing this condition to hold on a per analysis band basis degrades the codec noticeably. And it just so happens that it sometimes doesn’t. This is called joint channel coding or coupling. It is mostly useful in the high frequencies (above some 2kHz) where the ear can no longer accurately follow interaural phase differences but instead relies on an intensity analysis. If we go as far as to require complete locality in the crosscovariance functions, we end up with a single signal scaled on a per channel basis. This permits us to send a single channel plus a set of scaling coefficients instead of actual signals. In practice we send an average of phase normalized versions (to avoid cancellation because of differing phases) of some set of bands and an amplitude vector (to facilitate intensity panning between channels). Of course, we need to consider both time and frequency domain aspects of masking before making this simplification. As concrete examples, both Dolby AC‐3 and MPEG AAC utilize this approach.

The previous discussion centered on classical, discrete channel sound coding. But there are also such newer audio architectures as ambisonic and, more recently, wave field synthesis which do not code discrete channels at all but aim at recording some salient aspects of the more comprehensive sound field. These schemes represent entirely new challenges to lossy coding because they generally rely heavily on complex phase relationships between the recorded signals and also aim at maximum statistical separation between the channels. (More so with ambisonic with its spherical harmonics than WFS with its spatial sampling.) This necessitates heavy time domain analysis and due consideration to directional unmasking phenomena. Consequently all current proposals for sound field based audio architectures list lossless transmission as a prime requirement.

Advanced topics

If we started to build a complete audio application based on what has already been said, we might at best get something like eight to one compression with near perceptual transparency. Problems would mostly occur if we tried to lower the bit rate even further, start with a lower sample rate in the first place or process content rich with transients. To get past ten to one ratios or to achieve true transparency, we need some refinement in the signal chain. Some of the relevant mechanisms for going beyond the basic multiband quantizer are discussed next.

More on masking

As masking is the central enabling factor in audio compression, it is important to realize that as of now, we have barely scathed the surface of masking research. There are lots of special circumstances in which masking works quite differently form the simplified model presented above. The reason for this is that most of the theory presented above is still rooted in studies performed on periodic, nonmusical, single channel sound sources which have very little to do with the things we hear all day long in our natural environment. The masking rules presented so far also suffer slightly from the way we modelled them—as a convolution in the roughly linearized domain of a Bark scale.

 ‐low frequency exceptions to masking rules
  ‐details, anyone?

The first significant difference occurs in the low register. It appears that frequencies under some 500Hz mask higher frequencies progressively less as we go towards the lower limit of hearing. In fact, at 50Hz the effect imposed at 1kHz is already some 20‐30dB down from the value we would expect. Of course, subsonic hums do not mask just about anything, although it is quite desirable in some circumstances (like with theatrical special effects sound tracks) to convey these as well. Given that most masking models do not address the effects caused by varying the SMRs from transform block to the next or the spectral smearing (clicks and pops) caused by FFT based transform coders on block boundaries, it is only too easy to produce a coder which overestimates the amount of masking available by virtue of the low end content, splashes energy from the low frequency waves (which span multiple transform blocks) around the unmasked middle band by making adjustments to the respective transform coefficients too rapidly and does not consider how the harmonic distortion and noise products caused by the quantization step are handled when only very low frequency material is present. (Think this cannot happen? Listen to the characteristic hum which runs behind each and every Star Trek TNG episode…) So we need special exceptions to the masking rules at low middle to very low frequencies. The good news is, temporal processing is rarely an issue here since transform lengths are usually optimized for high middle to high frequencies.

A further source of degraded coding quality is the fact that certain structure and some forms of predictability in a signal to be coded can locally raise the masking threshold beyond what a spectral analysis might lead us to believe. Beats present a fine example: if we analyze a sum of two narrow bands of noise and two sine waves centered on the same frequencies, the power spectra will be highly similar (especially if windowing and transformations of domain (like linear to Bark) are used). But the combination of sinusoids will break through a much (as much as 10dB) higher masking threshold because it produces regular beats. So sometimes regular time structure within a single critical band will present problems. A connected problem (which I have never seen anyone mention) is that some highly structured waveforms (like simple low frequency analog synth tones) react pretty violently to the kinds of manipulation done by sound coders (I bumped into this with 192kbps MP3 and digitally generated, filtered square waves). It seems that there can be enough coherent time structure over the entire audio band to somehow diminish the masking effect imposed by the lower partials on the upper ones. I’ve seen some references to this phenomenon in connection with research on speech intelligibility, where higher harmonics of a speech sound are heard even when presented at an amplitude which in the presence of increasing additional noise falls unexpectedly fast beneath the masking threshold. I would surmise that this has some connections to the volley theory of pitch perception/hearing since a massive synchronization of neural impulses by such periodic wideband signals could well follow statistics separate from the ones encountered with natural wideband signals which are usually presented with some extraneous sound sources present. If this is indeed the case, this lends new credibility to my argument that electronical music should be included in ABX tests of audio coders in addition to the more traditional audiophile material of symphonical and jazz music.

 ‐co‐release of masking threshold?

Another notable difference between our current masking model and the way we actually hear is the threshold of hearing. Our model never determines that a given frequency cannot be heard because it is at a low amplitude. Only if a strong enough mask is present will the model permit killing a frequency band. Of course, this is a feature of any proportional amplitude audio signal chain: if the user can turn the volume knob, we can never assume that a sound is too quiet to be heard. But in calibrated, absolute amplitude environments (such as in cinemas, which calibrate both the analog and the digital signal chains precisely to predetermined standards; this is in part necessary for the noise reduction to work, as well) we can always calculate some suitable cutoff point beyond which people simply will not be able to hear a given signal. This can be used to advantage if we really work in such a nice environment. On the other hand, bringing such compressed material to a relative amplitude environment is simply asking for trouble—the results are horrible.

Temporal processing. Transient response concerns.

Often when we go to the frequency domain for some processing, we completely forget what happens in its dual, time. When using Fourier decompositions, this is an understandable but unfortunate thing—our hearing is well equipped to process both aspects of sound simultaneously and to rapidly switch from a primarily time based interpretation (like resolving a transient into an onset) into a frequency biased one (a short, rapid succession of identical transients may be interpreted in terms of frequency). Fourier analysis also has the further complicating feature that the precise effect of phase relationships is rather counter intuitive and so most of the people building audio systems would happily forget the phase data altogether. This section centers on the consequences of the uncertainty principle for frequency domain audio coding.

Remember, the uncertainty principle states that time and frequency resolution are always inversely proportional and fixing one then limits the other. For filterbanks this means that the more frequency selective they are, the less time resolution each channel has. On the other hand, the more time resolution we want, the wider the transition bands between adjacent filters in the bank. This can be seen in that the bandwidth of a given channel is directly proportional to the bitrate required to carry it (even if the bit depth stays constant)—more bits per second equals more sample periods (action) per time unit and hence better time resolution. This would not have much relevance to our discussion, except that we are inserting an error signal onto each channel in the lossy phase. Steep rolloff in the reconstruction filter will lead to worse time resolution—the filter rings and spreads out any error that we happened to insert. Now, this would not be such a problem if the particular time‐frequency tradeoff utilized by our ears was well matched to the one we have chosen for our filterbank. The problems arise because MDCT and other fixed bandwidth filterbanks divide the total bandwidth into a pile of equally wide subbands (which implies identical time resolution for all frequencies) while the human hearing uses a logarithmic division (so that going up octave also doubles the time resolution). This means that when we try to get adequate frequency resolution in the low register, time blurs up far too much in the higher frequencies—any quantization error in the higher bands will not be localised, there will be ringing and the precise time of occurrence of transients will be blurred. When we optimize for passable time resolution in the high register, we end up with poor frequency resolution in the low end. This in case leads to poor compression as the filter can no longer effectively extract long range correlations from the signal and also poor masking performance as the reconstruction filters will not be steep enough to keep quantization noise from leaking into adjacent bands.

In the context of FFT and MDCT based codecs, which are usually linear phase, poor time resolution in the high register manifests in a very characteristic and problematic form: any quantization noise inserted in the transformed domain will spread over the whole transform block, including ahead in time. This phenomenon even has a name, preechoing. The name accurately describes the situation in which frequencies present in a transient cause a psychoacoustic model based on a steady state analysis to raise the noise floor on those frequencies and the time inaccuracy of the filter in the high frequencies is reflected in the spreading of the error over the whole transform block. This manifests as an audible, colored noise burst that precedes the onset of the transient, a preecho.

There are many ways to solve these problems. We will first consider a straight forward approach. Instead of trying to alter the balance between frequency and time resolution, we might consider using a high frequency resolution and then just appending a separate correction (residue) signal whenever our psychoacoustic model tells us that the time masking characteristics of the signal are not sufficient to cover some effects of the quantization step. The downside of this approach is that extra data needs to be sent, but the problem is not as great as it seems—the spread out quantization error really needs to be cancelled only upto the occurrence of the first transient (forward masking will take care of the rest) and generally the amplitude of the error will be small if the masking model is correct. An even simpler approach would be to flag a block or two as having transient content and simply not quantize as aggressively in the high frequencies as would have under normal circumstances. This works when transients occur infrequently as a transient splashes so much energy across the spectrum that momentarily resorting to e.g. prediction for the less time sensitive lower part of the spectrum will not present a problem.

A production quality CODEC then probably needs something a bit more sophisticated. Probably the most elegant solution to joint time‐ frequency problems is to go to time‐frequency analyses which typically divide the band logarithmically. But these are not as effective to implement as Fourier transformations, they do not fit steady state musical data very well, there are intellectual property issues and the subsequent processing may not react very well to using such an analysis (there is no frame structure, the complexity of the bases we are implicitly employing may cause complex fluctuation in the output which confuses subsequent prediction steps and statistical coding, et cetera). It is more common, then, to use a constant bandwidth filterbank but choose the number of analysis channels on the fly. In the case of MDCT filterbanks, this can be easily accomplished by halving the length of the analysis block and altering the twiddle and windowing operations slightly. This way we can use longer blocks (less time and more frequency resolution) for steady state content and when we see transients switch to the shorter block length (more time resolution but less frequency accurate). This window halving design is quite common in perceptual codecs. On the negative side, low frequency resolution suffers as well, we need transient detection logic and extra data in the output to indicate which block length is being used. Additionally, any such switching will have an impact on subsequent prediction and entropy coding steps.

A cheaper solution to the problem is obtained by first aligning the frequency resolution of an equal bandwidth filterbank to that of our hearing at one frequency (say, at 1kHz) and then observing that as we go farther away from that base frequency, one of the two dimensions of resolution begins to degrade with respect to our hearing. Now, preprocessing the signal with another filter bank with a logarithmic frequency division but with fewer bands makes it possible to use different block lengths for the different bands and when the alignment is done in the middle of the band, get a sufficiently well matched time‐frequency tradeoff over the smaller bands even if there are very few of these. The solution is a kind of cross between constant Q filterbanks and MDCT ones and is distinctly cheaper to implement than the former. This approach has some trouble with the band transitions and the duplication of information between the first step analysis ranges.

In MPEG AAC, a highly interesting and innovative, if somewhat expensive, method is used to solve the problems associated with constant bandwidth filterbanks. This solution is called Temporal Noise Shaping (TNS) and it works through linear prediction in the spectral domain. The basic theory is that when transients are present, the spectrum will become less noise‐like and starts to oscillate. The more concentrated the transient, the more closely its spectrum will approximate a complex exponential. This happens because FFT and IFFT are basically the same: a concentrated spectrum will lead to a narrow‐band noise signal, a concentrated signal will lead to a narrow band noise signal. This property is reflected in the output of an MDCT filterbank, making its output predictable across frequency when transients are present.

When it detects transients in the input signal, TNS switches from plain coding of filterbank output to linear predictive coding (LPC) across frequency. We remember that LPC separates its input into a convolution of the impulse response of a fairly low degree linear IIR filter with a residual. Essentially, the residual will be whitened to a degree limited by the order of the filter produced. The properties of FFT mean that viewed in the time‐domain, this separation looks like a multiplication of the respective backwards transformed versions of the filter impulse response and the residue. Essentially, the inverse transform of the impulse response codes a volume envelope by which to weight the inverse transform of the residual to arrive at the original signal. The filter parameters are transmitted as side information to the decoder. As it is whitened, the residue loses its localisation and we no longer have to care about temporal smearing which arises when it is quantized. Inverting this procedure leads to versatile amplitude modulation of the inverse transformed residue, which adaptively blanks any audible preechos. The beauty of TNS is that it is readily usable with any constant bandwidth filterbank (such as the MDCT so common in real life CODECs) and that the coding of the amplitude envelope is quite compact, requiring little side information to be sent.

Improved prediction

Even after the actual spectral analysis, psychoacoustic modelling and reconstruction chains are complete, there is still the issue of how to best exploit any remaining redundancy in between. We already mentioned that lossless coding can be used, but it is not entirely clear that the reasons which made such coding inefficient for sound itself have completely been eliminated in this new representation. Indeed we see that a second round of predictive analysis can make a difference.

A polyphase filterbank outputs what are basically downsampled versions of loosely bandlimited signals. After we have updated our psychoacoustic model, we also have a good estimate of the amount of distortion which can be added to each band in the coding step. We might then try to use some predictive coder (like delta coding) on the subbands. On the other hand, when filterbanks with some band overlap are used on musical sounds, we often get smooth transitions between adjacent bands so prediction over frequency instead of time might well produce desirable results. And naturally these methods could be crossed by using both the previous values of some analysis channel and the values of other bands already sent in the predictor. Putting these principles into use leaves us once again with a prediction residue. Typical ways of handling this would include direct entropy coding (for lossless coding) or alternatively configuring the prediction step as an ADPCM loop. But what should we then do with the model? The answer is, we use it to tune the precision of the quantizer in the ADPCM loop. In a setup like this, we could also continuously monitor the variance of the quantized signal and use the running estimate to vary the code range of the coded residue. Some of the problems with an approach like this are that because of the structure of the ADPCM circuit and the processing delay in the filterbank we might get sluggish response to changes in the SMRs, the difficulty of deducing the optimal bit depth to be used (ADPCM, unlike direct quantization, inserts noise with a colored spectrum) and the inherent problem of inserting as much noise as possible onto the band while the noise now has an uneven spectrum and its pointwise maximum power must is bounded by the SMR for the band. The slew rate limitation imposed by ADPCM also makes time domain accuracy considerably more difficult to guarantee. In predictive architectures like this, the latter is a significant concern because shifts to alternative analysis lengths might not be possible and even if used, will usually degrade the performance of an aggressive prediction scheme. Moreover, time‐frequency analyses (which are inherently immune to the time domain problems which prompt switchable block lengths) cannot easily accommodate prediction over anything but time. (There is, after all, no clearly defined frame structure to delineate the spectra for processing.)

Nonuniform/vector quantization

Above, when we discussed the lossy phase of the compression pipe, we referred to quantization simply as dropping bits from a given number. But the concept of quantization is much more general. First of all, dropping bits from a binary word amounts to what is called uniform quantization, in which the difference between successive, quantized values is constant. The steps of the quantizer are equal, so to speak. This means that the error introduced in different amplitude ranges is similar. But when quantizing vanilla sound signals, for instance, it would be considerably nicer if the quantization error was proportional instead. This can be accomplished by nonuniform quantization and is used in the public telephone network in the form of A‐law and μ‐law encoding. These encoding schemes drive the signal through a piecewise linear approximation to a logarithm before quantizing and the corresponding exponential after returning the signal to its original range. This results in quantization steps being bigger near fullscale output and very small close to zero. This way the quantization noise becomes rougly proportional to the signal and the amplitude range in which the signal is likely to mask the noise will be greatly expanded. For instance, voice encoded in eight bit A‐law sounds almost as good as uniform quantization at twelve bits. The same principle is useful for storing filter coefficients (since amplitudes naturally live on a logarithmic scale) and coding residues. Actually we can view floating point numbers as a special case of just the kind of piecewise exponential quantization done in A‐law and μ‐law. This is one interesting way to explain why floats are so useful in audio DSP as well.

After we go to nonuniform quantization, there is no reason not to think about adaptive quantization as well. In this case, fixed, nonuniform steps are supplanted by steps which vary according to the features of the signal or are driven by some appropriate, external model. A good example of the former approach was seen when we talked about ADPCM where the residue is encoded in precisely such a fashion. The latter approach is seen in action when a psychoacoustic model is used to control the quantization step of subband ADPCM coders in a perceptual codec. Similarly, Rice coding is a proper forward adaptive variant.

The next step is to go from quantizing a number at a time to quantizing whole series of numbers. This can be formalized by viewing the whole series as a single unit (a vector) to be quantized instead of quantizing each of the numbers separately. We call this vector quantization (VQ). To understand what this is all about requires the concepts of vector spaces simply because of what is at the heart of quantization: finding the closest representation for a given numerical object in terms of a given set of other numerical objects necessitates a measure of the distance between the objects. After that quantization becomes a minimization problem.

The most common way to measure distance between real vectors is the usual, Euclidean distance metric. This is just the straight line distance between the points described by the two coordinate vectors. An easy example is a series of just three real numbers. (The general case is a straight forward jump to a dimension higher than three, but is not possible to visualize in a simple fashion.) These can be viewed as the coordinates of a point in space. To quantize each of the numbers (components) separately would mean to round each point the space to the nearest point representable on a rectangular grid with equal spacing of the points in all three coodinate directions. When the points are distributed in an even manner over the whole space, such quantization is in the average the best we can do. But if we have a whole series of such points (real triplets) and the three numbers are not independent, the points will concentrate in some part of the space.

It would seem sensible to concentrate more grid points in those denser regions, then, so that common vectors get smaller quantization error and the mean error goes down. Or, more importantly, we can get the same average error with a smaller number of grid points. This of course means fewer coded values on the whole and thus fewer bits per value sent. The problem is, such concentration breaks the beautiful grid structure. With a rectangular grid, we get all the grid points just by listing the combinations of coordinate values and rounding to the nearest grid point consists of simply rounding the components separately. Instead, with an arbitrary quantization grid we must have explicit bookkeeping of the grid (where are the grid points?) and quantization cannot be done in the component form—we must resort to somehow searching the grid for the closest point to round to. If there are lots of components in the vector (say, a couple of hundred), there will not be any way to explicitly list the grid coordinates (some quadrillions of points will not come anywhere close to a sufficiently accurate grid) and even if there is, simply going through the list of grid points and looking for a minimum distance will be far too slow. This is the big problem of vector quantization, since otherwise it is a proper subset of both component per component uniform and nonuniform quantization methods. There is also the issue of sending the codebook (the set of grid point/code mappings)—we’d rather leave it out since the size may well be huge. Sometimes we can build the codebook on the fly, really just one more variant of backward adaptation. A slight further drawback comes from the difficulty of processing data with no block structure. This is what makes vector quantization less than ideal for processing sound signals in their original form. On the other hand, after a filterbank or a block transformation, the data is quite suited for such quantization. The same goes for prediction residues and there is no reason why vector quantization could not be used in a configuration similar to ADPCM, though applications of this kind are limited by the difficulty of utilizing sidechain data from the perceptual modeller. In the context of image coding, a configuration in which blocking and then vector quantizing block to block prediction residues (which are vectors) with VQ is called Differential Vector Quantization (DVQ). The lack of per band control in a direct application of VQ to subband coefficients makes us consider controlling quantization noise spectra through multiband compansion of the signal—this approach works just like it did in the case of analog noise reduction and is readily applicable even if the quantization scheme does not permit band level control over the accuracy of the process. VQ does not, since it processes all the bands simultaneously as a vector.

The final benefit of vector quantization is that unlike componentwise processing, it allows a great deal of engineering on the structure of the codebook and the distance metric employed. When coding residues, for instance, we might structure the codebook to favor error signals with energy near the beginning of the block. The rationale is that because unmasked errors are generated by time smearing of transients, there will always be a signal likely to generate enough forward masking for the end of block part of the residue and so we might dispense with correcting it. Similarly we might want to use a distance metric which favors often used code vectors to speed up the convergence of a following codebook adaptation algorithm. The variantions are endless, but it is easy to see that VQ offers a lot of flexibility compared to the componentwise version.

It was already mentioned that large block sizes produce problems with VQ. Usual ways to solve the problem include at least:

  • Subdividing the block and using multiple narrower quantizers in some suitable transformation domain so that the performance does not suffer too badly.
  • Choosing some special form for the codebook which can be expressed as a simple enough combination of just a few basis vectors. Of course, the scheme must admit calculations in terms of the basis vectors in order to be useful and even after that severely limits the ways in which we can optimize the codebook.
  • Maintaining only a limited codebook but dynamically adapting it and/or sending a coding residue.
  • Using a limited codebook and making the quantization procedure hierarchical by further quantizing the residues.

Then there is the performance bottleneck caused by codebook search. Of course, the usual ways to speed up searches in multidimensional spaces apply here as well. Examples include search trees which are used to divide the vector space into pieces and make it possible to search for the codebook record of a given vector by recursively choosing branches of the search tree which represent volumes close to the one containing the actual vector. We get a set of volumes whose codevectors can then be searched for the best match to the given vector. As a special case, we get a scheme in which we end up with a single volume and the desired codevector is the only one situated in the volume. Practical examples of space partitioning trees are BSP (Binary Space Partition) trees in which each internal node represents a hyperplane through the current partition and its two children are the two sides of the plane and the higher dimensional extensions of quad and octrees. The latter have a far greater fanout than BSP trees—at each stage, the current partition is simultaneously divided in half along all the coordinate axes. In the plane this means that each internal node will have four children and in 3D the number doubles to eight; whence the names. In higher dimensions we might consider reducing this branching factor by variably dividing only in the direction of a subset of the coordinate axes. Whatever the searching scheme we choose, any extra structures will make the search a bit more rigid and difficult to update. This means that adaptive VQ is quite complicated to implement efficiently.

Building the codebook generally entails a clustering procedure which takes as input a set of vectors representative of the class we wish to quantize later and then tries to choose a set of codebook entries which represents the reference set well. Two variants exist: we can either try to find the smallest codebook which fulfills some limit to the total error which quantizing the whole set of reference vectors will make or we can start with a codebook of fixed size and try to find a set of codevectors which minimize a similar total error measure. Most often we want to minimize the mean square or maximum absolute error over the reference set. An optimal solution is quite tricky and cannot always be guaranteed without some extra assumptions. The median cut algorithm is one example of an approximative solution: we progressively divide the set of reference vectors by planes parallel to coordinate axes so that near equal numbers of vectors stay on each side of the plane and the variance of the projection of the set of vectors on the plane is minimized. This tends to leave clusters of nearby vectors in the same partitions while separate clusters will in each subdivision tend to be placed on the different sides of the dividing plane. Also, the exact position of the plane with respect to the coordinate axes chosen is determined so that a single cluster grouped into a given partition of the space will cause the next iteration to subdivide the cluster approximately in half in a direction in which most effectively reduces the spread of the set. Hence the procedure will tend to give symmetrical clusters of equal sizes (the set is split in half in the direction in which it is most spread out) and the iteration can be stopped after the diameter of the clusters generated is suitably small and a codevector can be assigned to represent each cluster. An extra benefit is that the search procedure generates a space partition tree while going. We might also stop after a given number of iterations to generate a codebook of fixed size. Another comparable procedure is the octree method: here we have a fixed subdivision of the space and each generated subspace and we simply expand the tree so that at each stage of the process, the number of codevectors stays comparable across the already subdivided octants of the tree (remember that we are not in dimension three so the fanout of the tree is substantially higher). Since the blocks generated are always hypercubes and the number of codevectors is kept similar in all the cubes, the final result will be a balanced set of clusters each of which can easily be assigned a codevector. Again we generate the search tree on the fly. The minus side, compared to median cut, is that the tree is not even approximately balanced by the algorithm and that the fixed points of subdivision constrain the efficiency with whom the algorithm can circle clusters a little. Analytic approaches to the clustering problem often approximate the probability density of the underlying probability space with a continuously differentiable interpolation of the sample vectors and use numerical search methods like gradient descent search to find local maxima in the generated scalar field. Then codevectors are assigned to these protoclusters. After this it may be possible to improve the accuracy of the codebook by iterative local optimisation of closely spaced codevectors.

Some recent clustering methods use competitive learning in neural networks for codebook construction. (For instance, Kohonen’s Self Organizing Maps (SOM’s).) These methods are very interesting but it is exceedingly difficult to prove bounded errors in the context of competitive learning. It is also possible to refine codebook design in other directions. One is to aim at greater entropy (unexpectability) in the produced indices or for codebook indices for which vectors close to each other are quantized to indices which differ only by as few bits as possible. When these methods are combined, we get vector quantization which does not need further compression at its output (high entropy means the output will not be compressed further by statistical methods) and has excellent perceptual error performance (isolated bit errors cause the decoder to choose decoded vectors which are close to the original, without any penalty in terms of compression).

Analysis variants

It has already become clear that in coding audio, there is always a tradeoff between time and frequency resolution to be concerned. We can trace this complementarity to the analysis phase—it is well known fact of mathematics that the definition of spectra through Fourier transformations makes all linear transformations to the frequency domain hinge on such a tradeoff. The key ingredient is the uncertainty principle which was already mentioned above. It is precisely what makes time and frequency resolution inversely dependent on each other. Through it the question of how to achieve the best possible time and frequency accuracy in a codec is naturally transformed into a one involving optimization of the joint time‐frequency characteristics with respect to how our ears work. It is easily shown that Fourier based transformations do not achieve optimality in the joint transform domain, as they are ultimately based on a completely time‐free mode of analysis. So we are naturally lead to consider other modes of linear analysis. Owing to quantum mechanics, there is nowadays a venerable theory behind time‐frequency transformations, something which we will delve into, next.

As with Fourier based methods, the more general frameworks for linear spectral analysis can be viewed from two separate angles. The first concerns itself with the time (waveform) and frequency (Fourier transform) characteristics of the filters we use to implement the transform, the second with a functional analysis view of the transformation as a linear operator. The first approach leads us to all the different kinds of filterbanks, the second produces wavelets. It should be no surprise that these two approaches have a lot in common and have, since their inception, been all but unified.

At the heart of both approaches is the concept of hierarchical decomposition. Whereas traditional transform based methods of analysis rely on an invertible, one‐shot multiway decomposition (usually into a number of channels which is a power of two), filterbanks guaranteed the existence of an inverse bank by subdividing the problem. Instead of inventing a huge, multichannel transform with the desired properties, we now divide the bandwidth into parts using simpler filters and apply the procedure recursively to the remaining parts of the spectrum. The simplest approach is to use a pair of filters, called a quadrature mirror, to split the band in two—one of the filters is a highpass one, the other a lowpass. What is so special about quadrature mirror filters (QMFs) is that the two filters strictly complement each other, both in frequency and in phase. They are structured so that finding an inverse operation to reconstruct the signal is easy. After this we can apply the subdivision to the lower part of the spectrum to obtain three bands, then to the lower part of this to obtain 5 and so on. This makes for a logarithmic band division as the lower of the resulting bands is always subdivided in two. But if the cutoff point of the filters stays the same, won’t the procedure only work for the first division? Indeed, this would happen if were to simply reapply the filters. So instead, we need an operation which spreads out the spectrum before the filters are reapplied. This operation is called decimation. Decimating a signal roughly means halving its sample rate by only retaining every nth sample (in connection with QMFs, usually every other to halve the band). This operation can be viewed as a sample rate conversion which (depending on one’s own viewpoint) either restricts the attention to the lower half of the band or stretches the lower half to fill the entire band. After such an operation the original QMF pair will work just fine. Except for one further complication: aliasing. When we implement such a crude resampling operation, alias will creep in (remember, the bandsplitting operation implemented by QMFs is of relatively low order so there will be lots of out‐of‐band material present before decimation). This is where the deep magic comes in—by combining the QMF and the related decimation and applying the decimator operation to both the low and the high halves of the original band, we can construct a losslessly invertible operation. Both resulting bands will carry the same amount of information, can be recombined (but not by a straight sum) to get the original signal and the sample rate of each of the channels will be halved. The intermediate signals we get from the analysis phase will have considerable alias but this will be cancelled in the reconstruction step. Recursive application of this operation will yield a whole bank of filters (a filterbank) with logarithmic frequency division which splits the original signal into neat, separate bands. Many variations of this basic scheme (with different filters, cutoff points, depth of the eventual subdivision, and the aggression with which we decimate) exist. One property of any filterbank (not just one constructed from QMFs) to consider is the property of perfect reconstruction. A PR filterbank can, when all arithmetic is done with infinite precision, be exactly inverted. Note that the magic mentioned above only works under a strict set of conditions on the filters and the decimator and that this places limits on the kind of frequency and time response we can expect from a given filterbank. Sometimes we may wish to use filterbanks which actually expand the data, meaning the decimation stages leave some headroom in the resulting signal. We may also wish to relax the restrictions on the filters to achieve better frequency or phase characteristics, even if this means some distortion in the reconstructed output. A second important property of a filterbank is closely tied to such concerns, namely, we might wish a filterbank to have linear phase. The definition of a linear phase filterbank is precisely the same as in the case of simple linear filters, i.e. that each of the outgoing bands retains the phase of the frequencies coded on the band. Linear phase perfect reconstruction filterbanks with no expansion of data are the Holy Grail of filter design. Alas, they are an exceptionally limited subclass of all filterbanks and they often have quite disappointing numerical performance (remember, we haven’t considered the effects of limited precision arithmetic at all, yet).

Needless to say, the reconstruction step employs one further operation, interpolation which is the direct opposite of decimation. The magic works here as well, and the interpolation combines gracefully with the QMFs so that the aliasing introduced in the analysis phase gets cancelled when the two bands are recombined.

From the above it seems like QMFs are the only way to go. But this is not at all the case. Many filterbanks are known which do not employ such deep hierarchical subdivision as was described in the preceding. Indeed, MDCT is a fine representative of an architecture which does not. These architectures are also desirable from the numerical math point of view—deep recursive subdivision means a highly varying amount of numerical processing (from one filter in the upper bands to possibly tens in the lowest ones) on the signal’s way from filter input to filter output. This equates with error accumulation which actually limits the usefulness of many wide perfect reconstruction filterbanks somewhat. Some techniques exist which can be used to compensate for the errors (we speak of twiddling the filter stages), but these are far from being suitable to generic use. But if processing is applied to the subband coefficients, we end up with an even more serious problem: alias. Remember that the subbands contain considerable alias which will only be cancelled in the reconstruction filterbank. Now, if we for instance quantize the coefficients, this cancellation will no longer work perfectly. We simply get an upper bound for the alias that will be produced. Sometimes this is tolerable, sometimes it isn’t. Audio codecs suffer gravely from this. Further complicating this latter situation, the less there is overlap between the adjacent channels, the higher the order of the filter a given channel represents and the more there will be ringing. This fact becomes a serious problem with high coding gain filterbanks, i.e. the more frequency selective ones. A high level sine wave may very well overload a given channel because of resonance in such a high selectivity linear filter. This problem is accentuated in filterbanks because of the many successive stages of processing employed—guaranteeing that no overload will occur even inside the filter while relying on all kinds of cancellation phenomena at the same time can be a real pain in the ass. Twiddling a filterbank usually makes the problem even worse—after that, the filter topology is enormously more complex. That we want to apply a nonlinear process to the filter outputs and after that put the resulting signals through another filterbank (which may behave even worse: inverting a filterbank can impose some pretty funky requirements on the reconstruction stage) do not help either. All these reasons contribute to the popularity of not‐ quite‐perfect‐reconstruction filtering in many audio codecs.

The preceding discussion displays how thinking of time‐frequency transforms from the point of view of filtering (with given time and frequency characteristics) leads naturally to filterbanks. But in signal processing literature another way of viewing signals is nowadays almost as common. We already touched this view when we spoke of vector quantization. This view is based on seeing signals not as functions of time (attached to each instant of discrete time is a signal value, which may be a vector like in 2+ channel transmission) but instead viewing a whole, infinitely long sequence of sample values as a single vector, or a point in an infinite dimensional real vector space. The algebraic side of this viewpoint is easy: constructing an infinite dimensional vector space is almost as simple as it is to construct a finite dimensional one (like each point in space as a combination of three real numbers). What makes the view difficult to grasp is the topology and the geometry of the resulting space—it doesn’t go to illustrate infinite dimensional rotations (like FFT, for instance). But after we define the space and construct some geometry in it (for instance, by specifying a distance between vectors), we suddenly have a whole slew of deep mathematical black art from basic linear algebra through the theory of general vector spaces to functional analysis, operator algebra and variational calculus to linear and general topology at our disposal. Suddenly many of the complex, serial interactions encountered in linear signal processing can be modelled as simple vector operations and e.g. filterbanks start to seem like nice decompositions of vectors into components belonging to separate subspaces of the vector space we came up with in the first phase.

Wavelets and related transforms (like wavelet packet transforms) are a product of thinking about filterbanks as linear decompositions, as functions from a given vector space of signal sequences to either the subspaces of the original space or to a combination of suitable auxiliary spaces. Usually we aim at having as much structure on the vector space as possible (for instance, instead of just a distance metric between vectors we might want a norm which tells the magnitude of a given vector and possibly even an inner product which lets us calculate with angles in addition to distances and magnitudes). We might also require some (kinda) topological properties, such as completeness which enable the use of some truly magnificent convergence theorems (needed for infinite sums of vectors). With enough structure the decomposition of signal sequences can sometimes be expressed in terms of (possibly orthogonal or even orthonormal) bases of our vector space. In this setting, decomposition finds somehow optimal linear combinations of a given set of basis vectors whose combination corresponds to the signal vector we are decomposing (reminded of VQ yet?), whereas reconstruction recombines the vectors into the original signal. Expansive processing (overlapping bands) becomes the problem of basis vectors which do not form a proper base of the vector space but rather some basis vectors can be expressed as linear combinations of the others (corresponding to the fact that with overlapping bands, the reconstruction phase produces the same results regardless of which band the information comes from, many different decompositions work with the same reconstruction stage and the difficulty of somehow choosing the best (optimal in some sense) decomposition for a given reconstruction step).

Wavelets differ from Fourier based transformations in that they do not have such a strict blocking structure as the Fourier ones. Instead as time‐frequency analyses they intrinsically cover the infinite dimensional case as well, without relying on periodicity of the signal to be analysed. What makes this possible is the fact that unlike the basis (not a real basis since the basis vectors themselves do not suffice the aforementioned summability constraint) offered by periodic, quadrature sinusoids (remember, both cosines and sines, or alternatively complex exponentials, are required), discrete wavelet transforms form true bases. Even more importantly, the basis vectors are time localized, meaning they obviously differ from zero but unlike infinitely long periodic sinusoids, they rapidly fade to zero away from some given centerpoint. This is what makes them integrable/summable in their own right, leads to the time aspect of the resulting analysis (stuff in the signal which falls on the near zero portion of a given basis wavelet does not appreciably affect the coefficient assigned to the basis function in the decomposition) and also is what delayed the invention of wavelets in the first place. The latter part happens because deriving families of functions which form a basis (quite a set of conditions to achieve this, plus the necessity of linear algebra to even consider such a thing—linear algebra has only existed for about a hundred years, now), are time localized (fade to zero/near zero rapidly enough around some point/time interval) and are frequency localized (their Fourier transform fades to zero rapidly enough around some chosen interval/frequency) at first sight seems almost impossible. Only the development of wave mechanics with its concept of frequency‐as‐energy made obvious that localized functions with localized Fourier transformations must exist. The first wavelets and wavelet packets (a generalized form in which the basis is substituted with an overcomplete family of functions) were developed as aids to the analysis of particle waves, operated on top of continuous time (actually place as well) and were obtained by a recursive construction not unlike the subdivision encountered with filterbanks, making them have a fractal structure (meaning they were not derivable). The discrete versions came much later at a point where it was realized that QMF filterbanks, the pyramidal algoritms in optical pattern recognition research and traditional wavelets shared many of their recursive features and so were really just different manifestations of the same underlying mathematical idea. Some wavelet transforms can be implemented as filterbanks. Some others (notably the so called biorthogonal wavelets which are very much in fashion, now) require a reconstruction which is very different from a straight forward inversion of the coding stage but can still be implemented through conventional filtering. Some other (these utilize much more irregular codebooks which have nothing to do with bases) require extensive search and optimization strategies in one or both ends of the codec chain. Understandably these are mainly of academic importance (although weighty compression with cheap decoding is just fine if better compression ratios are achieved).

Now that’s a lotta math. How does this affect sound coding/compression? Actually it does a lot. Non‐equal band divisions are essential to matching the analysis/synthesis phases of a coder to the way we hear sound and taking time into consideration at such a fundamental level also leads to coders with fewer special cases to consider (like window switching which disappears because time smearing of coding noise is now inversely related to frequency and the leakage gets masked automagically). A simpler architecture means higher performance and often also less sidechain data in the compressed bitstream (for instance, window switching data and coding residues), so aiding compression. A properly chosen wavelet base will also have excellent entropy compaction properties, thus potentially boosting the effects of any subsequent entropy coding step. Power of two discrete wavelet transforms can be implemented with high efficiency due to optimizations similar to the ones used to derive FFT. The fact that each band has its own frequency resolution can in some cases lead to coding/decoding latency that is lower on average than the fixed latency imposed by blocking Fourier transforms (where all frequencies suffer the same coding latency while waiting for a complete transform block to be gathered). Wavelet packet transforms and related generalized decompositions can more closely approximate the non‐perfect reconstruction properties of the human senses and so lead to better compression while at the same time guaranteeing invertible operation. On the minus side, the nice synchronous block structure of the MDCT vanishes and thus makes VQ, frequency domain prediction and continuous maintenance of the psychoacoustic model somewhat more difficult or even impossible. Numerical problems (error accumulation, imperfect quadrature filtering caused by rounding of filter coefficients, imperfect alias cancellation in the decimation/interpolation rounds and so on) can be aggravated to the point of contributing audible noise. So wavelets are not a panacea.

A further political problem comes from the many intellectual property issues surrounding wavelet based compression—unlike Fourier methods which arose in an age where sharing of knowledge was still in vogue, compression built on wavelet decomposition is the child of the modern, greedy times. Since multimedia transmission is big business today, any successful technological solutions are likely to be closely guarded. As wavelet decompositions are very effective tools (I’ve heard of claims about perceptually transparent full bandwidth audio coders operating at as low as 80kbps), the preceding applies en force.

Audio coding in the studio. Codec cascading.

If we look at where audio compression and coding algorithms, other than plain linear PCM, are used, we quickly see that the area of application is mostly limited to consumer or distribution formats. Audio is still largely edited in an uncompressed form. This is not a coincidence—effective audio coding mandates lossy algorithms and this is not what most musicians/studio technicians feel comfortable with. The traditional idea has been that inside studio the highest possible quality is retained and only before transmission to the consumer can the poorer quality distribution media be accommodated. On the technical side, repeated compression and decompression progressively degrade the quality of a recording. Employing convoluted coding methods also makes direct editing of the underlying sound data impossible—in the hands of a skilled technician, any delay or inflexibility imposed by an audio coding layer quickly translates into irritation or even frustration. These are all very powerful reasons to only employ lossy audio coding on distribution media.

Recently, however, new challenges have been forming in the audio production side. Two of the more significant ones to prompt attention are the introduction of multichannel and film audio and the ever increasing quality requirements for digital audio signals. The quality concerns directly translate into higher bandwidths and storage requirements as the only way to increase the quality of a sampled signal involve raising the bit depth and conversion rates employed. Multichannel audio places more diversified demands on the producer, including similar greatly increased storage requirements but also having to do with the capability of existing audio hardware and software standards of carrying and processing the data streams and the current difficulty of including other kinds of data (like descriptive metadata, processing instructions, copy protection flags and alike) alongside the actual digital sample streams. This means that being able to fit more channels and perceptually relevant data overall into the existing digital studio infrastructure is of great importance. These requirements naturally translate into a requirement of compressed audio support.

The requirements placed on audio compression in the studio are very different from the criteria imposed by distribution use. First of all, any studio quality coding scheme must be perceptually transparent. Second, such a format needs to be as editable as possible, even without decoding the data to linear PCM. And even if recoding is needed to achieve some effect, at the very least the operation must be localized. That is, it should be possible to recode only the data which will change, not the whole soundtrack. This means that long term predictive coders go out of the window right away. Second, we need to be able to do extensive editing and transformation on the sound after recording. This means that a studio compatible coding scheme will tolerate multiple decode/code cycles without audible degradation of sound quality. This property should hold even if extensive frequency domain operations are performed, so masking should not be relied on much (a lowpass filter might remove most of the mask in a signal plus perceptually based processing usually leads to spectral modifications which a subsequent recoding step cannot cope with). The formats should be flexible and future proof. This way they enable the engineer to use his expertise in making any relevant tradeoffs between sound quality and quantity and assure the buyer that (the largish) investment will not be in vain. Any existing digital infrastructure in a studio (like digital audio cabling and routing) should be used as much as possible to avoid unnecessary investments. The streams should then be transmittable over AES/EBU, MADI and/or S/PDIF buses and should use a framing architecture which permits manual patch bay routing without losing synchronization. Error tolerance is also a major factor—nobody wants to lose their precious masters to bit rot. These requirements should be contrasted with the consumer ones of high compression, acceptable sound quality, fair error tolerance (the higher the percentage of records lost after sale, the more money the producer will likely make), low cost of implementation and a wealth of media sexy features.

Currently there are two main contenders for serious multichannel studio coding, DTS and Dolby E. While DTS has originally been developed as a distribution format, Dolby E was built from the ground up for exclusive production use. Both feature laudable tolerance towards multiple coding cycles (around 5 for DTS and up to 10 for Dolby E; DTS derives its strength from its benign compression ratio and the its highly prediction based architecture) and permit an amount of frame based cut‐and‐paste editing of the data absent a recoding step. This capability is especially important in joint movie/sound production because of the scene and take based discontinuous nature of today’s film making, which requires incremental cut‐and‐paste editing over hundreds of cycles before a work is finished. Both formats are carried over standard AES/EBU and S/PDIF audio buses, although the buses are not actually meant to be used for coded audio. (The data is carried in normal digital audio channels, so there is the possibility of blowing a couple of fuses by introducing your stereo to the bitstream.) Dolby E is aggressively targeted for digital television and cinema production by virtue of its framing structure which has been optimized to match the picture frame rate used in movies. It has considerably more flexibility in terms of channels, compression ratios and auxiliary data than DTS which instead boasts readily available hardware and software, both for professional and end user needs.

 ‐lossless: DTS (hierarchical spectral+adpcm+residue)
 ‐MLP (prediction+residue)

Codecs to know

There are a whole bunch of notable sound codecs in existence. These are some of the ones I’d deem important to know. Some notable ones have also been left out, for instance Bell Laboratories PAC (Perceptual Audio Coder). But the general principles recur over and over again so not much is likely to be lost if some codecs are left out. It is a question of economical relevance as well—most codecs have not found a successful niche in which to become useful for the consumer and so are never widely deployed.

Microsoft ADPCM

 ‐microsoft: part of RIFF WAVE
 ‐not psychoacoustical
 ‐lossy
 ‐computationally efficient
 ‐less efficient perceptually
 ‐quantized delta
 ‐adaptive nonuniform quantization (thru tables)
 ‐reacts badly to noise/high frequency content

MLP

 ‐time domain prediction+residue method
 ‐prediction through instructions ⇒ strain on the encoder
 ‐only likely to be lossless!
 ‐BTW, where’s the meat on the algo?

The MPEG series audio codecs

 ‐polyphase filterbank+optional further MDCT (32 band?)
 ‐quantization
 ‐temporal processing?

ATRAC

 ‐used in MiniDisc and SDDS))
 ‐banding (0‐5.5
 ‐5.5‐11
 ‐11‐22kHz)
 ‐length adaptive MDCT
 ‐quantization

AC‐3

 ‐also called Dolby Digital, and used in Liquid audio
 ‐MDCT
 ‐binning
 ‐line segment convolution estimate
 ‐forward/backward hybrid
 ‐envelope delta modulation
 ‐adaptive block length
 ‐joint channel coding
 ‐floating point representation and exponent reuse

CELP

 ‐used in voice GSM
 ‐LPC
 ‐PARCOR filtering and coefficient interpolation
 ‐residue calculation
 ‐codebook vector quantization

DTS: Coherent Acoustics

PR/NPR banks
 ‐fourth order LPC/ADPCM
 ‐transient detection+block zoning+multiple scale factors for ADPCM
 ‐prediction+eradication of predictable content ⇒ only noise transmitted ⇒ tolerates cascading (why?)
 ‐little compression: psychoacoustic modelling difficult to exploit
 ‐ADPCM+filter bank⇒time trouble
 ‐transient handling not very good since filterbanks are equal subdivision and fixed rate
 ‐optional entropy coding
 ‐join channel coding
 ‐frequency extension mechanism: higher noise!
The one used in theatres is very different‼

MPEG AAC

 ‐accurate filterbank
 ‐quantization
 ‐backward linear prediction
 ‐temporal noise shaping
 ‐largely backward adaptive
 ‐breakin mechanism: bit slice arith ⇒ real heavy

TwinVQ

 ‐a Yamaha thing
 ‐filterbank
 ‐prediction?
 ‐interleaved vector quantization
 ‐fixed codebook lookup ⇒ efficiency in the decoding stage

A technical comparison of perceptual coders

Bottomline: how do they sound?