Oversampling and bitstream methods in audio

Through the relatively short history of digital audio processing, the technology has improved by impressive steps. Nowadays most problems which plagued early digital applications have all but vanished. For most of this we have only two inventions to thank: oversampling and bit reduction. This article gives a short introduction to these important topics. It also presents some of my views on using the resulting bitstream methods in audio transport applications, of which Sony’s Super Audio CD (SACD) is the first and foremost example.

Sampling basics

Oversampling and bit reduction techniques are mostly a matter of implementation—in theory, neither of them are needed to build robust, theoretically sound audio processing applications. On the other hand, both these technologies are based largely on the same principles as are the more classical incarnations of digital audio. That is why one needs at least a cursory understanding of sampling, reconstruction and the management of noise in audio systems before delving into the specific technologies.

The sampling theorem

The sampling theorem, in its present form, was formulated in the 20’s to 50’s by the same people who developed information theory—mostly Harry Nyquist and Claude Shannon, both employees of Bell Laboratories. The sampling theorem is the theoretical basis that allows us to process physical signals on discrete, digital computers.

What the sampling theorem says is, under certain conditions we can convert a continuous, infinitely accurate (analog) signal into a stream of time equidistant samples and lose no information in the process. The condition is, there is to be no content in the analog signal above or equal to half the frequency we are sampling on—the signal we sample is bandlimited. The conversion consists of taking the instantaneous value of the analog signal at regular intervals, determined by the sampling frequency. Note that nothing is said about the practical method used to achieve such infinitely narrow samples (only the value at a single instant of time affects the resulting number) or the number of bits in a sample (in the theory, a sample is a real number, i.e. it is infinitely accurate). Something is said about how to reconstruct the analog version from the resulting samples—after all, losslessness means being able to return the signal to its original form exactly.

Perfect reconstruction, as it is logically enough called, is achieved by passing the point samples through a perfect lowpass filter. Such a filter cuts off everything above half the sampling frequency. Of course, this kind of idealized response is not physically achievable, just like in the sampling side we have difficulty with obtaining very thin samples. Now, seeing the reconstruction step as a lowpass filter is not very instructive either. But seen in another way, it makes perfect sense: in essence, we are interpolating between the sample points. The ideal lowpass filter responds to one such input sample by emitting a sin(x)/x shaped signal—an oscillating function that dies out relatively slowly as we go farther away from the time of excitation. More specifically, the prototype sin(x)/x function is zero precisely an integral number of sample periods removed from the origin, unity at zero and symmetrical about it. Scaling this prototype by sample value, shifting the response from origin to center it around the exciting sample and then summing the responses we get from individual samples, we have a signal that agrees with the original at sample times and varies smoothly in between.

To process signals digitally, they will need to be bandlimited. If this condition is not fulfilled, aliasing will occur. This means that there will be only frequencies below half sampling rate in the reconstructed signal, and that any content in the input signal with frequencies above half the sample rate will fold into the admissible band. For instance, at 40kHz sampling rate, the extraneous 2kHz in a sine wave of 22kHz will cause it to fold to 18kHz. Aliasing does not sound nice and is to be avoided at all costs. This means that we need to guarantee no inadmissible content is present in the sampled signals. This is achieved by passing the signal through a lowpass filter, the anti‐aliasing filter, before sampling. Again the math assumes the filter to be perfect and this is not physically achievable.

Point‐like sampling and S/H

Anti‐aliasing and anti‐imaging aren’t the only problems that we encounter. Even such a simple operation as taking point samples is surprisingly difficult in practice—the incoming electrical signals roam throughout the sample period. The natural thing to do, then, is not to sample the signal directly, but put a kind of gate circuit in between. These circuits are called sample and hold or S/H. Such a circuit usually operates by sampling the incoming voltage to a capacitor and then switching (through the use of a couple of MOSFET transistors) the capacitor over to a high input impedance amplifier (an operational amplifier with a FET input stage) for sampling. This is better but still it is not optimal—any change in the capacitor’s voltage calls for a change in charge and such a change consumes energy. This energy change has to take place within the brief sampling interval and so circuit resistance, the capacitor’s capacitance and the finite operating voltage of any physical circuit bound from below the time accuracy of the sample and hold function. We also have to worry about charge leakage, circuit linearity and the considerable noise and heat introduced into the circuit by such rapid current flows.

A share of S/H helped alot in the implementation of older A/D converters. However, in the output side we still have problems left: there even correct instantaneous voltages are not enough. The ideal solution requires true impulses which, of course, are not even remotely achievable in the physical reality. If we decide to make do with less, distortions creep in: a S/H circuit in the output of the converter will allow the conversion process to settle to the right voltage without rippling the output but will also produce a staircase waveform instead of a train of scaled impulses. This is, in effect, a time variant linear filtering operation and produces frequency anomalies (the output becomes the ideal one convolved with a sampling period wide pulse which leads to high frequency attenuation—a pulse has a decaying, rippled spectrum instead of the flat unity of an impulse). This is why the operating voltage of converters strictly bounds naïve implementations like the one discussed above. We also get the same energy constraints that we had above, so the output will necessarily become a high impedance one—this is obviously bad from a thermal noise point of view.

Practical anti‐aliasing and anti‐imaging

Given that the theory is based in perfect lowpass filtering, it seems that imperfect physical filters pose a significant problem. Indeed, since we are talking about conversions, all these filtering operations would seemingly have to be implemented in the analog domain. Next we take a look at some of the problems associated with analog filters.

The first challenge is the amplitude response of our filters. From the theory of linear filters we know that the ideal brickwall filters can only be approximated by physical filters. This goes for both analog and digital implementations. To get a near approximation, the filters will also need to be of high order—in older digital audio systems the order of the analog input and output filters could exceed ten. This automatically leads to noise and stability problems, especially since the best responses require elliptical filters which are known to be quite sensitive to parameter fluctuations. The cost of implementation is quite high and the knee between passband and stopband will always be quite broad. This means that the upper part of the audio band will need to be sacrificed if correct behaviour in hostile conditions is desired.

Even if our filters now have a perfectly acceptable amplitude response, we are not done yet. This is because when high order analog filters are used (and especially elliptical ones), the phase response of the filter becomes exceedingly bad near the cutoff frequency. This means that the filter will ring, i.e. go into damped oscillation near sudden, transient signals. Consequently the time structure of the sound will be somewhat blurred near the cutoff frequency. This is an unavoidable consequence of analog filtering and is usually the reason given for the early CD players’ bad performance. (The bad rep from this era may be why many audiophiles still shun CDs.) Since we still need to limit the incoming audio band, the only real solution would seem to be using a higher sampling frequency so that any phase distortion the filter might cause ended outside the audible band. This is not very nice, though, because wider bands mean wasted space on storage media and also more expensive electronics to implement the system.

Conversion linearity

We’ve already seen that point sampling and anti‐imaging/aliasing are easier in theory than in practice. But how about the actual conversion step, the one that takes in voltages and puts out numbers? It should come as no surprise there are problems here, too.

There are at least three major ways to implement the conversion, none of which are perfect. The most straight forward is flash conversion: to convert we generate a reference voltage for each possible conversion value and compare these with the input voltage in parallel. Then we pick the highest lower than the input voltage and output the corresponding number. For D/A, we just output the correct reference voltage. This approach doesn’t scale far beyond 12 bits. The second way utilizes the fact that D/A conversion is generally easier than A/D—we approximate the input voltage by setting the highest unknown bit in the output number, compare a D/A’d version of the current number with the original, pick a value for the bit that makes the current number less than the input value and loop for all bits. This is called successive approximation. The method scales well, but is not very fast. We also depend on the accuracy of the D/A step involved. The third way is dual slope conversion—instead of level comparisons with reference voltages, we use the input voltage to drive current to a capacitor through one resistor and then discharge it through another. The time it takes for these two slopes to complete can be measured very accurately and the process is highly dependable. The problem is, it is also extremely slow and so isn’t suitable for audio sampling rates.

Now, from the above we gather that older conversion methods rely on the ability to generate accurate references. The usual way to do this is to use resistor networks and constant current sources. This is also where we get into trouble. Current sources cannot be made infinitely accurate and they always suffer from, e.g., temperature drift. On the other hand, resistor networks rely on accurate values of the resistors (in the best ones, such as the R‐2R ladder, on equal values of all the resistors involved) but these are quite difficult to achieve. This means that the reference voltages and D/A conversions achieved through resistor ladders have small variations between the sizes of adjacent conversion steps. Sometimes it may even be that the steps are not even monotonous—a higher digital input value might produce a lower output voltage, for instance. This is very bad since it destroys the linearity of the converter. And when this happens, there will always be distortion. The step size variations, dubbed differential nonlinearity, are difficult to correct without expensive manufacturing techniques. They also lead to converters which perform worse than their width in bits would suggest: an 18‐bit converter with some differential nonlinearity might have the S/N ratio of an ideal 15‐bit converter. Not to mention that the errors generated can be strongly correlated with the signal and thus easily discernible in the output. All this gives a good reason to try and avoid multiple independent reference voltages and architectures with narrow manufacturing tolerances.

The predominant solution

The mix of problems described above is nowadays solved with a standard bag of tricks that we intend to look into, next. This bag includes oversampling, digital filtering, bitwidth/bandwidth tradeoffs, noise shaping and delta‐sigma conversion.

Oversampling and digital filtering

All the problems with the anti‐aliasing and reconstruction filters described above are at some level linked to the fact that the filters are analog. In contrast with analog ones, digital filters can have perfectly linear phase response and arbitrarily high order without significant noise problems. Digital filters do not suffer from thermal drift, either. Given that numeric processing is quite cheap nowadays, we would ideally like to perform our filtering operations in the digital domain. But because we are talking about how to convert from digital to analog and vice versa, this would seem to be impossible.

The way to get around this little dilemma is actually very simple: we share the burden between the digital and analog domains. A low order analog filter can be used to guarantee that no significant content is present above some (rather high) frequency. If our sampling process works at a rate of at least twice this higher limit, we are left with a sampling chain with a relatively poor amplitude response in the higher part of the spectrum. The lower portion, however, can be quite usable. We can now use a digital filter to further limit the band to this usable portion without introducing any analog artifacts like phase distortion. We can actually use the digital filter to partially compensate for the imperfect amplitude response of the analog input filter.

Now, given that the original sampling rate was high enough (typically 64+ times the final sampling rate we wish to use) we are left with a high sample rate digital signal with most bandwidth unused. We can now resample this signal to achieve our (much lower) target sampling rate. We have used oversampling to enable digital processing. Since we often downsample by an integral amount (like 64 times), it is even possible to combine the downsampling process with the filtering step, creating a very efficient computational construct called a decimating filter. For digital to analog conversion, a symmetrical structure is used which interpolates to a higher sampling rate. It is called the interpolating filter.

There are also further benefits to the oversampling process. First of all, the problems associated with sample and hold are diminished. This happens because we are using a much higher sample rate and thus much shorter S/H periods—essentially the filtering operation imposed by staircase formation will now have a response which is almost constant over the audio band. Furthermore, since the analog lowpass filter guarantees that the signal cannot wobble very much during a single sampling period, it may even be possible to dispense with the S/H step altogether. Secondly, the analog filters used can have a very low order. Not only does this mean that the response will be almost constant over the target band but also that the filter will have an excellent phase response. (The problems will be outside the audible band.)

The traditional way to explain the substantial benefits of oversampling is through the realization that regardless of sampling rate, the bit depth of a converter determines the energy of the quantization error signal we insert. This is because the magnitude of the error signal stays constant while sample rates vary under the assumption that the error is uncorrelated with the signal. If we raise the sample rate, the power of the error signal is spread out in frequency and removing part of the bandwidth with a filter will lower the overall power of the error signal. I find this explanation difficult to understand. The twist is, it seems like this procedure gives us us D/A conversion which can be more accurate than the incoming sample stream.

What really happens, though, is that the staircasing which results from S/H in traditional converters gets diminished—low amplitude signals are produced with more average accuracy over a sample period because of the higher sample rate and the stairs get rounded by the output filters. This means that even a single bit on‐off fluctuation in the PCM data stream gets decoded, correctly, into a sinusoid. In essence, the oversampling makes it possible to reconstruct the signal mainly in the digital domain and the oversampled output with its analog filter is really just a way to deal gracefully with the extra bits produced by the digital anti‐imaging filter. All in all, we get a very low noise floor for conversion error, but the inherent resolution of the sampled PCM stream is in no way surpassed.

With proper dithering in the A/D chain, however, the increased decoding accuracy makes it a lot easier to hear material which is actually below the noise floor of the quantization plus dithering operations. Oversampling combined with digital filtering opens a way to nearly perfect A/D/A conversion and so it is a very important tool to any builder of audio systems.

Bit depth reduction and noise shaping

After oversampling is employed the filter troubles go away. However, the problems with the conversion step itself are made a couple of orders of magnitude worse. This happens because very accurate conversion is even more difficult to achieve at high sampling rates. In fact, at the megahertz rates required for 64 times oversampling architectures, traditional converters of over 12 bits do not really exist. Nonlinearity isn’t going to get any easier to handle, either. This is why we might wish to do with fewer bits. Digital filtering generates bits, too, so it would be very nice to somehow drop the extraneous ones. But this automatically means quantization noise will get intolerable, right?

In general, yes. But remember that we are talking about an oversampling architecture, now. Here only the lowest 1/64 of the total bandwidth is of real importance. What happens above that is not of real concern since we will not be able to hear it and, more importantly, the analog input/output filters will attenuate those frequencies progressively more in the higher bands. This means as long as we keep the in‐band noise in check, we can increase the out‐of‐band noise level considerably. This is achieved through what is appropriately enough called noise shaping.

The simplest digital noise shaper consists of a quantizer (in the digital domain this just takes a fixed number of the high order bits of a digital word) and a subtraction circuit which subtracts the quantization error introduced into the previous output sample from the current input of the quantizer (in effect subtracting the neglected low order bits of the previous input word from the whole current input word).

It is not very easy to see why this circuit does what we described in the previous paragraph. To get a picture of what happens, we must change the configuration described a bit. First of all, we can express the error (which we currently feed back) as a difference between the quantizer input and output values. Then we can separate these two into signals which are subtracted and added, respectively, in the quantizer input. After that it is easy to see that the original configuration corresponds to a very simple IIR filter followed by a quantizer whose output both serves as the final output of the circuit and also feeds back to be subtracted from the input signal.

Now, assuming the quantizer is just an additive source of noncorrelated noise (this is a fairly good approximation over a wide range of operating conditions and amounts to linearizing the circuit), it is quite easy to see why the loop behaves the way we expected: the circuit approximates a closed loop linear filter with an embedded source of white noise. The spectrum of the noise is determined by the closed loop response of this filter, and is easily evaluated. In the simple case, the inner IIR filter has a first order lowpass response, so the closed loop response of the outer loop is a highpass one—the noise is inserted primarily in the higher frequencies. Furthermore, the structure of the circuit as a whole guarantees that at low frequencies, the spectral structure of the output will closely match the one of the input with very little quantization noise present.

In general, the inner filter described above can be substituted with a much more complex one—this gives us a way to control the spectrum of the quantization noise. This way we can make sure the resulting noise is adequately low over the audible band and, above that, sufficiently attenuated by the filters employed. Furthermore, we may want the noise remaining in the audible part of the spectrum to be well matched to the threshold of hearing so that it will not be heard as readily.

It is clear that the fewer bits there are to worry about, the easier it is to design a working converter for them. So far so good. But the most striking benefit comes when the process is carried out to its logical conclusion to yield one bit processing. A one bit converter, in addition to being extremely simple to implement (one bit D/A is a switch, one bit A/D is a comparator) will suffer from zero differential nonlinearity—there is only one step so all the steps will obviously be of identical magnitude. Some slew rate distortion and a constant offset (resulting from imperfect supply voltages) are practically the only problems of the single bit converter. The slew rate issue can be handled by some careful design and constant offsets rarely matter in a mostly capacitively coupled audio signal processing environment.

Finally, the above description of noise shaping lends itself to both digital and analog implementation and the method is applicable to A/D conversion as well. These two factlets are all we need to arrive at today’s prevailing audio conversion concept, delta‐sigma conversion, the topic of the next chapter.

Delta‐sigma (ΔΣ) conversion

A beloved child has many names. So does this conversion method: delta‐sigma, ΔΣ, sigma‐delta, MASH and charge balance conversion are but a few. But the basis is the same—we employ a huge oversampling ratio (usually 64 times the target sampling rate) and aggressive noise shaping to bring the converter down to the one bit regime. In the A/D side we implement the noise shaping circuitry in analog form (the subtraction is an opamp based differential amplifier, the A/D converter is a comparator, the filter is a continuous time or switched capacitor analog one and the feedback loop holds a switch to convert back to the analog domain), in the D/A side we mostly employ digital processing (only the final bitstream is converted).

In addition to the reasons outlined in the previous chapter, delta‐sigma conversion has a very persuasive further benefit: it is very cost‐effective to implement. This is because the technique does not rely on any precision components (unlike the other methods which require resistor ladders and precision capacitors), is easy to embed into otherwise digital circuits (using CMOS logic and switched capacitor filters, the design nicely straddles the digital/analog boundary) and is repeatable unlike any other (the digital filters are always accurate and the few analog flaws can be ironed out through autocalibration). Further, delta‐sigma methods are the only to reach reliable 20+ bit performance at audio sampling rates, a noteworthy fact in an age everybody’s already got 16 bits in CD.

And now on to the downsides. Delta‐sigma (bitstream) methods are nice, but they’re not without their problems. We will now delve into those.

Nonlinearity, idle tones and dead bands

Above, when we tried to figure out why the noise shaping circuit did what it was supposed to, we resorted to linearizing the circuit. A hint was given that this wasn’t perhaps the best way to go. And so it isn’t—the linearized circuit does behave nicely and also approximates the actual quantizer performance quite well. But there are occasions when the true nonlinear nature of the circuit crops up. And these circumstances arise in practical converters, as well. The three major problems are idle tones, dead bands and nonlinear unstability, and they tend to plague delta‐sigma modulators of higher order. This is unfortunate since the higher the order of the modulator, the higher the potential performance at a given oversampling ratio and converter bit depth.

Idle tones, in system theory dubbed limit cycles, are a mode of nonlinear oscillation. They exemplify the exact opposite of one of the basic properties of linear systems. In the absence of input (i.e. given an input of all zeroes from some point of time), the output of a stable linear system always approaches zero. Of course, the convergence can be slow but it nevertheless happens for all linear systems—it is easy to show that if it doesn’t, the system cannot be stable. But not so for nonlinear ones. Idle tones are one of the consequences. They are stable sequences output by a nonlinear network in the presence of prolonged null input. In delta‐sigma architectures they most often occur at very low amplitudes, just when the output of the modulator should go to zero. Needless to say, our ears can easily pick these sequences up, even if they are below the noise floor. Idle tones are heard as faint, whining noises in converter output. The exact time domain behavior of the modulator after entering a limit cycle depends on the structure of its state space, and most importantly the amount of modulator state. As the latter grows with modulator order, it is not a surprise that higher order modulators are the first to be affected.

Now, the state is usually rather limited and the nonlinearity in this case is (in some intuitive sense) rather regular. That is why the possible modes of oscillation tend to be at least quasi‐periodic and mostly have a short period. When the input signal dominates the circuit, the lineariazed noise source analysis tends to hold so the problems mainly appear at low amplitudes. Summarizing, the resulting tones will have a high pitch, a low amplitude and a definite pitch. Idle tones are considerably more annoying than mere noise or some differential nonlinearity. This is why they must be avoided at all costs. The problem is, controlling this sort of nonlinearity analytically is exceedingly difficult. The most common way to deal with limit cycles, then, is to insert small amounts of noise into the modulator feedback loop to at least disperse the pitches generated and, in the best case, drive the modulator out of the limit cycle into a zero‐convergent region of the state space. Notice that this is definitely different from dither, which is used in the input of the converter to decorrelate the total quantization noise from the signal.

Dead bands are a concept closely related to idle tones. They denote parts of a nonlinear system’s state space (other than all zeroes) which captures the system—if the system enters such a band, it will not leave it in the absence of input. The concept of a dead band is more comprehensive than the one of idle tones since it can include nonoscillatory behavior (the output decays to within one unit of zero and stays there) and some gross nonlinear oscillatory modes (like the ones resulting from clipping and, especially, overflow). The concept is really applicable only to circuits which over some fairly broad set of operating conditions closely approximates a linear system, and in such settings is very useful for explaining some of the typical solution strategies which are used to bring the system back on track. For instance, the breaking of idle tones by inserting noise can be thought of as an attempt to nudge the system out of a dead band.

That dead bands can arise at high amplitudes as well is rather troubling. Indeed, at least the analog variants of fourth order and beyond delta‐sigma modulators manifest some rather troubling behavior when the values output by the loop lowpass filter (integrator) grow large. This is why the whole theoretical operating range of these designs is rarely utilized and some circuitry is often embedded to detect high amplitude unstability and consequently reset the modulator. An operation such as this seems quite drastic but is rarely needed in a properly amplitude controlled system. At high inputs the effects on the output usually cease in a single target sample period or less, as well. But still, the necessity of such emergency measures does not exactly serve as a corroboration of the theory behind delta‐sigma conversion.

Why go to higher bit depths?

In addition to concerns of economy, conversion and bit depth requirements constitute the major drive behind bitstream methods. It is understandable that people continuously strive for more accuracy. After all, not many people would mistake a recording for the original performance under any realistic conditions. But few people question whether going to ever wider converters and higher sampling rates is really the way. Contrary to common audiophile rhetoric, there is quite a bit of reason to believe we are already quite near the limit beyond which increasing bit rates make no difference to the human observer.

Based on what is known about hearing, people do not truly hear anything beyond 25kHz. And even this is quite a conservative estimate, since it primarily holds for isolated young adolescents. And even if some people do hear frequencies that high, the information extracted from the ultrasonics is very limited—there is some evidence that everything above some 16kHz is sensed purely based on whether it is there, irrespective of the true spectral content. As for dynamic range, research suggests that 22 bit accuracy should cover the softest as well as the loudest of tones over the entire audio bandwidth.

But these limits are not the end of story. If we are simply aiming for a good audio distribution format, some extra processing can yield significant benefit. This is because pure, linear PCM storage in no way employs the peculiarities of human hearing. The dynamic range and lower amplitude limit of our audition varies considerably over the audio bandwidth. Two known methods or employing this variance are noise shaping and pre/de‐emphasis. The first uses the above described noise shaping principles to move the quantization noise generated at a given bit depth from sensitive frequency ranges to less sensitive ones, in effect giving more bits to the ranges which most need them.

Noise shaping has the benefit of only being needed in the production side of the signal chain. It shapes the noise floor of the recording and so alters the dynamic range in the different parts of the spectrum. Pre/de‐emphasis, on the other hand, relies on the fact that the spectrum of acoustical signals is in general far from flat while at the same time the threshold of hearing also varies over the audible spectrum. The first invariably rolls off at high frequencies and the second creeps quite high in the acute register. Minimum thresholds are attained in the middle register, around the 1kHz mark. This means that it is advantageous to shift the transmitted dynamic range of separate bands with respect to each other. High frequencies, for instance, can be boosted (since acoustical signals leave some headroom there) and then de‐emphasized at playback so that any noise inserted by the signal chain is attenuated. In fact, in this range the perceptual noise floor can often be lowered below the threshold of hearing, in effect making the transmission chain perceptually flawless. The problem is, in the crucial mid and high mid ranges the dynamic range of both music and the hearing are wide and we also observe the minimum threshold of hearing. This is why pre/de‐emphasis is not a panacea.

Some careful thought enables us to construct a system with both emphasis and noise shaping to yield superior perceptual quality from a given bit depth. At best, 4–6 bits worth of perceptual dynamic range can be added by the procedure. Given that the sample rate is high enough so that all frequencies of interest are transmitted (some suggest 26kHz audio band gives all the headroom we need), optimal use of emphasis and noise shaping can theoretically yield perceptually transparent transmission from a 14 bit system. Since we do not rely on masking, the design yields surprisingly good error resilience in the presence of subsequent signal processing. Proper dithering can then be used to make the signal chain linear for all practical purposes. These results are in stark contrast with the claims of audiophiles and the audio industry of the need for ever higher bit rates in audio transmission.

This section draws heavily on the material available at the Acoustic Renaissance for Audio (ARA) web site. The material presented there is also much more comprehensive and doesn’t utilize proof‐by‐assertion like this introduction. The ARA activity is one of the greater influences to prompt me to take up writing this text.

From the above it appears that with some minor adjustment, the 16 bits and 44.1kHz of CD are almost enough. In practice we do have to be slightly more cautious. It must be acknowledged that some anecdotal evidence exists in favor of greatly increased bit rates. The most important experiments involve the effect of ultrasonics in the presence of frequencies traditionally considered audio band and the effects of ultrasonics on sound localisation and the definition (whatever that means) of sounds. The argument goes, we might not consciously hear isolated ultrasonics but in the presence of other sound material (especially transients) they might serve as additional localisation cues. There is also the age old debate over ultrasonics permeating to the audible band through distortion products generated in the ear. The latter claim has gained some support from experiments involving timbre perception of periodic sounds with and without ultrasonic components. All in all, the effects seem to be minor and it is not entirely clear whether they really exist, at least to a degree which requires attention from the audio designer.

One further, rather persuasive reason to reconsider the need for higher bit rates is the one of sensible resource allocation. If stereo transmission was really the best that could be done, ever higher accuracy could be justified by arguments of the better‐be‐on‐the‐safe‐side type. But there are multitudes of unaddressed issues in digital audio transmission which have nothing to do with the numerical accuracy of the channels employed. The most important ones are the number of channels, the accurate definition of what exactly is encoded by the information (to date the Ambisonic framework is the only one to comprehensively address this concern) and application of signal processing to enhance the signal chain (e.g. room equalisation, speaker linearization and restoration of analog recordings). The rapid evolution of DSP has also brought out new possibilities, like simulation of acoustical environments, which seem far more interesting from the consumer standpoint than laboratory grade signal chains. We should consider whether the future development of and investments in digital audio systems should perhaps be along these (in my mind extremely interesting) lines instead of on making marginal improvements to channel accuracy.

All of the above holds primarily for audio distribution formats. But when subsequent processing is to be expected, wider samples are very useful in preventing error accumulation. It is well known that most DSP operations, including simple filtering, generate lots of extra bits to existing signals. To guarantee that rounding and dithering products do not accumulate, even 32‐bit formats are sometimes used. On the other hand, no such distortion appears in the frequency domain, even in the presence of considerably long signal chains, so this does not affect the sampling rate considerations.

Bitstream as a transmission format

Now we’ve got to the original reason for this article. So far bitstream methods have only been described from the point of view of analog/digital/analog conversion. But a slight change in our point of view, and some study of the complete signal chain from analog to digital and back again leads us to wonder why we are doing the bitstream to linear PCM conversion and its inverse at all. Couldn’t we leave those stages out and simply pass the bitstream resulting from delta‐sigma modulation to the playback side? This is quite a natural thought and one that has recently found a concrete application in Sony’s SACD architecture. The subject of this final chapter is bitstream as a channel encoding and an architectural basis of digital audio transmission.

Rationale

The common wisdom in data transmission is that the shorter the signal chain, the more transparent it will be. This is also the most compelling reason for trying to lose the digital filtering and decimation steps from the PCM signal chain. Bitstream advocates feel that since we can do without these steps, they should go. Any processing being a matter of cost, implementations should become cheaper when the digital part is simplified. The idea of passing a simple bitstream is also seen as having a certain elegance. And certainly it has the delightful buzz of any new technology.

On a more serious note, the technique relies on oversampling and noise shaping on a very basic level. The oversampling ratios must be very large in order to get quality playback, so the underlying bandwidth will be huge compared to current PCM systems. Now although the reasoning that lead us to consider delta‐sigma conversion and the attendant noise shaping techniques is largely based on putting the shaped quantization noise into the headroom provided by oversampling and then killing it with an analog filter, after removing the digital filters we can also view the system as a full band one with lots of quantization noise, an anti‐alias filter with ridiculously bad passband response and funky single‐ended de‐emphasis to reduce the terrible noise figures. Essentially we have almost a conventional digital transmission line in which only the lowest 1/64 or so of the total bandwidth shows hifi performance. Superficially naïve, this is a powerful observation and has some deep consequences.

We have already seen that a full bandwidth PCM format such as CD can be made a lot better by introducing in‐band noise shaping and pre/deemphasis. Now how about applying this reasoning to the above? We get a system in which the lowest 1/64 frequency range (the conventional audio band) is hifi and the rest (possibly up to 32 times the sample rate!) displays a progressively degraded noise figure and maximum output. Now, if we assume that ultrasonic frequencies indeed do contribute to localisation and what not, those frequencies can now, for the most part, be transmitted. The only problem is accuracy, but if we cannot consciously hear the stuff anyway, its presence is a lot more important than the accuracy of transmission. Plus, in the vicinity of the audio band S/N ratios actually stay quite respectable. In effect, the signal chain has a sort of fade‐to‐noise frequency response.

As the above reasoning suggests, a 64 times oversampling delta‐sigma architecture (which is pretty much a standard for 16‐18 bit delta‐sigma converters in PCM applications) already contains some slack compared to the PCM counterpart. This is to be expected since the data stream is a lot fatter (16 bits times the sample rate vs. 1 bit times 64 times the sample rate already shows a four to one expansion). This implies that there is a certain level of flexibility in the system: varying the roll‐off of the analog output filter balances the maximum in‐band decoding accuracy vs. the level of access to slightly off‐band material possibly encoded. At the same time possible future improvements to delta‐sigma modulators give the producer side some choice over greater in‐band accuracy (and possibly even enables the encoder to match the dynamic range to the threshold of hearing) vs. encoding off‐band material. In effect, the boundary between in‐band and off‐band material is diminished and the limit can be set in a relatively independent fashion in both ends of the signal chain. All this lends some credibility to bitstream methods as basis for a complete audio architecture.

Sony’s DSD and SACD

The format that has prompted the whole recent bitstream discussion is Sony’s DSD (Direct Stream Digital) which was originally intended for the stereo digital soundtrack of DVD Video. That effort failed so Sony incorporated DSD into new a standalone audio format dubbed SACD (Super Audio CD) which has already hit the streets in Sony’s home market.

DSD is a straight forward application of a multichannel 64 times oversampling delta‐sigma conversion at some two and a half megahertz followed by direct transmission/storage of the resulting bitstreams and low order analog lowpass filtering for reconstruction at the reproduction side. SACD places this bitstream on a DVD derived high capacity disc. The most important technological contribution of SACD is the introduction of optional double‐layered and hybrid discs. Like other members of the DVD family, double layers simply mean double the capacity. This could be used for extra playing time or, at a later date, to accommodate multichannel capacity (currently only stereo SACDs are defined). The hybrid disc is a more interesting concept.

Hybrid SACDs incorporate one high density layer which stores the DSD bitstream and in addition to this, a Red Book compatible CD layer. The promise goes, hybrid discs will play as CDs in normal CD players in addition to containing the higher fidelity DSD stream. The SACD standard also defines that every SACD player must support the CD format. This is easily achieved because the technology is a direct DVD derivative—DVD players commonly employ dual beam pickups and two layer discs and so have the necessary dual focus capability. The only real obstacle in the way to complete Red Book compatibility is ensuring that the resulting hybrid discs fulfill the Red Book requirements for disc refractive index, thickness, depth of the recording layer and the absorption coefficient of the disc material at the 780nm wavelength used to read a CD. Through some design, Sony has accomplished this goal and created the migration path essential to any new audio format. From the user’s point of view, SACDs currently behave just like conventional CDs. In the future multichannel playback is envisioned and the standard can accommodate up to 6 channels of DSD encoded audio data.

In addition to the above specification, Sony has tried to make the SACD platform more desirable to content providers by embedding both a visible and an invisible watermark into the disc without which the SACD player will refuse to play the disc. This is done to make piracy more difficult. As a further hindrance to copying, no digital outputs are provided in the first generation SACD players.

Sony has tried to position SACD as an audiophile format and holds that SACD is not a direct competitor to DVD‐A which it claims is more geared towards the ordinary consumer (read: is somehow in the low end). This is very much reflected in Sony marketing rhetoric surrounding SACD, which invokes the audiophile fondness for analog formats and capitalizes on the benefits of a simplified signal chain.

Foreseeable problems

And now on to the meat. I do not agree at all with Sony hype about SACD being what practically amounts to the Second Coming. I also believe I share this worry with the right people—I’m certainly not the first one to think SACD is not a healthy way to go.

Perhaps the most straight forward reason why SACD is a bad idea is that it is perhaps not needed at all. Blind listening tests tell the average consumer has a fair bit of difficulty telling 24 bits at 96kHz from properly implemented 16 bits at 44.1kHz. Considering the numerical differences between these formats the question of whether we really need accuracy beyond the level of CDs becomes quite acute. Quite some people with golden ears agree that the difference is subtle. Now, the effective bit depth of DSD is around 20 and 24/96 already has over an octave of ultrasonic bandwidth. Why is it that by and far, the same golden ears find a great difference between CDs and SACDs?

Since SACD is very clearly a distribution format, we can question how close CDs mastered for delivery only (i.e. utilizing aggressive noise shaping, perhaps even driven by a masking model) can come. And theory suggests they come really close. So it might be DSD isn’t the optimal approach to improving the signal chain after all—perhaps we should instead stretch CD a bit. As for ultrasonics (which are possibly the only thing CDs cannot address at the fixed 44.1kHz sample rate), the evidence is not conclusive. Perhaps some content above the CD limit of 22050Hz should be included, but the limits set by DSD seem excessive.

Some proponents of DSD also claim that DSD offers superior time accuracy because of the bitstream approach and the extremely high sampling rate. The argument goes, we get in between the samples of PCM because the bitstream changes more often. But it is well known the reconstruction step in PCM achieves similar between the samples resolution, although simple minded analysis doesn’t show that right off. Dither also makes the phase resolution of PCM essentially unlimited when we integrate over all of time.

In addition to purely acoustical arguments, there is a host of technology based reasons to reconsider utilizing SACD. The most serious are related to the fact that bitstreams diverge radically from the more traditional PCM representation. Essentially, a bitstream has no number‐like structure, no clearly delimited frames such as the ones defined by PCM samples and the information (contrary to some Sony claims) is not even concentrated in the width of clearly defined pulses (that is to say like in traditional, analog pulse width modulation) but is distributed in a very complicated manner over long bursts of successive bits by the nonlinear noise shaping procedure. Essentially, this puts all current audio processing algorithms in the trash can—these methods require discrete signals which approximate sequences of real numbers. DSD streams do nothing of the sort since every bit in the sequence is inherently bilevel. You cannot even sum DSD streams without running into serious trouble, not to mention the complications with multiplication. And when multiplication and summation go, so does all of today’s signal processing theory.

Now, should we want to compress, convert, mix or edit the stream we have only two possibilities. The first one is, we convert to PCM. The second is, we build new DSP theory to do the operations in the bitstream domain. The first one immediately goes out the window since the first premise of DSD is that the signal chain should not include any of them harmful filters. We also run into complications with delta‐sigma itself—it is difficult to guarantee such conversions will be linear. This is less of a problem when we only do the step once or twice and the conversions are viewed as being approximative. But when DSD/PCM/DSD conversions need to be performed multiple times, we run into problems. The format isn’t even specified strictly enough to allow for optimal converters to be built—after all, the best converters marry the digital filters to the ones used in the delta‐sigma modulator. In DSD the room left for scalability means the specifications aren’t exact and the architecture inherently separates the modulator and conversion filters from each other.

The second option (new theory) is not very attractive either because it implies creating a theory of nonlinear audio processing from scratch. The complicated time structure of bitstreams—or better yet lack thereoff—complicates any attempts at direct processing even further. Pulling all this together, DSD is not compatible with anything involving calculation. This means it is not suitable for editing and, subsequently, mixing or post‐production of any kind.

Like any distribution format, DSD faces the usual questions of error resilience, space efficiency and so on. DSD does not fare very well in this department. Error correction codes can be used like usual but error concealment of the kind employed by CD players will at least at the present necessitate PCM processing and so DSD to PCM conversion. Of course, this does not affect DSD processing under normal conditions but requires the conversion to be implemented if error concealment is wished for. Space efficiency is significantly worse than for PCM since a lot of headroom is needed to cater for the distortion and the relatively broad, mostly unused spectral slot between the audio band and the stop band of the (slow roll‐off) analog output filter. Another way to see the genesis of the overhead is to consider how traditional delta‐modulation fails and then extrapolate to delta‐sigma: we need very high sample rates before a single bit increment can succesfully approximate a continuous signal. This holds for both delta and delta‐sigma modulation methods, although the precise time behavior of the loop filter may give a certain edge to delta‐sigma modulation.

A practical implementation of delta‐sigma conversion adds one further ingredient to the overhead issue: limitations on modulator sigma values. It was mentioned above that to guarantee stability, we might not be able to drive the output of the loop filter near its maximum value. In fact, many high order modulators employed in PCM applications use only some 75% of the theoretical maximum range of the circuit. Of course, a well designed PCM application will be calibrated to produce full scale output at the new limit. But when the bitstream is passed through on an as‐is basis, there’s not much to be done. A part of the theoretical operating range will be left unused. We then either require more transmission bandwidth to achieve the theoretical maximum accuracy of the original system or the precision of our implementation will be below what a straight forward noise shaping analysis would suggest.

The problems caused by high amounts of overhead are exacerbated by the difficulties in processing bitstream data. We cannot use compression (lossless or otherwise) to reduce the overhead, especially since the bit rates are very high, no clean symbol structure is present and the high frequency noise inserted by the modulation step renders any reasonable dictionary compression strategies moot. These concerns primarily affect multichannel and multimedia applications because in these, space is usually at a premium. Given that this is the direction most audio applications are seemingly going, the difficulty of compression can be a significant factor. If we employ compression (like DVD Audio does with its packing algorithm) and also are able to use the existing PCM processing techniques, much greater flexibility can be given to the content producer. Unlike SACD, DVD Audio lets the producer make the most suitable tradeoff between bit width, sampling rate and number of channels instead of mandating any fixed combination.

Going more to the theoretical side of things, while SACD is based on a framework of its own, it does rely on some unwritten rules of past audio systems. One of the more important one is the rule that each stored channel is destined for a single speaker, and that this faith is all that is needed to define what the stored signal means. But this is an assumption whose validity has more than once been questioned. First of all, to get optimal playback, the stored material must be adapted to the particular configuration of playback equipment the listener happens to have. This may mean anything from putting one channel out louder than the others to head position based HRTF filtering for high tech headphone playback. Processing like this also raises the question of what the input data actually is. The traditional channel concept is certainly not enough if HRTF processing or some similarly delicate procedure is to be performed. To date the only framework to convincingly address these questions is the Ambisonics one, where the channels are defined in terms of a spherical harmonical decomposition of the soundfield. The great benefit of such an approach is that the storage format is abstract, neatly specified, easily processed and completely gear independent. But this doesn’t work when the format does not support extensive signal processing capability. DSD does not. Gear dependence also future proofs a format, allowing new converter technology and processing methods to be employed. Again, DSD with its fixed specification of sampling rates, no compression and a very fixed pass‐it‐thru‐a‐filter configuration does not allow anything of the sort.

In spite of the preceding, Sony claims DSD does have scalability. This scalability mostly comes in the form of varying output filters. But this has its downsides as well. When the production side delta‐sigma modulator and the analog output filter are varied separately, there is a very definite danger of letting something inappropriate through. This is a very real concern since there are no steep digital lowpass filters in the signal chain as there are in the traditional PCM one. For instance, the principles of delta‐sigma modulation dictate that to raise the effective bit width of the conversion, we need to raise the sampling rate or the order of the modulator. The first cannot be done once there is an installed base of SACD players, but the second can. There is also a very real reason to—DVD Audio scales to 24 bits but the current, effective width of DSD is approximated to be 20 bits. If such a move is made, the noise generated by the modulation process will suddenly get a lot closer to the useful audio band and filters made today will not be prepared. This would suddenly mean a lot of ultrasonic energy in the analog output. In case, this would lead to problems with interference and nonlinear distortion (i.e. ultrasonics dropping down to audible range through such intermediate processes), requantization errors (e.g. for MD recording), breach of electromagnetic interference guidelines, increased jitter sensitivity, unexpected electrical resonances and certain problems with duty cycle modulated amplifiers (the so called digital ones).

Any migration from CD to SACD (or DVD‐A) will also be a question of resource allocation. Currently most people would happily admit that the lack of sound field immersion, three dimensional sound and localisation/directionality is far worse a problem for audio systems than their bit accuracy. Even audiophiles are likely to be more worried about other parts of the signal chain, like speakers and the room response, all problems which can at least partially be solved through DSP techniques. It is then highly questionable whether money should be spent on accuracy before the other concerns have been addressed. We might also ask if not PCM techniques would perform better at DSD‐like bitrates and so be a better candidate for any new audio architecture—after all, they cut the inherent overhead of bitstream modulation.

It is a truism of modern audio production that extensive processing is needed. Far from being about to die out, there is a great push to utilize signal processing techniques in the both ends of the audio signal chain. Studios use them for synthesis, effects and post‐production needs, radio stations need compression and equalisation and the end user probably wants bass boost and, one day, correction of the room and speaker responses and possibly simulated acoustical environments. It is therefore very doubtful whether investments should be made on an audio system which at the outset makes any such operations difficult. Often SACD advocates suggest precision analog processing as an alternative. This is downright dubious since the prime reason digital systems are used in the first place is because of their generally higher performance, better error tolerance, cost‐effectiveness and the ease of processing the information.

On the content producer side, SACD will not be very easy to master because the traditional studio audio processing model will have to be rethought. If processing in the bitstream domain is wished, new equipment will have to be purchased. Existing sound transports, like DAT and AES/EBU, are not compatible with the DSD signal chain. It is very likely that the equipment needed to work entirely in DSD will not be cheap enough to be available in but the largest of studios. Perhaps not even outside Sony itself. Similarly the heavy intellectual property protection, the narrow circulation of the SACD specification, and the probable incompatibility with writable DVD technology (especially the probable lack of DSD specific editing and mastering software with burn capability) will with all probability make the format quite hostile to smaller studios and the home studio based musicianship. This is in stark contrast with DVD Audio as the latter format can reasonably be expected to be computer writable in the near future. It might also be the format will end up being less than friendly to music of nonacoustic origin. After all, SACD is aimed at the audiophile market. I think this particular audience does not constitute the necessary force to drive the development of the software and hardware solutions necessary to author electronic music in DSD. The difficulty incurred by the DSD format on digital processing also precludes the authoring of electronic material entirely in DSD, necessitating PCM processing. It then becomes highly questionable why DSD should be utilized at all. In fact, using a PCM media like DVD Audio will actually shorten the signal chain for electronic material.

Technical details aside, the marketing side of SACD isn’t all that bright either. The first, and possibly most disturbing, facet is that the SACD seems to be strictly a distribution format. Sony has gone to great lengths to ensure the format cannot be copied and is as incompatible with existing equipment as possible. The format isn’t editable. The format is rather hostile to smaller studios as well, so it is clear that it is meant to be limited to the traditional record company centric model of the music business. Indeed, SACD could really be seen as a part of Sony’s overall strategy to secure its position against the changing conditions in the record market. SACD is difficult to convert to MP3’s, is well copy protected and is a direct competitor to less secure audio formats like CD and DVD.

In the motives side, Sony of course has a definite financial interest in SACD. And even more so when we consider that unlike with DVD, Sony receives the substantial part of licencing fees for SACD. So SACD is far more clearly seen to be proprietary than CD or DVD derivatives can ever be. Sony licencing policy is quite worrisome, then. So far it seem it has worked (there is substantial industry backing for the SACD format), but there’s never a guarantee with a proprietary format.

As for SACD and market acceptance, there is a definite problem. For the ordinary consumer, DVD Audio fulfills the promises of SACD and additionally delivers multimedia and multichannel capability. This means that the SACD installed base will not grow very fast and will probably be dwarfed by DVD Audio. This will mean expensive players and poor media availability (as of now, only a small number of Sony titles is available). SACD does not have the impetus of DVD‐V and DVDROM/RAM behind it to drive the prices down. This is a concern primarily with regard to the media—hybrid discs are important in the CD to SACD migration phase but are also a SACD specialty with little outside application to drive production costs down. (SACD and DVD will largely share drive components.) A further source of extra cost is the PCM incompatibility which implies duplication of production and mastering resources, thus further limiting who will be available to supply music on SACD. Unlike other companies, Sony has to worry about MiniDisc, CD, DVD Audio in addition to SACD, so the cost of releasing all material in all formats may prove excessive. This means even Sony might have trouble supplying material in its new format.

With all probability, industry support for DVD Audio will be wider than for SACD. This is because DVD Video players will systematically go with DVD Audio. Some notable manufacturers like Matsushita intend to release players which can do both DVD Audio and SACD, but it is not entirely clear if the support will be comprehensive. It may well be that multichannel capability does not appear in time while for DVD Audio, it will probably be supported from the start. In my mind, this is a crucial question for SACD. If multichannel SACD support from multiple manufacturers does not appear soon, the format will suffer serious harm.

Hype control—counters to SACD marketing

In the spirit of fairness and SACD bashing, some of the more ludicrous marketing speech of Sony deserves a counter. In the following some of the claims made by SACD material are put into proper perspective (and hopefully debunked altogether).

The favorite demonstration used to illustrate the extended frequency range of SACD is to display what happens to a 10kHz square wave when it is recorded on CD and SACD and then played back. The illustration consists of four oscilloscope shots and displays how SACD produces a very close approximation to the original square wave while the corresponding result for CD is a considerably rounded waveform which is closer to a sine wave than a square one. The pictures are very convincing and will probably spook quite a number of CD owners. They are accompanied by a brief description which tells how CD loses harmonics of the test wave from the third up and so is clearly inferior to SACD. What is forgotten is that the second harmonic of a 10kHz periodic waveform (which CD can handle) is at 20kHz, already at the upper limit of hearing for adolescents. The third harmonic would be at 30kHz and there is little evidence that people are able to hear that high under any reasonable conditions—it’s ultrasound. So is the demonstration meant for you or for your dog?

Similarly deceptive an illustration displays a scope shot with a cycle of something resembling a sine wave and an approximate DSD bitstream below it. It is easy to see the mean density of the bitstream closely corresponds to the value of the sound wave at each point in time. The text claims that since the stored bitstream is so close to the original wave, the resulting playback quality is superior to the one offered by PCM techniques. But what this really aims at is convincing those people that have reservations toward digital audio media and prefer good ol’ analog. The fact is, the stored structure of the data doesn’t matter a single bit as long as the output voltages closely follow what went in. After all, what is stored on a SACD displays little resemblance to the pure DSD stream the data carries. What matters is the subsequent processing and the soundness of theory behind it, as always.

Illustrations of typical PCM and DSD signal chains also serve a role in the campaign. Since DSD is obtained by dropping the digital filtering parts from a PCM signal chain and passing the pure bitstream directly to the receiver, it is easy to claim that the resulting signal chain now lacks some distortion. To some degree, this claim might even be true. What is forgotten, though, are all the practical consequences of the new architecture and the fact that in a well engineered signal chain, the filtering operations can work with significantly more flexibility and accuracy on the side of the bitstream, that the filters operate at a level of precision far beyond that imposed by the PCM format chosen, that the PCM philosophy does not call for exact reconstruction of the same bitstream in the D/A conversion process that was used for A/D, but rather the reconstitution of the signal represented by the PCM signal and that any rounding operations can utilize the precise same noise shaping logic that leads to the supposedly superior performance of DSD. Furthermore, proper dithering also aids in making the quantization procedure (including what the filters do) perceptually transparent. Furthermore, it is entirely forgotten that using DSD not only cuts the stages in the signal chain, but also prevents anything from being inserted there when really needed. This includes studio apparatus, which for the time being will be used by operating variably in DSD, analog and PCM domains—clearly far worse a thing for the quality of the resulting DSD stream than once going through a high precision digital filter and after that staying entirely in the digital (PCM) domain.

Some DSD groundwork material also suggests that the architecture is made cheaper by the elimination of some digital processing steps. This is downright funny considering the price of the first SACD implementations and the intended target audience of audiophiles, some of which are ready to pay tens of dollars per meter for speaker cable. Similarly market acceptance issues and the time it takes to update production and playback facilities and the money that goes to prolonged multiformat support will mean higher prices for the media as well as the players. Furthermore, fixed function digital signal processing is very cheap nowadays—most of the price of a CD player comes from sources other than the D/A converters which are quite economical to manufacture in large quantities. Remember, cost effectiveness was one of the reasons for going into delta‐sigma converters in the first place.