Audio transport

Probably the most visible area of audio and recording technology are the transports, physical media as well as conceptual protocols, over which sound is transported. After all, this is the side the consumer sees most prominently when bying or utilizing audio technology. Given that realism is often an important goal in sound reproduction and the fact that even the most highend frameworks for audio transmission and storage cannot really represent even the simplest of acoustic spaces, it is no wonder that there is a terrific drive towards new and improved standards for audio transmission. This section deals with the most important existing audio formats used in broadcasting and sound storage, with a specific emphasis on digital media.

The signal chain

Before going into the details of the multitude of the existing audio media, a couple of words must be said about the different areas of application of these technologies and their relative placing in the signal chain.

The areas of application of an audio storage/transmission format can be many. The most typical ones are the commercial, widescale distribution of prerecorded sound on tangible media (music available in stores, sound of film), the various broadcast environments in which music is distributed (radio, television and streaming Web content), individual use for timeshifting and archival purposes and professional use as editing intermediates. Each of these environments exposes its own specific requirements for an audio coding and imposes some predefined signal chain into which the audio conduit must fit.

In physical distribution of records, price and market acceptance are the decisive factors for the design of an audio format. Large scale replication at a low price with little deterioration of sound quality must be possible. Formats need to have proper features to sell them and to justify the costs of building entire delivery chains from scratch. The media need to be rugged and it must be easy and cost efficient to build playback equipment. Neither recording facilities nor the quality of sound are as essential as one would think: consumer markets always work in relative terms instead of relying on absolute measures of quality. Intellectual property issues are a great concern, since music distribution is a highly industrialized business. The format is eventually placed at the interface of a (traditionally) very centralized, upscale and well‐off production chain and a relatively lowend and unpredictable consumer facility. Hence there are few limits to what the producer can be expected to do, whereas not much can be expected from the consumer. We can also expect a relatively unsophisticated and short signal chain on the consumer side, whereas the producer will likely have something very different. The market is such that we often need to make incremental improvements to existing standards instead of establishing completely new ones. Together the last two points often imply hierarhical, downwards compatible standards.

Broadcast use calls for a flexible, nearly instantaneous coding strategy and an error tolerant receiver. The junction between producer and consumer is similar to the one in the above distribution chain. In broadcast use, the costs of establishing a new distribution chain are much slower to recoup, so incremental evolution of standards is even more of an issue than in the above. Also, broadcast applications usually share a common channel (a cable, a common Internet connection, the limited radio spectrum etc.) with other data streams so that multiplexing, in one form or another, is a must. In a shared channel situation, we get capacity limitations as well—few transmission channels have enough bandwidth to spare. Indeed, the same considerations apply in tangible distribution of audio when other data (e.g. video or CD‐ROM supplements) need to be stored as well or when the amount of bandwidth used by the audio application soars high enough relative to the available technology (e.g. in multichannel and multirepresentation audio).

In private use, the possibility of easily and cheaply producing/recording in a given format is imperative. The requirements for sound quality vary greatly, from barely intelligible in dictation use to as good as at all possible in audiophile storage of music. Since any home recordable format can be used to copy existing copyrighted material, there is great pressure from the music industry to limit the quality of attainable replication quality and/or to recover some of the (assumed) losses incurred. Sometimes this means blanket compensation for recordable media (such as the 1.5% of the price of empty C cassettes going to the music industry), technical protection measures (covered separately in a later section) or even purposeful degradation of the output quality attainable from compatible devices (e.g. DVD manufacturers must commit to not providing digital (more specifically FireWire) interfaces in players). Blank media should be competitively priced and compatible with playback equipment meant for prerecorded material. Sometimes this calls for tradeoffs, like nonrewritable media (as is the case with CD Recordables). For archival purposes, shelf life is often a concern, especially with analog magnetic media (which can, almost by definition, have little in the way of error correction). The degree of editing functionality needed varies greatly from one‐off recording of entire media (common with optical media) through linear overwrite editing (with analog tape storage) to complete nonlinear nondestructive editing (as with MiniDisc). The same goes for versatility and metadata capability, although the paradigm of user recordable media is nowadays pretty much settled. Recording consumer devices are usually interfaced through standard analog means to fit into the available home stereo setups, the signal sources (not counting playback of commercially distributed sound) are of relatively low quality and so little attention is given to absolute reproduction quality.

In professional applications, the needs are very different from the consumer counterpart. Long signal chains are to be expected and generational decay is therefore always an issue. For this reason and the highly variable quality attainable from different delivery channels, absolute, repeatable accuracy is paramount. Editing facilities, versatility, storage of archival metadata, long shelf lifes and standard low loss interfaces to other formats are needed as well. Price is mostly not an issue, not with respect to media, creation or playback.

Nowadays, after the digital revolution, it is highly fashionable to classify audio formats into digital and analog ones and tout the first pretty much as the equivalent of the second coming. We follow this classification in the following, but I’d like to remind that digital isn’t quite synonymous with better. In fact, the highest quality production facilities still rely primarily on analog technology. This is partly because of inertia, but also because the driving force behind the current trend of digitalization is not quality but price: at a given level of performance, analog’s reliance on quality components places it at a disadvantage compared to digital. On the other hand, precision analog equipment is remarkably good, especially with signals which still mostly originate in the acoustic/analog domain (e.g. sounds captured by microphone). In this rather different arena, the difficulty of doing accurate enough analog to digital conversion has long impeded the development of completely digital production facilities. Only now is this about to change. What digital excells in is error tolerance, controlled degradation, ease of transformation and editing flexibility.

Because of the different application areas, the signal chain from studio microphone to a user’s home or car stereo often consists of a veritable slew of audio formats. Since we aim at the greatest possible quality of reproduction, at no point in this chain should we go from a lower quality format to higher. Otherwise the latter will not be fully utilized. Second, conversions from format to format should be as close to lossless as possible. Mostly this is not the case. The reason is mostly in interfacing, though incompatible formats and politics get in the way as well. For instance, it is hideously difficult to duplicate perceptually encoded sound without an encode‐recode cycle. It is not a given that all signal chains aim at pure transmission, either. The most notorious example is the use of heavy compression in broadcast applications—recording from such a source certainly isn’t the optimum way to catch the original quality of, say, classical music.

Channel configurations and signal interpretation

The primary defining factor of an audio signal chain is the semantics it attaches to the information transmitted. However, it is also the very facet of audio transmission which is consistently forgotten even by the most proficient practitioners in the field. Since what is transmitted is almost without exception a set of time coincident signals (analog or digital) and sensible people agree that each of these should be transmitted with as little modification (linear or otherwise) as possible, the principal misunderstandings have to do with what exactly the relationship between the transmitted channels is and what the exact thing is that the stored channels represent. It is obvious that what moves down the wires in an optimal scenario is a linear combination of linearly filtered signals proportional to some measurable aspect of a coherent soundfield, measured in some particular way. But in the case of popular stereo sound (as in FM radio and television), it is not untypical that the above vagueness is left to reign. Thus far this is precisely the stance that we have been taking and hence the situation must be remedied.

Uniform channel configurations

The most common configuration used is a one in which we transmit one or more channels, each destined for a single speaker with more or less no intermediate processing. The setup is symmetric in that the channels are completely interchangeable and it is purely a matter of convention which channel carries which speaker signal. This view of sound transmission has its roots in the early monophonic recordings from which phonography evolved—in essence, the multichannel versions are just straight forward extensions of the basic monophonic conduit. The reasoning is eventually based on the principle that more is better.

As for the definition of what is stored, there are rival views. In the case of mono transmission, the situation is simple: the usual definition is that a monophonic signal represents the recorded sound pressure at a point. This would be what is captured by an ideal omnidirectional microphone set at the point. The ideal playback equipment would be anything which reproduces this pressure variation at the sweet spot. Understandably this is quite impossible for an audience of more than one, which highlights the futility of defining a transmission format in terms of the playback equipment.

However, in the case of stereo and beyond, we have at least two equally widespread conventions. The first is that the different channels represent separate point sources placed around the the listener. For stereo the normal placement is symmetrically to the front of the listener with a 60° angle between the speakers. For three channels, the third channel is placed between the main stereo pair, directly to the front of the listener. The speakers should be equidistant. With four channels and beyond the placement typically consists of a regular polyhedron with the listener at the center, oriented so that if there is an even number of channels, one of the edges of the polyhedron is aligned with the front stage (where we placed the stereo pair earlier), or if there is an odd number of channels, so that one of the vertices of the array is to the front. This view of the stored signal reinforces the intensity panning or pairwise mixing methodology of multichannel audio production—the placement of sound sources is achieved by mixing in‐phase, amplitude adjusted versions of a mono signal into at most two adjacent channels. This is the current paradigm embraced by the American recording industry and also represents its old skool views on micing technique, favoring production methods which aim at large listening areas and a relatively spacious stereo sound.

In case of acoustical material, we often try to regain the lost spaciousness by using spaced omni recording, where instead of trying to accurately capture the sound at the ideal listener’s seat, we create artificial envelopment by recording the stereo channels with omnidirectional microphones spaced far apart. This will decrease the inter‐channel correlation in the critical low frequencies, making spacing a workable spatialization technique; however, acoustical theory predicts that coincident pair technique, described next, ought to produce more dependable and realistic results. Hence many fans of coincident pair theory consider this practice dubious and indicative of the failure of speaker feed based recording technique.

In the sixties, Blumlein’s extensive research in stereo sound production lead to another paradigm which is in some ways incompatible with the American school of thought. Blumlein’s theory was based on the concept of coincident pairs of microphones. The idea is to capture the stereo sound at a single point in the form of two perpendicular components. Usually this form of capture is realized by what is called an M/S (mid‐side) microphone—a front pointing hypercardioid combined with a sideways dipole, with the signals matrixed to emulate a coincident pair of ideal directional mics aimed 45° off the center front stage. This provides better mono compatibility than is achievable by a straight realization via two hypercardioids. The playback would then be accomplished by an array of two speakers placed at 45° angles from the center front. Now, given the 60° setup described above, the stereo image produced is narrower than with spaced omni recording techniques, but Blumlein’s theory predicts that it should be possible to recover the effects caused by frontal sound sources perfectly at the sweet spot. This methodology is guided more by considerations pertaining to the transmitted representation while the one described in the preceding paragraph concentrates on utilizing a given output array. In effect, these formats describe the sink and the source, respectively. The incompatibility shows when we use the 60° setup, since this will flatten the resulting sound field considerably. Another common problem with coincident pair techniques is that they produce a very limited sweet spot, as the imaging relies somewhat more on phase coherence than methods structured around spaced omnis and speaker feeds.

Beyond two channels, homogeneous channel setups are governed by a view of channels as destined for equidistant point sound sources at equal angles around the listener. However, the history of such transmission formats is not long and is filled with commerical failures. The seventies’ experiments in quadraphonic sound systems all ended up as utter disasters. This was mainly due to expensive equipment combined with a transmission infrastructure which only had two channels. This lead to matrixing being used and, thus, inferior quality of sound and an unstable soundstage. These experiments lead to a considerable pile of knowhow on steering methods and matrixing which are nowadays brought to bear on surround sound and stereo enhancement. The premiere example is Circle Surround (CS).


The nonadherence of prevailing audio transmission standards to any strict interpretation of the channel data becomes a problem when truly accurate spatial audio is needed. It is unrealistic to assume that every listener will have the precise same, optimal playback rig. Not surprisingly, then, a given recording will never be optimal on any given playback setup.

Ambisonic is a framework of scalable spatial audio production which aims to alleviate this problem. It’s the brain child of Michael Gerzon, the late audio guru mostly responsible for the development of today’s audio applications of dithering and noise shaping. The original ambisonic signal chain aims at capturing a complete description of a soundfield at a point. The theory is an extension of coincident pair techniques to three dimensions, with a careful eye towards the physics of sound. Blumlein on steroids, so to speak. This implies recording one pressure (W) and three velocity (X, Y and Z) components. Hence, the ambisonic B‐format is a four channel one with a strict interpretation attached to it. Playback equipment is required to decode this four channel signal into the actual speaker feeds. The attendant processing depends on the speaker array in use and makes ambisonic truly device independent.

The original theory behind ambisonic deals with the representation of plane waves and displays that a B‐format signal can carry a complete representation of an arbitrary superposition of plane waves passing through the recording point. The theory treats the three velocity components as being equal to the pressure component multiplied by the direction cosines of a passing plane wave. For single plane waves in an open space this is true and lets us recover arbitrarily complex speaker feeds by a static matrix multiplication of the incoming 4‐vector. The accuracy for plane waves also gives the reason for ambisonic’s remarkably concise and workable encoding—sound sources which are located at a considerable conceptual distance from the playback array produce approximately planelike wavefronts which ambisonic then handles perfectly. But recent experiments in closed spaces and with nearby sources and their theoretical interpretation in light of the new sound intensity theory of Stanzial et al. (carried out mainly by Angelo Farina) display that in the general situation, there need not be any such relationship between the pressure and velocity fields of a complete soundfield. So instead, current developments in the ambisonic field have actually concentrated on the idea of spherical harmonic decompositions of the sound pressure field around a point.

Spherical harmonic functions are used in many branches of science, including physics and physical chemistry. Consequently, three dimensional polar plots of these functions can be found in any book on physical chemistry and are also available in the net. The basis functions are defined on a sphere and their magnitude represents the sensitivity of a given in a direction. The spherical harmonic component of zeroth order is constant over the sphere, representing an omnidirectional directionality pattern—equal averaging over all directions. Hence, the channel (called the W channel) carries the sound pressure in the measurement point. The three components of the first order are precisely equivalent to the classical free field interpretation of the ambisonic velocity components as a cosine weighted local pressure: their directionality patterns display two coordinate axis aligned lobes of opposite polarity and a magnitude which varies as the cosine of the angle (the lobes show as spheres in a 3D polar plot). Now, this new view on the channels has enabled even higher orders of spherical harmonics to be used. In the process we lose the ability to encode the velocity field. Instead, under the free field condition, we gain the possibility of approximating the directional derivatives of the sound pressure field around the point of measurement upto a desired angular accuracy. Thinking about the direction patterns represented by the magnitude of the spherical harmonic basis functions, and the fact that optimally decoding for a given channel layout is pretty much just doing a weighted average of each speaker’s direction pattern at the sweet spot with the spherical harmonic functions as the weight, we easily see that the higher the order of the decomposition used, the more precisely we can localize a given sound source on the sphere. I.e. in zeroth order setup, all sound will emanate equally from all directions. In an ideal first order system the sound will always come from the correct hemisphere, with a cosine directional weighting. With higher order systems, we can approximate point sources, which is what we need in order to present a stable spatial image to a larger audience. This implies a broadened sweet spot even in a second order ambisonic setup, although the level of realism even in a first order system is amazing. The number of channels stored naturally soars pretty high. Nine full bandwidth channels are needed for second order alone, contrasting the four used for conventional B‐format transmission.

Ambisonic is a remarkably concise and complete framework and has a firm following (see for community stuff). The emergence of high bandwidth audio media, such as DVD and the current digital cinematic audio systems, there have been high hopes among the ambisonic folk that one of the formats would allow ambisonic encoding as an option. Indeed, the lossless packing and framing format (Meridian Lossless Packing (MLP)) used in DVD Audio has such an option. Sadly enough, content is not yet available.

Since the beginning of the ambisonic movement in the seventies there has been a shortage of channels which can be used to reliably transmit the four discrete high quality channels needed for ambisonic B‐format. Hence a number of compatibility formats have been developed, going under the banner of UHJ. Basically, these are hierarchical matrix based encodings for the B‐format signal which admit varying levels of compatibility with the full 3D model of the ambisonic framework. Three channels suffice for 360° horizontal surround, two channels do it for a 180° frontal soundstage and 2.5 channels (one half bandwidth channel in addition to a genuine stereo pair) provides a degree of perceptual improvement over the bichannel setup. These compatibility encodings have all the advantages of a precisely defined transmission format and are quite compatible with conventional stereo transmission (stereo material decodes nicely on an ambisonic rig and UHJ played over a conventional stereo setup constitutes just a wider than usual stereo sound). The newest addition to the family of compatibility encodings are the G and G+2 family signals encodings intended for transmission over the prevailing cinema sound distribution formats. These utilize 5.1 channels, a number common in digital surround setups, to encode a three channel UHJ signal. The coding makes the signal enjoyable over a standard 5.1 setup but still enables recovery of the three UHJ signals for an orthodox horizontal ambisonic decoding. Adding a further discrete channel (dropping, for instance, the LFE channel and using a full bandwidth extension instead) we directly arrive at a signal playable over both 5.1 channel surround and, with proper decoding, arbitrary B‐format capable equipment. The +2 part substitutes a two channel UHJ encoding for the stereo mix often transmitted alongside a 5.1 channel surround signal (e.g. on DVD and LaserDisc).

The downsides of ambisonic include trademark hassles (Ambisonics is a registered trademark which cannot be used without permission), the wealth of channels needed for transmission, the cost of decoding equipment and speaker arrays (typically upwards of eight speakers) and the cost of production. Currently ambisonic is a true niche market, even while it has shown great promise since its birth. As probably the only current audio coding framework with both a theoretical and a practical foundation, I truly hope it gains some impetus—building heavy weight processing on top of a well defined, equipment independent encoding is far more productive than guessing what current 5.1 signals are supposed to mean.

Wavefield Synthesis

In contrast with the extreme space efficiency and approximative nature of ambisonic, wavefield synthesis (WFS) tackles the problem of spatial sound by force. Designed for arbitrarily large listening areas and no compromise horizontal surround, WFS attempts a complete restoration of the original soundfield, at least in two dimensions. The goal is achieved by a method similar to the old idea of a curtain of microphones. This is based on the observation that substituting the curtain of a concert hall with an array of thousands of microphones and using the signals from these to drive an equivalent array of loudspeakers in the listening space should actually recreate the sonic experience of sitting in the concert hall perfectly (absent room acoustics). This is a basic consequence of Huygens’ principle for the construction of wavefronts from point sources.

In WFS less thankfully suffices. First of all, WFS setups are usually designed either for two dimensional playback (no illusion of height) or alternatively a narrow sweet spot. We note that in these conditions (a single horizontal plane and possibly an intersecting vertical one in which the sound field is to be recreated; a bandlimited signal travelling in a homogeneous medium) we get a strict lower bound on the wavelength of the signals which we may wish to reconstruct. Now, the sampling theorem tells us that by spatial sampling of the pressure field (taking space‐equidistant samples from it), can build a perfect, invertible representation of the field. Spatial sampling is usually implemented with regular arrays of wide range microphones with an approximate cosine weighted response over a half ball and null response over the other half. The construction of such arrays is quite nontrivial. When we apply it in a plane or on two intersecting planes and remember that the maximum distance of two adjacent microphones is precisely half the wavelength of the highest frequency we will be capturing, we arrive at the surprising conclusion that not that many channels are needed, after all. For full range elements we arrive at a number of some 120 elements per metre to achieve full spatial resolution up to 20kHz. In fact, we can subdivide the bandwidth we are interested in and as we do so, we note that the number of spatial samples needed (channels transmitted) is proportional to the highest frequency present in a given band. This way we can use small elements for higher frequencies and we see that it is actually almost possible to achieve the full resolution we might wish.

WFS reconstructs the soundfield by playing the stored channels (spatial samples) through an equidistant array of well matched speakers, much akin to a curtain of microphones. In this case, the array needs to be a full bandwidth one with almost perfectly equal phase and frequency responses over the individual elements. This is not easily achieved. Usually we start with multiple rows of elements, using piezoelectric patches or magnetic strips for the highest frequencies and the usual cone electrodynamic drivers for the midrange. Low frequencies are commonly left out and instead reproduced mainly by a couple of conventional speakers because of the poor behavior of long wavelengths in closed spaces. Then heavy duty structural acoustics modelling is used to arrive at an array with a flat, equal directional response from each of the elements and balanced properties otherwise. Finally, digital signal processing is used to correct the remaining anomalies in the transfer functions of the individual elements and equalize delays over the whole array. This optimization is done in the intended plane of reproduction since stable 3D performance is a bit much to ask.

Of course storing, say, a couple of thousand full bandwidth channels is no mean feat. This is why WFS is currently an experimental method and it also explains why no WFS setup to date actually implements the 120 elements per metre requirement for perfect spatial resolution. Most experimentation is also confined to creating synthetic soundscapes instead of reproduction—precisely because nobody has the gigabytes per second of storage/conversion bandwidth needed for full WFS just now. Let’s explore what this means for actual delivery. First, lowering the spatial resolution of the system risks spatial aliasing. This phenomenon is best explained through the sampling theorem, which guarantees perfect reconstruction for wavelengths longer than twice the spatial sampling rate. Just as with normal aliasing, shorter wavelengths fold to above twice the sampling distance.

If we look at the kinds of sound sources which could theoretically produce such high spatial frequencies, we see only two obvious candidates: high frequency sources very near to the recording/playback array and at high angles relative to the normal of the playback surface. (The array constitutes a kind of window between the recording and playback universes.) This is because any free field source far away from the array whose projection on the playback surface falls within the lateral confines of the array will mimic a plane source which will actually excite all the channels equally at each moment in time. High frequencies emanating from close by sources and sources outside the window will, however, cause progressively sharper differences between adjacent channels as the angle between the normal of the array surface and the direction of the sound source gets larger. If and when spatial folding starts to happen, the perceptual effect is difficult predict. The sonic hologram becomes blurred and destabilized—moving in the field no longer evokes the sense of envelopment which is the primary objective of WFS.

A continuous sinusoid source gives us an easy reference. Close to the source the array will see it as a plane source with a zero spatial frequency (locally, the array receives approximately equal excitation for each channel). Instead, far away from the source along the line set by the array it seems more and more like it was actually on the line and the spatial frequency agrees with the inverse of the wavelength (wavefronts from the source propagate along the playback array instead of colliding with it at a right angle as in above). In fact, the local spatial frequency of a free running sinusoid source will be proportional to the sine of the angle between the direction of the source and the normal of the array. From this we see that the situation is fixed as well by a bigger array as it is by higher resolution.

Used as a storage format, WFS is notoriously wasteful of space. Hundreds of channels are needed for quality playback and the feed is, in contrast with ambisonic, destined for a given speaker configuration. On the other hand, compression based on spatial prediction might be quite successful at least with synthetic WFS material. As the concept of sampling (spatial or not) is well established and unambiguous, the format is quite strictly defined. Reproduction on a different playback array is a matter of spatial sample rate conversion, a conceptually simple though gigaflop hungry operation.

Nonuniform channel configurations

Owing mainly to the special needs of theatrical sound, there are now multiple inhomogeneous channel configurations in use. The most well known are the so called surround ones. The premise in these is that in addition to accurate, directional reproduction of frontal sound sources (such as sounds emanating from the stage or the big screen), we need a separate facility for ambience and other nondirectional sound material. The general idea may take many forms but the most common variant allocates one to two surround channels with a number of speakers to this function. Most of the sound emanates from the behind or to the sides of the listener. Obviously the surround channel(s) are interpreted quite separately from the frontal (main) channels. The diffuse soundfield associated with surround setups is usually created by a bank of loudspeakers placed around the listener, wired in parallel or alternatively connected out of phase through inverted wiring or allpass filters. In a home environment, dipole speakers, digital allpass filtering (mutual whitening of the surround speaker feeds) and wall reflections from weirdly aimed speakers are used to achieve a similar, diffuse soundscape.

In surround setups, the main channels are mostly treated as they are in standalone ones. Understandably, however, the interpretation of the surround content is considerably more hazy. Mostly it is used as a less than accurate, less than directional scrap heap where funky humm is occasionally placed. The surround sound of today isn’t exactly a science.

As a further concession to the movie theatre, what with all the extravagant special effects, is the addition of a bass boost (low frequency effects (LFE)) channel. It is used to momentarily add extraneous low frequency (or even subsonic) energy to explosion scenes and alike. Only one such channel is needed because we all know that in closed spaces, bass frequencies are poorly localised. In a modern well equipped theatre, it is however questionable why such a specialized bass channel would be needed at all; when digital transmission is used, any one of the channel data streams can carry even subsonic energy flawlessly. Such frequencies need only rarely be cut in the production phase and could be separated into a playback system of their own if the primary speaker array cannot carry the load.

The special needs which lead us to employ a nonuniform rig also give plenty of chances for other tuneups. The most common of these are associated with matrix coding and steering decoders, both of which benefit from extra assumptions related to the final destination of the channel data. For instance, masking and precedence effects are used to advantage in Dolby’s matrix surround, as well as in all the current digital cinematic sound systems (DD, DTS and SDDS).

Newer ambisonic configurations sometimes incorporate one to three frontal, conventional speaker feeds which are used to enhance the directional accuracy of the setup for the ever important stage sources. This can be considered another example of an existing, inhomogeneous channel configuration, driven by practical considerations rather than theoretical elegance.


 ‐voltage vs. current driven transmission
 ‐high and low impedance connections

Probably the most rudimentary way of moving sound from place A to place B is by wire. This is by far the most common audio transport, taking both analog and digital forms. The digital variety is the only audio transport capable of anything resembling, err, sufficient capacity.

Unbalanced analog

When most people think about connecting two pieces of audio equipment together, they think about unbalanced analog lines. These take many forms of which the variety connected by RCA connectors is now in vogue. Plugs are a typical option as well.

Unbalanced transmission uses a high impendance receiver and voltage based transmission. The voltage driven to the signal wire carries the audio, and is compared to the reference plane set by another wire. Typical signal levels include 4dBu and 10dBu, meaning a peak‐to‐peak voltage with a given number of decibels against a reference level of 1V.

Unbalanced transmission is cheap and does not require specialized transmission circuitry on either side of the connection. The wires used are typically coaxial and wound, with the reference plane carried by the outer shell of the wire, and the signal inside. As audio does not require large bandwidths, these cables are typically manufactured to a loose tolerance. They are also prepackaged in pairs for stereo use, with connectors soldered on. Cost‐efficiency and ease‐of‐use are the main reason for the wide popularity of unbalanced audio transmission.

The two primary problems with unbalanced analog are poor resistance to interference and trouble with earth loops. The first is a result of unbalanced transmission, which causes stray magnetic fields from power supplies and electric appliances to easily couple to the wires. The primary benefit of balanced transmission is cancellation of such interference. Earthing becomes a problem because unbalanced transmission requires a reference plane (usually connected to the outer shield of the cable and the earth planes of the connected pieces of equipment) which then ties the equipment together and causes loops to form through the various lines going in and out of the machinery. These loops then gather interference and, especially when some of the lines are power supply lines with high voltages, other coupled signals. Sometimes the signals are of high voltage and can even break delicate electronic equipment.

Balanced analog

Balanced transmission is most commonly used in the studio environment. It uses two or three wires: two signal wires employed as a differential pair, and an optional grounded shielding, which is typically grounded at just one end of the line. Ideally the signal wires will consist of a twisted pair to combat induction. XLR is the de facto connector standard, and is present in just about every piece of professional audio equipment out there.

Differential transmission is why balanced transmission is so much better than unbalanced. It means that the signal is carried not by a level or current in a single wire, but as difference between the signals in two wires. A well implemented balanced line requires some extra circuitry in both ends of the line, a driver to derive the differential signal and a receiver to convert back to the internal levels of the equipment. Neither of the lines should be connected to equipment ground, but should ideally be totally isolated, float.

First and foremost differential transmission combats interference. The principle is that the twisted pair of signal wires first gathers less interference because of its advantageous inductive properties. Then any interference that is picked will be picked by both wires and so cannot affect the difference between the signals carried. The interference will then cancel out in the receiver. The optional shielding does not carry a voltage, so it cannot form a ground loop. Instead it just protects against higher frequency electromagnetic interference which might be picked up. The price we pay is higher ohmic loss and reactivity because of the wound pair. Noise is kept down by employing higher signal levels to drown the noise out—those levels are made possible by the inclusion of separate transmission circuitry, which have to be capable of higher intensity drive anyway, because of the higher loss in balanced cables.

The downside of balanced wiring is higher cost in all parts of the system. Extra circuitry and power supply requirements are placed on the equipment, well‐made STP is not cheap either, and the higher interference standards necessitate more elaborate connectors as well. The result is, few people use balanced lines outside the studio, even when they would seriously cut down outside interference in a system.

 ‐voltage or current based transmission? the benefits?

Stereo digital: AES/EBU and S/PDIF

Multichannel digital: AES10 (MADI) and Dolby E

Proprietary digital buses

AC‐3 and DTS on 1.44Mbps S/PDIFAC‐3 RF interface?

Analog media

C(ompact) cassettes


Open reel tape

Sound on VCR and laserdisc

HIFI video

Digital media









Broadcast formats

 ‐analog: mainly about orthogonal/frequency division multiplexing

Analog radio and television

Analog public radio transmissions come in three main flavors: mono AM and mono/stereo FM. Of these, stereo FM is of the highest quality and consequently has become the most popular alternative today. Since these transmission formats have surprisingly lot in common, we will cover them in order.

 ‐mono AM

While amplitude modulation uses the instantaneous amplitude of the radio carrier to carry the payload audio signal, frequency modulation employs the instantaneous frequency instead. The prime benefit is that carrier amplitude stays at a constant high level and it can be used as a very stable reference level in the demodulation process. The sender also packs maximum energy to the radio frequency signal because the amplitude stays at its highest value all the time. This means that FM radio has a wider coverage than AM at the same average transmitted power. We also get easier, more stable gain control—on the edge of the reach of an AM sender, we get sporadic fluctuation of the received signal, whereas with FM, only noise levels rise. The negative side of FM is its low spectrum efficiency: FM uses approximately ten times the bandwidth of AM and can thus only be used in the higher frequency bands, in the vicinity of 100MHz or so. This heavily limits the maximum reach achieved since the higher frequencies get progressively more attenuated by physical obstacles on the transmission path.

Mono FM employs garden variety frequency modulation. Baseband audio signal is limited to 15kHz bandwidth and is modulated directly onto the carrier using a modulation index of ‐NOTE‼ What?.

Achievable signal to noise rations are on the order of 80dB near a transmitter, while the bandwidth for a single channel after modulation is approximately 200kHz.

 ‐NOTE‼ Lots more of radio technology stuff

Stereo FM builds on the older mono arrangement. In this case, we encode the sum of the stereo pair just like we did in mono FM. But after this, we insert a pilot tone, a constant 19kHz marker telling that the FM demodulated baseband signal actually contains more information. This information is placed on an auxiliary carrier of 38kHz, using vestigial sideband modulation (a variant of AM in which the lower sideband, but not the carrier frequency, is greatly attenuated but not entirely removed). What is modulated onto the subcarrier is the difference channel of the stereo pair, again limited to about 15kHz bandwidth. So in essence we have to do two separate demodulation runs to get the whole stereo signal: first we FM demodulate from the radio frequency signal (carrier in the 100MHz range) and then we VSB demodulate to get the difference signal (the carrier has a frequency precisely twice that of the pilot to facilitate the use of phase locked loops to synthesize the subcarrier from the pilot tone). Stereo FM uses some 600kHz for the complete signal after RF modulation, and achieves quality comparable to that of mono FM.

An interesting thing to note about stereo FM is that the difference channel is considerably more prone to bad radio engineering (but not environmental noise) than the sum channel—the modulation method chosen, the use of a subcarrier and the relatively tight positioning of the sum, pilot and difference bands in the intermediate domain cause nonlinear distortions to mainly contribute to error in the difference channel. This is most easily heard on commercial channels which press their modulation indices (i.e. volume) so high that distortion results. Its nonlinear nature causes formation of sidebands in the demodulated RF signal which reach right into the lower tail of the VSB subcarrier and cause annoying noise in the difference channel. Usually this is inaudible but when using a Dolby Surround or comparable matrix decoder on the output, the difference is again separated from the stereo pair and the noise becomes apparent.

 ‐NOTE‼ TV???


 ‐block companded digital sound at 13 bits
 ‐channel data rate max. 768kbps
 ‐not all is used even in stereo


 ‐digital perceptually compressed

Streaming Net audio

 ‐liquid: Dolby AC‐3 in stereo mode
 ‐Real: LPC ⇒ not for music
 ‐Shout: layer 3?

Sound on film

Analog optical

 ‐low S/N


 ‐even more expensive

Synchronized external audio

 ‐early experiments: France?
 ‐Terminator (CDS?)!

On‐film digital audio

 ‐low capacity
 ‐de facto standards

Replication and media

 ‐differences for

Copy protection: SCMS, CSS, SDMI, SCPS

 ‐DMCA/WIPO treaties et cetera
 ‐Content Protection Working Group stuff
 ‐SDMI: watermarking
 ‐watermarking overall