Sound synthesis

After considering the prominent methods of analysing and visualising sound signals, we are ready to take a brief look at sound synthesis. This is a fairly broad subject, so only an introduction can be given. Some relevant aspects, here, are our goals in synthesis (e.g. do we want to mimic existing instruments, electrical or physical, or create something completely original), the algorithms we use to reach those goals (they exist in quite a variety), the way we implement our algorithms (great care must be taken here—a bad or inefficient implementation will ruin the result) and the control structures, parameters and user interface we choose (these dictate how usable our synth becomes).

First, some of the more general aspects of synthesis will be discussed. After that, different synthesis methods are treated in a roughly decreasing order of prominence. Some analysis of the internals and the relative strengths and weaknesses of each algorithm is presented. Finally, implementation and interfacing related issues are briefly touched.

Some general points about good synthesis

When synthesizing sound, we are faced with all the classical engineering decisions, and then some. We need to see what we are trying to achieve, i.e. do we want to mimic existing instruments (classical or electronic) or produce original sounds, do we want to use the sounds as Foley effects, musical instruments or to create a sonic atmosphere. We need to know what applicability we wish from our synthesis engine—is it meant to be a once‐size‐fits‐all solution for music composition or do we expect to use it only to create classical guitar sounds for a single project. If we wish to have commercial applications, we need to consider price and target audience. If we need realtime performance, implementation issues and computational facilities are often a limiting factor. Whatever the scope, ease and intuitivity of use are always paramount to good synthesis—if few people can grasp the logic behind our engine, it is unlikely that a lot of useful synthesis work ever gets done. Original analysis needs to be weighed against work already done and earlier solutions readily available. Rich features must be balanced against cost and ease of use. All these and much more contribute to good synthesis and there are few absolute rules.

Some points should be noted, though. First of all, most synthesis techniques are not original. The best of methods are generally based on tried and true principles. This is most clearly seen when looking at commercial synthesizers: only seldom do original synthesis methods surface and make it in the market. Most successful products are based on a combination of sampling, filtering, rudimentary effects and a good user interface, put in an easy to use, cheap packet. Further, even though a good synth must have facilities to do what it was designed for, extra features often just confuse users. As an example, even though such profoundly versatile synths as the E‐mu Morpheus and Kawai’s K5000 have surfaced, they never make it as well on the market as their much cheaper and simpler counterparts, such as the Orbit series dance synths—they are just too damn complex. Even though versatile synthesis means more possibilities for original work, too much parameters with little intuitive meaning often swamp the usability of a product. Synthesis products are invariably used by people with little technical background and even with such knowledge, people lack interest in tinkering with obscure settings for hours to produce the instrument they want. Do not ruin the creative effort with too much technology.

Another important point is that most synthesis products are not exactly general purpose. The real hit synths on the market have been directed at fairly specialized needs (with the exception of instrumental synth/samplers, such as the Kurzweils), witness the popular dance music synthesizers of today or the classic ’303. This doesn’t mean that there shouldn’t be variety in the kinds of samples and algorithms your synth is capable of, it just means that putting 8MB worth of classical piano on your dance synth isn’t a good idea. Specialize. Use your 8MB for something a bit more original. Remember that factory settings sell a synth, not its technical sophistication. Use a couple of megabytes for original settings and you have a killer.

When using a synthesizer in a performance situation, realtime control facilities are paramount. Only the features that are readily accessible will be used. Including modulation control, easy and intuitive settings recall and proper MIDI control will help a lot. Hiding something behind four obscure button pushes will render it useless. The user also needs to see what is going on. Korg sold its Trinity largely with the big touch screen.

If aiming for the software market, see to the outward appearance of your product. In many cases the way a program looks will to a large part determine its success. Steinberg’s Cubase is just one example. In addition to basic user interface design, proper use of colors, audio signals and intuitive visualisation will help keep track of the functions going on. Relying too much on standard user interface elements provided by the operating system can make interaction with the program tacky and dull. With software, especially if non‐realtime synthesis is the aim, there is a tendency to include too much functionality. This is not necessarily bad as non‐realtime synths aim at a different market from their realtime counterparts. However, it is important that the user is insulated from too much settings and parameters. Make it possible to use your engine without setting everything up from scratch—only relevant parameters should be required.

Synthesis methods

As said, sound synthesis methods are here in numbers. That is because they all have different reasons for their original inception and different applicability to actual synthesis needs. Some of the more important aspects of an algorithm are the parameters (especially the amount and nature of these) an algorithm requires to work, the computational load an algorithm places on the hardware (e.g. FM is very easy whereas serious physical modelling can kill any machine), the kinds of sounds the algorithm can produce without requiring extensive modification and/or tuning, the ease of controlling the algorithm (i.e. how intuitive are the parameters?) and the ease of implementation (for example, Karplus–Strong can be very easy to implement, while the same certainly does not hold for full‐blown physical modelling algorithms). These considerations usually give rise to myriads of mutually incompatible requirements which must, as usual in engineering, weighed against each other to reach a workable synthesis platform. The aim, here, is to give an intuitive understanding of the workings of the different algorithms and to aid in understanding their commercial implementations.

Sampling based synthesis and wavetable lookup

Sampling and wavetable synthesis, although once slightly different in scope, have come to stand for the same thing: playing back stored sound through some processing. Originally wavetables were fixed size looped waveforms and sampling utilized variable length, sometimes one shot sounds. Nowadays wavetables are not used in their pure form, because wavetables can be seen as a subtype of samples. Traditional wavetables still crop up, however, but not in the context of sampling synthesis—namely, they are used in additive and subtractive synths to provide the basic waveforms. In fact, some variants of additive synthesis come pretty close to wavetables. The difference is, wavetables are meant to be crossfaded and to be loaded as part of the patching process. Additive synths do not depend on as much the content of the basic waveform but the number of basic waveforms combined. In additive synths the normal case is to have just a sine as the basic waveform.

On top of the basic playback algorithm, the possibility of mixing multiple sounds, pitch shifting and interpolation thereof plus some additional filtering and DSP effects are often added. The idea behind sampling is to use existing sound material (which is often very difficult to synthesize exactly) as the starting point, and so to create very convincing simulations of acoustical instruments. Of course, when crude processing is applied, instead of subtle coloration, entirely new sounds emerge. A good example is the way modern dance music abuses its drum tracks and beats; musique concrete and industrial are further examples: instead of sampling real instruments, they sample almost any sound source in existence. Because of this great generality, sampling is one of the most versatile synthesis methods in existence and is widely deployed in commercial synthesizers.

Under the cover, even such a naïve technique as sample based synthesis can be quite tricky. When implementing wavetable playback, one needs to start with a sampled sound and play it at a different speed, but still result in a data stream that is constant in rate.

The standard way to accomplish this is to keep a counter which points to the single sound sample now being played, and an increment which tells how much to add to the counter to reach the next sample. Both the counter and the increment need to include a fractional part, because otherwise one would have only a very limited set of pitches that could be used. From the fractional parts the next issue arises: what to do when you have to actually output a sample from a fractional wavetable offset? One solution would be to just neglect the fractional part (truncate the address to a whole number) and use the previous true sample. This is bad—what it amounts to is severe (by musical standards) nonlinear distortion; in effect, noise. A better way would be to round the offset, but even this isn’t suitable for high quality synthesis. Better yet, we can interpolate. This means taking a couple of samples on both sides of the fractional location and use them and the counter value to create a suitable sample inbetween. The most common method encountered is linear interpolation: here one conceptually draws a line between the sample values—basically taking a weighted average of the two nearest samples where the weighing coefficient is the fractional part of the address. This is already quite good. As long as the signal is properly bandlimited (no components approaching the Nyquist limit), no problem arises.

However, linear interpolation doesn’t behave very well in the presence of very high frequencies. To correct for this, one might add to the order of interpolation—fitting splines, for example. But even this is not optimal. The reason is that straight forward lookup and interpolation leads to aliasing, if large enough increments are employed, and, in addition to that, polynomial interpolation is not, theoretically, the right way to go. (When the order of the polynomial increases, so does the wiggle between sample values—the result is poorly behaved and leads to high frequency distortion.) Also, because one is in effect down‐sampling the signal, sooner or later some of the higher frequency components will fold and sound bad. The optimal solution to the problem would be to do true band‐limited interpolation. (Meaning reconstruction by a sufficiently accurate representation of the perfect low‐pass filter and sampling the result at specified intervals.) However, this is not suitable for real‐time operation and usually wouldn’t even be worthwhile. (Some oversampling headroom and linear interpolation will do the trick in most musical applications.)

In addition to the basic resampling process, some other features are also needed to make a workable synthesis method. The first is looping. When using a sampled sound, a lot of space is required to hold the wavetable. Often this requirement creates a need to somehow compress the sound. The most obvious way is to make it loop. This means wrapping the counter over some specific point in the wavetable back to some earlier address. This is useful because most instrumental sounds start with an attack transient and decay to a quasi‐periodic waveform that is often amenable to looping. When the sound reaches the loop, it also sticks—depending on the length of the loop, slow variations in the sound disappear or become cyclic. This can be good or bad, depending on the goals of the musician. On top of the looping resampler, volume enveloping and low‐order low‐pass filtering is usually applied. The need for the former is obvious: to create discernible notes. Filtering, on the other hand, can be used in different ways. The original reason (and the reason why most such filters are low‐pass) is that when one plays acoustical instruments louder, the resulting sound is richer in high partials than when playing soft. This is an effect easily emulated by a properly controlled time variable low‐pass filter. But since the time of the first samplers, filtering has become an expressive tool as well (as in the analog synthesizers) and thus more complex filter designs have proliferated in samplers as well.

What are the pros and cons of sampling synthesis, then? On the plus side, sampling is extremely versatile: you can sample most anything with a good Akai. It is also quite easy and efficient to implement, leading to low‐cost designs. Further, when beefed up by some additional processing, very convincing acoustical instrument simulations can be created (witness the K2500). Drums in special are almost made for sampling. And, because of the inherent quality of sampling, a true industry of sound distribution and reuse has developed around sampling—instruments and samples are easy to obtain and to create. On the minus side, sampling isn’t very good at creating time variable timbres, per se. Original synthesis is also very difficult on the basic sampler. A lot of memory is required and since layering (using multiple separately controlled sampled sounds to create one instrument) is widely employed, processing requirements are not as low as one might think. Sampling doesn’t have many perceptually significant parameters, when thought of as a synthesis method and it also has a tendency to distort sounds in a non‐natural way when significant pitch shifts are used.

Beefing the idea up with more sophisticated filtering, layering, modulation (on amplitude, pan position, pitch, filter parameters etc.), multisampling (multiple samples of the same instrument for different parts of the scale), per timbre effects, wavetable interpolation (fading multiple wavetables in and out in series) and so on makes the method very usable but sampling is still quite limited if truly original synthesis or complex modulation and performance control are needed. There are also quite a number of synthesis methods which use sample playback but emphasize different ways to extend it. Vector synthesis (also wave stacking) is probably the best known of these. It layers multiple samples over each other buth instead of controlling them individually, it mixes them under the control of an envelope or the user. If fixed size wavetables are used, this comes close to group additive synthesis. Another typical approach to extended sampling is wave sequencing which fades through multiple wavetables under list guidance. Wave sequencing is good for pulsating, rhythmic sounds and sample‐subtractive hybrid instruments. For a physical realization, Korg’s Wavestation combines wave stacking and sequencing to create what many people consider to be one of the best platforms for works of texturology and ambience.

One fairly interesting recent variant of wavetable synthesis is scanning synthesis, which employs slowly evolving wavetables of constant size. The idea is to utilize something akin to the waveguide methods described further down to generate the wavetable. Basically, we think of the wavetable itself as a waveguide in which we can set up vibratory patterns. If we make sure that the evolution of these patterns is slow enough (with periods in the order of 0.1s to 10s), rapidly scanning the wavetable will result in smoothly varying quasiperiodic signals. Setting up the dynamics is usually done by hand—essentially the user gets to wave points on the wavetable in real time and the resulting ripples develop according to the characteristic modes of the waveguide. Many of the complications with true physical modelling are often left out (e.g. scattering and nonlinearity), leading to an efficient and interesting sounding implementation of digital synthesis. On the downside, the method is already patented and it relies quite heavily on realtime human control—getting predictable results out of a scanned synthesizer (and especially duplicating physical instrumental sounds) is difficult.

Subtractive synthesis

Subtractive synthesis is the prominent method in analog synthesizers, so it’s only natural there would be an all‐digital implementation. In subtractive synthesis we start with a basis waveform, usually a periodic sound rich in partials. To achieve a usable timbre, we filter, subtract, frequencies from the basis. To get anything interesting, the filters we use need to be time‐variable. On the analog side, the archetypal filter is a multimode (lowpass/highpass/bandpass/notch) filter with a 12dB or 24dB slope. Digital implementations typically use discrete filters from second to fourth order. When higher orders are used (i.e. when the processing power is available), some rather wild effects can be achieved, like speech. When the orders grow, we may end up near the regime occupied by linear prediction. This enables semiautomatic analysis of instruments to produce patches. Mid order filters produce such methods as E‐mu’s Z‐plane synthesis—natural, intuitive yet very powerful synthesis especially suited for original patching work. We must remember, however, that most subtractive synths use the traditional, low order architecture, often with only a few basis waveforms to choose from. This leads to a low number of configurable parameters and can lead to a dull, clichéd sound. Since filtering takes a lot of CPU cycles, a lot of effort has been put into extending the subtractive algorithm without raising filter orders. Some of the approaches employed are using configurable wavetables or samples as the basis waveforms, permitting more than one oscillator to feed the filter, adding extensive modulation possibilities (envelopes, lfo and arpeggiation), letting the user cascade oscillators to achieve FM and AM effects before filtering and waveshaping/subtractive hybrids.

The basis waveforms form an important part of subtractive synthesis. In analog (and analog emulating) implementations, the starting point is usually a simple‐to‐produce, harmonically rich waveform, such as a triangle, sawtooth, rectangular, pulse or noise signal. Often these have modulatable parameters. The prominent example here is pulse width. Often some means of interlocking multiple oscillators is also implemented. The classical way is to let the zero crossings or pulse edges of one waveform to reset (restart) another oscillator. This is called (hard) sync and makes the triggered oscillator take on the fundamental pitch of the trigger osc. The sounds produced are hard and crisp and can be used to simulate various nonlinear phenomena in instruments (like the stick‐slip process in string instruments). On the analog side, the exact implementation of the filter is also an important characteristic of the instrument. (As good examples one could give the Moog ladder filter, the distinctive Prophet sound and, of course, the crappy‐but‐oh‐so‐wonderful 303 filter section.) In the analog synthesis community, many filter designs have really become institutions and so are the single most sought‐after feature of some synths. The digital counterparts of these filters are often much more accurate, which some people think leads to a certain lack of depth in digital synthesis. Digital implementations must also cope with some peculiar problems of their own, like the inherent problems in producing correctly anti‐aliased basis waveforms. (This is especially so with wavetable and sample based methods because closed form, provable formulæ for variable‐pitch oscillators aren’t common knowledge. For more, see approximation problems over disjoint sets and nonuniform sampling.)

Subtractive synthesis is a very workable method. Because low‐order filtering is very intuitive, subtractive synthesis is easy and rewarding to use. Most of its parameters also have proper psychoacoustical semantics—timbre is created by taking a proper starting waveform and shaping its spectrum with filters. Modulation is then used to shape the sound into an instrument and a little twitching makes the sound more lively and organic (this is mostly about detuning, feedback, sync etc.). Our ears are quite used to sounds generated this way as we know from the discussion on source‐excitation models, in the previous sections. On the negative side, accurate instrument simulations are surprisingly difficult to create because of the simplicity of the synthesis engine. Digital implementations are often quite problematic and do not sound very good without extensive modification and addition of features.

Additive synthesis

Additive synthesis reflects the mental opposite of subtractive synthesis. Whereas subtractive synthesis takes a top‐down simplify‐from‐complex attitude, additive synthesis works from bottom up, combining simple sounds to form more complex ones. The basic prototype has its roots in Fourier theory: any sound can be created by combining multiple sine waves at different frequencies, phase angles and amplitudes. Additive synthesis, then, thrives to create instruments by decomposition and reconstruction.

Implementation of additive synthesis is quite straight‐forward—one only needs a way to create a lot of sine waves. Why so many? Because most instrumental sounds include rapidly varying and stochastic components that arise from nonlinear interactions in the instrument. This is something that leads at best to hundreds of partials, all time variant and often multiply interconnected (e.g. very small changes to the original instrument sound require very large‐scale modification in additive synthesis parameters to acheive accurate reproduction). Thus, although theoretically perfect, additive synthesis is not very well matched to the actual production of sound in physical instruments. The great number of partials makes true additive synthesis less than efficient to implement. Some simplifications can make the method more usable though. The first one is not to allow arbitrary sine waves but to group them into bundles of mutually harmonic partials. This allows the use of the Fast Fourier Transform to generate each group efficiently. If the further simplification of disallowing separate envelopes inside such a harmonic group is made, group additive synthesis results—here each group can be recreated by wavetable lookup and amplitude scaling, which is very efficient. Indeed, this comes close to original wavetable synthesis and synchronous granular methods. Excellent quality of reproduction is retained. A completely different—but in certain situations even more powerful—optimization is based on discrete summation formulæ (DSF). These are mathematical equivalences based on trigonometric identities. They make it possible to calculate values for some special classes of functions (most often polynomials, whence the name) very efficiently by simplifying them through trigonometric manipulation. For instance, there is an efficient closed form equivalent for a trigonometric polynomial composed of the first n even harmonics, assuming each sinusoid is present at an amplitude of some constant times the amplitude of the preceding one (i.e. the spectrum decays exponentially). This particular DSF can be used to implement bandlimited square waves without any oversampling or filtering—a major speedup on general purpose hardware.

Additive synthesis is probably the most versatile synthesis method in existence. Any sound can be represented accurately by it. It is also capable of creating new timbres from scratch, in addition to being susceptible to analysis‐resynthesis techniques. It is also one of the few synthesis methods for which automatic transcription of instrumental sounds is fairly well developed. As a general synthesis method, it is also unusually accurate—even the slightest nuances can be captured by it. Additive synthesis is, however, computationally expensive and nearly impossible to implement in analog form. (Due to the high number of partials, noise levels shoot through the roof.) It also requires immense amounts of control data, even in reduced form. (Amplitude and frequency envelopes for each of the partials in nonreduced form.) Tweaking is possible, but due to the frequency envelope sensitivity of the human hearing, large scale modifications to synthesis parameters are necessary to produce a natural sounding modification to a timbre. Thus the psychoacoustical significance of a single parameter is quite limited. The difficulty of automating large scale spectral edits further compounds the problem. This is the reason additive synthesis easily leads to thin sounding instruments if operated manually—the complex harmonic structure of a sound is easy to destroy. Additive synthesis also behaves rather badly in the presence of stochastic components and highly transient signals.

These do not gracefully decompose into neat, slowly varying sinusoidal partials. The result: a huge amount of partials with rapidly varying parameters—difficult to implement efficiently and quite storage and control rate hungry. The problem is that we are modelling continuous spectra with sinewaves—possible in theory where we have numerable summations but bad in practice where we have only very finite and limited digital computers. The problem does not show through too badly if stochastic components (strictly continuous spectra at any scale) or transients (strictly time limited, leading to wide and continuous spectra as well) are not present, and methods such as Prony’s can be used to obtain extended versions of the analysis which better deal with transient sounds. Nevertheless, this is an area where we bump into the theoretical limits of additive synthesis. Put in another way, additive synthesis in its pure form performs best with quasi‐periodic sounds as it does not concern itself with time information.

Additive synthesis also takes a lot of programming time and is difficult to master; consequently it is not widely used. As a recent example, the Kawai K5000 employs additive synthesis with six parts of 64 harmonic partials (either the lowest or highest 64 of 128), almost certainly implemented by FFT.

Phase modulation synthesis—frequency modulation (FM), phase modulation and phase distortion

Phase modulation is a synthesis technique with a long history. The first forms of phase modulation can be found in, where else, radio technology. There it is commonly employed in FM radio. Phase modulation was also used with analog synthesizers, but the limited accuracy of analog oscillators and the difficulty of building oscillators with negative frequency support hindered the analog implementations. The true break‐through of FM technology came with John Chowning and the subsequent patenting of the method for sound synthesis by Yamaha. The result was DX7, probably the most successful single synthesizer in existence. More recent derivatives include OPL2‐4 synth chips (ADLIB etc.) and Yamaha’s more mature version of the DX7 synthesis principles, the SY series.

The idea behind FM synthesis is that quite rich and deliciously time‐variable timbres can be created by modulating the frequency of a carrier sine oscillator by another, the modulator. When the modulator frequency stays below 20Hz or so, only more or less rapid vibrato results. But when the modulation frequency rises to the audio band, the characteristic sidebands resulting from the modulation process can be heard. The sidebands are (generally) not harmonic, except in special cases. These come about when the frequencies of the carrier and the modulator form a simple ratio. The method has few variable parameters, these including the volumes of the two oscillators (the modulator volume affects timbre, not volume) and modulator frequencies. The basic configuration is two oscillators cascaded, as described above, plus envelope generators to control the amplitudes. Often more oscillators are used as well, since interesting (and complex) inharmonic spectra are thus easily produced, allowing for quite realistic bell and brass sounds to be generated. The most characteristic FM sound is a slow sweep of the modulator volume, while keeping the carrier‐modulator frequency ratio constant. This produces the well‐known ADLIB timbre. Common modifications include several two‐oscillator complexes in parallel (allowing for a form of additive synthesis), multiple oscillators in series (allowing for extremely inharmonic and noise‐like spectrum formation), non‐sine components (they produce a richer sound and, when added in parallel, result in modified group additive synthesis that complements the capabilities of the base FM system), layering, feedback (for noise and weird sounds and adding long term development to the sound), addition of filters (since FM can produce most of the basis waveforms for subtractive synthesis, this also complements the capabilities of the synthesizer) and several combinations of the preceding. Some specific modifications include a limited form of FM, called formant FM, which is capable of producing voice like timbres and formant peaks and has an associated analysis procedure which makes instrument design considerably easier, and a couple of other academic projects, with no presence in the commercial music business. For example, Yamaha’s FS1R uses a combination of formant FM and filtering to create its timbres.

The implementation of FM synthesis is very easy, the only problem being aliasing which results from high modulator frequencies. Specifically, the computational cost of the algorithm is very low due to the high simplicity of the algorithm—nothing more than a couple of table lookups are needed to produce a sample by FM. The cost increases when more oscillators, options and enhancements are added, but if implemented in hardware (which is also quite easy and has resulted in the commercial synthesizers of the DX and SY series), generally can achieve very high polyphony with minimal control data, reasonable sound quality and very low cost.

The other two forms of phase modulation, general phase modulation and phase distortion modulation, are less used. Phase modulation, in general, means modulating the instantaneous phase of a carrier. This is very similar to FM, except for the fact that arbitrary phase curves are allowed. The advantage is minimal. Phase distortion modulation was originally used by Casio in its CZ‐series synthesizers as a way to circumvent Yamaha patents. The idea, here, is to vary the reading speed of the carrier wavetable during a single cycle of sound production. The modulator function is essentially a saw‐tooth wave, with the form and frequency depending on the carrier frequency. It’s sort of like hard synced phase modulation. The CZ series synths are very nice for beeps and buzzes, something that is quite hot in the techno scene, nowadays. For real instrument simulation, phase distortion is practically useless. (Although the CZ’s do a remarkable job, considering what’s under the hood.)

So the pros and the cons. FM synthesis is cost‐efficient and easy to implement. Additionally, the parameters are, in a sense, acoustically significant and quite easy to predict because a firm mathematical theory exists (in terms of Bessel functions of the first kind) for the formation of the sidebands. Also, since the prime function of the modulation process is to spread the carrier frequency into multiple, symmetric sidebands around the original carrier, the method can be used to create rough estimations of formants. Because most FM realisations include many options and enhancements, they are well suited for original synthesis—many unheard of sounds can easily be produced. However, since the synthesis procedure bears absolutely no resemblance to the formation of sound in nature, the method is poorly suited to general simulation of acoustical instruments. The method has its own distinctive sound which can be extremely annoying in the long run. Further, the method has been intellectual property of the Yamaha corporation for so long, it has not gained long enduring acceptance outside the academic community.

(Nonlinear) waveshaping

Waveshaping is just what the name says—it takes a simpler wave and shapes it until it sounds right. The simplest form takes a sine wave (only one frequency) and passes it through a carefully crafted function (usually implemented by a lookup table) that adds sidebands to it, based on the amplitude of the original signal. The theory behind the method is that if the function is a suitable Chebyshev polynomial, any combination of upper harmonic partials can be produced from a steady‐state sinewave. When one, then, varies the sine volume, the larger the volume is, the more harmonic content there will be in the resulting waveshaped signal. Thus descending volume produces descending harmonic content—something that is characteristic of most instrumental sounds. Thus we have hope of producing believable timbres by feeding waves of varying amplitude through a waveshaper. Indeed, for some instruments this works quite well. The problem is, although the theory is well‐developed, one waveshaper almost never suffices. This is because most instrumental sounds include elements (transients, stochastic components, inharmonic partials and partials with little mutual correlation) which make it impossible to synthesize the sound with a single waveshaper. More sophisticated versions exist, including waveshaping of non‐sine input signals (harmonic or inharmonic, of which the latter is more complex to analyse), combinations with filtering and cascading multiple waveshapers, either in series or in parallel. Some research into using multiparameter functions as waveshapers has also been done. In this case the method is dubbed wave terrain synthesis. The problem is, none of these really has sufficient theory behind them to make them easily applicable, let alone to allow instrument design to be automated.

As a unique feature of waveterrains, they permit multiple input signals to be combined by the function. Beating between the inputs can produce quite complicated long term evolution and multiparameter real valued functions can have continuity problems. Predicting the loudness of the output is a problem, too. So things get complicated pretty fast. Inharmonic input signals to the one parameter waveshaper produces similar problems. The difference is, continuity and table wraparound are not problems, here.

In theory, implementation of waveshaping is easy: all you need is a lookup table (with interpolation, probably) and a simple oscillator with amplitude control. You use the output of the oscillator to lookup from the table. In reality, nothing is this simple. The problem is, once again, aliasing. As the process is nonlinear, it adds to the frequency content of the input signal. Especially, it widens the signal’s bandwidth. (The higher the degree of the shaping polynomial, the more marked the effect. For example, raising to the second power (squaring) doubles the bandwidth.) The result is that with high input frequencies and/or insufficiently smooth shaping functions, significant aliasing may occur. Neglecting alias, the low computational cost of the algorithm sometimes makes it worthwhile as well as makes it useful as a building block for more sophisticated hybrid synthesis methods.

The better side of this algorithm is its simplicity and the rugged theoretical foundation on which it is built. Also, some instruments are fairly well modelled with variants of the base waveshaping algorithm. When combined with filters, some quite usable synthesis methods can be built. (See Korg M1, for instance.) However, since the algorithm has few parameters (aside from the input waveform and lookup table contents, which are difficult to modify systematically on the fly), it allows little in the way of modulation effects and long‐term development in the sound. By itself, the output power is difficult to predict or to control. To this end, separate amplitude feedback circuits are often employed, but with the more complex variants, these might not react in a sufficiently prompt manner.

Decomposing time—granular synthesis, FOF and VOSIM

Granular synthesis has its roots in the area of quantum physics and wavelet analysis. The basic premise here is that signals can be decomposed in bases different from the classical Fourier one. (Well, to tell the truth, wavelet decompositions do not generally form bases, only generalized frames, which do not meet proper orthogonality requirements.) Especially, we might wish for a decomposition in which local changes to the signal being analysed only result in local changes to the analysis result. This means we want time localization. Classical Fourier integral transforms have no such thing: add a local bump and the whole frequency spectrum changes, add a jump discontinuity and you get nonuniform convergence/Gibbs’ phenomenon. The problem is in the result the quantum mechanics people call the uncertainty principle (it was formulated in quantum mechanics by Werner Heisenberg). What it says is, basically, that no matter what decomposition you have, you always have strict bounds on time‐resolution of the analysis in terms of the frequency resolution and vice versa. What does this mean? It means that since Fourier analysis has infinite frequency resolution (Fourier integral transform gives you the exact frequencies required to synthesize the signal), it necessarily has no time localization. On the other end of the scale we have analyses that have no frequency resolution (decomposition into an integral transform of delta‐distributions) but have perfect time resolution (they give indefinitely accurate times of occurrence for all the deltas). All this seems a bit odd, since our ears can certainly pinpoint sounds in both frequency (or sounds wouldn’t have a pitch) and time (or we wouldn’t need the concept of notes). So there should exist a form in‐between that behaves similarly to our hearing organ. Such a form could also be very useful.

Such forms do exist. They are standard material in wavelet analysis. The basic idea is to trade frequency resolution for time localization and the other way around, depending on your needs. What results are transforms which have both good time and frequency resolution. Analyses of this kind permit decomposition of sound signals in ways that slightly resemble the way our ears decompose sound. The inverse of this procedure leads to/resembles granular synthesis, which has its theory rooted in the writings of Dennis Gabor and, later, in the musical applications end, Iannis Xenakis.

It is possible to achieve both time and frequency localization, but there is always a price. The first obvious condition is that not both the analysis wavelet and its Fourier transform can have compact support—in English, one cannot have analyses with both perfect time and frequency localization. If the analysis wavelet spans only a limited portion of the real axis, it will, to some extent, span the whole spectrum and vice versa.

The basic premise is that we can create rich sound textures by superimposing large numbers of small sound grains—little pieces of sound that have little distinctive flavor on their own, but when used in large numbers, coalesce into a coherent sonic matte. Usually these grains are windowed sine waves (often using a truncated Gaussian or raised‐cosine window, since these yield good frequency localization without sacrificing compact support, i.e. finite length, of the grain) or something very close to them. These little sound bites are then combined stochastically, with parameters such as density (grains per second), mean frequency, mean length, variance and envelopes of the previous controlling the overall sonic experience. What results is an extremely powerful, general and easily adapted technique of sound generation that yields very rich timbres and lends itself as well to automated design as to creation of entirely new instruments. Also, by using more sophisticated forms of control (such as statistical distributions and/or grain by grain control) and by substituting richer grain material (non‐sine waves, different windows, chopped natural sound etc.), the method scales almost indefinitely.

This all may sound like a lot of semi‐scientific mumbo‐jumbo, but in the end it is very easy to see what is going on, if some thought is put into it. Think about a sine wave. Its frequency content is simple: it’s just a delta‐spike at the frequency of the wave. (Let’s not burden ourselves with the fact that in the normal sense of the word, these are not functions and the Fourier integral doesn’t converge.) So we have only one frequency. Now let’s take a Gaussian. We know that a Fourier integral transform of a Gaussian is another Gaussian. So we have a clear peak in the spectrum. Now, sample by sample, multiply these two together. What results is something like a windowed sine, except it doesn’t vanish anywhere, but, instead, only decays rapidly towards zero. What is the spectrum of this new signal? It is the convolution of the spectra of the original two signals, i.e. a Gaussian with a higher center frequency. And the time domain representation decays quickly, so we can take it to be time‐localized around the peak amplitude at the center of the original Gaussian. So we have a signal that is both time‐ and frequency localized. (See the illustration below to get a sense of what is going on.) Now we can add these together to add specific frequencies at specified times (approximately), something we certainly cannot do with inverse Fourier transformations unless we use the discrete version and window the results—something that is really just a naïve version of the grain approach. Using stochastic control and great enough grain densities produces sounds that have no recognizable structure aside from the desired timbre.

[Figure 1: Gaussian sine grain illustration]

Figure 1 From left to right, a segment of a sinusoid, its spectrum, a Gaussian, its spectrum and finally the sinusoid multiplied pointwise (windowed) by the Gaussian plus the resulting spectrum. The sinusoid has a line spectrum whereas the Gaussian is invariant under the Fourier transformation. (To be precise, sinusoidal signals are not square integrable, so they do not have a proper Fourier transformation in the traditional sense. Basically, the line in the spectrum is a delta‐distribution.) The multiplication (modulation) raises the center frequency of the Gaussian, but does not alter the shape of its spectrum. The pictures also illustrates the kind of frequency modulated functions Dennis Gabor suggested for harmonic expansions. (Later the idea in its original form has received considerable critique—such expansions have considerable problems despite being complete.) Negative frequencies are only shown in the spectrum of the Gaussian, and the pictures have not been drawn into scale.

VOSIM is actually a method that is completely independent of the grain based synthesis principles. But it shares some common ground with them, nevertheless. The idea of VOSIM is based around the source‐excitation model of speech production. Here, speech is viewed as being produced by a linear filter (the vocal tract) driven by a series of pulses with a wide, relatively constant spectrum (glottal pulses). In VOSIM (which was originally developed as a side product of research into speech), one first looks at the spectral response of this conceptual filter. More often than not one can find the distinctive formant peaks characterizing the instantaneous quality of the sound. One then models the waveform by adding together carefully crafted signals composed of periodic decreasing trains of raised cosine pulses. The point behind the procedure is that the pulse trains form controllable formants: the decay factor controls the width of the formant lobe, the rate of repeat of the base raised cosine wave tells the center frequency of the formant and the repeat rate of the whole pulse train is the frequency of the glottal excitation function.

Figure 2 From left to right, a single cycle of a sinusoid, half the cycle raised to the second power, a string of raised cosine pulses and its parameters and, finally, a complete, periodic VOSIM waveform are shown.

All in all, a second‐order all‐pole filter with the aforementioned properties (i.e. resonance frequency and Q‐value), driven with a periodic pulse train is (rather crudely) approximated. If we make the assumption that speech can be modelled as an all‐pole, pulse‐excited filter, we can decompose it into parallel second order filter sections which, in case, can be modelled by VOSIM generators. Very rich timbral envelopes can be modelled as a combination of additive VOSIM elements. Advantage over a pulse excited filter bank: VOSIM is computationally cheap—one generator requires only a single multiplication per raised cosine cycle, a table lookup for the waveform and a counter to count to the length of the bigger cycle.

FOF, the brain‐child of Xavier Rodet of IRCAM, is very similar in spirit to the VOSIM method, but is designed more for music and singing than for speech sounds. It’s in fact closer to the granular methods, since it employs a bank of what resembles grain oscillators to produce unconventionally windowed sine waves. But the ideology behind the algorithm is closer to VOSIM—construction of speech and/or chant by methods derived from the source‐excitation paradigm. FOF has been included in the influential MUSIC V and CSOUND synthesis languages, which makes for its widespread use inside the academic community.

Controlling granular synthesis, in all its variants, differs markedly from other synthesis methods. The most important difference is that unlike with other algorithms, in granular approaches the basic building blocks do not extend very widely in time. Because of this, sustained sounds demand considerable numbers of overlapping grains and control necessarily transcends the level of individual grains. This means that parameter generation for an individual grain must be automated and the parameters one usually associates with synthesis methods (like volume, amplitude, timbre and so on) do not map into grain parameters in a straight forward manner. Of especial concern in the research literature is the problem of grain instantiation—when to emit a new grain. Many strategies have developed but three deserve special mention: synchronous, asynchronous and analysis‐resynthesis. In synchronous granular synthesis (SGS), one emits grains strictly periodically. This comes close to the ideas put forth by windowing function synthesis (WF), which substitutes commonly used window functions (like Kaiser, raised cosine and Blackman‐Harris windows) for more general grains, and group additive synthesis (described above). Spectra are matched by altering grain contents and mixing multiple streams of grains with separate parameters, parameters usually controlled by normal continuous methods. VOSIM can be seen as a particular subtype of synchronous granular synthesis. Asynchronous granular synthesis takes the opposite view: to create interesting sonic clouds, statistical methods are used to produce rapid, controlled variation in grain parameters. Averages, deviations and other statistical measures replace continuous controller data as input parameters. While the synchronous approach excels at spectral matching and static, playable sounds, asynchronous granular synthesis is best suited to simulations of natural sounds with lots of internal movement. Examples include combustion, wind, rattle and so on. A further subtype of the asynch variety should be mentioned: quasi‐synchronous granular synthesis QSGS is based on stochastically varying the time interval between successive grains around some predetermined value. The other possibility is to use a direct statistical measure of how likely a grain is to appear at in a given interval of time. The QSGS is more suited to instrumental simulations and includes SGS as a special case of zero standard deviation in the intergrain time delay. Granular analysis‐resynthesis departs from AGS and SGS methods in giving more weight to the actual contents of a single grain than to the statistical properties of large clouds of grains. In these applications large scale statistics rarely change, but each grain is different—typically the grains are sliced off from existing sound, either in realtime or from sample memory. Typically one then uses rearranges, pitch shifts, adjusts the grain windowing functions, repeats and overlaps the grains resulting from the analysis stage. This way interestingly blurred, pulsating, buzz filled and metallic timbres can result. For reference, one of the more common time stretching algorithms employs a similar construct. The long, vocal dreads of drum’n’bass are produced by abusing this procedure; unlike in effects use, in analysis‐resynthesis we do not aim at transparency.

Physical modelling—waveguides and controlled nonlinearity

Physical modelling is a bundle of methods which all aim at a common goal—modelling some of the relevant parts of sound production in real, physical instruments. In computer jargon, at emulating instruments. There are many different ways to do thisg, including waveguides, filterbanks, the finite element method, Karplus–Strong type algorithms, and then some. What is common to all of these, is that they implement different large scale theories of sound production in different types of physical objects, usually rather more complex than what has classically been analysed.

Waveguides are the prominent technology at the moment. They are based on an abstraction of sound transmission in instrument bodies and cavities as linear transmission of waves in a one‐dimensional tube. The argument for woodwinds goes like this: since the inner tube of these instruments is rather thin compared to the wavelength of the sounds they emit, they can be abstracted with high precision as one‐dimensional transmission lines and by linearity, the incremental losses incurred in each infinitely short piece of the tube can be accumulated to yield a combination of linear elements. The tube is modelled as multiple two‐way segments, with each direction put together from a delay line, a dispersive linear filter and a lossy linear filter. Inbetween the sections, we have any scattering junctions, which represent points reflecting sound and any modification imposed upon the sound by modulatable elements. To get life into the pipe, we put a driver into one end to take the place of the physical reed. We call this the excitation source. The reed sends pulses to the delay lines to produce sound. The pulses are either generated directly (e.g. by table lookup) or, more in tune with the idea of modelling, by letting the pressure at the end of the pipe (remember, two ways: something comes back as well) effect the amount of pressure inserted by the reed into the tube. The latter method works the way physical reeds work: bay letting the excitation‐tube whole go into nonlinear oscillation. Then the delay lines are tapped in appropriate places (mainly in the end of the tube, sometimes in the midst to model directional radiation and valve leakage) for sound transmission out of the system. All this is computationally heavy (a lot of delay memory and processing power for the filters are needed), but extremely high realism can be achieved.

Some new problems arise, when the one‐dimensional abstraction is not as valid as in preceding case. Good examples are such instruments as the violin (where we can, however, model the strings as being one‐dimensional) and drums (where the assumption collapses completely). In these cases more accurate simulations can be achieved by creating a two or three dimensional mesh of delay lines. Now the problem becomes one of combinatorics. In dimensions from two up, computational loads by simulation size start growing first as squares, then by cubes and beyond. Realism quickly becomes impossible to achieve in real time. In the case of string instruments, the resonant cavity can sometimes be modelled sufficiently accurately as a linear filter, possibly by linear prediction. The strings can mostly be fit into the one dimensional framework, except for the placing of the excitation (moving and in the middle of the string). But the strings still produce complications, since they have multimode behavior with nonlinear coupling between the modes. (Longitudinal waves and twisting couple with the usual modes, especially on high playing volumes and when using the bow.) There is also the question of bidirectional coupling between the strings and the instrument body. When dealing with multidimensional meshes, nonlinear coupling between modes (which can depend on direction—anisotropy) complicates matters appreciably.

Waveguides are not the only way to PM wonderland, however, so a brief run down is in order. The finite element method is based on completely different principles from the other methods and is only mentioned for completeness. The finite element method is heavy enough to be totally unusable as a real synthesis method plus totally unmusical. Basically it is a generic method used to solve partial differential equations numerically. Canondale uses it to design their bikes to withhold tensile stress, for instance. But as wave transmission is a phenomenon which is mathematically described by partial differential equations, such numerical solutions actually are a way to synthesize sound. FEM is only used in theoretical studies, though, since it hogs mind‐boggling amounts of computing power. More in line with the application oriented note of this text, the Karplus–Strong can be thought of as a greatly simplified version of the waveguide model, in effect one with only a very simple (often first degree) filter, one way delay line with feedback and a single random driving waveform. The basic method works by filling the delay line with random numbers and then iteratively feeding back the average of the last two samples of the output end to the input. This creates surprisingly convincing string sounds. Modifications include inversion of certain samples in the delay line (AM, if you wish), higher order filters, fractional delay line lengths (with various kinds of interpolation to achieve the desired effect) and signals added to the delay line at specific points during the cycle. All in all, this is a very well known synthesis method and a predecessor of most of the waveguide methods. Finally, filter based methods of physical modelling rely on a more classical analysis of sound and attempt to model the response of approximately linear resonators by certain kinds of filters. One approach, appropriately named modal synthesis handles the problem by subdividing it: the instrument is divided into parts whose characteristics are known and, as vibration analysis data is readily available in engineering literature, the differential equations describing these parts are just looked up. After that all that needs to be done is to glue the parts together and numerically solve the resulting equations—this is often done just by creating difference equations to estimate the original ones and running the resulting algorithms against our known excitation functions. Of course, finding efficient and sufficiently accurate ways to estimate the original equations can be quite tricky indeed.

As classical instruments are quite complicated in the mathematical sense, it is an enormously time consuming task to create accurate, efficient models of them. This means that automatic analysis or at least some good analytical tools to aid in the process would be nice. However, the fact that originally made the instruments difficult to analyze (nonlinearities and complex physical properties) also make completely automated analysis impossible. Tools are available, of course, but most of these are more in the line of classical spectral and statistical analysis rather than being especially suited for the task at hand. Currently this means that each instrument has to be modelled separately, from first principles, but some recent discoveries have eased the burden a little. The most important is called higher order spectral analysis (HOS). It was conceived to help in the analysis of general, nonlinear differential equations and systems and is thus quite a handy tool for the synthesist as well. The idea, here, is to track the complex dependencies between different vibratory motions appearing in a signal so that nonlinear interactions can be tracked down and isolated. This helps greatly in designing excitation sources and their coupling to the other parts of the instrument being modelled.

All in all, physical modelling is an extremely good choice for synthesis of many classical instruments, especially those of the woodwind and brass families. Its parameters directly reflect the ones of the real instrument and excellent emulations can be produced. Original synthesis is fairly easy on PM platforms. The downside is that serious processing power is needed, something that limits the polyphony of current PM implementations. In addition, instrument design can be very time consuming. Some types of instruments are more difficult to model, as well, especially instruments with significant two plus dimensional effects. These include e.g. drums and plates, and, to some extent, string instruments. Sometimes these problems can be solved by using modelling alongside other synthesis methods or expanding our models to include samples as excitation or by allowing traditional sound processing methods (effects, filtering etc.) to be applied within our instrument. Sometimes not. Progress is fast, tehcniques are developing constantly and the field will certainly get even more attention as time goes by and serious commercial applications continue to appear.

Time‐domain and graphical synthesis

As computers have pervaded the music industry and academia, direct manipulation and trial‐and‐error methods (as opposed to careful top‐down classical planning/composition and batch synthesis) has taken foot hold as a method of composing. With waveforms as the basic building blocks, sampling and digital processing have had a huge impact on how we see sound. This is the basis on which many a strange synthesis method has been built. The same goes for direct manipulation of graphical representations (mostly spectra) of sound.

Common to the methods discussed here is that they are all influenced by the view of sound as a stream of numbers, a discrete signal. Since such signals are the natural representation of audio on computers, one might ask, whether this view suggests original synthesis methods. And indeed, there are some synthesis methods that are based on boolean and other purely numeric manipulation of discretized signals. Examples include SAWDUST, SSP, waveform interpolation synthesis and instruction synthesis. The basic premise here is that since the sounds are byte streams, one should treat them as such and apply methods designed for number streams or other digital data (such as computer instructions) to them. Other influences include serial composition, deconstructionist ideology and granular synthesis methods, which all suggest that it might be beneficial to adopt a truly bottom‐up view of composition. Namely, one starts from individual samples, builds series, mutates by bit‐wise, logical and numerical operations, splices and glues, mixes and transforms, interpolates for smooth trasitions and then iterates, reiterates and rereiterates. Sometimes we even go as far as to treat the control data of our synthesis architecture in a similar manner. A less puristic implementation might include some of the more traditional signal tools, like filtering, as well. But these are often seen as mere concessions to a more conservative view of music and hence omitted entirely. What results in terms of sound is different indeed—the result can be quite indistinquishable from digital noise, something you’d get if you converted a program image to sound. On the other hand, the result can be even melodic. This usually doesn’t happen by accident, as it can with more traditional synthesis methods.

A more recent addition to the sample based paradigm are stochastic and fractal (iterative substitution) processes to shape the sample stream. Iannis Xenakis’s stochastic dynamical synthesis, implemented in GENDY (GENeration DYnamique). The method mixes, cascades and intermodulates parametrizable stochastic processes with each other to arrive at continuously developing signals of wide variety. To impose a level of determinism, the system revolves around the concept of elastic variables and their bounds, which take the form of complicated waveshaping of the stochastic data with feedback. The process paradigm causes methods like this to be quite high level and theoretical. Emulating traditional musical idioms is almost impossible, but as a producer of interesting new sounds and/or control data for more traditional synthesis methods, these algorithms have their place. They are also entire orders of magnitude faster to use than pure sample level manipulation, so as a kind of tool for the sound designer’s workbench they work quite well.

Graphical synthesis methods are another, though somewhat more conventional, child of the computer era. They were originally driven by the contemporary composers of the academia, used to the common music notation but fighting against its inherent limitations. In part through the joint efforts of graphical artists and composers, the era of the twelve‐tone and serialist composers produced a whole slew of augmented and completely reinnovated musical notations. When computers came into the picture, a desire naturally arose to employ similar graphical notations when controlling digital synthesis. The first systems read graphical notation on paper, latter variants employed true graphical user interfaces. Unlike CMN, these new notations attempted to capture the timbre of the music as well. This leads to some difficulty because there has never been an established vocabulary for describing such things, let alone a formal notation. This is why most graphical notations turned to spectral representations for help—modern composers and scientists readily understand the notation and both additive and subtractive synthesis can be used to play the resulting score. Nowadays there is a multitude of tools out there which employ some sort of spectral representation of sound and allow the user to actually paint in sound. We even have some æsthetics built around the graphical paradigm of composition, mostly derived by applying the existing æsthetic of painting directly to the notation. This rather far fetched idea also lives on in the community developing commercial tools for graphical sound processing: some vendors have added fully fledged image manipulation capability into their products.

In practice, the spectral representation makes graphical synthesis relatively workable, especially if analysis‐resynthesis is also possible. The amount of data is enormous, as it is with sample based methods, but the massive, intuitive operations enabled by the representation ease the editing task considerably. Traditional soundscapes are profoundly difficult to create, but fresher genres of music benefit from the renewed framework. For instance, lush, unintuitive ambient sounds are usually quite easy to create. It would also seem possible to intermix the bulk operations in the spectral domain with the time‐domain nitty gritty of sample level editing to advantage.

On the positive side, all of the bottom‐up methods described above are accurate to the maximum. One can hardly expect more control than one has at the level of individual samples or Fourier coefficients. Such methods routinely produce new, unexplored sounds, nicely supporting trial‐and‐error composing practices. But the utter lack of perceptual significance of the most of the operations and the truly ad hoc nature of the algorithms pave way for their primarily academic interest. Meager results can be expected to result from such methods alone. However, in combination with other synthesis algorithms, such innovations can be useful. For example, many of the resulting digital timbres are excellent raw material for carefully crafted grains or attack transients for more conventional sounds. As just about all of the operations described above are transformative in nature, manipulation of imported sound data is also possible. This should please the electronica community, as the number of weird distortions enabled by level level modification of sample streams is virtually unlimited.

Analysis–resynthesis

Analysis‐resynthesis techniques are different from the other methods described here in that they are not stand‐alone algorithms for sound synthesis—they always require some starting material for sound construction. Here we first take a sound, analyse it, modify the analysis data and then resynthesize it to create more or less similar sounds. The technique was already hinted at in the additive synthesis paragraph. This is because additive synthesis is the most straight‐forward synthesis end for most analysis algorithms. Also, the amount of control data required by additive synthesis can realistically be produced only by automated analysis of existing instrumental sounds, followed, perhaps, by some hand‐tuning to make for specific impressions. Good examples include such sound processing methods as vocoding (more on that in the effects section), linear prediction based synthesis of vocal/instrument hybrids and generation of instrument families by automatic transformation from a single member of the family (used by the additive synthesis community).

Analysis‐resynthesis is good in that it is often quite an intuitive method. It also results in drastic savings of time when used in combination with additive synthesis, in comparison with raw additive. Furthermore, its different forms may allow for extensive modification and intuitive control of existing sound parameters, making it suitable for both original synthesis, transformation, mutation and automated conversion. The downside is that original material is required, the analysis quality is often far from perfect and great amounts of analysis data can result from processing rather simple sounds. Further, as the amount of data increases, the perceptual significance of a single parameter decreases—this results in the need for complex processing environments and extensive know‐how to manage the resulting intermediate data. Analysis of sounds from instruments with stochastic and/or nonlinear interactions often presents the greatest challenge for additive analysis‐resynthesis techniques, because an immense number of low amplitude sine waves are needed to account for the highly irregular and time‐variable spectra involved. Problems of this kind are alleviated by combination with other modelling techniques, notably subtractive synthesis. A good example of this approach is spectral modelling synthesis (SMS), in which dominant partials are taken care of by decomposition into sinusoids and the residual signal is modelled as timevariant colored noise. Hybrids of this kind are often more viable than pure additive methods since they slice off difficult to model parts of the signal and leave harmonic analysis with more coherent data to work on. Result: less intermediate data with more significant parameters—an obvious win‐win situation.

All in all, analysis‐resynthesis methods really reside somewhere between synthesis, generic sound transformation paradigms and effects algorithms. Considering that, they are a great addition to our bag’o’tricks.

Hybrid methods and derivatives, modelling in general

As indicated above on many occasions, most synthesis methods do not perform well alone. Many of the basic algorithms do one thing well but may fail miserably when something else is desired. An excellent example is FM synthesis: certain inharmonic sounds such as bell and tube sounds are amazingly well reproduced, as well as completely new synthetic sounds. But when string or woodwind sounds are needed, the method reveals its limits. Physical modelling can take care of these, but the tubes and bells do not reproduce well because of the limitations of the one‐dimensional waveguide abstraction. That is why most commercial implementations of the different algorithms are hybrids: most samplers have filters, most subtractive synths have multiple waveforms and often some kind of waveform playback, many physical modelling synthesizers include a sample‐based drum kit, at the very least and greatly modified FM algorithms are favored over pure FM. Furthermore, most electronic sound generation methods of today are enhanced by the addition of a selected assortment of digital effects.

There are also many less benign reasons for this trend towards greater complexity. One of them is the nature of the commerce—one has to have a distinctive product to make it to the stores. Another is the need to achieve cost‐efficiency. Although some synthesis methods are capable of unbelievable generality (e.g. additive and physical modelling synthesis), their cost is so great that they cannot be incorporated into a mass produced synthesizer. It is cheaper to pack a few tens of megabytes of sample memory or a dozen different, lower computational cost algorithms into a module than to design a custom ASIC to do the job of handling a sufficient number of physical modelling voices. Then there are patent and intellectual rights issues—one often needs to circumvent these by adding to ones repertoire of algorithms. Also, people want to have more power on their fingertips each day; especially since the timbre has only now begun to get a truly important part in the fabric of modern popular music.

One final reason for the conception of highly hybrid synthesizer designs is the need to model existing instruments—in a sense, to guard the heritage. This is because unlike in the early days of the synthesizer industry, replication of instrumental sounds is not necessarily the main goal of synth design, anymore. Now one also has to be able to model the electronic instruments of the past. For this end, a multitude of analog emulation synthesizers have come to fore. They employ a number of different techniques to achieve their goal, some of which are physical modelling techniques in a small scale, digitized (sampled) versions of analog oscillators and filters (which are often extremely difficult to faithfully reproduce in discrete form; witness the 303 and Moog ladder, the latter of which includes a zero‐delay loop in the naïvely discretized version), samples of actual analog instrumental sounds and from the bottom rebuilds of analog instruments into digital‐analog hybrids. The success of this breed of synthesizers in their task depends heavily on the original sound they attempt to replicate. The weirder the original instruments, the harder the job of the architect. Analog instruments often get their distinctive feel from design flaws, component weaknesses and the generally weaker stability of analog designs—all things that are difficult to spot when analysing an analog design and even more difficult to model effectively and efficiently.

Polyphony. Multitimbrality.

In describing synthesis algorithms, not much thought is usually given to their actual use or implementation details. One of the aspects usually neglected in brief treatments (such as this one), is polyphony and, with that, multitimbrality. Knowing how synthesis works is fine, but one cannot make any music before multiple voices and separate timbres can be combined. Polyphonic (as opposed to monophonic) is the word used to describe instruments which can generate multiple instrumental sounds at once and multitimbral the one used for an instrument capable of generating multiple separate timbres at once (i.e. in which separate voices can use separate parameters and/or algorithms).

Todays high‐end low‐cost synthesizers and computer sound cards are usually both polyphonic and multitimbral, which makes many people think this is the only way of doing things. However, in the past there have been many instruments which were either monophonic or monotimbral, or which had only limited multitimbral capabilities.

The usual reason for limiting these capabilities is implementation complexity: it is surprisingly difficult to build cost‐efficient hardware that allows for such complex setups and signal routings as required by full multitimbrality. Sometimes one can also drastically optimize one’s implementation if timbrality is restricted: wavetables can be reused, fewer translation tables need to be kept for interpretation of modulator data, effects routings can be simpler and so on. And as for polyphony, some synthesis algorithms are so complex that available/affordable hardware cannot support more than monophonic operation. Full‐blown physical modelling comes very close: Yamaha’s original VL‐1 was duophonic (i.e. had only two voices) and two part multitimbral. For the same reason some of the more intensive algorithms often limit available polyphony.

How is polyphonic/multitimbral operation implemented, then? One of the more common ways is to use a signal processor (possibly augmented with some special purpose hardware to do some of the routine calculations involved), divide the computing capacity in equal parts, use these time slots to implement voices and use a separate microprocessor for control (e.g. enveloping, modulation, MIDI, user interface,…). Almost all commercially available synthesizers use this approach, varying only in the type of processors, software and auxiliary chips employed. When one uses this approach, it is obvious that full multitimbrality kind of comes for free—since each voice is separate from the others, it can use its own local copy of synthesis parameters and so can produce any timbre desired. This way, one gets a bank of voices onto which a musical performance can then be mapped. And if full multitimbral operation is available, this mapping can be quite complicated indeed—a single logical note can map into multiple simultaneous notes (layering) with more than one individually controlled subpart (vector synthesis) each with many component events (wave sequencing), sometimes with quite a bit of control data and parameters and even multiple separate synthesis algorithms flying around. Such mapping is the second reason for limited polyphony in current instruments—the resources are there but patches tend to use more of them. It is no wonder, then, that to avoid missing musically significant events, we need to systematically weed out resource allocations which are of little sonic importance. This is the subject of the next section on voice allocation.

Voice allocation

There are, basically, two way to map logical instruments into physical voices. The first is to use fixed mapping: instrument x always uses voice y. This is the approach used by monophonic instruments and older tracker type composition software. Instrument equals channel equals instrument. The second way is to use dynamic allocation: a new musical event is mapped to a free physical output voice at the time of its creation. Most instruments use the latter approach. This is because it brings with it a sort of useful abstraction—from user perspective, the instrument is constructed to behave like it had practically infite polyphony. From implementation view, the hardware only realizes a fixed number of physical voices and tries to allocate these voices to capture the most significant logical events. This might seem quite abstract, but has the feature that the logical and physical sides are decoupled—one can implement the same instrument with varying degrees of polyphony. The implementation approximates the perfect ∞‐phonic instrument to an implementation dependent degree. The same songs will play, albeit with varying levels of sonic accuracy, on all synthesizers of a series.

This is good, but embodies a problem—how is the logical‐physical mapping done best? And no slight problem that is. It is extremely difficult to determine algorithmically which of a multitude of competing events should be realized and which—if any—can be discarded. Two crude heuristics are commonly employed to solve the dilemma. The first is to discard the oldest note still sounding, the other throws away the quietest. Both give similar results, since in Western music, notes tend to die away rather quickly. (I.e., Western music tends to have notes…) Sometimes, to aid in simulating ensembles of independent instruments, voices can be divided in banks (say, a minimum of 6 voices for a guitar) and the allocation algorithm executed within a bank, only.