Sound analysis and visualisation

The first thing to know, before trying to generate or modify sound signals, is how existing sound is visualised and how some established parameters can be extracted from it. This is important because it facilitates gaining an intuitive picture of what is happening and gives one the opportunity of learning more about common sound sources and the way we hear sound, in general.

Waveform visualisation

From the mathematical standpoint, this is the most straight‐forward way of visualising sound—since we model sound as a signal, i.e. a function, the easiest way to represent sound is to plot the function. What results is just a plot of sound pressure variations against time in some point of space. The scales may vary—sometimes it is convenient to draw the pressure axis as a logarithmic one—but the most common form is to use linear pressure against linear time. Since such visualisation is often done by computers, natural signals (which are, to great accuracy, continuous and differentiable), are drawn in time sampled form, leading to some choppiness of the signal (as time is discrete in this case, the signal certainly doesn’t appear differentiable or even continuous if we do not generate a continuous wave between adjacent sample times). Often this choppiness is, then, reduced by interpolating between sample values—most often, linear interpolation is used, i.e. we draw a straight line between adjacent sample values, although scaling the signal considerably (zooming in) creates the need to use more sophisticated (often cubic B‐spline or equivalent) interpolation to give a coherent picture of how the signal should look.

Of course, an approximation of the perfect lowpass reconstructing filter would be the best choice, but hardly anyone bothers to use methods as time consuming as that one, especially since most editing needs are satisfied without never taking into consideration what happens between the samples. Most sample editors, for instance, do not allow selection or manipulation of sound segments with fractional amounts of samples. Support for such editing would require near‐perfect signal reconstruction and consequently a lot more work. Also, such editing facilities can be approximated by handling up‐sampled versions of the signal and after that restoring the original sample rate. Besides, the time resolution of human hearing is so bad (with the exception of binaural effects) and the common sample rates so high (22500‐48000Hz) that fractional sample periods are hardly ever needed. One exception would be sample loops. More on that in the section covering sample based synthesis.

Usually waveform plots can be produced at different scales so that we can see the different levels of detail in a sample. When scaling, aliasing becomes, of course, a problem if we down‐sample (fit a longer waveform segment on the screen at once). This means that, in this case, we no longer visualise the original signal but an aliased version which can be quite ugly indeed. The simplest way is to ignore the problem. This leads to workable visualisation, if accurate editing is not a requirement. However, we miss some of the detail in the signal (for example, a signal rich in high frequency components gets aliased so that it is quite difficult to spot amplitude envelopes and other detail in the signal). Another way would be to use an anti‐alias filter, but now we would ignore the high‐frequency part entirely (this is used by some editing software and it works if we are interested in highly composite signals where we can sufficiently safely assume that most significant information is always present in the low frequencies). All this happens because we cannot down‐sample and not lose information at the same time. So how about encoding at least some of the lost information in another way? (This is possible since we now have a two‐dimensional, often colorable surface on which we can fit much more data than in a 1‐D signal.) This leads to a simple solution which preserves amplitude envelopes quite well: for each sample period of the down‐sampled signal, find the largest and smallest values the original signal takes during the period and draw the period as a vertical bar between these values, possibly with some smoothing between adjacent sample periods. This approach—and quite a few variations thereof—are used in practical applications and seem to be the best for audio editing purposes where the overall amplitude envelope of the signal is a key factor.

Amplitude envelope extraction—VU meters and envelope followers

Envelope extraction was already touched above. The idea, here, is to give an indication of the overall amplitude of the signal at a given moment in time. The problem is a bit more subtle than it would seem at first:how, exactly, do we define amplitude at some specific time? Is it the actual amplitude of the signal (meaning it can contain arbitrarily high frequency components) or a somehow less time‐localized version? Should all frequencies be handled equally? In the case of non‐localized representations, what aspect of the signal do we want to use—a smoothed version, some non‐linear function of the original signal (e.g. absolute maximum value) or some combination?

Usually two main paths are taken—both are the result of taking a non‐linear function of the signal over some time. In the first case, we get an RMS (Root Mean Square) amplitude envelope. This is obtained by low‐pass filtering (averaging) a squared version of the signal over some time‐period (usually from 10 to 100 milliseconds) and taking a square root. Through some Fourier analysis, the result is roughly proportional to the aggregate power present in the signal over the whole frequency spectrum. The second approach takes the maximum absolute value of the signal over the time‐period we are interested in, producing the so called peak amplitude envelope. Both methods can be implemented rather easily in analog circuitry, and have been used since the inception of recording technology. The difference between the methods is in the fact that even if two signals have the same exact frequency content, phase relationships between the different components can lead to widely varying peak amplitudes. Since the human hearing is much more sensitive to frequency variations than to phase relationships, the perceived loudness of a signal correlates better with the RMS envelope. However, electrical circuits are time‐domain, not frequency domain devices, so peak amplitude envelopes are more relevant to them; for example, clipping occurs (especially in digital equipment) if the signal has large peak amplitudes. This means that when matching a program source to digital transmission, amplitude adjustments are usually made based on peak amplitudes, not RMS.

After extracting an amplitude parameter, all that needs to be done is to visualise it somehow. One way is to interpret it as a time‐varying quantity, i.e. a signal, and visualise it as a function. This is usually a bit too much. VU meters provide an alternative way: here we present the time‐variable amplitude as a stack of lights whose height depends on the momentary amplitude. Color coding the lights (usually LEDs) enables us to note immediately if the signal is too loud for our equipment to handle or if there is no signal at all (a cable has come loose etc.). In some applications, only an indication of some binary condition is needed. Here we can just threshold the amplitude and give a visible signal (e.g. turn on a red LED) if the condition comes true. Such indicators are to be seen everywhere in studio equipment. (Examples include the little red lights indicating analog input overshoot and the little green ones signaling a present signal.)

Of course, now that we have extracted an abstract parameter from the signal, nobody says we actually have to visualise it—we can do anything with it. This is the idea of envelope followers—they extract instantaneous amplitude information from signals and pass it on for processing. Most effects, for example, can be controlled by amplitude envelopes extracted from the input signal. This is how automatic wah‐wah pedals and other such devices work.

Spectral analysis

 ‐multiple bandpass
 ‐FFT
 ‐MDCT
 ‐ARMA
 ‐wavelets
 ‐Prony
 ‐MacAulay‐Quatieri (MQ)
 ‐autocorrelative/autoregressive
 ‐Wigner distributions
 ‐statistical maximum likelihood methods
 ‐line/sine decompositions (like in MPEG4 low rate voice codecs‼)
 ‐LPC and format extraction (interpolation and peak estimation)
 ‐multiresolution variants of conventional spectral decompositions

Energy waterfalls

For overcomplete time/frequency analysis ⇒ always ’smooth’!

Pitch tracking

 ‐zero crossings
 ‐autocorrelation analysis
 ‐FFT binning and interpolation
 ‐cepstrum based analysis (not dependent on the precence of the fundamental problems with nonlinearity and harmonic sounds with little content other than fundamental)
 ‐autocovariance
 ‐cochleagram and/or gamma filterbank based functional descriptions of the ear
 ‐Eric Schreier’s work on modelless cybernetics principles
 ‐the old+new heuristic

Tempo tracking and rhythm extraction

 ‐amplitude/power extraction/thresholding
 ‐autocorrelation