Hearing, physiological and psychological aspects of

Before going further into actual sound system issues and mathematics, it is important to know the significance of human sound perception. The aim in this section is to shed some light on the physiological, neuropsychological and cognitive mechanisms which take part in our hearing of sound. Mostly, it is comprised of a brief treatment of the field of psychoacoustics, although an amount of physics, human anatomy and relevant psychology is explained as well—all these have an important role in explaining what sound is to us. This section is a long one, the reason being that not many things in sound and music are interesting to humans outside the context of how we hear and interpret what was heard. Were it not for our sense of hearing and all the cognitive processing that goes along with it, the physical phenomenon of sound would be only just that—a physical phenomenon. Not many would have any interest in such a thing.

Knowledge of how we hear and why is thus paramount to understanding the relevance of the many algorithms, mathematical constructs and the general discipline of audio signal processing we encounter further on. This understanding helps explain why some synthesis methods are preferred over others, what it is that separates pleasant, harmonious music from horrifying noise, what it is that comprises the pitch, timbre and loudness of an instrument, what makes some sounds especially natural or fat, where the characteristic sound of some particular brand of equipment comes from and what assumptions and simplifications can be made in storing, producing and modifying sound signals. Basic knowledge of psychoacoustics can also help avoid some of the common pitfalls in composition and sound processing and suggest genuine extensions to one’s palette of musical expression.

What is psychoacoustics all about?

To put it shortly, psychoacoustics is the field of science which studies how we perceive sound and extract meaningful data from acoustical signals. It concerns itself primarily with low level functions of the auditory system and thus doesn’t much overlap with the study of music or æsthetics. Basic psychoacoustical research is mainly directed toward such topics as directional hearing, pitch, timbre and loudness perception, auditory scene analysis (the separation of sound sources and acoustical parameters from sound signals) and related lower functions, such as the workings of our ears, neural coding of auditory signals, the mechanisms of interaction between multiple simultaneously heard sound sources, neural pathways from ears to the auditory cortex, their development and the role of evolution in the developement of hearing. Psychoacoustical research has resulted in an enormous amount of data which can readily be applied to sound compression, representation, production and processing, musicology, machine hearing, speech recognition and composition, to give just a few examples. The reason why such a long part of this document is devoted to psychoacoustics, is that although one can understand sound synthesis and effects fairly well just by grasping the relevant mathematics, one cannot truly get a hold onto their underlying principles, shortcomings and precise mechanisms of action before considering how the resulting sound is heard. Human auditory perception is a rather quirky and complicated beast—it often happens that sheer intuition simply doesn’t cut it.

The structure and function of the ear

There are three main parts in the human ear: outer, middle and inner ear. The outer ears include ear lobes (pinnæ) and the ear canals. Between outer and middle ear, the tympanic membrane (or eardrum) resides. The middle ear is a cavity in which three bones (called malleus, incus and stapes) reside. Malleus is attached to the tympanic membrane, stapes to the oval window which separates the inner and middle ears. Incus connects these two. The three bones (collectively called ossicles) form a kind of lever which transmits vibrations from the tympanic membrane to the fluid filled inner ear, providing an impedance match between the easily compressed air in the outer ear and the noncompressible fluid in the inner. To these bones the smallest muscles in our body, the middle ear muscles, are attached. They serve to dampen the vibration of the ossicles when high sound pressures are met, thereby protecting the inner ear from excessive vibration. The inner ear is composed of the cochlea, a little, bony, seashell shaped apparatus which eventually senses sound waves, and the vestibular apparatus. All these structures are incredibly small—the ear canal measures about 3 centimeters in length and about one half in diameter, the middle ear is about 2 cubic centimeters in volume and the cochlea is, when rolled straight, about 35 millimeters in length and 2 millimeters in diameter.

In the cochlea, we find an even smaller level of detail. The cochlea is divided into three longitudinal compartments, scala vestibuli, scala tympani and scala media. The first two are connected through the apex in the far end of the cochlea, the middle one is separate from the others. The vibrations from the middle ear reach the cochlea through the oval window which resides in the outer end of scala vestibuli. In the outer end of scala tympani, the round window connects the cochlea to the middle ear for the second time. Vibrations originate from the oval window, set the intermediate membranes (Reissner’s membrane between scala vestibuli and scala media, the basilar membrane between scala media and scala tympani) in movement and get damped upon reaching the round window. On the floor of scala media, under the tectorial membrane, lies the organ of Corti. This is where neural impulses associated with sound reception get generated. From the bottom of the organ of Corti, the auditory nerve emanates, headed for the auditory nuclei of the brain stem.

The organ of Corti is the focal point of attention in many basic psychoacoustical studies. It is a complex organ, so we will have to simplify its operation a bit. For a more complete description, see Kan01. That book is also a good general reference on neural structures. On top of the Corti organ, two lines of hair cells, covered with stereocilia (small hairs of actin filaments which sense movement) stand. On the outside of the cochlea is the triple line of outer hair cells, on the inside the single line of inner hair cells. The other ends of the stereocilia are embedded in the overlying tectorial membrane. This arrangement means that whenever the basilar membrane twitches, the stereocilia get bent between it and the tectorial membrane. Obviously, the pressure changes in scala vestibuli result in just such action, which means that sound results in bent stereocilia. This in case leads to neural impulses being generated. These are lead by the afferent auditory nerves towards the brain. The inner and outer hair cells are innervated rather differentely—it seems that the inner ones are mainly associated with louder and the outer with quieter sounds (see Gol01). Also, some efferent innervation reaches the outer hair cells, so it is conceivable that the ear may adapt under neural control, possibly to aid in selective attention (Kan01).

From air to brain—spectral analysis, tonotopic organization and time domain coding

By now, the basic function of the ear should be quite clear. However, nothing has been said about how the ear codes the signals. It is well known that neurons cannot fire at rates exceeding 500‐1000Hz. Neurons also primarily operate on binary pulses (action potentials)—there either is a pulse or there is not. Direct encoding of waveforms does not come into question, then. And how about amplitude? To answer these questions, something more has to be said about the structure of the cochlea.

When we look more closely at the large scale structure of the organ of Corti, we see a few interesting things. First, the width of the basilar membrane varies over the length of the cochlea. Near the windows, the membrane is quite narrow whereas near the apex, the membrane is quite a bit wider. Similarly, the thickness and stiffness vary—near the windows they are considerable whereas near the apex they are much less. And the same kind of variation is repeated in the hair cells and their stereocilia—near the apex, longer, more flexible hairs prevail over the stiffer, shorter stereocilia of the base of the cochlea. All this has a serious impact on the vibrations caused in the Corti organ by sound—vibrations of higher frequency tend to cause response mainly near the windows where the characteristic vibrational frequency of the basilar membrane is higher, lower frequencies primarily excite the hair cells near the apex. This means that the organ of Corti performs physical frequency separation on the sound. This separation is further amplified by the varying electrical properties of the hair cells, which seem to make the cells more prone to excitation at specific frequencies. All in all, the ear has an impressive apparatus for filter bank analysis. From the inner ear, this frequency decomposed information is moved by the auditory nerve. The nerve fibers are also sorted by frequency, a pattern repeated all over the following neural structures. This is called tonotopic organization.

When sound amplitude varies, the information should be coded somehow. This is an area of study which is still going on strong. This means that a complete description cannot be given here, but most relevant points are mentioned, anyhow. One mechanism of coding involves the relative firing frequencies and amplitudes of the individual neurons—the more excitation there is, the more there is neural activity in the relevant auditory neurons. Since frequency information is carried mainly by tonotopic mapping of the neurons, this doesn’t pose a problem of data integrity. A second mechanism which seems to augment the transmission is based on the fact that as louder sounds impinge upon the ear, the width of resonance on the basilar membrane increases. This may cloud the perception of nearby frequencies but can also be used to deduce the amplitude of the dominant component. The efferent innervation to the outer hair cells and afferent axons from the inner hair cells also seem to play a part in loudness perception—there is evidence suggesting that the ear can adapt to loud sounds, keeping the ranges in check.

So we now we have a rough picture of how amplitude and frequency content are carried over the auditory nerve. How about time information, then? Considering the high time accuracy of our hearing (in the order of milliseconds, at best), mere large scale time variation in neural activity (governed by the many resonating structures on the signal path and the inherent limitation on neuron firing rate) does not seem to explain everything. When investigating this mystery, researchers ran into an interesting phenomenon, namely, phase locking, which has also served as a complementary explanation to high frequency pitch perception. It seems that hair cells, in addition to firing more often when heavily excited, tend to fire at specific points of the vibratory motion. This means that the firing of multiple neurons, although mutually asynchronous, concentrate at a specific point of the vibratory motion. This phenomenon has been experimentally demonstrated for frequencies as high as 8kHz. It is conceivable, then, that many neurons working in conjunction could directly carry significantly higher frequencies than their maximum firing rate would at first sight suggest. This result has been experimentally confirmed as well as its role in conveying accurate phase information to the brain (this is important in measuring interaural phase differences and, consequently, plays a big part in directional hearing). It also serves as a basis to modern theories of pitch perception through what is called periodicity pitch, pitch determination through the period of a sound signal. The idea of concerted action of phase locked neurons carrying frequency information is called the volley principle and augments the frequency analysis (place principle) interpretation introduced above. This time domain coding is extremely important because it seems that accurate frequency discrimination cannot be explained without it—place codings display a serious lack of frequency selectivity, even after considerable neural processing and enhancement.

Furthermore, it is now known that the hair cells of the inner ear only react to vibrations in one direction, i.e. during one halfwave of a periodic vibration. This is an intrinsically nonlinear mechanism. When more than one frequency is present, this mode of response offers a highly elegant explanation to the phenomenon of difference tones, hybridization products which arise during the detection process of complex sounds. Earlier these were simply explained as resulting from the action of nonlinearities in the middle ear at loud volumes. Now it seems that that explanation does not really cover everything. Further, problems with the so called missing fundamental type tones (periodic sounds which take on the pitch of the first harmonic, even if the first (few) harmonic(s) is (are) not actually present) seem to benefit greatly from analyses which take into account this type of nonlinearity in the ear. See Wal01 and section 4.10 on pitch perception.

The auditory pathway: nerves, nuclei and their roles in auditory analysis

Now the function of the auditory system has been described upto the auditory nerve. What about after that? The eighth cranial nerve, most of which is an extension of the innervation of the inner ear (the rest being mainly concerned with the sense of balance), carries the auditory traffic to the brain stem. Here the auditory nerve passes through the cochlear nuclei, which start the neural processing and feature extraction process. Upon entering the cochlear nucleus, the auditory nerve is divided in two. The upper branch goes to the upper back quarter of the nucleus while the lower branch innervates the lower back quarter and the front half. The cochlear nuclei display clear tonotopic organization, with high frequencies mapped to the centre and lower frequencies mapped to the surface. The ventral (back) side of the nucleus is made up from kinds of cells, bushy and stellate (starlike). Stellate cells respond to single neural input pulses by a series of evenly spaced action potentials of a cell dependent frequency (this is called a chopper response). The stellate cells have long, rather simple dendrites. This suggests that the stellate cells gather pulses from many lower level neurons, and extract precision frequency information from their asynchronous outputs. Their presence supports one of the theories of frequency discrimination, which speculates on the presence of a timing reference in the brain. The bushy cells, on the other hand, have a fairly compact array of highly branched dendrites (whence the name) and respond to depolarization with a single output pulse. This suggests they are probably more concerned with time‐domain processing. It seems bushy cells extract and signal the onset time of different frequencies in a sound stimulus. There are also cells, called pausers, which react to stimuli by first chopping a while, then stopping, and after a while starting again. These may have something to do with estimating time intervals and/or offset detection.

Following the cochlear nuclei, the auditory pathway is divided in three. The dorsal (front side) acoustic stria crosses the medulla, along with the intermediate acoustic stria. The most important, however, is the trapezoid body, which is destined to the next important processing centre, the superior olivary nucleus. The olives are a prime ingredient in directional hearing. Both nuclei receive axons from both the ipsilateral (same side) and contralateral (opposite side) cochlear nuclei. The medial (closer to the centre of the body) and lateral (closer to the sides of the body) portions of the nuclei serve different functions: the medial part is concerned with measuring interaural time differences while the lateral half processes interaural intensity information. Time differences are measured by neurons which integrate the information arriving from both ears—since propagation in the preceding neurons is not instantaneous and the signals from the ears tend to travel in opposite directions along the pathways, this system works as a kind of correlator. The coincidence detector is arranged so that neurons closer to the opposite side of the sound source tend to respond to it. Similarly, the intensities are processed—contralateral signals excite and ipsilateral signals inhibit the response of the intensity detector. These functions are carried out separately for different frequency bands and are duplicated in both superior olivary nuclei, although the dynamics of the detection process mainly place the response on the opposite side of the signal source.

After leaving the olives, the axons rejoin their crossed and uncrossed friends from the cochlear nuclei. They then progress upwards—this time the bundle of axons is called the lateral lemniscus. The lemniscus ascends first through pons where an amount of crossing between the lateral pathways is observed. This happens through Probst’s commissure which mainly contains axons from the nuclei of the lateral lemniscus. From here, the lane continues upward to the midbrain (more specifically to the inferior colliculus) where all the axons finally synapse. This time there seems not to be any extensive crossing. It would appear that the inferior colliculus has something to do with orientation and sound‐sight coordination—the superior colliculus deals with eye sight and there are some important connections to be observed. Also, there is good evidence that topographic organization according to the spatial location of the sound is present in the inferior colliculi. It is noticeable that while we trace the afferent auditory pathway through to the lateral lemniscus and the inferior colliculus, the firing pattern of the neurons changes from flow‐like excitation to an onset/offset oriented kind. More on this can be found in the sections on transients and time processing. The pathway is then extended upwards to the medial geniculate nuclei just below the forebrain which then, finally, projects to the primary auditory cortex on the cerebrum.

One special thing to note about the geniculate nuclei is that they, too, are divided into parts with apparently different duties. The ventral portion displays tonotopic organization, whereas the dorsal and medial (magnocellular) parts do not. The ventral part projects to tonotopically organized areas of the cortex, the dorsal part nontonotopic ones and the medial part to both. In addition, the magnocellular medial geniculate nuclei display a certain degree of lability/plasticity. This means it may have considerable part in how learning affects our hearing. A noteworthy fact is that the nontonotopically organized parts of the geniculate nuclei and the cortex are considerably less well known than their tonotopic counterparts—complex, musically relevant mappings might be found there, in the future. Throughout the journey, connections to and from the reticular formation (which deals with sensomotoric integration, controls motivation and maintains the arousal and alertness in the rest of the central nervous system) are observed. Finally, the auditory cortex is located on the surface of the temporal lobes. And just to add to the fun, there is extensive crossing here, as well. This time it takes place through corpus callosum, the highway between the right and left cerebral hemispheres.

In the way to the auditory cortex, extensive mangling of information has already taken place. It is seen, for example, that although the tonotopic organization has survived all the way through the complex pathways, it has been multiplied, so that there are now not one but several frequency maps present on the auditory cortex. The structural organization is more complex, here, also. Like most of the cortex, the auditory cortex is both organized into six neuronal layers (which mainly contain neuronal cell bodies) and into columns (which reach through the layers). The layers show their usual pattern of external connections: layer IV receives the input, layer V projects back towards the medial geniculate body and layer VI to the inferior colliculus. The columns, on the other hand, serve more specialized functions and the different types are largely interspersed among one another. Binaural columns, for instance, show an alternating pattern of suppression and addition columns—columns which differentiate between interaural features and and those which do not, respectively. Zoning of callosally connected and nonconnected areas is also observed. Further, one must not forget that there exist areas in the brain which are mainly concerned with speech production and reception (the areas of Wernicke and Broca, respectively). They are specific to humans although some similar formations are present in the brain of other animals, especially if they are highly dependent on auditory processing (bats and dolphins are examples with their echo location and communication capabilities).

All in all, the functional apparatus of the brain concerned with auditory analysis is of considerable size and complexity. One of the distinctive features of this apparatus is the extensive crossing between the two processing chains—one of the most peculiar aspects of hearing is that while the usual rule of processing on the wrong side is generally observed, the crossing distributes the processing load so that even quite severe lesions and extensive damage to the cortex need not greatly disturb auditory functions.

Steady‐state vs. transient sounds. The attack transient. Vowels and consonants.

In the last section, it became apparent that the brain has an extensive apparatus for extracting both time and frequency information from sounds. In fact, there are two separate pathways for information: one for frequency domain and the other for time domain data. This has far reaching consequences for how we hear sound signals. First of all, it means that any perceptually significant analysis or classification of sound must include both time and frequency. This is often forgotten in the traditional Fourier analysis based reasoning on sound characteristics. Second, it draws a kind of dividing line between sound signals whose main content to us is in one of the domains. Here, this division is used to give meaning to the often encountered terms transient and steady‐state; we take the first to mean time oriented, and the second frequency oriented. Another (rather more rigorous) definition of steady‐state is based on statistics. In this context, a signal is called steady‐state if it is stationary in the short term and transient if it is not.

The root of this terminology lies in linear system analysis. There steady‐state means that a clean, often periodic or almost constant excitation pattern has been present long enough so that Fourier based analysis gives proper results. Formally, when exposed to one‐sided inputs (non‐zero only if time is positive), linear systems exhibit output which can be decomposed into two additive parts: a sum of exponentially decaying components which depends on the system and a sustained part which depends on both the excitation and the system. The former is the transient part, the latter steady‐state. Intuitively, transients are responses which arise from changes of state—from one constant input or excitation function to another. They are problematic, since they often correspond to unexpected or rare events; it is often desired that the system spend most of its time in its easiest to predict state, a steady‐state. Because transients are heavily time‐localized, they defy the usefulness of traditional Fourier based methods.

In acoustics and music, the situation is similar in that frequency oriented methods tend to fail when transients are present. Moreover, in music, transients often correspond to excitatory motions on behalf of the performer (plucking a bow, striking a piano key, tonguing the reed while playing an oboe etc.), and so involve

Significant nonlinear interactions (instruments behave exceedingly nonlinearly)
Stochastic or chaotic phenomena (often from turbulence, as when sibilant sounds are produced in the singing voice)
Unsteady vibratory patterns (the onset of almost any note)
Partials with rapidly changing amplitudes and frequencies (as a result of the above)

All these together mean that pure frequency domain analyses do not explain complex sounds clearly enough—they do not take into account the time‐variant, stochastic or nonlinear aspects of the event. From an analytical point of view, a time‐frequency analysis is needed. Some of these are mentioned in the math section. The fourth item in the list above deserves special attention because it is characteristic of vocal sounds—consonants are primarily recognized the trajectories (starting points, relative amplitudes and speed of movement) of the partials present in the following phoneme Dow01. Usually consonants consists of a brief noisy period followed by the partials of the next phoneme sliding into place, beginning from positions characteristic to the consonant. This happens because consonants are mostly based on restricting air passage through the vocal tract (this and the following release produce the noise), because the following phoneme usually exhibits different formant frequencies (causing a slide from the configuration of the consonant) and, finally, because consonants are mostly very short compared to vowels.

What is the perceptual significance of our transient vs. steady classification, then? To see this, we must consider speech, first. In the spoken language, two general categories of sounds are recognized: vowels and consonants. They are characterized by vowels being voiced, often quite long and often having a more or less clear pitch as opposed to consonants being short, sometimes noiselike (such as the pronunciation of the letter s) and mostly unpitched. Vowels arise from nicely defined vibratory patterns in the vocal tract which are excited by a relatively steady pulse train from the vocal chords when consonants mostly arise from constrictions of the vocal tract and the attendant turbulence, impulsive release (like when pronouncing a p, or one of the other plosives) or nonlinear vibration (like the letter r). Now, a clear pattern shows here. Consonants tend to be transient in nature, while vowels are mostly steady‐state. This is very important because most of the higher audio processing in humans has been shaped by the need to understand speech. This connection between vowel/consonant and steady/transient classification has also been demonstrated in a more formal setting: in listening experiments, people generally tend to hear periodic and quasi‐periodic sounds as being vowel‐like while noises, inharmonic waveforms and nonlinear phenomena tend to be heard as consonants. Some composers have also created convincing illusions such as speech music by proper orchestration—when suitable portions of transient and steady‐state material is present in the music in some semi‐logical order, people tend to hear a faint speech like quality in the result. The current generation of commercial synthesizers also demonstrates the point—today, modulatory possibilities and time evolution of sounds often outweighs in importance the basic synthesis method and as a buying criterion. The music of the day relies greatly on evolving, complex sounds instead of the traditional one‐time note event structure.

It is kind of funny how little attention time information has received in the classical study, although one of the classic experiments in psychoacoustics tells us what importance brief, transient behavior of sound signals has. In the experiment, we record instrumental sounds. We then cut out the beginning of the sound (the portion before the sound has stabilized into a quasi‐periodic waveform). In listening experiments, samples brutalized this way are quite difficult to recognize as being from the original instrument. Furthermore, if we splice together the end of one sample and the beginning of another, the compound sound is mostly recognized as being from the instrument of the beginning part. In a musical context, the brief transient in the beginning of almost all notes is called the attack, then. For a long time, it eluded any closer inspection and even nowadays, it is exceedingly difficult to synthesize if anything but a throrough physical model of the instrument is available.

This kind of high importance of transient characteristics in sound is best understood through two complementary explanations. First, from an evolutionary point of view, time information is essential to survival—if it makes a sudden loud noise, it may be coming to eat you or falling on you.

This is where the startle and orientation reflexes come in: sudden noises or movement tend to cause a rapid fight or flight reaction and even weaker, unexpected stimuli cause one to locate the sound source by turning the head towards it. Since unexpected, sudden features in the heard sound tend to cause such effects and generally arouse the central nervous system, it can be conjectured that notes may well have some very deep seated physiological meaning to people—they do tend to start with transients and cause fixing of attention.

You need rapid classification as to what the source of the sound is and where it is at. From a physical point of view, there may also be considerably more information in transient sound events than in steady‐state (and especially periodic) sound—since high frequency signals are generated in nature by vibrational modes in bodies which have higher energies, they tend to occur only briefly and die out quickly. In addition to that, most natural objects tend to emit quasi‐periodic sound after a while has passed since the initial excitation. These two facts together mean that, first, upper frequencies and highly inharmonic content tend to concentrate on the transient part of a sound and, second, the following steady‐state portion often becomes rather nondistinctive.

So the steady‐state part is certainly not the best part to look at if source classification is the issue. The other part of the equation are the neural excitation patterns generated by different kinds of signals—transients tend to generate excitation in greater quantities and more unpredictably. Since unpredictability equals entropy equals information, transients tend to have a significant role in conveying useful data. This is seen in another way by observing that periodic sounds leave the timing pathway of the brain practically dead—only spectral information is carried and, as is explained in following sections, spectra are not sensed very precisely by humans. Kind of like watching photos vs. watching a movie. In addition to that, such effects as masking and the inherent normalisation with regard to the surrounding acoustic space greatly limit the precision of spectral reception.

Aside from their important role in classifying sound sources, transient features also serve a complementary role in sound localization. This is most clearly seen in auditory physiology: our brain processes interaural time differences instead of phase differences and has separate circuitry for detecting the onset of sonic events. This means that transient sounds are the easiest to locate. Experiments back this claim: the uncertainty in sound localization is greatest when steady‐state, periodic sounds are used as stimuli.

Critical bands and masking

Until now, we have tacitly assumed that the ear performs like a measuring instrument—if some features are present in a sound, we hear them. In reality, this is hardly the case. As everybody knows, it is often quite difficult to hear and understand speech in a noisy environment. The main source of such blurring is masking, a phenomenon in which energy present in some range of frequencies lessens or even abolishes the sensation of energy in some other range. Masking is a complex phenomenon—it works both ipsilaterally and contralaterally, and maskings effects extend both forwards and backwards in time. It is highly relevant to both practical applications (e.g. perceptual compression) and psychoacoustic theory (for instance, in models of consonance and amplitude perception). This also means that masking has been quite thoroughly investigated over the years. The bulk of research into masking involves experiments with sinusoids or narrow‐band noise masking a single sinusoid. Significant amounts of data are available on forward and backward masking as well. It seems most forms of masking can be explained at an extremely low (almost physical) level by considering the time dynamics of the organ of Corti under sonic excitation. This is not the case for contralateral masking, though, and it seems this form of masking stands separately from the others. Currently it is thought that contralateral masking is mediated through the olivo‐cochlear descending pathway by means of direct inhibition of the cochlea in the opposite ear. (Masking like this is called central, whereas normal masking by sound conducted through bone across the scull is called transcranial.)

Masking is a rather straight forward mechanism, which can be studied with relative ease by presenting test signals of different amplitudes and frequencies to test subjects in the presence of a fixed masking signal. The standard way to give the result of such an experiment is to divide the frequency‐amplitude plane into parts according to the effect produced by a test signal with the respective attributes while the mask stays constant. The main feature is the masking threshold which determines the limit below which the masked signal is not heard at all. This curve has a characteristic shape with steep roll‐off below the mask frequency and a much slower, uneven descent above. This means that masking mostly reaches upwards with only little effect on frequencies below that of the mask. At each multiple of the mask frequency we see some dipping because of beating effects with harmonic distortion components of the mask. Above the threshold we see areas of perfect separation, roughness, difference tones and beating.

A lot is known about masking when both the mask and the masked are simple, well‐behaved signals devoid of any time information. But how about sound in general? First we must consider what happens with arbitrary static spectra. In this case one proper—and indeed lot used—way is to take masking to be additive. That is, the mask contribution of each frequency is added together to obtain the amount of masking imposed on some fixed frequency.

This leads to the masking threshold over the whole audio bandwidth being the convolution of the (frequency variable) masking curve with our mask spectrum. Further, if we warp the frequency domain representation of the mask sound to produce a scale (the bark scale) with equal critical bandwidth over the frequency axis, we see that masking curves corresponding to different mask frequencies are made nearly identical in shape. Performing the convolution in this new domain is the normal shift‐invariant convolution operation and can be very nearly approximated by low cost linear shift‐invariant filtering. Extremely useful in masking calculations like when performing perceptual compression.

So additivity is nice. But does it hold in general? Not quite. Since the ear is not exactly linear, some additional frequencies always arise. These are not included in our masking computation and can sometimes make a difference. Also, in the areas where our hearing begins to roll off (very low and very high frequencies), some exceptions to the additivity must be made. Since masking mainly stretches upwards, this is mostly relevant in the low end of the audio spectrum—low pitched sounds do not mask higher ones quite as well as we would expect. Further, beating between different partials of the masking and the masked can sometimes cause additivity to be too strict an assumption. This is why practical calculations sometimes err on the safe side and take maximums instead of sums. This works because removing all content other than the frequency (band) whose masking effect was the greatest will still leave the signal masked. It is proper to expect that putting the rest of the mask back in will not reduce the total masking effect.

The above discussion concerns steady spectra. In contrast, people hear time features in sounds as well. So there is still the question of how the masking effect of a particular sound develops in time. When we study masking effects with brief tone bursts, we find that masking extends some tens of milliseconds (often quoted as 50ms) backwards and one to two hundred milliseconds forward in time. The effect drops approximately exponentially as the temporal separation of the mask and the masked increases. These results too can be explained by considering what happens in the basilar membrane of the ear when sonic excitation is applied—it seems backward and forward masking, as these are respectively called, are the result of the basilar membrane’s inherently resonant nature. The damped vibrations set of by sound waves do not set in or die out abruptly, but instead some temporal integration is always observed. This same integration is what causes the loudness of very short sounds proportional to their total energy instead of the absolute amplitude—since it takes some time for the vibration (and, especially, the characteristic vibrational envelope) to set in, the ear can only measure the total amount of vibration taking place, and ends up measuring energy across a wide band of frequencies. Similarly, any variation in the amplitudes of sound frequencies are smoothed out, leading to the the ear having a kind of time constant which limits its temporal accuracy.

Closely tied to masking (and, indeed, many other aspects of human hearing) are the concepts of critical bandwidth and critical bands. The critical bandwidth is defined as that width of a noise band beyond which increasing the bandwidth does not increase the masking effect imposed by the noise signal upon a sinusoid placed at the center frequency of the band. The critical bandwidth varies across the spectrum, being approximately one third of an octave in size, except below 500Hz, where the width is more or less constant at 100Hz. This concept has many uses and interpretations because in a way, it measures the spectral accuracy of our ear. Logically enough, a critical band is a frequency band with the width of one critical bandwidth. Through some analysis of auditory physiology we find that a critical band roughly corresponds to a constant number of hair cells in the organ of Corti. In some expositions, critical bands are thought of as having a fixed center frequency and bandwidth. Although such a view is appealing from an application standpoint, no physiological evidence of direct banking of any kind is found in the inner ear or the auditory pathway, it seems that this way of thinking is somewhat erroneous. Instead, we should think of critical bands as giving the size and shape of a minimum discernible spectral unit of kind—in measuring the loudness of a particular sound, the amplitude for each frequency is always averaged over the critical band corresponding to the frequency. (This amounts to lowpass filtering (i.e. smoothing) of the perceived spectral envelope.) This effect is illustrated by the fact that people can rarely discern fluctuations in the spectral envelope of a sound which are less than one critical band in width.

One counter‐example is found in the perception of speech: the inherent unstability in the pitch of a vowel makes many static features of the spectral envelope more easily heard, even if these features are extremely narrow. More generally, if we pass a periodic sound with a strong harmonic spectrum (like a pulse train) through a filter, applying some slow fluctuation to the base frequency makes the harmonics roam through the peaks and dents in the filter’s response and so turns the static response of the filter into relatively simple amplitude modulation of the partials. Our ear is tuned to detecting these and can so go below the critical bandwidth in resolving the overall spectral envelope. Naturally the trick won’t work with extremely low base frequencies (the harmonics are too close to each other), fast fluctuations (the ear integrates in time), spectra which are not nearly harmonic, or flat/continuous spectra (like noise).

Amplitude to loudness

Considering the complexity of the analysis taking place in the auditory pathway, it is no wonder that few parameters of sound signals are translated directly into perceptually significant measures. This is the case with amplitude too—the nearest perceptual equivalent, loudness, consists of much more than a simple translation of signal amplitude. First, the sensitivity of the human ear is greatly frequency dependent—pronounced sensitivity is found at the frequencies which are utilized by speech. This is mainly due to physiological reasons (i.e. the ear canal has its prominent resonance on these frequencies and the transduction and detection mechanisms cause uneven frequency response and limit the total range of frequency perception). There are also significant psychoacoustic phenomena involved. Especially, humans tend to normalize sounds. This means that parameters of the acoustic environment we listen to a sound in is separated from the properties of the sound source. This means, for instance, that we tend to hear sounds with similar energies as being of unequal loudness if our brain concludes that they are coming from differing distances. Further, such phenomena as masking can cause significant parts of a sound to be shielded from us, effectively reducing the perceived loudness. We also follow very subtle clues in sounds to deduce the source parameters. One example is the fact that a sound with significant high frequency content usually has a higher perceived loudness than a sound with similar amplitude and energy but less high end. This is a learned association—we know that usually objects emit higher frequencies if they are excited more vigorously. Phenomena such as these are of great value to synthesists since they allow us to use simple mathematical constructs (such as low order lowpass filters) to create perceptually plausible synthesized instruments. On the other hand, they tend to greatly complicate analysis.

If we take a typical, simple and single sound and look at its loudness, we can often neglect most of the complicated mechanisms of perception and look directly at the physical parameters of the sound. Especially, this is the case with sinusoids since they have no spectral content apart from their frequency. Thus most of the theory of loudness perception is formulated in terms of pure sine waves at different frequencies. It is mostly this theory that I will outline in the remainder of this section.

[Figure 1: Equal loudness curves in a free field experiment]

Figure 1 Equiphon contours for the range of human hearing in a free field experiment, according to Robinson and Dadson. At 1kHz, the phon contours correspond to the decibels. All sinusoids on the same contour (identified by sound pressure level and frequency) appear to have identical loudness to a human listener. It is seen that the dynamic range and threshold of hearing are worst in the low frequency end of the spectrum. Also, it is quite evident that at high sound pressure levels, less dependency on frequency is observed (i.e. the upper contours are flatter than the lower ones).

Decibels are nice but they have two problems: they do not take the frequency of the signal into account and also show poor correspondence with perceived loudness at low SPLs. The puzzle is solved in two steps. First, we construct a scale where the frequency dependency is taken into account. This is done by picking a reference frequency (1kHz, since this is where we defined the zero level for SPLs) and then examining how intense sounds at different frequencies need to be to achieve loudness similar to their 1kHz counterparts. After that we connect sounds with similar loudnesses across frequencies. The resulting curves are called equiphon contours and are shown in the graph from Robinson and Dadson. We get a new unit, the phon, which tells loudness in terms of SPL at 1kHz. That 1kHz is the reference point shows in that there the resulting decibel to phon mapping is an identity. Elsewhere we see the frequency dependency of hearing: following the 60 phon contour, we see that to get the same loudness which results from presenting a 1kHz, 60dB SPL sine wave, we must use a 90dB SPL sine wave at 30Hz or 55dB sine wave at 4kHz. We also see that the higher the sound pressure level, the less loudness depends on frequency (the isophon contours are straighter in the upper portion of the picture).

The phon is not an absolute unit: it presents loudness relative to the loudness at 1kHz. Knowing the phons, we cannot say one sound is twice as loud as another one—this would be like saying that a five star hotel is five times better than a one star motel, i.e. senseless. Instead, we would wish an absolute perceptual unit. All that remains to be done is to get the phons at some frequency (preferably at 1kHz since the SPL‐to‐phon mapping is simplest there) to match our perception. This is done by defining yet another unit, the sone. When this is accomplished, we can first use the equiphon contours to map any SPL to its equivalent loudness in phons at 1kHz and then the mapping to sones to get a measure of absolute loudness. The other way around, if we want a certain amount of sones, we first get the amount of phons at 1kHz and then move along the equiphon contours to get the amount of decibels (SPL) at the desired frequency. Experimentally we get a power law between sones and phons—at 1kHz, the mapping from sones to phons obeys a power function with an exponent of 0.6, 40 phons being equal to 1 sone. (0 phons, that is 0dB, becomes 0 sones of course.) This way at high SPLs the sone scale is nearly the same as the phon/decibel one, while at low levels, small changes in sones correspond to significantly higher differences in phons. In effect, at low levels a perceptually uniform volume slider works real fast while at higher levels, it’s just exponential.

All the previous development assumes that the sounds are steady and of considerable duration. If we experiment with exceedingly short stimuli, different results emerge. Namely, we observe considerable clouding of accuracy in the percepts and signs of temporal integration. This means that as we go to very short sounds and finally impulses which approach or are below the temporal resolution of the organ of Corti, the total energy in the sound becomes the dominant measure of loudness. At the same time, loudness resolution degrades so that only few separate levels of loudness can be distinguished. Similarly, when dealing with sound stimuli, the presence of transients becomes an important factor in determining the lower threshold of hearing—transient content (e.g. rapid onset of sinusoidal inputs and fluctuation in the amplitude envelopes) tends to lower the threshold while at the same time clouding the reception of steady‐state loudness.

Finally, a few words must be said about the loudness of complex sounds. As was explained in the previous section, sinusoidal sounds close to each other tend to mask one another. If the sounds are far enough from one another (more than one critical bandwidth apart) and the higher is sufficiently loud, they are heard as separate and contribute separately to loudness. In this case sones are roughly added. Since masking is most pronounced in the upward direction, a sound affects the perception of lower frequencies considerably less than higher ones—in a sufficiently rapidly decaying spectrum, the lower partials dominate loudness perception. Also, sinusoids closer than the critical bandwidth are merged by hearing so their contribution to loudness is less than the sum of their separate contributions. The same applies for narrow‐band (bandwidth less than the critical bandwidth) noise. If beating is produced, it may, depending on its frequency, increase, decrease or blur perceived loudness. Similarly harmonics (whether actually present or born in the ear) of low frequency tones and the presence of transients may aid in the perception of the fundamental, thus affecting the audibility of real life musical tones as compared to the sine waves used in the construction of the above equiphon graph.

For signals with continuous spectra (such as wideband noise), models of loudness perception are almost always heavily computational—they usually utilize a filterbank analysis followed by conversion into the bark scale, a masking simulation and averaging. Wideband signals also have the problem of not exactly following the conversion rules for decibels, phons and sones—white noise, for instance, tends to be heard as relatively too loud if its SPL is small and too silent if its SPL is high.

There are also more subtle complications in trying to generalize the notion of loudness to cover complex sounds—it is quite possible the same concepts are not completely applicable to both simple and composite sounds. Fusion and streaming may separate composite sounds so that asking for a measure of loudness to be assigned to the composite becomes meaningless. On top of that, loudness may have very different meanings depending on the context. There is a fascinating discussion on what makes a movie too loud on the Dolby Laboratories web site. It clearly demonstrates the difference between estimating the loudness of instantaneous/long term, music/noise and expected/unexpected sounds and the contribution of dynamic range to the perceived strength of a sound. In other words it struggles with the cognitive aspects of loudness perception. All in all, we have bumped for the first time into one of the prime problems in psychometrics—people display modes of behavior (such as categorical perception) which tend to defy easy measurement. Not all things in our perceptual world have dimensions or permit proper units.

Temporal processing. Amplitude and frequency modulation.

From looking at the structure of our auditory system, it seems like quite considerable machinery is assigned to temporal processing. Furthermore, it seems like time plays an important role in every aspect of auditory perception—even more so than in the context of the other senses. This is to be expected, of course: sound as we perceive it has few degrees of freedom in addition to time.

The importance of time processing shows in the fact that it starts at an extremely early stage in the auditory pathway, namely, in the cochlear nuclei. The bushy cells mentioned earlier seem to be responsible for detecting the onset of sounds at different frequency ranges. Excitation of the bushy cells elicits a phasic response (onset produces a response, continued excitation does not) as opposed to the tonic (continued excitation produces a continued response) pattern most often observed higher up the auditory track. This way, the higher stages of auditory processing receive a more event centric view of sound as opposed to the flow‐like, tonic patterns of the lower auditory pathway. The pauser cells may be responsible for detecting sound offsets. This way sound energy in different frequency bands is segregated into time‐limited events. This time information is what drives most of our auditory reflexes, such as startle, orientation and protective reflexes. As such, it hardly comes as a surprise that heavy connections to the reticular formation (which controls arousal and motivation, amongst other things) are observed throughout the auditory pathway.

In other animals, and especially those which rely heavily on hearing to survive (e.g. bats, whales and owls), specialized cells which extract certain temporal features from sound stimuli have been found. For instance, in bats certain speed ranges of frequency sweeps are mapped laterally on the auditory cortex Kan01. This makes it possible for the bat to use Doppler shifts to correctly echolocate approaching obstacles and possible prey. These cells are very selective—they respond best to sounds which nearly approximate the squeals sent by the bat, excepting the frequency shift. This leads to good noise immunity. Cells similarly sensitive to certain modulation effects have been found in almost all mammals and there is some evidence people are no exception. For instance, amplitude modulation in the range ?? to ??Hz displays high affinity for a group of cells in the ???????. TEMP!!! Also, the nonlinearity of the organ of Corti makes AM appear in the neural input to the cochlear nuclei as‐is. Mechanisms like these may be what makes it possible for us to follow rapid melodies, rhythmic lines and the prosody of speech without difficulty. It is also probable that they serve a role in helping separate phonemes from each other when they follow in rapid succession. Without such detection mechanisms it is quite difficult to see how consonants are so clearly perceived by the starting frequencies and the relative motion of the partials present. These mechanisms may even be lent to the interpretation of formant envelopes (and, thus, the discrimination of vowels) through the minute amplitude fluctuations in the partials of a given speech sound. (As was discussed in section 4.6, such flutter is caused by involuntary random vibrato in the period of the glottal excitation pulse train.)

The importance of time features has been heavily stressed, above. However, we have yet to discuss quantitatively the sensitivity of our ears to nonstationary signals. One reason for deferring the issue until now is that it is not entirely clear what we mean by it. We would like some objective measure of the time sensitivity of the ear, in a sense, a time constant. Some of the more important temporal measures are the time required to detect a gap in some sound signal, the time taken before two overlapped sonic events can be heard as being separate, the mean repetition rate at which a recurring sonic event fuses into a coherent, single whole and the rate at which a masking effect at certain frequency builds up when the mask is applied or evaporates after the mask is gone. The first hints at a discrimination test, the second is clearly a matter of categorical perception and multidimensional study and from third on we walk in the regime of continuous temporal integration.

The time required to hear a sonic gap varies somewhat over the audio bandwidth. For a first order approximation, we might say that to effect a discontinuous percept, we need some constant number of wavelengths of silence. But looking a bit closer, this number also depends on the amplitude, timbral composition and timbral composition of the sound. Voice band sinusoids are probably the easiest, complex sounds with lots of noise content and expected behavior the most complicated. In the context of rich spectra, temporal smearing of over 50ms can occur. With a 1kHz 80dB sinusoid, a gap of 3‐4 cycles is enough to effect a discontinuous percept. Often sounds overlaid with expectations (such as a continuously ascending sinusoid) lend themselves to a sort of perceptual extrapolation—even if the percept is broken by wideband noise, our auditory system tries to fill the gap and we may well hear the sound continue through the pause. The effect is even more pronounced when a sound is masked by another one. This will be discussed further down, in connection with the pattern recognition aspects of hearing.

The minimum length of audible gaps is one but only one measure of the ear’s time resolution. In fact, it is a very simplistic one. Another common way to describe the resolution is to model our time perception through a kind of lowpass filtering (integration or blurring) operation. In this case, we try to determine the time constant of the ear. The time constant of this conceptual filter is then used to predict whether two time adjacent phenomena are heard as separate or if they are fused into one. The first thing we notice is that the time constant varies for different frequency ranges.

 ‐so what’s the value?

When we look at simple stimuli, we get some nice, consistent measures of the temporal behavior of our hearing. But as always, when higher level phenomena are considered as well, things become complicated. It seems that the so called psychological time is a very complicated beast. For instance, it has been shown that dichotic hearing can precipitate significant difference in perceived time spans as compared to listening to the same material monophonically. In an experiment in which a series of equidistant pulse sounds were presented at different speeds and relative amplitudes via two headphones, it was possible to fool test subjects to estimate the tempo of the pulse train to be anywhere between the actual tempo and its duplicate, on a continuous scale. This means that the phenomenon isn’t so much a question of locking onto the sound in a particular fashion (hearing every other pulse, for instance) but rather a genuine phenomenon of our time perception. This experiment has a partial explanation in the theory of auditory perception, which states that the processing of segregated streams of sound (in this case, the trains of clicks in the two ears) are mostly independent but that the degree of separation depends on the strength of separation of the streams. This disjoint processing can give rise to some rather unintuitive effects. First of all, time no longer has the easy, linear structure the Western world attributes to it—segregated streams all more or less have their own, linear time. The implication is that time phenomena which are strongly segregated are largely incommeasurate. This is demonstrated by the fact that a short gap within a sentence played to test subjects is surprisingly difficult to place within the sentence afterwards: the subjects know that there was a gap (and even what alternative material was possibly played in the gap) but cannot place the gap with any certainty (as in it was after the word is). In effect, the order of time events has gone from total (a common linear scale one which everything can be compared) to partial (there are incommeasurate events which cannot be placed with respect to each other).

Further complicating the equation, we know that to some degree our perception of rhythm and time is relative. The traditional point of comparison is the individual’s heart beat but the relative state of arousal (i.e. whether we are just about to go to sleep or hyperaroused by a fight‐or‐flight reaction) probably has an even more pronounced effect. This may in part explain why certain genres of music are mostly listened to at certain times of the day. A fun experiment in relative time perception is to listen to some pitched music with a regular time while yawning, dosing off or…getting high. All these should cause profound distortions in both time and pitch quite like they do with the general state of arousal of a person.

There is an interesting concept which has been floating around the community of music psychology for quite some time. I figure it’s a noteworthy one when talking about hearing and time. That’s the concept of the perceptual now, the psychological equivalent of the present time. What is noteworthy about it is that it extends over a variable time span and in a multiresolution manner. Depending on what sort of sonic events we are looking at, the psychological now varies from milliseconds to entire seconds, even while the different sounds overlap. Of course, any two events heard in the now mentally coincide.

 ‐Vesa Valimaki’s work on time masking etc.

Pitch perception

 ‐volley theory (especially in the low register)
 ‐virtual vs. real pitch
 ‐nonlinearity/missing fundamental problem
 ‐spectral pitch (place theory interpretation for acute tones)
 ‐formants/spectral envelope

Directional hearing, externalization and localisation

 ‐phase difference (grave)
 ‐amplitude gradient (acute)
 ‐indetermination in between registers
 ‐amplitude envelopes important (acute)
  ‐connection to the concept of group delay
 ‐relative reverb/early reflections as size/distance cues
 ‐the poor performance of generic computational models as proof of the
    acuity of these processes

Auditory perception as a pattern recognition task: stream segregation and fusion

 ‐common features
 ‐occlusion
  ‐the old+new heuristic
 ‐layers: neurological and cognitive
 ‐attention: effects on both layers/selection
 ‐orientation (reflexes+attention)
 ‐pattern recognition
  ‐what’s here?
 ‐perceptual time
  ‐e.g. dichotic clicks seem slower than the same sequence when presented
   monaurally
  ‐relation to state of arousal, heartbeat and other natural
   timekeepers
 ‐vertical vs. horizontal integration
  ‐competition between integration and segregation
  ‐this is a typical application of the Gestalt type field rules

Timbre?

 ‐formants
 ‐spectral envelopes
 ‐temporal processing
  ‐e.g. the genesis of granular textures
 ‐connections to fusion; relevance of vibrato/dynamic envelopes for fusion
  ‐e.g. fusion of separately introduced sinusoids upon the introduction of a
   common frequency/amplitude modulator, and its converse when the commonality
   no longer holds
 ‐the importance of attacks and transients
  ‐spectral spashing
  ‐information carrying capacity of transients (no steady‐state
   vibration…)
 ‐indetermination in periodic timbre
  ‐ergo, place theory/formant perception et cetera is not very accurate,
   whereas volley theory/temporal processing seems to be
 ‐multidimensionality
  ‐i.e. it is very difficult to characterize/measure timbre
  ‐there have been attempts
   ‐for instance, for steady‐state spectra with origin in orchestral
    instruments, we seem to get three dimensions via
    PCA/FA
   ‐most of these attempts do not concern temporal phenomena (the
    overemphasis of on Fourier, mentioned earlier)
   ‐this sort of theory is based on extremely simplified sounds and
    test setups
 ‐connection to masking (especially in composite signals)
 ‐phase has little effect
  ‐except in higher partials and granular/percussive stuff
  ‐i.e. steady‐state is again overemphasized in traditional expositions
 ‐timbre is not well defined (Bregman: wastebasket)

Sensory integration

 ‐head turning as a localisation cue
  ‐we continuously extract spatial information based not only on an
   open‐loop interpretation of what is heard, but on a closed‐loop one of what
   happens when we change the acoustic conditions
 ‐the McGurk effect
  ‐that is, seeing someone talk can change the interpretation of the same
   auditory input

Cognitive aspects of hearing. Evolutionary perspectives.

 ‐what can be learned?
  ‐apparently a lot!
 ‐lateralization implies invariance/hardwiring?
  ‐or just that there is a typical dynamic balance arising from the common
   underlying circuitry?
 ‐is plasticity the norm?
 ‐what features in sound prompt specific invariant organizational patterns?
 ‐evolution and development of audition
 ‐population variations
  ‐esp. the Japanese peculiarities in lateralization!

Frontiers

 ‐hearing under the noise floor
 ‐the effects of ultrasonic content on directional hearing
  ‐esp. sursound discussions on transient localization
   ‐the idea that bandlimitation (and the inherent ringing, and especially
    pre‐echos) it produces fool our time resolution circuitry
    ‐this is a nasty idea, because we cannot hear ultrasonic content, per se
    ‐it implies that spatial hearing is inherently non‐linear
    ‐it does not imply that all ultrasonic content has to be stored
     ‐instead it would mean that we might have to consider some nonlinear
      storage format, which only helps store transients more accurately
  ‐thoughts on why dither might not help this situation, even if it makes
   the average temporal resolution of an audio system approach
   infinite
 ‐overcomplete analysis and superresolution of sounds (Michael
  Gerzon’s unpublished work?)
 ‐inherent nonlinearity in hearing (computational models of microcilia!)
  ‐used to explain difference tones, perception of harmonic/near‐harmonic
   spectra, missing fundamentals(, what else?)
 ‐levels of pattern recognition (learned vs. intrinsic)
 ‐comodulation of masking release and profile perception as signs of cross
  frequency band processing at a low level
 ‐the consequent refutation of strictly tonotopic place theories of pitch
  etc.
 ‐envelopment and externalization through decorrelation
  ‐frequency ranges?