Читать книгу The Handbook of Speech Perception - Группа авторов - Страница 18

The plausibility of the generic account of perceptual organization A brief review of the acoustic properties of speech

One challenge of perceptual organization facing a listener is simple to state: to find and follow a speech stream. This would be an easy matter were the acoustic constituents of a speech signal or their auditory sensory correlates unique to speech, if the speech signal were more or less stationary in its spectrum, or if the acoustic elements and the auditory impressions they evoke were similar moment by moment. None of these is true, however, which inherently undermines the plausibility of any attempt to formalize the perceptual organization of speech as a task of determining successive or simultaneous similarities in auditory experience. First, the acoustic effects of speech are distributed across six octaves of audibility. The sensory contour of an utterance is widely distributed across frequency. Second, none of the multitude of naturally produced vocal sounds composing a speech signal is unique to speech. Arguably, the physical models of speech production succeed so well because they exploit an analogy between vocal sound and acoustic resonance (Fant, 1960; Stevens & House, 1961). Third, one signature aspect of speech is the presence of multiple acoustic maxima and minima in the spectrum, and the variation over time in the frequencies at which the acoustic energy is concentrated (Stevens & Blumstein, 1981). This frequency variation of the formant centers is interrupted at stop closures, creating an acoustic spectrum that is both nonstationary and discontinuous. Fourth, the complex pattern of articulation by which talkers produce consonant holds and approximations creates heterogeneous acoustic effects consisting of hisses, whistles, clicks, buzzes, and hums (Stevens, 1998). The resulting acoustic pattern of speech consists of a nonstationary, discontinuous series of periodic and aperiodic elements, none of which in detail is unique to a vocal source.

The diversity of acoustic constituents of speech is readily resolved as a coherent stream perceptually, though the means by which this occurs challenges the potential of the generic auditory account. Although some computational implementations of gestalt grouping have disentangled spoken sources of simple nonstationary spectra (Parsons, 1976; Summerfield, 1992), these have occurred for a signal free of discontinuities, as occurs in the production of sustained, slowly changing vowels. Slow and sustained change in the spectrum, though, is hardly typical of ordinary speech, which is characterized by consonant closures that impose rapid spectral changes and episodes of silence of varying duration. To resolve a signal despite silent discontinuities requires grouping by closure to extrapolate across brief silent gaps. To invoke generic auditory properties in providing this function would oppose present evidence, though. For example, in an empirical attempt to discover the standard for grouping by closure (Neff, Jestead, & Brown, 1982), the temporal threshold for gap detection was found to diverge from the tolerance of discontinuity in grouping. On such evidence, it is unlikely that a generic mechanism of extrapolation across gaps is responsible for the establishment of perceptual continuity, whether in auditory form or in the perception of speech.

Evidence from tests of auditory form suggests that harmonic relations and amplitude comodulation promote grouping, albeit weakly (Bregman, Levitan, & Liao, 1990). That is, sharing a fundamental frequency or pulsing at a common rate promote auditory integration. These two characteristics are manifest by oral and nasal resonances and by voiced frication. This may be the most promising principle to explain the coherence of voiced speech by generic auditory means, for an appeal to similarity in frequency variation between the formants is unlikely to explain their coherence. Indeed, the pattern of frequency variation of the first formant typically differs from that of the second, and neither first nor second resemble the third, due to the different articulatory origins of each (Fant, 1960). To greatly simplify a complex relation, the center frequency of the first formant often varies with the opening and closing of the jaw, while the frequency of the second formant varies with the advancement and retraction of the tongue, and the frequency of the third formant alternates in its articulatory correlate. Accordingly, different patterns of frequency variation are observed in each resonance due to the relative independence of the control of these articulators (see Figure 1.2). Even were generic auditory functions to bind the comodulated formants into a single stream, without additional principles of perceptual organization a generic gestalt‐derived parsing mechanism that aims to compose perceptual streams of similar auditory elements would fail; indeed, it would fracture the acoustically diverse components of a single speech signal into streams of similar elements, one of hisses, another of buzzes, a third of clicks, and so on, deriving an incoherent profusion of streams despite the common origin of the acoustic elements in phonologically governed sound production (Lackner & Goldstein, 1974; Darwin & Gardner, 1986; Remez et al., 1994). Apart from this consideration in principle, a small empirical literature exists on which to base an adequate account of the perceptual organization of speech.

Figure 1.2 A comparison of natural and sinewave versions of the sentence “The steady drip is worse than a drenching rain”: (A) natural speech; (B) sinewave replica.

Подняться наверх