Читать книгу The Handbook of Speech Perception - Группа авторов - Страница 21

The perceptual organization of speech Characteristics of the perceptual coherence of speech

Оглавление

While much remains to be discovered about perceptual organization that depends on sensitivity to complex coordinate variation, research on the psychoacoustics and perception of speech from a variety of laboratories permits a rough sketch of the parameters. The portrait of perceptual organization offered here gathers evidence from different research programs that aimed to address a range of perceptual questions, for there is no unified attempt at present to understand the organization of perceptual streams that approach the acoustic variety and distributed frequency breadth of speech. Overall, these results expose the perceptual organization of speech as fast, unlearned, nonsymbolic, keyed to complex patterns of sensory variation, indifferent to sensory quality, and requiring attention whether elicited or exerted.

The evidence that perceptual organization of speech is fast rests on long‐established findings that an auditory trace fades rapidly. Although estimates vary with the task used to calibrate the durability of unelaborated auditory sensation, all of the measures reflect the urgency with which the fading trace is recoded into a more stable phonetic form (Howell & Darwin, 1977; Pisoni & Tash, 1974). It is unlikely that much of the auditory form of speech persists beyond a tenth of a second, and it has decayed beyond recurrent access by 400 ms. The sensory integration required for perceptual organization is tied to this pace. Contrary to this notion of perceptual organization as exceedingly rapid, an extended version of auditory scene analysis (Bregman, 1990) proposes a resort to a cognitive mechanism occurring well after primitive grouping takes place, to function as a supplement to the gestalt‐based mechanism. Such knowledge‐based mechanisms also feature as a method to resolve difficult grouping in artifactual approaches to perceptual organization (e.g. Cooke & Ellis, 2001). However, the formal or practical advantages that this method achieves come at a clear cost, namely, to reject boundary conditions that subscribe to the natural auditory limits of perceptual organization.

The propensity to organize an auditory pattern by virtue of complex coordinate variation is apparently unlearned, or nearly so. In tests with infant listeners, 14‐week‐old subjects exhibited the pattern of adult sensitivity to dichotically arrayed components of synthetic syllables (Eimas & Miller, 1992; cf. Whalen & Liberman, 1987; Vouloumanos & Werker, 2007; Rosen & Iverson, 2007). In this case, the pattern of perceptual effects evident in infants was contingent on the integration of sensory elements despite detailed failures of auditory similarity on which gestalt grouping depends. Perhaps it is an exaggeration to claim that this organizational function is strictly unlearned, for even the youngest subject in the sample had been encountering airborne sound for three months, and undeniably had the opportunity to refine their sensitivity through this exposure. However, the development of sensitivity to complex auditory patterns cannot plausibly result from a history of meticulous trial and error in listeners of such a tender age, nor is it likely to reflect specific knowledge of the auditory effects that typify American English phonetic expression. It is far likelier that this sensitivity represents the emergence of an organizational component of listening that must be present for speech perception to develop (Houston & Bergeson, 2014), and 14‐week‐old infants still have several months ahead of them before the phonetic properties of speech become conspicuous (Jusczyk, 1997).

Research on sinewave replicas of speech has shown that the perceptual organization of speech is nonsymbolic and keyed to patterns of sensory variation. The evidence is provided by tests (Remez et al., 1994; Remez, 2001; Roberts, Summers, & Bailey, 2010) that used tone analogs of sentences in which a sinewave replicating the second formant was presented to one ear while tone analogs of the first, third, and fricative formants were presented to the other ear. In such conditions, much as Broadbent and Ladefoged had found, perceptual fusion readily occurs despite the violation of spatial dissimilarity and the absence of other attributes to promote gestalt‐based grouping. To sharpen the test, an intrusive tone was presented in the same ear with the tone analogs of the first, third, and fricative tones. This single tone presented by itself does not evoke phonetic impressions, and is perceived as an auditory form without symbolic properties: it merely changes in pitch and loudness without phonetic properties. In order to resolve the speech stream under such conditions, a listener must reject the intrusive tone despite its spatial similarity to the first, third, and fricative tones of the sentence, and appropriate the tone analog of the second formant to form the speech stream despite its spatial displacement from the tones with which it combines. Control tests established that a tone analog of the second formant alone failed to evoke an impression of phonetic properties. Performance of listeners in a transcription task, a rough estimate of phonetic coherence, was good if the intrusive tone did not vary in a speechlike manner. That is, an intrusive tone of constant frequency or of alternating frequency had no effect on the perceptual organization of speech. When the intrusive tone exhibited the tempo and range of frequency variation appropriate for a second formant, without supplying the proper variation that would combine with other tones to form an intelligible stream, performance suffered. It was as if the criterion for integration of a tone was specific to its frequency variation under conditions in which it was nonetheless unintelligible.

Since the advent of the telephone, it has been obvious that a listener’s ability to find and follow a speech stream is indifferent to distortion of natural auditory quality. The lack of spectral fidelity in early forms of speech technology made speech sound phony, literally, yet it was readily recognized that this lapse of natural quality did not compromise the usefulness of speech as a communication channel (Fletcher, 1929). This fact indicates clearly that the functions of perceptual organization hardly aim to collect aspects of sensory stimulation that have the precise auditory quality of natural speech. Indeed, Liberman and Cooper (1972) argued that early synthesis techniques evoked phonetic perception because the perceiver cheerfully forgave departures from natural quality that were often extreme. In techniques such as speech chimeras (Smith, Delgutte, & Oxenham, 2002) and sinewave replication, the acoustic properties of intelligible signals lie beyond the productive capability of a human vocal tract, and the impossibility of such spectra as vocal sound does not evidently block the perceptual organization of the sound as speech. The variation of a spectral envelope can be taken by listeners to be speechlike despite acoustic details that give rise to impressions of gross unnaturalness. Findings of this sort contribute a powerful argument against psychoacoustic explanations of speech perception generally (e.g. Holt, 2005; Lotto & Kluender, 1998; Lotto, Kluender, & Holt, 1997; Toscano & McMurray, 2010), and perceptual organization specifically.

Ordinary subjective experience of speech suggests that perceptual organization is unbidden, for speech seems to pop right out of a nearby commotion. Yet studies reveal that sensory contours, whether simple or complex, form only with attention. In speech, as with simpler contours, the primitive segregation of figure and ground is at stake. Attention permits perceptual analysis to apply to a broadband contour of heterogeneous acoustic composition. Opposing this axiom – that sensory contours require attention to form – findings with sinewave replicas of utterances show that the perceptual organization of speech requires attention and is not an automatic consequence of a class of sensory effects. This feature differs from the automatically engaged process proposed in strict modular terms by Liberman and Mattingly (1985). With sinewave signals, most subjects fail to notice that concurrent tones can cohere unless they are asked specifically to listen for speech (Remez et al., 1981; also see Liebenthal et al., 2003), indicating that the auditory forms alone do not evoke speech perception. Critically, a listener who is asked to attend to arbitrary tone patterns as if listening to speech fails to report phonetic impressions (Remez et al., 1981), indicating that signal structure as well as phonetic attention are required for the organization and analysis of speech. A neural population code representing the speech spectrum without attention cannot be responsible for both the stable albeit unintegrated auditory form of sinewave speech and the stable integrated coherent contour that is susceptible to phonetic analysis (cf. Engineer et al., 2008). In this regard, general auditory perceptual organization is similar to speech perception in requiring attention for auditory figures to form (e.g. Carlyon et al., 2001). Of course, a natural vocal signal exhibits the phenomenal quality of speech, and this is evidently sufficient to elicit a productive form of attention for perceptual organization to ensue. This premise cautions against the use of passive listening procedures to identify supposed automatic functions of linguistic analysis of speech (e.g. Zevin et al., 2010). Such studies merely fail to secure attention. A listener whose attention is free to wander cannot be considered inattentive to the sounds delivered without instruction. In such conditions, performance arguably reflects a mix of cognitive states evoked with attention and vegetative excitation evoked without attention.

The Handbook of Speech Perception

Подняться наверх