Читать книгу The Handbook of Speech Perception - Группа авторов - Страница 20

Organization by coordinate variation

Оглавление

A classic understanding of the perception of speech derives from study of the acoustic correlates of phonetic contrasts and the physical and articulatory means by which they are produced (reviewed by Raphael, Chapter 22; also, see Fant, 1960; Liberman et al., 1959; Stevens & House, 1961). In addition to calibrating the perceptual response to natural samples of speech, researchers also used acoustic signals produced synthetically in detailed psychoacoustic studies of phonetic identification and differentiation. In typical terminal analog speech synthesis, the short‐term spectra characteristic of the natural samples are preserved, lending the synthesis a combination of natural vocal timbre and intelligibility (Stevens, 1998). Acoustic analysis of speech, and synthesis that allows for parametric variation of speech acoustics, have been important for understanding the normative aspects of perception, that is, the relation between the typical or likely auditory form of speech sounds encountered by listeners and the perceptual analysis of phonetic properties (Diehl, Molis & Castleman, 2001; Lindblom, 1996; Massaro, 1994).

However, a singular focus on statistical distributions of natural samples and on synthetic idealizations of natural speech discounts the adaptability and versatility of speech perception, and deflects scientific attention away from the properties of speech that are potentially relevant to understanding perceptual organization. Because grossly distorted speech remains intelligible (e.g. Miller, 1946; Licklider, 1946) when many of the typical acoustic correlates are absent, it is difficult to sustain the hypothesis that finding and following a speech stream crucially depends on meticulous registration of the brief and numerous acoustic correlates of phonetic contrasts described in classic studies. But, if the natural acoustic products of vocalization do not determine the perceptual organization and analysis of speech, what does?

An alternative to this conceptualization was prompted by the empirical use of a technique that combines digital analysis of speech spectra and digital synthesis of time‐varying sinusoids (Remez et al., 1981). This research has revealed the perceptual effectiveness of acoustic patterns that exhibit the gross spectro‐temporal characteristics of speech without incorporating the fine acoustic structure of vocally produced sound. Perceptual research with these acoustic materials and their relatives (noise‐band vocoded speech: Shannon et al., 1995; acoustic chimeras: Smith, Delgutte, & Oxenham, 2002; Remez, 2008) has permitted an estimate of a listener’s sensitivity to the time‐varying patterns of speech spectra independent of the sensory elements of which they are composed.

The premise of sinewave replication is simple, though in practice it is as laborious as other forms of copy synthesis (Remez et al., 2011). Three or four tones, each approximating the center frequency and amplitude of an oral, nasal, or fricative resonance, are created to imitate the coarse‐grain attributes of a speech sample. Lacking the momentary aperiodicities, harmonic spectra, broadband formants, and regular pulsing of natural and most synthetic speech, a sinewave replica of an utterance differs acoustically and qualitatively from speech while remaining intelligible. A spectrogram of a sinewave sentence is shown in the bottom panel of Figure 1.2; a comparison of short‐term spectra of natural speech and both synthetic and sinewave imitations is shown in Figure 1.3.

It is significant that three or four tones reproducing a natural formant pattern evoke an experience in a naive listener of several concurrent whistles changing in pitch and loudness, and do not automatically elicit an impression of speech. The listener’s attention is free to follow the course of the auditory form of each component tone. Certainly, this aspect of a sinewave pattern is salient auditorily, and little of the raw quality prompts attention to the tones as a single compound contour. Studies show that listeners are well able to attend to individual tone components and to focus on the pattern of pitch changes each evokes over the run of a few seconds (Remez & Rubin, 1984, 1993). In other words, the immediate experience of the listener is accurately predicted by a generic auditory account, because acoustic elements that change frequency at different rates to different extents, onsetting and offsetting at different moments in different frequency ranges are dissimilar along many dimensions that specify separate perceptual streams according to gestalt principles.

Once instructed that the tones compose synthetic speech, a listener readily reports linguistic properties as if hearing the original natural utterance on which the sinewave replica was modeled. If attention to a complex, broadband contour is characteristic of the perceptual organization of speech, its sufficient condition is met in the absence of natural acoustic vocal products. Performance levels reported with this kind of copy synthesis have varied with the proficiency of the synthesis, although it has often been possible to achieve very good intelligibility, rivalling natural speech (for instance, Remez et al., 2008). Within this range of performance levels, these acoustic conditions pose a crucial test of a gestalt‐derived account of perceptual organization, for a perceiver must integrate the tones in order to compose a single sensory contour segregated from the background, ready to analyze for the linguistic properties borne on the pattern of the signal. Several tests support this claim of true integration preliminary to analysis.

In direct assessments, the intelligibility of sinewave replicas of speech exceeded intelligibility predicted from the presentation of individual tones (Remez et al., 1981, 1987, 1994). This superadditive performance is evidence of integration, and it persisted even when the tones came from separate spatial sources, violating similarity in location (Remez et al., 1994; see also Broadbent & Ladefoged, 1957). In combining the individual tones into a single time‐varying coherent stream, however, this complex organization, which is necessary for phonetic analysis, does not exclude an auditory organization as independently resolvable streams of tones (Remez & Rubin, 1984, 1993; Roberts, Summers, & Bailey, 2015). In fact, the perceiver’s resolution of the pitch contour associated with the frequency pattern of tonal constituents is acute whether or not the fusion of the tones supporting phonetic perception occurs (Remez et al., 2001). On this evidence rests the claim that sinewave replicas are bistable, exhibiting two simultaneous and exclusive organizations.


Figure 1.3 A comparison of the short‐term spectrum of natural speech (top); terminal analog synthetic speech (middle); and sinewave replica (below). Note the broadband resonances and harmonic spectra in natural and synthetic speech, in contrast to the sparse, nonharmonic spectrum of the three tones.

Even if the sensory causes of these perceptual impressions were strictly parallel, the bistable occurrence of auditory and phonetic perceptual organization is not amenable to further simplification. A sinewave replica of speech allows two organizations, much as celebrated cases of visual bistability do: the duck–rabbit figure, Woodworth’s equivocal staircase, Rubin’s vase, and Necker’s cube. Unlike the visual cases of alternating stability, the bistability that occurs in the perception of sinewave speech is simultaneous. A conservative description of these findings is that an organization of the auditory properties of sinewave signals occurs according to gestalt‐derived principles that promote segregation of the tones into separate contours. Phonetic perceptual analysis fails to apply or to succeed under that organization. However, the concurrent variation of the tones also satisfies a non‐gestalt principle of coordinate auditory variation despite local dissimilarities, and this promotes integration of the components into a single broadband stream. This organization, binding diverse components into a single complex sensory contour, is susceptible to phonetic analysis.

The Handbook of Speech Perception

Подняться наверх