Читать книгу The Handbook of Speech Perception - Группа авторов - Страница 80

Feature representations: Articulatory or acoustic The motor theory of speech perception

Оглавление

We turn now to an unresolved question: What is the nature of feature representations? The crux of the problem turns on variability in the phonetic input. As indicated at the beginning of this chapter, there are many sources of variability that affect and influence the ultimate speech input that the listener receives. The question is whether, despite this variability, there are patterns (articulatory or acoustic) that provide a stable mapping from acoustic input to features and ultimately phonetic categories. At this point, no one has solved this invariance problem, that is, no one has solved the transformation of a variable input to a constant feature or phonetic category representation. Even if one were to assume that lexical representations are episodic, containing fine structure acoustic differences that are used by the listener, as has been proposed by Goldinger (1998) and others, such a view still begs the question. It does not elucidate the nature of the mapping from input to sublexical or lexical representations and thus fails to provide an explanation for how the listener knows that a given stimulus belongs to one phonetic category and not another; that is, what property of the signal tells the listener that the input maps onto the lexical representation of pear and not bear or that the initial consonant is a variant of [p] and not [b].

The pioneering research of Haskins Laboratories in the 1950s tried to solve the invariance problem. It is important to understand the historical context in which this research was conducted. At that time, state‐of‐the‐art speech technology consisted of the sound spectrograph and the pattern playback (see Cooper, 1955; Koenig, Dunn, & Lacy, 1946). The sound spectrograph provided a visual graph of the Fourier transform of the speech input, with time represented on the abscissa, frequency represented on the ordinate, and amplitude represented by the darkness of the various frequency bands. The pattern playback converted the visual pattern of a representation of the sound spectrogram to an auditory output (see Studdert‐Kennedy & Whalen, 1999, for a review). Thus, examining the patterns of speech derived from sound spectrograms, it was possible to make hypotheses about particular portions of the signal or cues corresponding to particular features of sounds or segments (phonetic categories). Using the pattern playback, these potential cues were then systematically varied and presented to listeners for their perception. Results reported in their seminal paper (Liberman et al., 1967) showed clearly that phonetic segments occur in context, and cannot be defined as separate “beads on a string.” Indeed, the context ultimately influences the acoustic manifestation of the particular phonetic segment, resulting in acoustic differences for the same features of sound. For example, sound spectrograms of stop consonants show a burst and formant transitions, which potentially serve as cues to place of articulation in stop consonants. Varying the onset frequency of the burst or the second formant transition and presenting them to listeners provided a means of systematically assessing the perceptual role these cues played. Results showed there was no systematic relation between a particular burst frequency or onset frequency of the second formant transition to place of articulation in stop consonants (Liberman, Delattre, & Cooper, 1952). For example, there was no constant burst frequency or formant transition onset that signaled [d] in the syllables [di] and [du]. Rather, the acoustic manifestation of sound segments (and the features that underlie them) is influenced by the acoustic parameters of the phonetic contexts in which they occur.

Liberman et al. (1967) recognized that listener judgments were still consistent. What then allowed for the various acoustic patterns to be realized as the same consonant? They proposed the motor theory of speech perception, hypothesizing that what provided the stability in the variable acoustic input was the production of the sounds or the articulatory gestures giving rise to them (for reviews see Galantucci, Fowler, & Turvey, 2006; Liberman et al., 1967; Fowler, 1986; Fowler, Shankweiler, & Studdert‐Kennedy, 2016). In this view, despite their acoustic variability, constant articulatory gestures provided phonetic category stability – [p] and [b] are both produced with the stop closure at the lips, [t] and [d] with the stop closure at the alveolar ridge, and [k] and [g] are produced with the closure at the velum.

It is worth noting, that even the motor theory fails to provide the nature of the mapping from the variable acoustic input to a particular articulatory gesture. That is, it is not specified what it is in the acoustic signal that allows for the transformation of the input to a particular motor pattern. In this sense, the motor theory of speech perception did not solve the invariance problem. That said, there are many proponents of the motor (gesture) theory of speech perception (see Fowler, Shankweiler, & Studdert‐Kennedy, 2016, for a review), and recently evidence from cognitive neuroscience has been used to provide support (see D’Ausilio, Craighero, & Fadiga, 2012 for a review). In particular, it has been shown in a number of studies that the perception of speech not only activates auditory areas of the brain (temporal structures) but also, under some circumstances, activates motor areas involved in speech production. For example, using fMRI, activation has been shown in motor areas during passive listening to syllables, areas activated in producing these syllables (Wilson et al., 2004), and greater activation has been shown in these areas for nonnative speech sounds compared to native speech sounds (Wilson & Iacoboni, 2006). Transmagnetic stimulation (TMS) studies showed a change in the perception of labial stimuli near the phonetic boundary of a labial–alveolar continuum after stimulation of motor areas involving the lips; no perceptual changes occurred for continua not involving labial stimuli, for example, alveolar–velar continua (Mottonen & Watkins, 2009; Fadiga et al., 2002). Nonetheless, activation of motor areas during speech perception in both the fMRI and TMS studies appears to occur under challenging listening conditions such as when the acoustic stimuli are of poor quality, for example, when sounds are not easily mapped to a native‐language inventory or during the perception of boundary stimuli, but not when the stimuli are good exemplars. These findings raise the possibility that frontal areas are recruited when additional neural resources are necessary, and thus are not core areas recruited in the perception of speech (see Schomers & Pulvermüller, 2016, for a contrasting view).

It would not be surprising to see activation of motor areas during the perception of speech, as listeners are also speakers, and speakers perceive the acoustic realization of their productions. That there is a neural circuit bridging temporal and motor areas then would be expected (see Hickok & Poeppel, 2007). However, what needs to be shown in support of the motor (gesture) theory of speech is that the patterns of speech‐perception representations are motoric or gestural. It is, of course, possible that there are gestural as well as acoustic representations corresponding to the features of speech. However, at the minimum, to support the motor theory of speech, gestures need to be identified that provide a perceptual standard for mapping from auditory input to phonetic feature. As we will see shortly, the evidence to date does not support such a view (for a broad discussion challenging the motor theory of speech perception see Lotto, Hickok, & Holt, 2009).

The Handbook of Speech Perception

Подняться наверх