Читать книгу The Handbook of Speech Perception - Группа авторов - Страница 81

The acoustic theory of speech perception

Оглавление

Despite the variability in the speech input, there is the possibility that there are more generalized acoustic patterns that can be derived that are common to features of sounds, patterns that override the fine acoustic detail derived from analysis of individual components of the signal such as burst frequency or frequency of the onset of formant transitions. The question is where in the signal such properties might reside and how they can be identified.

One hypothesis that became the focus of the renewed search for invariant acoustic cues was that more generalized patterns could be derived at points where there are rapid changes in the spectrum. These landmarks serve as points of stability between transitions from one articulatory state to another (Stevens, 2002). Once the landmarks were identified, it was necessary to identify the acoustic parameters that provided stable patterns associated with features and ultimately phonetic categories. To this end, research focused on the spectral patterns that emerged from the integration of amplitude and frequency parameters within a window of analysis rather than considering portions of the speech signal that had been identified on the sound spectrogram and considered to be distinct acoustic events.

The first features examined in this way were place of articulation in stop consonants, the features that failed to show invariance in the Haskins research. In a series of papers, Stevens and Blumstein explored whether the shape of the spectrum in the 25‐odd ms at consonant release could independently characterize labial, alveolar, and velar stop consonants across speakers and vowel contexts. Here, labial consonants were defined in terms of a flat or falling spectral shape, alveolar consonants were defined in terms of a rising spectral shape, and velar consonants were defined in terms of a compact spectral shape with one peak dominating the spectrum (Stevens & Blumstein, 1978). Results of acoustic analysis of productions by six speakers of the consonants [p t k b d g] produced in the context of the vowels [i e a o u] classified the place of articulation of the stimuli with 85 percent accuracy (Blumstein & Stevens 1979). Follow‐up perceptual experiments showed that listeners could identify place of articulation (as well as the following vowel) with presentation of only the first 20 ms at the onset of the burst, indicating that they were sensitive to the spectral shape at stop consonant onset (Blumstein & Stevens, 1980; see also Chang & Blumstein, 1981).

Invariant properties were identified for additional phonetic features, giving rise to a theory of acoustic invariance hypothesizing that, despite the variability in the acoustic input, there were more generalized patterns that provided the listener with a stable framework for the perception of the phonetic features of language (Blumstein & Stevens, 1981; Stevens & Blumstein, 1981; see also Kewley‐Port, 1983; Nossair & Zahorian, 1991). These features include those signifying manner of articulation for [stops], [glides], [nasals], and [fricatives] (Kurowski & Blumstein, 1984; Mack & Blumstein, 1983; Shinn & Blumstein, 1984; Stevens & Blumstein, 1981). Additionally, research has shown that if the speech auditory input were normalized for speaker and vowel context, generalized patterns can be identified for both stop (Johnson, Reidy, & Edwards, 2018) and fricative place of articulation (McMurray & Jongman, 2011).

A new approach to the question of invariance provides perhaps the strongest support for the notion that listeners extract global invariant acoustic properties in processing the phonetic categories of speech. Pioneering work from the lab of Eddie Chang is examining neural responses to speech using electrocorticography (ECoG). Here, intracranial electrophysiological recordings are made in patients with intractable seizures, with the goal of identifying the site of seizure activity. A grid of electrodes is placed on the surface of the brain and neural activity is recorded directly, with both good spatial and temporal resolution. In a recent study (Mesgarani et al., 2014), six participants listened to 500 natural speech sentences produced by 400 speakers. The sentences were segmented into sequences of phonemes. Results showed, not surprisingly, responses to speech in the posterior and mid‐superior temporal gyrus, consistent with fMRI studies showing that the perception of speech recruits temporal neural structures adjacent to the primary auditory areas (for reviews see Price, 2012; Scott & Johnsrude, 2003). Critically important were the patterns of activity that emerged. In particular, Mesgarani et al. (2014) showed selective responses of individual electrodes to features defining natural classes in English. That is, selective responses occurred for stop consonants including [p t k b d g], fricative consonants [s z f š ϴ], and nasals [m n ƞ]. That these patterns emerged across speakers, vowel, and phonetic contexts indicate that the inherent variability in the speech stream was essentially averaged out, leaving generalized patterns common to those features representing manner of articulation (see also Arsenault & Buchsbaum, 2015). It is unclear whether the patterns extracted are the same as those identified in the Stevens and Blumstein studies described above. However, what is clear is that the basic representational units corresponding to these features are acoustic in nature.

That responses in the temporal lobe are acoustic in nature is not surprising. A more interesting question is: What are the patterns of response to speech perception in frontal areas? As discussed earlier, some fMRI and TMS studies showed frontal activation during the perception of speech. However, what is not clear is what the neural patterns of those responses were; that is, did they reflect sensitivity to the acoustic parameters of the signal or to the articulatory gestures giving rise to the acoustic patterns?

In another notable study out of the Chang lab, Cheung and colleagues (2016) used ECoG to examine neural responses to speech perception in superior temporal gyrus sites, as they did in Mesgarani et al. (2014). Critically, they also examined neural responses to both speech perception and speech production in frontal areas, in particular in the motor cortex – the ventral half of lateral sensorimotor cortex (vSMC). Nine participants listened to and produced the consonant–vowel (CV) syllables [pa ta ka ba da ga, sa, ša] in separate tasks, and in a third task, passively listened to portions of a natural speech corpus (TIMIT) consisting of 499 sentences spoken by a total of 400 male and female speakers. As expected, for production, responses in the vSMC reflected the somatotopic representation of the motor cortex with distinct clustering as a function of place of articulation. That is, as expected, separate clusters emerged reflecting the different motor gestures used to produce labial, alveolar, and velar consonants.

Results of the passive listening task replicated Mesgarani et al.’s (2014) findings, showing selective responses in the superior temporal gyrus (STG) to manner of articulation as a function of manner of articulation, that is, the stop consonants clustered together and the fricative consonants clustered together. Of importance, a similar pattern emerged in the vSMC: neural activity clustered in terms of manner of articulation, although interestingly the consonants within each cluster did not group as closely as the clusters that emerged in the STG. Thus, frontal areas are indeed activated in speech perception; however, this activation appears to correspond to the acoustic representation of speech extracted from the auditory input rather than being a transformation of the auditory input to articulatory, motor, or gestural representations. While only preliminary, these neural findings suggest that the perceptual representation of features, even in motor areas, are acoustic or auditory in nature, not articulatory or motor. These results are preliminary but provocative. Additional research is required to examine neural responses in frontal areas to auditory speech input to the full consonant inventory across vowel contexts, phonetic position, and speakers. The question is: When consonant, vowel, or speaker variability is increased in the auditory input, will neural responses in frontal areas pattern with spectral and temporal features or gestural features.

The Handbook of Speech Perception

Подняться наверх