Читать книгу The Handbook of Speech Perception - Группа авторов - Страница 48

Auditory phonetic representations in the sensorimotor cortex

From the STG, we turn now to a second cortical area. The ventral sensorimotor cortex (vSMC) is better known for its role in speech production than in speech comprehension (Bouchard et al., 2013). This part of the cortex, near the ventral end of the SMC (see Figure 3.6), contains the primary motor and somatosensory areas, which send motor commands to and receive touch and proprioceptive information from the face, lips, jaw, tongue, velum, and pharynx. The vSMC plays a key role in controlling the muscles associated with these articulators, and is further involved in monitoring feedback from the sensory nerves in these areas when we speak. Less widely known is that the vSMC also plays a role in speech perception. We know, for example, that a network including frontal areas becomes more active when the conditions for perceiving speech become more difficult (Davis & Johnsrude, 2003), such as when there is background noise or the sound of multiple speakers overlaps (contrast easy listening conditions when distractions like these are absent). This context‐specific recruitment of speech‐production areas may signal that they play an auxiliary role in speech perception, by providing additional computational resources when the STG is overburdened. We might ask how the vSMC, as an auxiliary auditory system that is primarily dedicated to coordinating the articulation of speech, represents heard speech. Does the vSMC represent the modalities of overt and heard speech similarly or differently? Is the representation of heard speech in the vSMC similar or different from that of the STG?

ECoG studies of speech production (Bouchard et al., 2013; Cheung et al., 2016) suggest that place‐of‐articulation features take primacy over the manner‐of‐articulation features in the vSMC, which is the reverse of what we described for the STG (Mesgarani et al., 2014). Given that the vSMC contains a map of body parts like the lips and tongue, it makes sense that this region be represented by place‐of‐articulation features, rather than by manner‐of‐articulation features. But does this representation in vSMC hold during both speech production and comprehension? Our starting hypothesis might be that, yes, the feature representations in vSMC will be the same regardless of task. There is even some theory to back this up. For example, there have been proposals, like the motor theory of speech perception (Liberman et al., 1967; Liberman & Mattingly, 1985) or the analysis‐by‐synthesis theory (Stevens, 1960) that view speech perception as a kind of active rather than passive process. Analysis by synthesis says that speech perception involves trying to match what you hear to what your own mouth, and other articulators, would have needed to do to produce what you heard. Speech comprehension would therefore involve an active process of covert speech production. Following this line of thought, we might suppose that what the vSMC does, when it is engaged in deciphering what your friend is asking you at a noisy cocktail party, is in some sense the same as what the vSMC does when it is used to articulate your reply. Because we know that place‐of‐articulation features take priority over manner‐of‐articulation features in the vSMC during a speech‐production task (i.e. reading consonant–vowel syllables aloud), we might hypothesize that place‐of‐articulation features will similarly take primacy during passive listening. Interestingly, despite being predicted by theory, this prediction is wrong.

When Cheung et al. (2016) examined neural response patterns in the vSMC while subjects listened to recordings of speech, they found that, as in the STG, it was the manner‐of‐articulation features that took precedence. In other words, representations in vSMC were conditioned by task: during speech production the vSMC favored place‐of‐articulation features (Bouchard et al., 2013; Cheung et al., 2016), but during speech comprehension the vSMC favored manner‐of‐articulation features (Cheung et al., 2016). As we discussed earlier, the STG is also organized according to manner‐of‐articulation features when subjects listen to speech (Mesgarani et al., 2014). Therefore the representations in these two areas, STG and vSMC, appear to use a similar type of code when they represent heard speech.

To be more concrete, Cheung et al. (2016) recorded ECoG from the STG and vSMC of subjects performing two tasks. One task involved reading aloud from a list of consonant–vowel syllables (e.g. ‘ba,’ ‘da,’ ‘ga’), while the other task involved listening to recordings of people producing these syllables. Instead of using hierarchical clustering, as Mesgarani et al. (2014) did in their study of the STG, Cheung et al. (2016) used a dimensionality‐reduction technique called multidimensional scaling (MDS) but with the similar goal of describing the structure of phoneme representations in the brain during each task (Figure 3.8). For the speaking task, the dimensionality‐reduced vSMC representations for eight sounds could be linearly separated into three place‐of‐articulation features: labial /p b/, alveolar /t d s ʃ/, and velar /k g/ (see Figure 3.8, panel D). The same phonemes could not be linearly separated into place‐of‐articulation features in the listening task (Figure 3.8, panel E); however they could be linearly separated into another set of features (Figure 3.8, panel G): voiced plosives /d g b/, voiceless plosives /k t p/, and fricatives /ʃ s/. These are the same manner‐of‐articulation and voicing features that characterize the neural responses in STG to heard speech (Figure 3.8, panel F). Again, the implication is that the vSMC has two codes for representing speech, suggesting that there are either two distinct but anatomically intermingled neural populations in vSMC, or the same population of neurons is capable of operating in two very different representational modes. Unfortunately, the spatial resolution of ECoG electrodes is still too coarse to resolve this ambiguity, so other experimental techniques will be needed. For now, we can only say that during speech production the vSMC uses a feature analysis that emphasizes place‐of‐articulation features, but during speech comprehension the vSMC uses a feature analysis that instead emphasizes manner features and voicing. An intriguing possibility is that the existence of similar representations for heard speech in the STG and the vSMC may play an important role in the communication, or connectivity, between distinct cortical regions – a topic we touch on in the next section.

Figure 3.8 Feature‐based representations in the human sensorimotor cortex. (a) and (b) show the most significant electrodes (gray dots) for listening and speaking tasks. (c) presents a feature analysis of the consonant phonemes used in the experiments. The left phoneme in each pair is unvoiced and the right phoneme is voiced (e.g. /p/ is unvoiced and /b/ is voiced). (d–g) are discussed in the main text; each panel shows a low‐dimensional projection of the neural data where distance between phoneme representations is meaningful (i.e. phonemes that are close to each other are represented similarly in the neural data). The dotted lines show how groups of phonemes can be linearly separated (or not) according to place of articulation, manner of articulation, and voicing features.

Source: Cheung et al., 2016. Licensed under CC BY 4.0.

Подняться наверх