Читать книгу The Handbook of Speech Perception - Группа авторов - Страница 36
General examples of supramodal information
ОглавлениеWhile some progress has been made in identifying the detailed ways in which information takes the same specific form across modalities, more progress has been made to establish the general ways in which the informational forms are similar. In the previous version of this chapter, it was argued that both auditory and visual speech show an important primacy of time‐varying information (Rosenblum, 2005; see also Rosenblum, 2008). At the time that chapter was written, many descriptions of visual speech information were based on static facial feature information, and still images were often used as stimuli. Since then, most all methodological and conceptual interpretations of visual speech information have incorporated a critical dynamic component (e.g. Jesse & Bartoli, 2018; Jiang et al., 2007).
This contemporary emphasis on time‐varying information exists in both the behavioral and the neurophysiological research. A number of studies have examined how dynamic facial dimensions are extracted and stored for purposes of both phonetic and indexical perception (for a review, see Jesse & Bartoli, 2018). Other studies have shown that moment‐to‐moment visibility of articulator movements (as conveyed through discrete facial points) is highly predictive of lip‐reading performance (e.g. Jiang et al., 2007). These findings suggest that kinematic dimensions provide highly salient information for lip‐reading (Jiang et al., 2007). Other research has examined the neural mechanisms activated when perceiving dynamic speech information. For example, there is evidence that the mechanisms involved during perception of speech from isolated kinematic (point‐light) displays differ from those involved in recognizing speech from static faces (e.g. Santi et al., 2003). At the same time, brain reactivity to the isolated motion of point‐light speech does not qualitatively differ from reactivity to normal (fully illuminated) speaking faces (Bernstein et al., 2011). These neurophysiological findings are consistent with the primacy of time‐varying visible speech dimensions, which, in turn is analogous to the same primacy in audible speech (Rosenblum, 2005).
A second general way in which auditory and visual speech information takes a similar form is in how it interacts with – and informs about – indexical properties. As discussed in the earlier chapter, there is substantial research showing that both auditory and visual speech functions make use of talker information to facilitate phonetic perception (for reviews, see Nygaard, 2005; Rosenblum, 2005). It is easier to understand speech from familiar speakers (e.g. Borrie et al., 2013; Nygaard, 2005), and easier to lip‐read from familiar faces, even for observers who have no formal lip‐reading experience (e.g. Lander & Davies, 2008; Schweinberger & Soukup, 1998; Yakel, Rosenblum, & Fortier, 2000).
In these talker‐facilitation effects, it could be that an observer’s phonetic perception is facilitated by their familiarity with the separate vocal and facial characteristics provided by each modality. However, research conducted in our lab suggests that perceivers may also gain experience with the deeper, supramodal talker dimensions available across modalities (Rosenblum, Miller, & Sanchez, 2007; Sanchez, Dias, & Rosenblum, 2013). Our research shows that the talker experience gained through one modality can be shared across modalities to facilitate phonetic perception in the other. For example, becoming familiar with a talker by lip‐reading them (without sound) for one hour allows a perceiver to then better understand that talker’s auditory speech (Rosenblum, Miller, & Sanchez, 2007). Conversely, listening to the speech of a talker for one hour allows a perceiver to better lip‐read from that talker (Sanchez, Dias, & Rosenblum, 2013). Interestingly, this crossmodal talker facilitation works for both old words (perceived during familiarization) and new words, suggesting that the familiarity is not contained in specific lexical representations (Sanchez, Dias, & Rosenblum, 2013). Instead, the learned supramodal dimensions may be based on talker‐specific phonetic information contained in the idiolect of the perceived talker (e.g. Remez, Fellowes, & Rubin, 1997; Rosenblum et al., 2002).
This interpretation can also explain our finding that learning to identify talkers can be shared across modalities (Simmons et al., 2015). In this demonstration, idiolectic information was isolated visually through a point‐light technique, and audibly through sinewave resynthesis (e.g. Remez et al. 1997; Rosenblum, Yakel, & Baseer, 2002). With these methods we observed that experience of learning to recognize talkers through point‐light displays transfers to allow better recognition of the same speakers heard in sinewave sentences. No doubt, our findings are related to past observations that perceivers can match a talker’s voice and speaking face, even when both signals are rendered as isolated phonetic information (sinewave speech and point‐light speech; Lachs & Pisoni, 2004). In all of these examples, perceivers may be learning the particular idiolectic properties of talkers’ articulation, which can be informed by both auditory and visual speech information.
We have termed this interpretation of these findings the supramodal learning hypothesis. The hypothesis simply argues that part of what the speech function learns through experience is the supramodal properties related to a talker’s articulation. Because these articulatory properties are distal in nature, experience with learning in one modality can be shared across modalities to support crossmodal talker facilitation, learning, and matching.
We further argue that the supramodal learning hypothesis helps explain bimodal training benefits recently reported in the literature. Bimodal training benefits occur when an observer is able to better understand degraded auditory speech after first being trained with congruent visual speech added to the degraded signal (Bernstein et al., 2013; Bernstein, Eberhardt, & Auer, 2014; Eberhardt, Auer, & Bernstein, 2014; Kawase et al., 2009; Lidestam et al., 2014; Moradi et al., 2019; Pilling & Thomas, 2011; but see Wayne & Johnsrude, 2012). For example, vocoded auditory speech is easier to understand on its own if a perceiver is first trained to listen to vocoded speech while seeing congruent visual speech. Bimodal training effects are also known to facilitate: (1) talker recognition from auditory speech (Schall & von Kriegstein, 2014; Schelinski, Riedel, & von Kriegstein, 2014; Sheffert et al., 2002; von Kriegstein et al., 2008; von Kriegstein & Giraud, 2006); (2) talker‐familiarity effects; (3) language development (Teinonen et al., 2008); and (4) second‐language learning (Hardison, 2005; Hazan et al., 2005). (There are also many examples of bimodal training benefits outside of speech perception suggesting that it may be a general learning strategy of the brain (e.g. Shams et al., 2011).)
Importantly, neurophysiological correlates to many of these effects have revealed mechanisms that can be modulated with bimodal learning. It has long been known that the pSTS responds when observers are asked to report the speech they either see or hear. More recent research suggests that this activation is enhanced when a perceiver sees or hears a talker with whom they have some audiovisual experience (von Kriegstein & Giraud, 2006; von Kriegstein et al., 2005). If observers are tasked, instead, with identifying the voice of a talker, they show activation in an area associated with face recognition (fusiform face area; von Kriegstein et al., 2008; von Kriegstein & Giraud, 2006). This activation will also be enhanced by prior audiovisual exposure to the talker (von Kriegstein & Giraud, 2006). These findings are consistent with the possibility that observers are learning talker‐specific articulatory properties, as the supramodal learning hypothesis suggests.
Other theories have arisen to explain these bimodal training benefits including (1) tacit recruitment of the associated face dimensions when later listening to the auditory‐alone speech (Riedel et al., 2015; Schelinski, Riedel, & von Kriegstein, 2014); and (2) greater access to auditory primitives based on previously experienced associations with the visual speech component (Bernstein et al., 2013; Bernstein, Eberhardt, & Auer, 2014). However, both of these theories are based on a mechanism that requires experiencing associations between concurrent audio and visual streams in order to improve subsequent audio‐alone speech perception. Recall, however, that our crossmodal talker‐facilitation findings (Rosenblum et al., 2007) show that such bimodal experience is not necessary for later auditory speech facilitation. Accordingly, we argue that at least some component of bimodal training benefits is based on both modalities providing common talker‐specific talker information. To examine this possibility, our laboratory is currently testing whether a bimodal training benefit can, in fact, occur without the associations afforded by concurrent audio and visual speech information, as the supramodal learning hypothesis would predict.
Interestingly, because the supramodal learning hypothesis suggests that perceptual experience is of articulatory properties regardless of modality, another surprising prediction can be made. Observers should be able to show a bimodal training benefit using a modality they have rarely, if ever, used before: haptic speech. We have recently shown that, by listening to distorted auditory speech while touching the face of a speaker, observers are later able to understand the distorted speech on its own better than control subjects who touched a still face while listening (Dorsi et al., 2016). These results, together with our crossmodal talker facilitation findings (Rosenblum et al., 2007; Sanchez et al., 2013) suggests that the experiential basis of bimodal training benefits require neither long‐term experience with the involved modalities nor concurrent presentation of the streams. What is required for a bimodal training benefit is access to some lawful auditory/visual/haptic information for articulatory actions and their indexical properties.
In sum, as we argued in our 2005 chapter, both auditory and visual speech share the general informational commonalities of being composed of time‐varying information which is intimately tied to indexical information. However, since 2005, another category of informational commonality can be added to this list: information in both streams can act to guide the indexical details of a production response. It is well known that during live conversation each participant’s productions are influenced by the indexical details of the speech they have just heard (e.g. Pardo, 2006; Pardo et al., 2013; for a review, see Pardo et al., 2017). This phonetic convergence shows that interlocuters’ utterances often subtly mimic aspects of the utterances of the person with whom they are speaking. This phenomenon occurs not only during live interaction, but also when subjects are asked to listen to recorded words and to say each word out loud. There have been many explanations for this phenomenon, including that it helps facilitate the interaction socially (e.g. Pardo et al., 2012). Phonetic convergence may also reveal the tacit connection between speech perception and production, as if the two function share a “common currency” (e.g. Fowler, 2004).
Importantly, recent research from our lab and others suggests that phonetic convergence is not an alignment toward an interlocuter’s sound of speech as much as toward their articulatory style – conveyed supramodally. We have shown that, despite having no formal lip‐reading experience, perceivers will produce words containing the indexical properties of words they have just lip‐read (Miller, Sanchez, & Rosenblum, 2010). Further, the degree to which talkers converge toward lip‐read words is comparable to that observed for convergence to heard words. Other research from our lab shows that, during live interactions, seeing an interlocuter increases the degree of convergence over simply hearing them (Dias & Rosenblum, 2011), and that this increase is based on the availability of visible speech articulation (Dias & Rosenblum, 2016). Finally, it seems that the visual information for articulatory features (voice‐onset time) can integrate with auditory information to shape convergence (Sanchez, Miller, & Rosenblum, 2010). This finding also suggests that the streams are merged by the time they influence a spontaneous production response.
This evidence for multimodal influences on phonetic convergence is consistent with neurophysiological research showing visual speech modulation of speech motor areas. As has been shown for auditory speech, visual speech can induce speech motor system (cortical) activity during lip‐reading of syllables, words, and sentences (e.g. Callan et al., 2003, 2004; Hall, Fussell, & Summerfield, 2005; Nishitani & Hari, 2002; Olson, Gatenby, & Gore, 2002; Paulesu et al., 2003). This motor system activity also occurs when a subject is attending to another task and passively perceives visual speech (Turner et al., 2009). Other research shows an increase in motor system activity when visual information is added to auditory speech (e.g. Callan, Jones, & Callan, 2014; Irwin et al., 2011; Miller & D’Esposito, 2005; Swaminathan et al., 2013; Skipper, Nusbaum, & Small, 2005; Skipper et al., 2007; Uno et al., 2015; Venezia, Fillmore, et al., 2016; but see Matchin, Groulx, & Hickok, 2014). This increase is proportionate to the relative visibility of the particular segments present in the stimuli (Skipper, Nusbaum, & Small, 2005). Relatedly, with McGurk‐effect types of stimuli (audio /pa/ + video /ka/), segment‐specific reactivity in the motor cortex follows the integrated perceived syllable (/ta/; Skipper et al., 2007). This finding is consistent with other research showing that with transcranial magnetic stimulation (TMS) priming of the motor cortex, electromyographic (EMG) activity in the articulatory muscles follow the integrated segment (Sundara, Namasivayam, & Chen, 2001; but see Sato et al., 2010). These findings are also consistent with our own evidence that phonetic convergence in production responses is based on the integration of audio and visual channels (Sanchez, Miller, & Rosenblum, 2010).
There is currently a debate on whether the involvement of motor areas is necessary for audiovisual integration and for speech perception, in general (for a review, see Rosenblum, Dorsi, & Dias, 2016). But it is clear that the speech system treats auditory and visual speech information similarly for priming phonetic convergence in production responses. Thus, phonetic convergence joins the characteristics of critical time‐varying and indexical dimensions as an example of general informational commonality across audio and video streams. In this sense, the recent phonetic convergence research supports a supramodal perspective.