Читать книгу The Handbook of Speech Perception - Группа авторов - Страница 31

Ubiquity and automaticity of multisensory speech

Since 2005, evidence has continued to grow that supports speech as an inherently multisensory function. It has long been known that visual speech is used to enhance challenging auditory speech, whether that speech is degraded by noise or accent, or simply contains complicated material (e.g. Arnold & Hill, 2001; Bernstein, Auer, & Takayanagi, 2004; Reisberg, McLean, & Goldfield, 1987; Sumby & Pollack, 1954; Zheng & Samuel, 2019). Visual speech information helps us acquire our first language (e.g. Teinonen et al., 2008; for a review, see Danielson et al., 2017) and our second languages (Hardison, 2005; Hazan et al., 2005; Navarra & Soto‐Faraco, 2007). The importance of visual speech in language acquisition is also evidenced in research on congenitally blind individuals. Blind children show small delays in learning to perceive and produce segments that are acoustically more ambiguous, but visually distinct (e.g. the /m/–/n/ distinction). Recent research shows that these idiosyncratic differences carry through to congenitally blind adults who show subtle distinctions in speech perception and production (e.g. Delvaux et al., 2018; Ménard, Leclerc, & Tiede, 2014; Ménard et al., 2009, 2013, 2015).

The inherently multimodal nature of speech is also demonstrated by perceivers using and integrating information from a modality that they rarely, if ever, use for speech: touch. It has long been known that deaf‐blind individuals can learn to touch the lips, jaw, and neck of a speaker to perceive speech (the Tadoma technique). However, recent research shows just how automatic this process can be for even novice users (e.g. Treille et al., 2014). Novice perceivers (with normal sight and hearing) can readily use felt speech to (1) enhance comprehension of a noisy auditory speech (Gick et al., 2008; Sato, Cavé, et al., 2010); (2) enhance lip‐reading (Gick et al., 2008); and (3) influence perception of discrepant auditory speech (Fowler & Dekle, 1991, in a McGurk effect). Consistent with these findings, neurophysiological research shows that touching an articulating face can speed auditory cortex reactions to congruent auditory speech in the same way as is known to occur with visual speech (Treille et al., 2014; Treille, Vilain, & Sato, 2014; and see Auer et al., 2007). Other research shows that the speech function can effectively work with very sparse haptic information. Receiving light puffs of air on the skin in synchrony with hearing voiced consonants (e.g. b) can make those consonants sound voiceless (p; Derrick & Gick, 2013; Gick & Derrick, 2009). In a related example, if a listener’s cheeks are gently pulled down in synchrony with hearing a word that they had previously identified as “head,” they will be more likely to now hear that word as “had” (Ito, Tiede, & Ostry, 2009). The opposite effect occurs if a listener’s cheeks are instead pulled to the side.

These haptic speech demonstrations are important for multiple reasons. First, they demonstrate how readily the speech system can make use of – and integrate – even the most novel type of articulatory information. Very few normally sighted and hearing individuals have intentionally used touch information for purposes of speech perception. Despite the odd and often limited nature of haptic speech information, it is readily usable, showing that the speech brain is sensitive to articulation, regardless through which modality it is conveyed. Second, the fact that this information can be used spontaneously despite its novelty may be problematic for integration accounts based on associative learning between the modalities. Both classic auditory accounts of speech perception (Diehl & Kluender, 1989; Hickok, 2009; Magnotti & Beauchamp, 2017) and Bayesian accounts of multisensory integration (Altieri, Pisoni, & Townsend, 2011; Ma et al., 2009; Shams et al., 2011; van Wassenhove, 2013) assume that the senses are effectively bound and integrated on the basis of the associations gained through a lifetime of experience simultaneously seeing and hearing speech utterances. However, if multisensory speech perception were based only on associative experience, it is unclear how haptic speech would be so readily used and integrated by the speech function. In this sense, the haptic speech findings pose an important challenge to associative accounts (see also Rosenblum, Dorsi, & Dias, 2016).

Certainly, the most well‐known and studied demonstration of multisensory speech is the McGurk effect (McGurk & MacDonald, 1976; for recent reviews, see Alsius, Paré, & Munhall, 2017; Rosenblum, 2019; Tiippana, 2014). The effect typically involves a video of one type of syllable (e.g. ga) being synchronously dubbed onto an audio recording of a different syllable (ba) to induce a “heard” percept (da) that is strongly influenced by the visual component. The McGurk effect is considered to occur whenever the heard percept is different from that of the auditory component, whether a subject hears a compromise between the audio and visual components (auditory ba + visual ga = heard da) or hears a syllable dominated by the visual component (auditory ba + visual va = heard va). The effect has been demonstrated in multiple contexts, including with segments and speakers of different languages (e.g. Fuster‐Duran, 1996; Massaro et al., 1993; Sams et al., 1998; Sekiyama & Tohkura, 1991, 1993); across development (e.g. Burnham & Dodd, 2004; Desjardins & Werker, 2004; Jerger et al., 2014; Rosenblum, Schmuckler, & Johnson, 1997); with degraded audio and visual signals (Andersen et al., 2009; Rosenblum & Saldaña, 1996; Thomas & Jordan, 2002); and regardless of awareness of the audiovisual discrepancy (Bertelson & De Gelder, 2004; Bertelson et al., 1994; Colin et al., 2002; Green et al. 1991; Massaro, 1987; Soto‐Faraco & Alsius, 2007, 2009; Summerfield & McGrath, 1984). These characteristics have been interpreted as evidence that multisensory speech integration is automatic, and impenetrable to outside influences (Rosenblum, 2005).

However, some recent research has challenged this interpretation of integration (for a review, see Rosenblum, 2019). For example, a number of studies have been construed as showing that attention can influence whether integration occurs in the McGurk effect (for reviews, see Mitterer & Reinisch, 2017; Rosenblum, 2019). Adding a distractor to the visual, auditory, or even tactile channels seems to significantly reduce the strength of the effect (e.g. Alsius et al., 2005; Alsius, Navarra, & Soto‐Faraco, 2007; Mitterer & Reinisch, 2017; Tiippana, Andersen, & Sams, 2004; see also Munhall et al., 2009). Unfortunately, relatively few of these studies have also tested unimodal conditions to determine whether these distractors might simply reduce detection of the requisite unimodal information. If, for example, less visual information can be extracted during distraction (of any type), then a reduced McGurk effect would likely be observed. In the few studies that have examined distraction of visual conditions, it seems unlikely that these tests are sufficiently sensitive (given the especially low baseline performance of straight lipreading; Alsius et al., 2005; Alsius, Navarra, & Soto‐Faraco, 2007; and for a review of this argument, see Rosenblum, 2019). Thus, to date, it is unclear whether outside attention can truly penetrate the speech integration function or instead simply distracts from the extraction of the visual information for a McGurk effect. Moreover, it could very well be that the McGurk effect itself may not constitute a thorough test of speech integration.

Подняться наверх