Читать книгу The Handbook of Speech Perception - Группа авторов - Страница 32
The double‐edged sword of the McGurk effect
ОглавлениеAs stated, the McGurk effect is considered the quintessential demonstration of multisensory integration and has been a key factor motivating our new understanding of the multisensory brain. At the same time, it is often considered a quintessential instance of multisensory integration and has become a litmus test for investigating how integration is affected by myriad factors. Methodologically, the McGurk effect may provide some advantages over other tests of multisensory speech such as visual enhancement of speech in noise. By asking subjects to report on what they are hearing, the effect uses a more implicit measure of visual influence. This fact increases the likelihood that the method is measuring true perceptual rather than post‐perceptual/decision‐making processes (e.g. Rosenblum, Yakel, & Green, 2000). Additionally, the McGurk method provides advantages in being composed of very short and simple syllable stimuli. Such stimuli allow the effect to be tested in time‐constrained imaging contexts, as well as in linguistic contexts for which it is important to limit lexical and semantic influences. Finally, while the effect has been shown to occur in myriad conditions, its strength and frequency can be variable, lending itself to a useful dependent measure.
Consequently, the effect has become a method for establishing under which conditions integration occurs. Measurements of the effect’s strength have been used to determine how multisensory speech perception is affected by: individual differences (see Strand et al., 2014, for a review); attention; and generalized face processing (e.g. Eskelund, MacDonald, & Andersen, 2015; Rosenblum, Yakel, & Green, 2000). The effect has also been used to determine where in the perceptual and neurophysiological process integration occurs and whether integration is complete (for discussions of these topics, see Brancazio & Miller, 2005).
However, a number of researchers have recently questioned whether the McGurk effect should be used as a primary test of multisensory integration (Alsius, Paré, & Munhall, 2017; Remez, Beltrone, & Willimetz, 2017; Rosenblum, 2019; Irwin & DiBlasi, 2017; Brown et al. 2018). There are multiple reasons for these concerns. First, there is wide variability in most aspects of McGurk methodology (for a review, see Alsius, Paré, & Munhall, 2017). Most obviously, the specific talkers used to create the stimuli usually vary from project to project. The dubbing procedure – specifically, how the audio and visual components are aligned – also vary across laboratories. Studies will also vary as to which syllables are used, as well as the type of McGurk effect tested (fusion; visual dominance). Procedurally, the tasks (e.g. open response vs. forced choice), stimulus ordering (fully randomized vs. blocked by modality), and the control condition chosen (e.g. audio‐alone vs. audiovisually congruent syllables) vary across studies (Alsius, Paré, & Munhall, 2017). This extreme methodological variability may account for the wide range of McGurk effect strengths reported across the literature. Finding evidence of the effect under such different conditions does speak to its durability. However, the methodological variability makes it difficult to know whether influences on the effect’s strength are attributable to the variable in question (e.g. facial inversion), or to some superfluous characteristic of idiosyncratic stimuli and/or tasks.
Another concern about the McGurk effect is whether it is truly representative of natural (nonillusory) multisensory perception (Alsius, Paré, & Munhall, 2017; Remez Beltrone, & Willimetz, 2017). It could very well be that different perceptual and neurophysiological resources are recruited when integrating discrepant rather than congruent audiovisual components. In fact, it has long been known that McGurk‐effect syllables (e.g. audio ba + visual a = va) are less compelling and take longer to identify (Brancazio, 2004; Brancazio, Best, & Fowler, 2006; Green & Kuhl, 1991; Jerger et al., 2017; Massaro & Ferguson, 1993; Rosenblum & Saldaña, 1992) than analogous audiovisual congruent syllables (audio va + visual va = va). This is true even when McGurk syllables are identified with comparable frequency (98 percent va; Rosenblum & Saldaña, 1992) to the congruent syllables. Relatedly, there is evidence that, when spatial and temporal offsets are applied to the audio and visual components, McGurk stimuli are more readily perceived as separate components than as audiovisually congruent syllables (e.g. Bishop & Miller, 2011; van Wassenhove, Grant, & Poeppel, 2007).
There are also differences in neurophysiological responses to McGurk compared to congruent syllables (for a review, see Alsius, Paré, & Munhall, 2017), even when these are identified as the same segment, and with the same frequency. For example, there is more involvement of the superior temporal sulcus (STS) when perceiving McGurk compared to audiovisually congruent stimuli (e.g. Beauchamp, Nath, & Pasalar, 2010; Nath & Beauchamp, 2012; Münte et al., 2012; but see Baum et al., 2012; Baum & Beauchamp, 2014). Relative to congruent stimuli, McGurk stimuli also induce different cortical temporal reactions and neural synchrony patterns relative to analogous audiovisually congruent syllables (Fingelkurts et al., 2003; Hessler et al., 2013).
Additional evidence that the McGurk effect may not be representative of normal integration comes from intersubject differences. It turns out that there is little evidence for a correlation between a subject’s likelihood to display a McGurk effect and their benefit in using visual speech to enhance noisy auditory speech (at least in normal hearing subjects; e.g. Van Engen, Xie, & Chandrasekaran, 2016; but see Grant & Seitz, 1998). Relatedly, the relationship between straight lip‐reading skill and susceptibility to the McGurk effect is weak at best (Cienkowski & Carney, 2002; Strand et al., 2014; Wilson et al., 2016; Massaro et al., 1986).
A particularly troubling concern regarding the McGurk effect is evidence that its failure does not mean integration has not occurred (Alsius, Paré, & Munhall, 2017; Rosenblum, 2019). Multiple studies have shown that when the McGurk effect seems to fail and a subject reports hearing just the auditory segment (e.g. auditory /b/ + visual /g/ = perceived /b/), the influences of the visual, and perhaps integrated, segment are present in the gestural nuances of the subject’s spoken response (Gentilucci & Cattaneo, 2005; Sato et al., 2010; see Rosenblum, 2019 for further discussion). In another example, Brancazio and Miller (2005) showed that in instances when a visual /ti/ failed to change identification of an audible /pi/, a simultaneous manipulation of spoken rate of the visible /ti/ did influence the voice‐onset time perceived in the /pi/ (see also Green & Miller, 1985). Thus, information for voice‐onset time was integrated across the visual and audible syllables even when the McGurk effect failed to change the identification of the /pi/.
It is unclear why featural integration can still occur in the face of a failed McGurk effect (Rosenblum, 2019; Alsius, Paré, & Munhall, 2017). It could be that standard audiovisual segment integration does occur in these instances, but the resultant segment does not change enough to be categorized differently. As stated, perceived segments based on McGurk stimuli are less robust than audiovisually congruent (or audio‐alone) perceived segments. It could be that some integration almost always occurs for McGurk segments, but the less canonical integrated segment sometimes induces a phonetic categorization that is the same as the auditory‐alone segment. Regardless, the fact that audiovisual integration of some type can occur when the McGurk effect appears to fail forces a reconsideration of the effect as a primary test of integration.
For all of these reasons, a number of authors, including ourselves, have suggested that less weight be placed on the McGurk effect in evaluating multisensory integration. Evaluation of integration may be better served with measures of the perceptual super‐additivity of visual and audio (e.g. in noise) streams (e.g. Alsius, Paré, & Munhall, 2017; Irwin & DiBlasi, 2017; Remez, Beltrone, & Willimetz, 2017); influences on speech‐production responses (Gentilucci & Cattaneo, 2005; and see Sato et al., 2010); and neurophysiological responses (e.g, Skipper et al., 2007). Such methods may very well be more stable, valid, and representative indexes of integration than the McGurk effect.