Читать книгу The Handbook of Speech Perception - Группа авторов - Страница 33

Multimodal speech is integrated at the earliest observable stage

Оглавление

The question of where in the speech function the modal streams integrate (merge) continues to be one of the most studied in the multisensory literature. Since 2005, much of this research has used neurophysiological methods. After the aforementioned fMRI report by Calvert and her colleagues (1997; see also Pekkola et al., 2005), numerous studies have also shown visual speech activation of the auditory cortex, often using other technologies, for example, functional near‐infrared spectroscopy (fNIR) (van de Rijt et al., 2016); electroencephalography (EEG; Callan et al., 2001; Besle et al., 2004); intercranial EEG (ECoG; e.g. Besle et al., 2008); magneto‐encephalography (MEG; Arnal et al., 2009; for a review, see Rosenblum, Dorsi, & Dias, 2016). More recent evidence shows that visual speech can modulate neurophysiological areas considered to be further upstream including the auditory brainstem (Musacchia et al., 2006), which is one of the earliest locations at which direct visual modulation could occur. There is even evidence of visual speech modulation of cochlear functioning (otoacoustic emissions; Namasivayam et al., 2015). While it is likely that visual influences on such peripheral auditory mechanisms are based on feedback from downstream areas, that it can occur indicates the importance of visual input to the speech function.

Other neurophysiological findings suggest that the integration of the streams also happens early. A very recent EEG study revealed that N1 auditory‐evoked potentials (known to reflect primary auditory cortex activity) for visually induced (McGurk) fa and ba syllables (auditory ba + visual fa; auditory fa + visual ba, respectively) resemble the N1 responses for the corresponding auditory‐alone syllables (Shahin et al. 2018; and see van Wassenhove, Grant, & Poeppel, 2005). The degree of resemblance was larger for individuals whose identification responses showed greater visual influences, suggesting that this modulated auditory cortex activity (reflected in N1) corresponds to an integrated perceived segment. This finding is less consistent with the alternative model that separate unimodal analyses are first conducted at primary cortexes, with their outcomes then combined at a multisensory integrator, such as the posterior STS (pSTS; e.g. Beauchamp et al., 2004).

Other findings suggest that visual modulation of the auditory cortex (as it responds to sound) happens too quickly for an additional integrative step to be part of the process (for a review, see Besle et al., 2004). In fact, there is evidence that adding congruent visual speech to auditory speech input speeds up ERP and MEG reactions in the auditory cortex (van Wassenhove, Grant, & Poeppel, 2005; Hertrich et al., 2009). This facilitation could be a result of visible articulatory information for a segment often being available before the auditory information (see Venezia, Thurman, et al., 2016 for a review). This could allow visual speech to partially serve a sort of priming function – or a cortical preparedness – to speed the auditory function for speech (e.g. Campbell, 2011; Hertrich et al., 2009). Regardless, it is clear that, as the neuroscientific technology improves, it continues to show crossmodal influences as early as can be observed. This pattern of results is analogous to recent nonspeech findings which similarly demonstrate early audiovisual integration (e.g. Shams et al., 2005; Watkins et al., 2006; for a review, see Rosenblum et al., 2016).

The behavioral research also continues to show evidence of early crossmodal influences (for a review, see Rosenblum, Dorsi, & Dias, 2016). Evidence suggests that visual influences likely occur before auditory feature extraction (e.g. Brancazio, Miller, & Paré, 2003; Fowler, Brown, & Mann, 2000; Green & Gerdeman, 1995; Green & Kuhl, 1989; Green & Miller, 1985; Green & Norrix, 2001; Schwartz, Berthommier, & Savariaux, 2004). Other research shows that information in one modality is able to facilitate perception in the other even before the information is usable – and sometimes even detectable – on its own (e.g. Plass et al., 2014). For example, Plass and his colleagues (2014) used flash suppression to render visually presented articulating faces (consciously) undetectable. Still, if these undetected faces were presented with auditory speech that was consistent and synchronized with the visible articulation, then subjects were faster at recognizing that auditory speech. This suggests that useful crossmodal influences can occur even without awareness of information in one of the modalities.

Other examples of the extreme super‐additive nature of speech integration have been shown in the context of auditory speech detection (Grant & Seitz, 2000; Grant, 2001; Kim & Davis, 2004; Palmer & Ramsey, 2012) and identification (Schwartz, Berthommier, & Savariaux, 2004), as well audiovisual speech identification (Eskelund, Tuomainen, & Andersen, 2011; Rosen, Fourcin, & Moore, 1981). Much of this research has been interpreted to suggest that, even without its own (conscious) clear phonetic determination, each modality can help the perceiver attend to critical information in the other modality through analogous patterns of temporal change in the two signals. These crossmodal correspondences are thought to be influential at an especially early stage (before feature extraction) to serve as a “bimodal coherence‐masking protection” against everyday signal degradation (e.g. Grant & Seitz, 2000; Kim & Davis, 2004; Schwartz, Berthommier, & Savariaux, 2004; see also Gordon, 1997). The impressive utility of these crossmodal correspondences will also help motivate the theoretical position proposed later in this chapter.

However, other recent results have been interpreted as suggesting that additional linguistic analyses are conducted on the individual streams before, or concurrent with, integration. For example, a literature has emerged showing that the McGurk effect can be influenced by lexicality and semantic (sentence) context (e.g. Brancazio, 2004; Barutchu et al., 2008; but see Sams et al., 1998; Windmann, 2004, 2007). In one example, audio /ba/ paired with visual /va/, is perceived more often as va when presented in the context of the word valve than in the nonword vatch (Brancazio, 2004). This could mean that the analysis of each individual stream proceeds for some time before influencing the likelihood of audiovisual integration.

However, other interpretations of these results have been offered which are consistent with early integration (Brancazio, 2004; Rosenblum, 2008). It may be that lexicality and sentence context does not bear on the likelihood of integration, but instead on how the post‐integrated segment is categorized. As stated, it is likely that syllables perceived from conflicting audiovisual information are less canonical than those based on congruent (or audio‐alone) information. This fact likely makes those syllables less robust, even when they are being identified as visually influenced segments. This could mean that, despite incongruent segments being fully integrated, the resultant perceived segment is more susceptible to contextual (e.g. lexical) influences than audiovisually congruent (and auditory‐alone) segments. This is certainly known to be the case for less canonical, more ambiguous audio‐alone segments as demonstrated in the Ganong effect, that is, when an ambiguous segment equally heard as k or g in isolation will be heard as the former when placed in front of the syllable iss, but as the latter if heard in front of ift (Connine & Clifton, 1987; Ganong, 1980). If the same is true of incongruent audiovisual segments, then lexical context may not bear on audiovisual integration as such, but on the categorization of the post‐integrated (and less canonical) segment (e.g. Brancazio, 2004).

Still, other recent evidence has been interpreted as showing that a semantic analysis is conducted on the individual streams before integration is fully complete (see also Bernstein, Auer, & Moore, 2004). Ostrand and her colleagues (2016) present data showing that, despite a McGurk word being perceived as visually influenced (e.g. audio bait + visual date = heard date), the auditory component of the stimulus provides stronger priming of semantically related auditory words (audio bait + visual date primes worm more strongly than it primes calendar). This finding could suggest that the auditory component goes through a semantic analysis before it is merged with the visual component and provides stronger priming than the visible word. If this contention were true, then it would mean that the channels are not fully integrated until at least a good amount of processing has occurred on the individual channels.

A more recent test of this question has provided very different results, however (Dorsi, Rosenblum, & Ostrand, 2017). For this purpose, our laboratory used word combinations that begin with consonants known to produce a very strong McGurk effect (e.g. audio boat + visual vote = heard vote). Using these stimuli, we found that it was the visually influenced word that provided stronger priming than the word comprising the auditory component (audio boat + visual vote primes election more strongly than it primes dock). Follow‐up analyses on both our own and Ostrand et al.’s (2016) original stimuli suggest that the degree to which an audiovisual word is actually identified as visually influenced (showing the McGurk effect), the more likely it is to show greater priming from the visually influenced word. This could mean that Ostrand et al.’s original findings might have been based on a stimulus set that did not induce many McGurk effects. The findings also suggest that the streams are functionally integrated by the time semantic analysis occurs.

In sum, much of the new results from the behavioral, and especially neurophysiological, research suggest that the audio and visual streams are merged as early as can be currently observed (but see Bernstein, Auer, & Moore, 2004). In the previous version of this chapter we argued that this fact, along with the ubiquity and automaticity of multisensory speech, suggests that the speech function is designed around multisensory input (Rosenblum, 2005). We further argued that the function may make use of the fact that there is a common informational form across the modalities. This contention will be addressed in the final section of this chapter.

The Handbook of Speech Perception

Подняться наверх