Читать книгу The Handbook of Speech Perception - Группа авторов - Страница 35

Specific examples of supramodal information

Оглавление

Summerfield (1987) was the first to suggest that the informational form for certain articulatory actions can be construed as the same across vision and audition. As an intuitive example, he suggested that the higher‐order information for a repetitive syllable would be the same in sound and light. Consider a speaker repetitively articulating the syllable /ma/. For hearing, a repetitive oscillation of the amplitude and spectral structure of the acoustic signal would be lawfully linked to the repetitive movements of the lips, jaw, and tongue. For sight, a repetitive restructuring of the light reflecting from the face would also be lawfully linked to the same movements. While the energetic details of the information differ across modalities, the more abstract repetitive informational restructuring occurs in both modalities in the same oscillatory manner, with the same time course, so as to be specific to the articulatory movements. Thus, repetitive informational restructuring could be considered supramodal information – available in both the light and the sound – that acts to specify a speech event of repetitive articulation. A speech mechanism sensitive to this form of supramodal information would function without regard to the sensory details specific to each modality: the relevant form of information exists in the same way (abstractly defined) in both modalities. In this sense, a speech function that could pick up on this abstract form of information in multiple modalities would not require integration or translation of the information across modalities.

Summerfield (1987) offered other examples of supramodal information such as how quantal changes in articulation (e.g. bilabial contact to no contact), and reversals in articulation (e.g. during articulation of a consonant–vowel–consonant such as /wew/) would be accompanied by corresponding quantal and reversal changes in the acoustic and optic structure.

More formal examinations of supramodal information have been provided by Vatikiotis‐Bateson and his colleagues (Munhall & Vatikiotis‐Bateson, 2004; Yehia, Kuratate, & Vatikiotis‐Bateson, 2002; Yehia, Rubin, & Vatikiotis‐Bateson, 1998). These researchers have shown high correlations between amplitude/spectral changes in the acoustical signal, kinematic changes in optical structure (measured mouth movements extracted from video), and changing vocal tract configurations (measured with a magnetometer). The researchers report that information visible on the face captures between 70 and 85 percent of the variance contained in the acoustic signal. Vatikiotis‐Bateson and his colleagues also found a close relationship between subtle nodding motions of the head and fundamental frequency (F0), which is potentially informative about prosodic dimensions (Yehia, Kuratate, & Vatikiotis‐Bateson, 2002). Other researchers have shown similar close relationships between articulatory motions, spectral changes, and visible movements using a wide variety of talkers and speech material (e.g. Barker & Berthommier, 1999; Jiang et al., 2002). These strikingly strong moment‐to‐moment correspondences between the acoustic and visual signals are suggestive that the streams can take a common form.

Other recent research has determined that some of the strongest correlations across audible and visible signals lie in the acoustic range of 2–3 kHz (Chandrasekaran et al., 2009). This may seem unintuitive because it is within this range that the presumed less visible articulatory movements of the tongue and pharynx play their largest role in sculpting the sound. However, the configurations of these articulators were shown to systematically influence subtle visible mouth movements. This fact suggests that there is a class of visible information that strongly correlates with the acoustic information formed by internal articulators. In fact, visual speech research has shown that the presumably “hidden” articulatory dimensions (e.g. lexical tone, intraoral pressure) are actually visible in corresponding face surface changes, and can be used as speech information (Burnham et al., 2000; Han et al., 2018; Munhall & Vatikiotis‐Bateson, 2004). That visible mouth movements can inform about internal articulation may explain a striking recent finding. It turns out that, when observers are shown cross‐sectional ultrasound displays of internal tongue movements, they can readily integrate these novel displays with synchronized auditory speech information (D’Ausilio et al., 2014; see also Katz & Mehta, 2015).

The strong correspondences between auditory and visual speech information has allowed auditory speech to be synthesized based on tracking kinematic dimensions available on the face (e.g. Barker & Berthommier, 1999; Yehia, Kuratate, & Vatikiotis‐Bateson, 2002). Conversely, the correspondences have allowed facial animation to be effectively created based on direct acoustic signal parameters (e.g. Yamamoto, Nakamura, & Shikano, 1998). There is also evidence for surprisingly close correspondences between audible and visible macaque calls, which macaques can easily perceive as corresponding (Ghazanfar et al., 2005). This finding may suggest a traceable phylogeny of the supramodal basis for multisensory communication.

Importantly, there is evidence that perceivers make use of these crossmodal informational correspondences. While the supramodal thesis proposes that the relevant speech information takes a supramodal higher‐order form, the degree to which this information is simultaneously available in both modalities depends on a number of factors (e.g. visibility, audibility). The evidence shows that, in contexts for which the information is simultaneously available, perceivers take advantage of this correspondence (e.g. Grant & Seitz, 2000; Grant, 2001; Kim & Davis, 2004; Palmer & Ramsey, 2012; Schwartz, Berthommier, & Savariaux, 2004; Eskelund, Tuomainen, & Andersen, 2011; Rosen, Fourcin, & Moore, 1981). Research shows that the availability of segment‐to‐segment correspondence across the modalities’ information strongly predicts how well one modality will enhance the other (Grant & Seitz, 2000, 2001; Kim & Davis, 2004). Functionally, this finding supports the aforementioned “bimodal coherence‐masking protection” in that the informational correspondence across modalities allows one modality to boost the usability of the other (e.g. in the face of everyday masking degradation). In this sense, the supramodal thesis is consistent with the evidence supporting the bimodal coherence masking protection concept discussed earlier (Grant & Seitz, 2000; Grant, 2001; Kim & Davis, 2004). However, the supramodal thesis does go further by suggesting that: (1) the crossmodal correspondences are considered to be much more common and complex; and (2) the abstract form of information that can support correspondences is considered the primary type of information which the speech mechanism uses (regardless of the degree of moment‐to‐moment correspondence or specific availability of information in a modality).

The Handbook of Speech Perception

Подняться наверх