Читать книгу The Handbook of Multimodal-Multisensor Interfaces, Volume 1 - Sharon Oviatt - Страница 12
Оглавление1
Theoretical Foundations of Multimodal Interfaces and Systems
Sharon Oviatt
This chapter discusses the theoretical foundations of multisensory perception and multimodal communication. It provides a basis for understanding the performance advantages of multimodal interfaces, as well as how to design them to reap these advantages. Historically, the major theories that have influenced contemporary views of multimodal interaction and interface design include Gestalt theory, Working Memory theory, and Activity theory. They include perception-action dynamic theories and also limited resource theories that focus on constraints involving attention and short-term memory. This chapter emphasizes these theories in part because they are supported heavily by neuroscience findings. Their predictions also have been corroborated by studies on multimodal human-computer interaction. In addition to summarizing these three main theories and their impact, several related theoretical frameworks will be described that have influenced multimodal interface design, including Multiple Resource theory, Cognitive Load theory, Embodied Cognition, Communication Accommodation theory, and Affordance theory.
The large and multidisciplinary body of research on multisensory perception, production, and multimodal interaction confirms many Gestalt, Working Memory, and Activity theory predictions that will be discussed in this chapter. These theories provide conceptual anchors. They create a path for understanding how to design more powerful systems, so we can gain better control over our own future. In spite of this, it is surprising how many systems are developed from an engineering perspective that is sophisticated, yet in a complete theoretical vacuum that Leonardo da Vinci would have ridiculed:
Those who fall in love with practice without science are like a sailor who enters a ship without helm or compass, and who never can be certain whither he is going. Richter and Wells [2008]
This chapter aims to provide a better basis for motivating and accelerating future multimodal system design, and the quality of its impact on human users.
For a definition of highlighted terms in this chapter, see the Glossary. For other related terms and concepts, also see the textbook on multimodal interfaces by [Oviatt and Cohen 2015]. Focus Questions to aid comprehension are available at the end of this chapter.
1.1 Gestalt Theory: Understanding Multimodal Coherence, Stability, and Robustness
In cognitive neuroscience and experimental psychology, a rapidly growing literature during the past three decades has revealed that brain processing fundamentally involves multisensory perception and integration [Calvert et al. 2004, Stein 2012], which cannot be accounted for by studying the senses in isolation. Multisensory perception and communication are supported by multimodal neurons and multisensory convergence regions, which are a basic design feature of the human brain. As outlined in Section 1.2, it now is understood that multisensory integration of information exerts extensive control over human perception, attention, language, memory, learning, and other behaviors [Calvert et al. 2004, Schroeder and Foxe 2004, Stein and Meredith 1993]. This relatively recent shift from a unimodal to multisensory view of human perception reflects movement away from reductionism toward a perspective compatible with Gestalt theory, which originated in the late 1800s and early 1900s [for intellectual history, see Smith 1988]. Gestalt theory presents a holistic systems-level view of perception, which emphasizes self-organization of perceptual experience into meaningful wholes, rather than analyzing discrete elements as isolates. It asserts the principle of totality, or that the whole is a qualitatively different entity than the sum of its parts. A second overarching belief of Gestalt theory is the principle of psychophysical isomorphism, which states that conscious perceptual experience corresponds with underlying neural activity. These Gestalt views substantially predate Stein and Meredith [1993] pioneering research on multisensory integration and the neurophysiology of the superior colliculus.
In terms of the principle of totality, a central tenant of Gestalt theory is that when elements (e.g., lines) are combined into a whole percept (e.g., human figure), emergent properties arise that transform a perceptual experience qualitatively. Multisensory processing research has demonstrated and investigated many unexpected perceptual phenomena. Reports abound of perceptual “illusions” once thought to represent exceptions to unimodal perceptual laws. A classic example is the case of Wertheimer’s demonstration in 1912 that two lines flashed successively at optimal intervals appear to move together, an illusion related to human perception of motion pictures [Koffka 1935]. In these cases, it is the whole percept that is apprehended first, not the elements composing it. That is, the whole is considered to have experiential primacy. These integrated percepts typically do not involve equal weighting of individual stimuli or simple additive functions [Calvert et al. 2004]. Although Gestalt theory’s main contributions historically involved perception of visual-spatial phenomena, as in the classic Wertheimer example, its laws also have been applied to the perception of acoustic, haptic, and other sensory input [Bregman 1990]. They likewise have been applied to the production of multimodal communications, and to human-computer interface design [Oviatt et al. 2003], as will be described further in this chapter.
Gestalt theory describes different laws or principles for perceptual grouping of information into a coherent whole, including the laws of proximity, symmetry, area, similarity, closure, continuity, common fate, and others [Koffka 1935, Kohler 1929, Wertheimer 1938]. With respect to perceptual processing, Gestalt theory claims that the elements of a percept first are grouped rapidly according to its main principles. In addition, more than one principle can operate at the same time. Gestalt laws, and the principles they describe, maintain that we organize experience in a way that is rapid, economical, symmetrical, continuous, and orderly. This is viewed as economizing mental resources, which permits a person’s focal attention to be allocated to a primary task.
The Gestalt law of proximity states that spatial or temporal proximity causes unisensory elements to be perceived as related. This principle has become the backbone for explaining current multisensory integration of whole percepts, which is formulated as two related rules. The spatial rule states that the likelihood and strength of multisensory integration depends on how closely located two unisensory stimuli are to one another [Stein and Meredith 1993]. In parallel, the temporal rule claims that the likelihood and strength of multisensory integration depends on the degree of close timing of two unisensory stimuli, which must occur within a certain window of time [Stein and Meredith 1993]. Multisensory integration research has confirmed and elaborated the role of these two principles at both the behavioral and neuroscience level. For example, it is now known that there is a relatively wide temporal window for perceiving simultaneity between signals during audiovisual perception [Dixon and Spitz 1980, Spence and Squire 2003].1 A wide temporal window also has been demonstrated at the cellular level for simple stimuli in the superior colliculus [King and Palmer 1985, Meredith et al. 1987].
Glossary
Affordances are perceptually based expectations about actions that can be performed on objects in the world, which derive from people’s beliefs about their properties. Affordances invite and constrain people to interact with objects, including computer interfaces, in specific ways. They establish behavioral attunements that transparently prime people’s use of objects, including their physical and communicative actions, and they can lead to exploratory learning. Affordances can be analyzed at the biological, physical, perceptual, and symbolic/cognitive level, and they are influenced by cultural conventions. The way people use different computer input tools is influenced heavily by their affordances, which in turn has a substantial impact on human cognition and performance [Oviatt et al. 2012].
Disequilibrium refers to the Gestalt concept that people are driven to create a balanced, stable, and meaningful whole perceptual form. When this goal is not achieved or it is disrupted, then a state of tension or disequilibrium arises that alters human behavior. For example, if a user encounters a high rate of system errors during interaction, this will create a state of disequilibrium. Furthermore, human behavioral adaptation during disequilibrium will differ qualitatively depending on whether a system is multimodal or unimodal, which is consistent with the Gestalt principle of totality (see separate entries on multimodal hypertiming and hyperarticulation). During disequilibrium, behavioral adaptation observed in the user aims to fortify the organizational principles described in Gestalt laws in order to restore a coherent whole percept.
Extraneous cognitive load refers to the level of working memory load that a person experiences due to the properties of materials or computer interfaces they are using. High levels of extraneous cognitive load can undermine a user’s primary task performance. Extraneous load is distinguished from (1) intrinsic cognitive load, or the inherent difficulty level and related working memory load associated with a user’s primary task, and (2) germane cognitive load, or the level of a student’s effort and activity compatible with mastering new domain content, which may either be supported or undermined by interface design (e.g., due to inappropriate automation).
Hyperarticulation involves a stylized and clarified adaptation of a user’s typical unimodal speech, which she will shift into during disequilibrium—for example, when accommodating “at risk” listeners (e.g., hearing impaired), adverse communication environments (e.g., noisy), or interactions involving frequent miscommunication (e.g., error-prone spoken language systems). A user’s hyperarticulate speech to an error-prone speech system primarily involves a lengthier and more clearly articulated speech signal, as summarized in the CHAM model [Oviatt et al. 1998]. This type of hyper-clear unimodal speech adaptation is distinct from that observed when speech is combined multimodally, which involves multimodal hypertiming. In general, speakers hyperarticulate whenever they expect or experience a communication failure with their listener, which occurs during both interpersonal and humancomputer exchanges. When interacting with spoken dialogue systems, it is a major cause of system recognition failure, although it can be avoided by designing a multimodal interface.
Limited resource theories focus on cognitive constraints, especially ones involving attention and working memory, that can act as bottlenecks limiting human processing. Examples of limited resource theories include Working Memory theory, Multiple Resource theory, and Cognitive Load theory. These and similar theories address how people adaptively conserve energy and mental resources, while striving to optimize performance on a task. They have been well supported by both behavioral and neuroscience data. Currently, Working Memory theory is most actively being researched and refined.
Maximum-likelihood estimation (MLE) principle applies Bayes rule during multisensory fusion to determine the variance associated with individual input signals, asymmetry in signal variance, and the degree to which one sensory signal dominates another in terms of the final multimodal percept. The MLE model also estimates variance associated with the combined multimodal percept, and the magnitude of any super-additivity observed in the final multimodal percept (see separate entry on super-additivity).
Multimodal hypertiming involves adaptation of a user’s typical multimodal construction, for example when using speech and writing, which she will shift into during disequilibrium. For example, when interacting with an error-prone multimodal system, a user’s input will adapt to accentuate or fortify their habitual pattern of signal co-timing. Since there is a bimodal distribution of users who either demonstrate a simultaneous or sequential pattern of multimodal signal co-timing, this means that (1) simultaneous integrators, whose input signals overlap temporally, will increase their total signal overlap, but (2) sequential integrators, who complete one signal piece before starting another with a lag in between, will instead increase the total lag between signals. This multimodal hypertiming represents a form of entrenchment, or hyper-clear communication, that is distinct from that observed when communicating unimodally (see separate entry on hyperarticulation).
Perception-action dynamic theories assert that perception, action, and consciousness are dynamically interrelated. They provide a holistic systems-level view of interaction between humans and their environment, including feedback processes as part of a dynamic loop. Examples of anti-reductionistic perception-action dynamic theories include Activity meta-theories, Embodied Cognition theory, Communication Accommodation theory, and Affordance theory. These theories claim that action may be either physical or communicative. In some cases, such as Communication Accommodation theory, they involve socially-situated theories. Perception-action dynamic theories have been well supported by both behavioral and neuroscience data, including research on mirror and echo neurons. Currently, Embodied Cognition theory is most actively being researched and refined, often in the context of human learning or neuroscience research. It asserts that representations involve activating neural processes that recreate a related action-perception experiential loop, which is based on multisensory perceptual and multimodal motor neural circuits in the brain [Nakamura et al. 2012]. During this feedback loop, perception of an action (e.g., writing a letter shape) primes motor neurons (e.g., corresponding finger movements) in the observer’s brain, which facilitates related comprehension (e.g., letter recognition and reading).
Super-additivity refers to multisensory enhancement of the neural firing pattern when two sensory signals (e.g., auditory and visual) are both activated during a perceptual event. This can produce a total response larger than the sum of the two sources of modality-specific input, which improves the reliability of the fused signal. Closer spatial or temporal proximity can increase super-additivity, and the magnitude of super-additivity increases in adverse conditions (e.g., noise, darkness). The maximum-likelihood estimation (MLE) principle has been applied to estimate the degree and pattern of super-additivity. One objective of multimodal system design is to support maximum super-additivity.
Research on multisensory integration has clarified that there are asymmetries during fusion in what type of signal input dominates a perceptual interpretation, and the degree to which it is weighted more heavily. In the temporal ventriloquism effect, asynchronous auditory and visual input can be fused by effectively binding an earlier visual stimulus into temporal alignment with a subsequent auditory one, as long as they occur within a given window of time [Morein-Zamir et al. 2003]. In this case, visual perception is influenced by auditory cues. In contrast, in the spatial ventriloquism effect the perceived location of a sound can be shifted toward a corresponding visual cue [Bertelson and deGelder 2004]. The maximum-likelihood estimation (MLE) principle of multisensory fusion, based on Bayes rule, has been used to estimate the degree to which one modality dominates another during signal fusion. This principle describes how signals are integrated in the brain to minimize variance in their interpretation, which maximizes the accuracy of the final multimodal interpretation. For example, during visual-haptic fusion, visual dominance occurs when the variance associated with visual estimation is lower than that for haptic estimation [Ernst and Banks 2002]. For further details, see Section 1.3.
Multisensory integration research also has elaborated our understanding of how spatial and temporal proximity influence the salience of a multisensory percept. Neurons in the deep superior colliculus now are well known to exhibit multisensory enhancement in their firing patterns, or super-additivity [Anastasio and Patton 2004]. This can produce responses larger than the sum of the two modality-specific sources of input [Bernstein and Benoit 1996, Anastasio and Patton 2004]. Closer proximity of related signals can produce greater super-additivity. This phenomenon functions to improve the speed and accuracy of human responsiveness to objects and events, especially in adverse conditions such as noise or darkness [Calvert et al. 2004, Oviatt 2000, Oviatt 2012]. From an evolutionary perspective, these behavioral adaptations have directly supported human survival in many situations.
In addition to promoting better understanding of multisensory perception, Gestalt theoretical principles have advanced research on users’ production of multimodal constructions during human-computer interaction. For example, studies of users’ multimodal spoken and written constructions confirm that integrated multimodal constructions are qualitatively distinct from their unimodal parts. In addition, Gestalt principles accurately predict the organizational cues that bind this type of multimodal construction [Oviatt et al. 2003]. In a pen-voice multimodal interface, a user’s speech input is an acoustic modality that is structured temporally. In contrast, her pen input is structured both temporally and spatially. Gestalt theory predicts that the common temporal dimension will provide organizational cues for binding these modalities during multimodal communication. That is, modality co-timing will serve to indicate and solidify their relatedness [Oviatt et al. 2003]. Consistent with this prediction, research has confirmed the following:
• Users adopt consistent co-timing of individual signals in their multimodal constructions, and their habitual pattern is resistant to change.
• When system errors or problem difficulty increase, users adapt the co-timing of their individual signals to fortify the whole multimodal construction.
Figure 1.1 Model of average temporal integration pattern for simultaneous and sequential integrators’ typical multimodal constructions. (From Oviatt et al. [2005])
From a neuroscience perspective, the general importance of modality co-timing is highlighted by previous findings showing that greater temporal binding, or synchrony of neuronal oscillations involving different sources of sensory input, is associated with improved task success. For example, correctly recognizing people depends on greater neural binding between multisensory regions that represent their appearance and voice [Hummel and Gerloff 2005].
Studies with over 100 users—children through seniors—have shown that users adopt one of two types of temporal organizational pattern when forming multimodal constructions. They either present simultaneous constructions in which speech and pen signals are overlapped temporally, or sequential ones in which one signal ends before the second begins and there is a lag between them [Xiao et al. 2002, 2003]. Figure 1.1 illustrates these two types of temporal integration pattern. A user’s dominant integration pattern is identifiable almost immediately, typically on the very first multimodal construction during an interaction. Furthermore, her habitual temporal integration pattern remains highly consistent (i.e., 88–93%), and it is resistant to change even after instruction and training.
A second Gestalt law, the principle of area, states that people will tend to group elements to form the smallest visible figure or briefest temporal interval. In the context of the above multimodal construction co-timing patterns, this principle predicts that most people will deliver their signal input simultaneously. Empirical research has confirmed that 70% of people across the lifespan are indeed simultaneous signal integrators, whereas 30% are sequential integrators [Oviatt et al. 2003].
Figure 1.2 Average increased signal overlap for simultaneous integrators in seconds (left), but increased lag for sequential integrators (right), as they handle an increased rate of system errors. (From Oviatt and Cohen [2015])
An important meta-principle underlying all Gestalt tendencies is the creation of a balanced and stable perceptual form that can maintain its equilibrium, just as the interplay of internal and external physical forces shape an oil drop [Koffka 1935, Kohler 1929]. Gestalt theory states that any factors that threaten a person’s ability to achieve a goal create a state of tension, or disequilibrium. Under these circumstances, it predicts that people will fortify basic organizational phenomena associated with a percept to restore balance [Koffka 1935, Kohler 1929]. As an example, if a person interacts with a multimodal system and it makes a recognition error so she is not understood, then this creates a state of disequilibrium. When this occurs, research has confirmed that users fortify, or further accentuate, their usual pattern of multimodal signal co-timing (i.e., either simultaneous or sequential) by approximately 50%. This phenomenon is known as multimodal hypertiming [Oviatt et al. 2003]. Figure 1.2 illustrates increased multimodal signal overlap in simultaneous integrators, but increased signal lag in sequential integrators as they experience more system errors. Multimodal hyper-timing also has been demonstrated in users’ constructions when problem difficulty level increases [Oviatt et al. 2003].
From a Gestalt viewpoint, this behavior aims to re-establish equilibrium by fortifying multimodal signal co-timing, the basic organizational principle of such constructions, which results in a more coherent multimodal percept under duress. This multimodal hyper-timing contributes to hyper-clear communication that increases the speed and accuracy of perceptual processing by a listener. This manifestation of hyper-clear multimodal communication is qualitatively distinct from the hyper-clear adaptations observed in unimodal components. For example, in a unimodal spoken construction users increase their speech signal’s total length and degree of articulatory control as part of hyperarticulation when system errors occur. However, this unimodal adaptation diminishes or disappears altogether when speech is part of a multimodal construction [Oviatt et al. 2003].
The Gestalt law of symmetry states that people have a tendency to perceive symmetrical elements as part of the same whole. They view objects as symmetrical, formed around a center point. During multimodal human-computer interaction involving speech and pen constructions, Gestalt theory would predict that more symmetrical organization entails closer temporal correspondence or co-timing between the two signal pieces, and a closer matching of their proportional length. This would be especially evident in significantly increased co-timing of the component signals’ onsets and offsets. Research on multimodal interaction involving speech and pen constructions has confirmed that users increase the co-timing of their signal onsets and offsets during disequilibrium, such as when system errors increase [Oviatt et al. 2003].
In summary, the Gestalt principles outlined above have provided a valuable framework for understanding how people perceive and organize multisensory information, as well as multimodal input to a computer interface. These principles have been used to establish new requirements for multimodal speech and pen interface design [Oviatt et al. 2003]. They also have supported computational analysis of other types of multimodal system, for example involving pen and image content [Saund et al. 2003].
One implication of these results is that time-sensitive multimodal systems need to accurately model users’ multimodal integration patterns, including adaptations in signal timing that occur during different circumstances. In particular, user-adaptive multimodal processing is a fertile direction for system development. One example is the development of new strategies for adapting temporal thresholds in time-sensitive multimodal architectures during the fusion process, which could yield substantial improvements in system response speed, robustness, and overall usability [Huang and Oviatt 2005, Huang et al. 2006].
An additional implication is that Gestalt principles, and the multisensory research findings that have further elaborated them, potentially can provide useful guidance for designing “well integrated” multimodal interfaces [Reeves et al. 2004]. Researchers have long been interested in defining what it means to be a well-integrated multimodal interface, including the circumstances under which super-additivity effects can be expected rather than interference between modalities. One particularly salient strategy is to integrate maximally complementary input modes in a multimodal interface, or ones that produce a highly synergistic blend in which the strengths of each mode can be capitalized upon and used to overcome weaknesses in the other [Cohen et al. 1989, Oviatt and Cohen 2015]. Complementarity can aim to minimize variance in estimations of individual signal interpretation, which maximizes the accuracy of the final multimodal interpretation, as discussed previously. Alternatively, it can aim to expand the functional utility of an interface for a user.
Further research could leverage theory and multidisciplinary research findings to determine what it means to be a well-integrated multimodal interface beyond simply selecting the modalities for inclusion. For further discussion of this topic, see Section 8.4, “Principles for Strategizing Multimodal Integration,” in Oviatt and Cohen [2015].
1.2 Working Memory Theory: Performance Advantages of Distributing Multimodal Processing
In comparison with Gestalt theory, a major theme of Working Memory theory is that attention and working memory are bottlenecks constricting information processing during cognitive activities. Working memory span is a limited capacity system that is critical for basic cognitive functions, including planning, problem solving, inferential reasoning, language comprehension, written composition, and others. It focuses on goal-oriented task processing, and is susceptible to distraction, especially from simultaneous processes and closely related information. Working Memory theory, Multiple Resource theory, and Cognitive Load theory all are limited resource theories that address the fundamental issue in multimodal interface design of how to manage input and output modalities in a way that alleviates this bottleneck in order to optimize human performance. See Kopp et al.’s Chapter 6 in this volume for a related description of recent extensions of Working Memory Theory, and Zhou and colleagues’ chapter in Volume 2 [Zhou et al. 2017] for current approaches to real-time assessment of cognitive load based on different modalities and sensors.
Working memory refers to the ability to store information temporarily in mind, usually for a matter of seconds without external aids, before it is consolidated into long-term memory. To be consolidated into long-term memory, information in working memory requires continual rehearsal or else it becomes unavailable. Loss of information from working memory is influenced by cognitive load, which can be due to task difficulty, dual tasking, interface complexity, and similar factors. It also can occur when the content of distractors interfere with to-be-remembered information [Waugh and Norman 1965].
Miller and colleagues originally introduced the term “working memory” over 50 years ago [Miller et al. 1960]. They described the span of working memory as limited to approximately seven elements or “chunks,” which could involve different types of content such as digits or words [Miller 1956]. Expansion of this limit can be achieved under some circumstances, for example when information content involves different modalities that are processed in different brain areas. The development of domain expertise also can effectively expand working memory limits, because it enables a person to perceive and group isolated units of information into larger organized wholes. As a result, domain experts do not need to retain and retrieve as many units of information from working memory when completing a task, which frees up memory reserves for focusing on other or more difficult tasks.
Baddeley and Hitch [1974] proposed a particularly consequential theory of working memory in the 1970s, which preceded modern neuroscience findings on multisensory-multimodal brain processing. According to Baddeley’s theory, working memory consists of multiple semi-independent processors associated with different modalities [Baddeley 1986, 2003]. A visual-spatial “sketch pad” processes visual materials such as pictures and diagrams, whereas a separate “phonological loop” stores auditory-verbal information in a different brain area. These lower-level modality-specific processing systems are viewed as functioning largely independently. They are responsible for constructing and maintaining information in mind through rehearsal activities. In addition, Baddeley describes a higher-level “central executive” component that plans actions, directs attention to relevant information while suppressing irrelevant ones, manages integration of information from the lower-level modality stores, coordinates processing when two tasks are performed at a time, initiates retrieval of long-term memories, and manages overall decision-making processes [Baddeley 1986, 2003].
It is the semi-independence of lower-level modality-specific processing that enables people to use multiple modalities during a task in a way that circumvents short-term memory limitations, effectively expanding the size of working memory. For example, during dual tasking it is easier to maintain digits in mind while working on a spatial task than another numeric one [Maehara and Saito 2007]. Likewise, it is easier to simultaneously process information presented auditorily and visually than two auditory tasks. The “expansion” of working memory reserves that occurs is especially important as tasks become more difficult, because under these circumstances more elements of information typically must be integrated to solve a problem. Two key implications of these theoretical contributions for interface design are the following:
• Human performance improves when a computer interface combines different modalities that can support complementary information processing in separate brain regions conducted simultaneously. An advantage can accrue whether simultaneous information processing involves two input streams, an input and output stream, or two output streams.
• Flexible multimodal interfaces that support these processing advantages are essential as tasks become more difficult, or whenever users’ processing abilities are limited.
Multiple Resource theory, which is related to Working Memory theory, directly addresses the above processing advantages due to modality complementarity [Wickens et al. 1983, Wickens 2002]. It states that there can be competition between modalities during tasks, such that attention and processing required during input and output will result in better human performance if information is distributed across complementary modalities. For example, verbal input is more compatible with simultaneous visual than auditory output. This theory states that cross-modal time-sharing is effectively better than intra-modal time-sharing. The implication of both Working Memory and Multiple Resource theories is that multimodal interface design that permits distributing processing across different modality-specific brain regions can minimize interference and cognitive load, improving performance.
Working memory is a theoretical concept that is actively being researched in both cognitive psychology and neuroscience. During the past few decades, the neural basis of memory function has advanced especially rapidly [D’Esposito 2008]. It has confirmed and elaborated our understanding of modality-specific brain regions, the process of multisensory fusion, and circumstances under which interference occurs during consolidation of information in memory. Neurological evidence has confirmed that working memory is lateralized, with the right prefrontal cortex more engaged in visual-spatial working memory, and the left more active during verbal-auditory tasks [Owen et al. 2005, Daffner and Searl 2008]. Working Memory theory is well aligned with Activity Theory (see Section 1.3) in emphasizing the dynamic processes that construct and actively suppress memories, which are a byproduct of neural activation and inhibition. For example, active forgetting is now understood to be an inhibitory process at the neural level that is under conscious control [Anderson and Green 2001].
Cognitive Load theory, introduced by John Sweller and colleagues, applies working memory concepts to learning theory [Sweller 1988]. It maintains that during the learning process, students can acquire new schemas and automate them more easily if instructional methods or computer interfaces minimize demands on students’ attention and working memory, thereby reducing extraneous cognitive load [Baddeley 1986, Mousavi et al. 1995, Oviatt 2006, Paas et al. 2003, van Merrienboer and Sweller 2005]. Cognitive load researchers assess the extraneous complexity associated with instructional methods and tools separately from the intrinsic complexity and load of a student’s main learning task. Assessments typically compare performance indices of cognitive load as students use different curriculum materials or computer interfaces. Educational researchers then focus on evidence-based redesign of these materials and tools to decrease students’ extraneous cognitive load, so their learning progress can be enhanced.
Numerous learning studies have shown that a multimodal presentation format supports students’ learning more successfully than does unimodal presentation. For example, presentation of educational information that includes diagrams and audiotapes improves students’ ability to solve geometry problems, compared with visual-only presentation of comparable information content [Mousavi et al. 1995]. When using the multimodal format, larger performance advantages have been demonstrated on more difficult tasks, compared with simpler ones [Tindall-Ford et al. 1997]. These performance advantages of a multimodal presentation format have been replicated in different content domains, with different types of instructional materials (e.g., computer-based multimedia animations), and using different dependent measures [Mayer and Moreno 1998, Tindall-Ford et al. 1997]. These research findings based on educational activities are consistent with the general literature on multimodal processing advantages.
In recent years, Cognitive Load theory has been applied more broadly to computer interface design [Oviatt 2006]. It has supported the development of multimodal interfaces for education, and adaptive interface design tailored to a learner’s level of domain knowledge. Empirical studies have demonstrated that flexible multimodal interfaces are effective partly because they support students’ ability to self-manage their own working memory in a way that reduces cognitive load [Oviatt et al. 2004a]. For example, students prefer to interact unimodally when working on easy problems. However, they will upshift to interacting multimodally on harder problems in order to distribute processing, minimize cognitive load, and improve their performance [Oviatt et al. 2004a].
One implication of these research findings is that flexible multimodal interfaces are especially well suited for applications like education, which typically involve higher levels of load associated with mastering new content. In fact, all applications that require extended thinking and reasoning potentially could be improved by implementing a flexible and expressively powerful multimodal interface. In addition, Working Memory theory has direct implications for designing well-integrated multimodal interfaces that can combine complementary modalities to process specific types of content with minimal interference effects.
1.3 Activity Theory, Embodied Cognition, and Multisensory-Multimodal Facilitation of Cognition
Activity Theory is a meta-theory with numerous branches. It makes the central claim that activity and consciousness are dynamically interrelated. Vygotskian Activity theory states that physical and communicative activity play a major role in mediating, guiding, and refining mental activities [Luria 1961, Vygotsky 1962, 1978, 1987].
In Vygotsky’s view, the most powerful tools for semiotic meditation are symbolic representational ones such as language. Vygotsky was especially interested in speech, which he believed serves dual purposes: (1) social communication and (2) self-regulation during physical and mental activities. He described self-regulatory language, also known as “self talk” or “private speech,” as a think-aloud process in which individuals verbalize poorly understood aspects of difficult tasks to assist in guiding their thought [Berk 1994, Duncan and Cheyne 2002, Luria 1961]. In fact, during human-computer interaction the highest rates of self talk occur during more difficult tasks [Xiao et al. 2003]. For example, when using a multimodal interface during a map task, people typically have the most difficulty with relative directional information. They may subvocalize, “East, no, west of …” when thinking about where to place a landmark on a digital map. As a map task increases in difficulty, users’ self talk progressively increases, which has been shown to improve their performance [Xiao et al. 2003].
Since Vygotsky’s original work on the role of speech in self-regulation, further research has confirmed that activity in all communication modalities mediates thought, and plays a self-regulatory role in improving performance [Luria 1961, Vygotsky 1962, 1987]. As tasks become more difficult, speech, gesture, and writing all increase in frequency, reducing cognitive load and improving performance [Comblain 1994, Goldin-Meadow et al. 2001, Oviatt et al. 2007, Xiao et al. 2003]. For example, manual gesturing reduces cognitive load and improves memory during math tasks, with increased benefit on more difficult tasks [Goldin-Meadow et al. 2001]. When writing, students also diagram more as math problems became harder, which can improve correct solutions by 30–40% [Oviatt 2006, 2007]. In summary, research across modalities is compatible with Vygotsky’s theoretical view that communicative activity mediates thought and improves performance [Luria 1961, Vygotsky 1962].
Activity theory is well supported by neuroscience results on activity- and experience-dependent neural plasticity. Activity-dependent plasticity adapts the brain according to the frequency of an activity. Activities have a profound impact on human brain structure and processing, including changes in the number and strength of synapses, dendritic branching, myelination, the presence of neurotransmitters, and changes in cell responsivity, which are associated with learning and memory [Markham and Greenough 2004, Sale et al. 2009]. Recent neuroscience data indicate that physical activity can generate change within minutes in neocortical dendritic spine growth, and the extent of dendritic spine remodeling correlates with success of learning [Yang et al. 2009]. Other research has shown that the experience of using a tool can change the properties of multisensory neurons involved in their control [Ishibashi et al. 2004].
A major theme uncovered by neuroscience research related to Activity theory is the following:
• Neural adaptations are most responsive to direct physical activity, rather than passive viewing or vicarious experience [Ferchmin and Bennett 1975].
One major implication of this finding is that the design of computer input tools is particularly consequential for eliciting actions that directly stimulate cognition. This contrasts with the predominant engineering focus on developing system output capabilities.
In addition, neuroscience findings emphasize the following:
• Physical activity that involves novel or complex actions is most effective at stimulating synaptogenesis, or neural adaptations compatible with learning and memory. In contrast, familiar and simple actions do not have the same impact [Black et al. 1990, Kleim et al. 1997].
A further theme revealed by neuroscience research, which focuses specifically on activity theory and multimodality, is:
• Multisensory and multimodal activity involve more total neural activity across a range of modalities, more intense bursts of neural activity, more widely distributed activity across the brain’s neurological substrates, and longer distance connections.
Since multimodal interfaces elicit more extensive neural activity across many dimensions, compared with unimodal interfaces, they can have a greater impact on stimulating cognition. In particular, they produce deeper and more elaborated learning, improve long-term memory, and result in higher performance levels during human-computer interaction [Oviatt 2013].
Embodied Cognition theory, which is related to Activity theory and Situated Cognition theory, asserts that thought is directly shaped by actions in context as part of an action-perception loop [Beilock et al. 2008, Shapiro 2014, Varela et al. 1991]. For example, specific gestures or hand movements during problem solving can facilitate an understanding of proportional equivalence and other mathematical concepts [Goldin-Meadow and Beilock 2010, Howison et al. 2011]. Representations and meaning are created and interpreted within activity, rather than being stored as past knowledge structures. More specifically, representation involves activating neural processes that recreate a related action-perception experiential loop. Two key findings in the embodied cognition literature are that:
• The action-perception loop is based on multisensory perceptual and multimodal motor neural circuits in the brain [Nakamura et al. 2012].
• Complex multisensory or multimodal actions, compared with unimodal or simpler actions, can have a substantial and broad facilitatory effect on cognition [James 2010, Kersey and James 2013, Oviatt 2013].
As an example, writing complex letter shapes creates a long-term sensory-motor memory, which is part of an integrated multisensory-multimodal “reading neural circuit” [Nakamura et al. 2012]. The multisensory experience of writing includes a combination of haptic, auditory, and visual feedback. In both children and adults, actively writing letters has been shown in fMRI studies to increase brain activation to a greater extent than passively viewing, naming, or typing them [James 2010, James and Engelhardt 2012, James 2010, Kersey and James 2013, Longcamp et al. 2005, 2008]. Compared with simple tapping on keys during typing, constructing letter shapes also improves the accuracy of subsequent letter recognition, a prerequisite for successful comprehension and reading. Writing letters basically leads to a more elaborated and durable ability to recognize letter shapes over time. Research by Berninger and colleagues [Berninger et al. 2009, Hayes and Berninger 2010] has further documented that the multisensory-multimodal experience of writing letter shapes, compared with typing, facilitates spelling, written composition, and the content of ideas expressed in a composition. This extensive body of neuroscience and behavioral findings has direct implications for the broad cognitive advantages of pen-based and multimodal interfaces.
Figure 1.3 Embodied cognition view of the perception-action loop during multisensory integration, which utilizes the Maximum Likelihood Estimation (MLE) model and combines prior knowledge with multisensory sources of information. (From Ernst and Bulthoff [2004])
In research on multisensory integration, Embodied Cognition theory also has provided a foundation for understanding human interaction with the environment from a systems perspective. Figure 1.3 illustrates how multisensory signals from the environment are combined with prior knowledge to form more accurate percepts [Ernst and Bulthoff 2004]. During multisensory integration, Ernst and colleagues describe the Maximum Likelihood Estimation (MLE) model, using Bayes’ rule. As introduced earlier, MLE integrates sensory signal input to minimize variance in the final estimate under different circumstances. It determines the degree to which information from one modality will dominate over another [Ernst and Banks 2002, Ernst and Bulthoff 2004]. For example, the MLE rule predicts that visual capture will occur whenever the visual stimulus is relatively noise-free and its estimate of a property has less variance than the haptic estimate. Conversely, haptic capture will prevail when the visual stimulus is noisier.
Empirical research has shown that the human nervous system’s multisensory perceptual integration process is very similar to the MLE integrator model. Ernst and Banks [2002] demonstrated this in a visual and haptic task. The net effect is that the final estimate has lower variance than either the visual or the haptic estimator alone. To support decision-making, prior knowledge is incorporated into the sensory integration model to further disambiguate sensory information. As depicted in Figure 1.3, this embodied perception-action process provides a basis for deciding what goal-oriented action to pursue. Selective action may in turn recruit further sensory information, alter the environment that is experienced, or change people’s understanding of their multisensory experience. See James and colleagues’ Chapter 2 in this volume [James et al. 2017] for an extensive discussion and empirical evidence supporting Embodied Cognition Theory.
Communication Accommodation theory presents a socially situated perspective on embodied cognition. It has shown that interactive human dialogue involves extensive co-adaptation of communication patterns between interlocutors. Interpersonal conversation is a dynamic adaptive exchange in which speakers’ lexical, syntactic, and speech signal features all are tailored in a moment-by-moment manner to their conversational partner. In most cases, children and adults adapt all aspects of their communicative behavior to converge with those of their partner, including speech amplitude, pitch, rate of articulation, pause structure, response latency, phonological features, gesturing, drawing, body posture, and other aspects [Burgoon et al. 1995, Fay et al. 2010, Giles et al. 1987, Welkowitz et al. 1976]. The impact of these communicative adaptations is to enhance the intelligibility, predictability, and efficiency of interpersonal communication [Burgoon et al. 1995, Giles et al. 1987, Welkowitz et al. 1976]. For example, if one speaker uses a particular lexical term, then their partner has a higher likelihood of adopting it as well. This mutual shaping of lexical choice facilitates language learning, and also the comprehension of newly introduced ideas between people.
Communication accommodation occurs not only in interpersonal dialogue, but also during human-computer interaction [Oviatt et al. 2004b, Zolton-Ford 1991]. These mutual adaptations also occur across different modalities (e.g., handwriting, manual signing), not just speech. For example, when drawing interlocutors typically shift from initially sketching a careful likeness of an object to converging with their partner’s simpler drawing [Fay et al. 2010]. A similar convergence of signed gestures has been documented between deaf communicators. Within a community of previously isolated deaf Nicaraguans who were brought together in a school for the deaf, a novel sign language became established rapidly and spontaneously. This new sign language and its lexicon most likely emerged through convergence of the signed gestures, which then became widely produced among community members as they formed a new language [Kegl et al. 1999, Goldin-Meadow 2003].
At the level of neurological processing, convergent communication patterns are controlled by the mirror and echo neuron systems [Kohler et al. 2002, Rizzolatti and Craighero 2004]. Mirror and echo neurons provide the multimodal neurological substrate for action understanding, both at the level of physical and communicative actions. Observation of an action in another person primes an individual to prepare for action, and also to comprehend the observed action. For example, when participating in a dialogue during a cooking class, one student may observe another’s facial expressions and pointing gesture when she says, “I cut my finger.” In this context, the listener is primed multimodally to act, comprehend, and perhaps reply verbally. The listener experiences neurological priming, or activation of their own brain region and musculature associated with fingers. This prepares the listener to act, which may involve imitating retraction that they observe with their own fingers. The same neurological priming enables the listener to comprehend the speaker’s physical experience and emotional state. This socially situated perception-action loop provides the evolutionary basis for imitation learning, language learning, and mutual comprehension of ideas.
This theory and related literature on convergence of multimodal communication patterns has been applied to designing more effective conversational software personas and social robots. One direct implication of this work is that the design of a system’s multimodal output can be used to transparently guide users to provide input that is more compatible with a system’s processing repertoire, which improves system reliability and performance [Oviatt et al. 2004b]. As examples, users interacting with a computer have been shown to adopt a more processable volume, rate, and lexicon [Oviatt et al. 2004b, Zolton-Ford 1991].
Affordance theory presents a systems-theoretic view closely related to Gestalt theory. It also is a complement to Activity theory, because it specifies the type of activity that users are most likely to engage in when using different types of computer interface. It states that people have perceptually based expectations about objects, including computer interfaces, which involve different constraints on how one can act on them to achieve goals. These affordances of objects establish behavioral attunements that transparently but powerfully prime the likelihood that people will act in specific ways [Gibson 1977, 1979]. Affordance theory has been widely applied to human interface design, especially the design of input devices [Gaver 1991, Norman 1988].
Since object perception is multisensory, people are influenced by an array of object affordances (e.g., auditory, tactile), not just their visual properties [Gaver 1991, Norman 1988]. For example, the acoustic qualities of an animated computer persona’s voice can influence a user’s engagement and the content of their dialogue contributions. In one study, when an animated persona sounded like a master teacher by speaking with higher amplitude and wider pitch excursions, children asked more questions about science [Oviatt et al. 2004b]. This example not only illustrates that affordances can be auditory, but also that they affect the nature of communicative actions as well as physical ones [Greeno 1994, Oviatt et al. 2012]. Furthermore, this impact on communication patterns involves all modalities, not just spoken language [Oviatt et al. 2012].
Recent interpretations of Affordance theory, especially as applied to computer interface design, specify that it is human perception of interface affordances that elicits specific types of activity, not just the presence of specific physical attributes. Affordances can be described at different levels, including biological, physical, perceptual, and symbolic/cognitive [Zhang and Patel 2006]. They are distributed representations that are the by-product of external representations of an object (e.g., streetlight color) and internal mental representations that a person maintains about their action potential (e.g., cultural knowledge that “red” means stop), which determines the person’s physical response. This example of an internal representation involves a cognitive affordance, which originates in cultural conventions mediated by symbolic language (i.e., “red”) that are specific to a person and her cultural/linguistic group.
Affordance theory emphasizes that interfaces should be designed to facilitate easy discoverability of the actions they are intended to support. It is important to note that the behavioral attunements that arise from object affordances depend on perceived action possibilities that are distinct from specific learned patterns. As such, they are potentially capable of stimulating human activity in a way that facilitates learning in contexts never encountered before. For this reason, if interface affordances are well matched with a task domain, they can increase human activity patterns that stimulate exploratory learning, cognition, and overall performance.
Motivated by both Affordance theory and Activity theory, research on humancomputer interaction has shown that more expressively powerful interfaces can substantially stimulate human communicative activity and corresponding cognition. An expressively powerful computer interface is one that can convey information involving multiple modalities, representations, or linguistic codes [Oviatt 2013]. Recent research has shown that different input capabilities, such as a keyboard vs. digital pen, have affordances that prime qualitatively different types of communicative content. In one study, students expressed 44% more nonlinguistic representational content (e.g., numbers, symbols, diagrams) when using a pen interface. In contrast, when the same students worked on the same type of problems with keyboard input, they switched to expressing 36% more linguistic content (e.g., words, abbreviations) [Oviatt et al. 2012].
These differences in communication pattern corresponded with striking changes in students’ cognition. In particular, when students used a pen interface and wrote more nonlinguistic content, they also generated 36% more appropriate biology hypotheses. A regression analysis revealed that knowledge of individual students’ level of nonlinguistic fluency accounted for a substantial 72% of all the variance in their ability to produce appropriate science ideas (see Figure 1.4, left; Oviatt et al. 2012). However, when the same students used the keyboard interface and communicated more linguistic content, a regression now indicated a substantial decline in science ideation (see Figure 1.4, right). In this case, knowledge of students’ level of linguistic communication had a negative predictive relation with their ability to produce appropriate science ideas. That is, it accounted for 62% of the variation in students’ inability to produce biology hypotheses.
Figure 1.4 Regression analysis showing positive relation between nonlinguistic communicative fluency and ideational fluency (left). Regression showing negative relation between linguistic communicative fluency and ideational fluency (right). (From Oviatt et al. [2012])
From an Activity theory perspective, neuroscience, behavioral, and humancomputer interface research all consistently confirm that engaging in more complex and multisensory-multimodal physical actions, such as writing letter shapes, can stimulate human cognition more effectively than passive viewing, naming, or tapping on a keyboard. Keyboard interfaces never were designed to be a thinking tool. They constrict the representations, modalities, and linguistic codes that can be communicated when using computers, and therefore fail to provide comparable expressive power [Oviatt 2013].
In addition, research related to Activity theory has highlighted the importance of communication as a type of activity that directly stimulates and shapes human cognition. This cognitive facilitation has been demonstrated in a variety of communication modalities. In summary, multimodal interface design is a fertile direction for supporting computer applications involving extended thinking and reasoning.
All of the theories presented in this chapter have limitations in scope, but collectively they provide converging perspectives on multisensory perception, multimodal communication, and the design of multimodal interfaces that effectively blend information sources. The focus of this chapter has been to summarize the strengths of each theory, and to describe how they have been applied to date in the design of multimodal interfaces. In this regard, the present chapter is by no means exhaustive. Rather, it highlights examples of how theory has influenced past multimodal interface design, often in rudimentary ways. In the future, new and more refined theories will be needed that can predict and coherently explain multimodal research findings, and shed light on how to design truly well-integrated multimodal interfaces and systems.
Focus Questions
1.1. Describe the two main types of theory that have provided a basis for understanding multisensory perception and multimodal communication.
1.2. What neuroscience findings currently support Gestalt theory, Working Memory theory, Activity theory, Embodied Cognition theory, and Communication Accommodation theory?
1.3. What human-computer interaction research findings support these theories?
1.4. What Gestalt laws have become especially important in recent research on multisensory integration? And how has the field of multisensory perception substantially expanded our understanding of multisensory fusion beyond these initial Gestalt concepts?
1.5. What Working Memory theory concept has been central to understanding the performance advantages of multimodal interfaces, as well as how to design them?
1.6. Activity theory and related research asserts that communicative activity in all modalities mediates thought, and plays a direct role in guiding and improving human performance. What is the evidence for this in human-computer interaction studies? And what are the key implications for designing multimodal-multisensor interfaces?
1.7. How is the action-perception loop, described by Embodied Cognition theory, relevant to multisensory perception and multimodal actions? Give one or more specific examples.
1.8. How do multisensory and multimodal activity patterns influence the brain and its neurological substrates, compared with unimodal activity? What are the implications for multimodal-multisensor interface design?
1.9. What is the principle of complementarity, and how does it relate to designing “well-integrated” multimodal-multisensor systems? What are the various ways that modality complementarity can be defined, as well as measured?
1.10. The field’s understanding of how to design well-integrated multimodal interfaces remains rudimentary, in particular focused on what modalities to include in a system. What other more fine-grained questions should be asked regarding how to design a well-integrated system? And how could future human-computer interaction research and theory be organized to refine our understanding of this topic?
References
T. Anastasio and P. Patton. 2004. Analysis and modeling of multisensory enhancement in the deep superior colliculus. In G. Calvert, C. Spence, and B. Stein, editors. The Handbook of Multisensory Processing. pp. 265–283. MIT Press, Cambridge, MA. 25
M. Anderson and C. Green. 2001. Suppressing unwanted memories by executive control. Nature, 410:366–369. DOI: 10.1038/35066572. 32
A. Baddeley. 1986. Working Memory. Oxford University Press, New York. 30, 32
A. Baddeley. 2003. Working memory: Looking back and looking forward. Nature Reviews., 4:829–839. DOI: 10.1038/nrn1201. 30
A. D. Baddeley and G.J. Hitch. 1974. Working memory. In G. H. Bower, editor. The Psychology of Learning and Motivation: Advances in Research and Theory, vol. 8, pp. 47–89. Academic, New York. 30
S. L. Beilock, I. M. Lyons, A. Mattarella-Micke, H. C. Nusbaum, and S. L. Small. 2008. Sports experience changes the neural processing of action language. In Proceedings of the National Academy of Sciences, vol. 105, pages, 13269–13273. DOI: 10.1073/pnas.0803424105. 35
L. E. Berk. 1994. Why children talk to themselves. Scientific American, 71(5):78–83. 33
V. Berninger, R. Abbott, A. Augsburger, and N. Garcia. 2009. Comparison of pen and keyboard transcription modes in children with and without learning disabilities. Learning Disability Quarterly 32:123–141. DOI: 10.2307/27740364. 35
L. Bernstein and C. Benoit. 1996. For speech perception by humans or machines, three senses are better than one. In Proceedings of the International Conference on Spoken Language Processing, vol. 3, pages 1477–1480.DOI: 10.1.1.16.5972. 25
P. Bertelson and B. deGelder. 2004. The psychology of multimodal perception. In C. Spence and J. Driver, editors. Crossmodal Space and Crossmodal Attention, pp. 141–177. Oxford University Press, Oxford, UK. DOI: 10.1093/acprof:oso/9780198524861.003.0007. 24
J. Black K. Isaacs, B. Anderson, A. Alcantara, and W. Greenough. 1990. Learning causes synaptogenesis, whereas motor activity causes angiogenesis in cerebellar cortex of adult rats. In Proceedings of the National Academy of Sciences, vol. 87, 5568–5572. 34
A. S. Bregman. 1990. Auditory Scene Analysis. MIT Press, Cambridge, MA. 21
J. Burgoon, L. Stern, and L. Dillman. 1995. Interpersonal Adaptation: Dyadic Interaction Patterns. Cambridge University Press, Cambridge, UK. 37
G. Calvert, C. Spence, and B. E. Stein, editors. 2004. The Handbook of Multisensory Processing. MIT Press, Cambridge, MA. 20, 21, 25
P. R. Cohen, M. Dalrymple, D. B. Moran, F. C. N. Pereira, J. W. Sullivan, R. A. Gargan, J. L. Schlossberg, and S. W. Tyler. 1989. Synergistic use of direct manipulation and natural language. In Proceedings of the Conference on Human Factors in Computing Systems [CHI’89], pp. 227–234. ACM Press, New York. Reprinted in M. T. Maybury and W. Wahlster, editors. 1998. Readings in Intelligent User Interfaces, pp. 29–37. Morgan Kaufmann, San Francisco. DOI: 10.1145/67450.67494. 29
A. Comblain. 1994. Working memory in Down’s Syndrome: Training the rehearsal strategy. Down’s Syndrome Research and Practice, 2(3):123–126. DOI: 10.3104/reports.42. 33
K. Daffner and M. Searl. 2008. The dysexecutive syndromes. In G. Goldenberg and B. Miller, editors. Handbook of Clinical Neurology, vol. 88, ch. 12, pp. 249–267. Elsevier B.V. DOI: 10.1016/S0072-9752(07)88012-2. 31
M. D’Esposito. 2008. Working memory, The dysexecutive syndrome. In G. Goldenberg and B. Miller, edtiors. Handbook of Clinical Neurology, vol. 88, ch. 11, pp. 237–248. Elsevier B.V. 31
N. F. Dixon and L. Spitz. 1980. The detection of auditory visual desynchrony. Perception, 9:719–721. 24
R. M. Duncan and J. A. Cheyne. 2002. Private speech in young adults: Task difficulty, self-regulation, and psychological predication. Cognitive Development, 16:889–906. DOI: 10.1016/S0885-2014(01)00069-7. 33
M. Ernst and M. Banks. 2002. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415:429–433. DOI: 10.1038/415429a. 25, 36, 37
M. Ernst and H. Bulthoff. 2004. Merging the senses into a robust percept. Trends in Cognitive Sciences, 8(4):162–169. DOI: 10.1016/j.tics.2004.02.002. 36
N. Fay, S. Garrod, L. Roberts, and N. Swoboda. 2010. The interactive evolution of human communication systems. Cognitive Science, 34:351–386. DOI: 10.1111/j.1551-6709.2009.01090.x. 37
P. Ferchmin and E. Bennett. 1975. Direct contact with enriched environment is required to alter cerebral weight in rats. Journal of Comparative and Physiological Psychology, 88:360–367. 34
W. Gaver. 1991. Technology affordances. In Proceedings of the CHI Conference, pp. 79–84. ACM Press, New York. DOI: 10.1145/108844.108856. 38, 39
J. Gibson. 1977. The theory of affordances. In R. Shaw and J. Bransford, editors. Perceiving, Acting and Knowing. vol. 3, pp. 67–82. Erlbaum, Hillsdale, NJ. 38
J. Gibson. 1979. The Ecological Approach to Visual Perception. Houghton Mifflin, Boston. 38
H. Giles, A. Mulac, J. Bradac, and P. Johnson. 1987. Speech accommodation theory: The first decade and beyond. In M. L. McLaughlin, editor. Communication Yearbook 10, pp. 13–48. Sage Publications, London. DOI: 10.1080/23808985.1987.11678638. 37
S. Goldin-Meadow. 2003. The Resilience of Language: What Gesture Creation in Deaf Children Can Tell Us About How Children Learn Language. Psychology Press, New York. 38
S. Goldin-Meadow and S. Beilock. 2010. Action’s influence on thought: The case of gesture. Perspectives on Psychological Science, 5(6):664–674. DOI: 10.1177/1745691610388764. 35
S. Goldin-Meadow, H. Nusbaum, S. J. Kelly, and S. Wagner. 2001. Explaining math: Gesturing lightens the load. Psychological Science, 12(6):516–522. 33
J. Greeno. 1994. Gibson’s affordances. Psychological Review, 101(2):336–342. 39
J. Hayes and V. Berninger. 2010. Relationships between idea generation and transcription: How the act of writing shapes what children write. In C. Bazerman, R. Krut, K. Lunsford, S. McLeod, S. Null, P. Rogers, and A. Stansell, editors. Traditions of Writing Research, pp. 116–180. Routledge, New York. 35
M. Howison, D. Trninic, D. Reinholz, and D. Abrahamson. 2011. The mathematical imagery trainer: From embodied interaction to conceptual learning. In Proceedings of the CHI Conference, pp. 1989–1998. ACM Press, New York. DOI: 10.1145/1978942.1979230. 35
X. Huang and S. Oviatt. 2005. Toward adaptive information fusion in multimodal systems. In Second Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms [MIML’05]. Springer-Verlag, Edinburgh, UK. 28
X. Huang, S. L. Oviatt, and R. Lunsford. 2006. Combining user modeling and machine learning to predict users’ multimodal integration patterns. In S. Renals and S. Bengio, editors. Third Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms [MIML’06], Springer Lecture Notes in Computer Science. Springer-Verlag GmbH. DOI: 10.1007/11965152_5. 28
F. Hummel and C. Gerloff. 2005. Larger interregional synchrony is associated with greater behavioral success in a complex sensory integration task in humans. Cerebral Cortex, 15:670–678. DOI: 10.1093/cercor/bhh170. 26
H. Ishibashi, S. Obayashi, and A. Iriki. 2004. Cortical mechanisms of tool use subserved by multisensory integration. In G. Calvert, C. Spence, and B. E. Stein, editors. The Handbook of Multisensory Processing. pp. 453–462. MIT Press, Cambridge, MA. 34
K. James. 2010. Sensori-motor experience leads to changes in visual processing in the developing brain. Developmental Science 13:279–288. DOI: 10.1111/j.1467-7687.2009.00883.x. 35
K. James and L. Engelhardt. 2012. The effects of handwriting experience on functional brain development in pre-literate children. Trends in Neuroscience Education, 1:32–42. DOI: 10.1016/j.tine.2012.08.001. 35
K. James and S. Swain. 2010. Only self-generated actions create sensori-motor systems in the developing brain. Developmental Science, 1–6. DOI: 10.1111/j.1467-7687.2010.01011.x.
K. James, S. Vinci-Booher, and F. Munoz-Rubke. 2017. The impact of multimodal-multisensory learning on human performance and brain activation patterns, The Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations (ed. by S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos & A. Krüger), San Rafael, CA: Morgan Claypool Publishers. 37
J. Kegl, A. Senghas, and M. Coppola. 1999. Creation through contact: Sign language emergence and sign language change in Nicaragua. In M. DeGraff, editor. Language Creation and Language Change: Creolization, Diachrony and Development, pp. 179–237. MIT Press, Cambridge MA. 38
A. Kersey and K. James. 2013. Brain activation patterns resulting from learning letter forms through active self-production and passive observation in young children. Frontiers in Psychology, 4(567):1–15. DOI: 10.3389/fpsyg.2013.00567. 35
A. J. King and A. R. Palmer. 1985. Integration of visual and auditory information in bimodal neurons in the guinea-pic superior colliculus. Experimental Brain Research, 60:492–500. 24
J. Kleim, K. Vij, J. Kelly, D. Ballard, and W. Greenough. 1997. Learning-dependent synaptic modifications in the cerebellar cortex of the adult rat persist for at least 4 weeks. Journal of Neuroscience, 17:717–721. 34
K. Koffka. 1935. Principles of Gestalt Psychology. Harcourt, Brace and Company, New York. 21, 27
W. Kohler. 1929. Dynamics in Psychology. Liveright, New York. 21, 27
E. Kohler, C. Keysers, M. Umilta, L. Fogassi, V. Gallese, and G. Rizzolatti. 2002. Hearing sounds, understanding actions: Action representation in mirror neurons. Science, 297:846–848. DOI: 10.1126/science.1070311. 38
M. Longcamp, C. Boucard, J-C. Gilhodes, J.-L. Anton, M. Roth, B. Nazarian, and J-L. Velay. 2008. Learning through hand- or typewriting influences visual recognition of new graphic shapes: Behavioral and functional imaging evidence. Journal of Cognitive Neuroscience, 20(5):802–815. DOI: 10.1162/jocn.2008.20504. 35
M. Longcamp, M.-T. Zerbato-Poudou, and J.-L. Velay. 2005. The influence of writing practice on letter recognition in preschool children: A comparison of handwriting and typing. Acta Psychologica, 119:67–79. DOI: 10.1016/j.actpsy.2004.10.019. 35
A. R. Luria. 1961. The Role of Speech in the Regulation of Normal and Abnormal Behavior. Liveright, Oxford. 33, 34
Y. Maehara and S. Saito. 2007. The relationship between processing and storage in working memory span: Not two sides of the same coin. Journal of Memory and Language, 56(2):212–228. DOI: 10.1016/j.jml.2006.07.009. 30
J. Markham and W. Greenough. 2004. Experience-driven brain plasticity: beyond the synapse. Neuron Glia Biology, 1(4):351–363. DOI: 10.1017/s1740925x05000219. 34
R. Mayer and R. Moreno. 1998. A split-attention effect in multimedia learning: Evidence for dual processing systems in working memory. Journal of Educational Psychololgy, 90:312–320. DOI: 10.1037/0022-0663.90.2.312
M. A. Meredith, J. W. Nemitz, and B. E. Stein. 1987. Determinants of multisensory integration in superior colliculus neurons. I. Temporal factors. Journal of Neuroscience, 7:3215–3229. 24
G. Miller. 1956. The magical number seven plus or minus two: some limits on our capacity for processing information. Psychological Review, 63(2):81–97. 30
G. A. Miller, E. Galanter, and K. H. Pribram. 1960. Plans and the Structure of Behavior. Holt, Rinehart and Winston, New York. 30
S. Morein-Zamir, S. Soto-Faraco, and A. Kingstone. 2003. Auditory capture of vision: Examining temporal ventriloquism. Cognitive Brain Research, 17(1):154–163. 24
S. Mousavi, R. Low, and J. Sweller. 1995. Reducing cognitive load by mixing auditory and visual presentation modes. Journal of Educational Psychology 87(2):319–334. DOI: 10.1037/0022-0663.87.2.319. 32
K. Nakamura, W.-J. Kuo, F. Pegado, L. Cohen, O. Tzeng, and S. Dehaene. 2012. Universal brain systems for recognizing word shapes and handwriting gestures during reading. In Proceedings of the National Academy of Science, 109(50):20762–20767. DOI: 10.1073/pnas.1217749109. 24, 35, 626
D. Norman. 1988. The Design of Everyday Things. Basic Books, New York. 38, 39
S. L. Oviatt. 2000. Multimodal signal processing in naturalistic noisy environments. In B. Yuan, T. Huang, and X. Tang, editors, Proceedings of the International Conference on Spoken Language Processing [ICSLP’2000], pp. 696–699, vol. 2. Chinese Friendship Publishers, Beijing. 25
S. L. Oviatt. 2006. Human-centered design meets cognitive load theory: Designing interfaces that help people think. In Proceedings of the Conference on ACM Multimedia, pp. 871–880. ACM, New York. DOI: 10.1145/1180639.1180831. 32, 34
S. L. Oviatt. 2012. Multimodal interfaces. In J. Jacko, editor, The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications [revised 3rd edition], pp. 405–430, ch. 18. CRC Press, Boca Raton, FL. 25
S. Oviatt. 2013. The Design of Future Educational Interfaces. Routledge Press, New York. DOI: 10.4324/9780203366202. 35, 39, 41
S. Oviatt, A. Arthur, and J. Cohen. 2006. Quiet interfaces that help students think. In Proceedings of the Conference on User Interface Software Technology, pp. 191–200. ACM Press, New York. DOI: 10.1145/1166253.1166284.
S. Oviatt, A. Arthur, Y. Brock, and J. Cohen. 2007. Expressive pen-based interfaces for math education. Proceedings of the Conference on Computer-Supported Collaborative Learning, International Society of the Learning Sciences. DOI: 10.3115/1599600.1599708. 33, 34
S. Oviatt, A. Cohen, A. Miller, K. Hodge, and A. Mann. 2012. The impact of interface affordances on human ideation, problem solving and inferential reasoning. In Transactions on Computer-Human Interaction, ACM Press, New York. DOI: 10.1145/2362364.2362370. 22, 39, 40, 609
S. Oviatt and P. Cohen. 2015. The Paradigm Shift to Multimodality in Contemporary Computer Interfaces. Morgan Claypool Synthesis Series. Morgan & Claypool Publishers, San Rafael, CA. DOI: 10.2200/S00636ED1V01Y201503HCI030. 20, 27, 29
S. L. Oviatt, R. Coulston, S. Shriver, B. Xiao, R. Wesson, R. Lunsford, and L. Carmichael. 2003. Toward a theory of organized multimodal integration patterns during humancomputer interaction. In Proceedings of the International Conference on Multimodal Interfaces [ICMI’03], pp. 44–51. ACM Press, New York. DOI: 10.1145/958432.958443. 21, 25, 27, 28
S. L. Oviatt, R. Coulston, and R. Lunsford. 2004a. When do we interact multimodally? Cognitive load and multimodal communication patterns. In Proceedings of the International Conference on Multimodal Interaction [ICMI’04]. ACM Press, New York. DOI: 10.1145/1027933.1027957. 32
S. L. Oviatt, C. Darves, and R. Coulston. 2004b. Toward adaptive conversational interfaces: Modeling speech convergence with animated personas. Transactions on Computer Human Interaction [TOCHI] 11(3):300–328. DOI: 10.1145/1017494.1017498. 37, 38, 39
S. L. Oviatt, R. Lunsford, and R. Coulston. 2005. Individual differences in multimodal integration patterns: What are they and why do they exist? In Proceedings of the Conference on Human Factors in Computing Systems [CHI’05], CHI Letters. pp. 241–249. ACM Press, New York. DOI: 10.1145/1054972.1055006. 26
S. L. Oviatt, M. MacEachern, and G. Levow. 1998. Predicting hyperarticulate speech during human-computer error resolution. Speech Commun., 24(2):1–23. DOI: 10.1016/S0167-6393(98)00005-3. 22, 618
A. Owen, K. McMillan, A. Laird and E. Bullmore. 2005. N-back working memory paradigm: A meta-analysis of normative functional neuroimaging studies. Human Brain Mapping, 25:46–59. DOI: 10.1002/hbm.20131. 31
F. Paas, J. Tuovinen, H. Tabbers, and P. Van Gerven. 2003. Cognitive load measurement as a means to advance cognitive load theory. Educational Psychology, 38(1):63–71. DOI: 10.1207/S15326985EP3801_8 32
L. Reeves, J. Lai, J. Larson, S. Oviatt, T. Balaji, S. Buisine, P. Collings, P. Cohen, B. Kraal, J.-C. Martin, M. McTear, T. V. Raman, K. Stanney, H. Su, and Q. Wang. 2004. Guidelines for multimodal user interface design. Communications of the ACM, 47(1):57–59. DOI: 10.1145/962081.962106. 28
I. A. Richter and T. Wells, (eds.) 2008. Leonardo da Vinci Notebooks, Oxford World’s Classics (2nd edition), Oxford University Press. 20
G. Rizzolatti and L. Craighero. 2004. The mirror-neuron system. Annual Review of Neuroscience, 27:169–192. DOI: 10.1146/annurev.neuro.27.070203.144230. 38
A. Sale, N. Berardi, and L. Maffei. 2009. Enrich the environment to empower the brain. Trends in Neuroscience, 32:233–239. DOI: 10.1016/j.tins.2008.12.004. 34
E. Saund, D. Fleet, D. Larner, and J. Mahoney. 2003. Perceptually-supported image editing of text and graphics. In Proceedings of the 16th Annual ACM Symposium on User Interface Software Technology [UIST’2003], pp. 183–192. ACM Press, New York. DOI: 10.1145/964696.964717. 28
C. Schroeder and J. Foxe. 2004. Multisensory convergence in early cortical processing. In G. Calvert, C. Spence, and B. Stein, editors, The Handbook of Multisensory Processing, pp. 295–309. MIT Press, Cambridge, MA. DOI: 10.1007/s10339-004-0020-4. 20
L. Shapiro, editor. 2014. The Routledge Handbook of Embodied Cognition. Routledge Press, New York. DOI: 10.4324/9781315775845. 35
B. Smith, editor. 1988. Foundations of Gestalt Theory. Philosophia Verlag, Munich and Vienna. 20
C. Spence and S. Squire. 2003. Multisensory integration: Maintaining the perception of synchrony. Current Biology 13:R519–R521. DOI: 10.1016/S0960-9822(03)00445-7 24
B. E. Stein, editor. 2012. The New Handbook of Multisensory Processing, 2nd ed. MIT Press, Cambridge, MA. DOI: 10.2174/2213385203999150305104442. 20
B. E. Stein and M. Meredith. 1993. The Merging of the Senses. MIT Press, Cambridge, MA. 20, 21
J. Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2):257–285. 32
S. Tindall-Ford, P. Chandler, and P. Sweller. 1997. When two sensory modes are better than one. Journal of Experimental Psychology Applied, 3(4):257–287. DOI: 10.1037/1076-898X.3.4.257. 32
J. van Merrienboer and J. Sweller. 2005. Cognitive load theory and complex learning: Recent developments and future directions. Educational Psychology Review, 17(2):147–177. DOI: 10.1007/s10648-005-3951-0. 32
F. Varela, E. Thompson, and E. Rosch. 1991. The Embodied Mind: Cognitive Science and Human Experience. The MIT Press, Cambridge, MA. 35
L. Vygotsky. 1962. Thought and Language. MIT Press, Cambridge, MA (Translated by E. Hanfmann and G. Vakar from 1934 original). 33, 34
L. Vygotsky. 1978. Mind in Society: The Development of Higher Psychological Processes. M. Cole, V. John-Steiner, S. Scribner, and E. Souberman, editors. Harvard University Press, Cambridge, MA. 33
L. Vygotsky. 1987. The Collected Works of L. S. Vygotsky, Volume I: Problems of General Psychology, Edited and translated by N. Minick. Plenum, New York. 33
N. Waugh and D. Norman. 1965. Primary memory. Psychological Review72:89–104. 30
J. Welkowitz, G. Cariffe, and S. Feldstein. 1976. Conversational congruence as a criterion of socialization in children. Child Development 47:269–272. 37
M. Wertheimer. 1938. Laws of organization of perceptual forms. In W. Ellis, editor, translation published in A Sourcebook of Gestalt Psychology. pp. 71–88, Routledge and Kegan Paul, London. 21
C. Wickens, D. Sandry, and M. Vidulich. 1983. Compatibility and resource competition between modalities of input, central processing, and output. Human Factors, 25(2):227–248. 31
C. Wickens. 2002. Multiple resources and performance prediction. Theoretical Issues in Ergonomic Science, 3(2):159–17. DOI: 10.1518/001872008X288394. 31
B. Xiao, C. Girand, and S. L. Oviatt. 2002. Multimodal integration patterns in children. In Proceedings of the International Conference on Spoken Language Processing, pp. 629–632. DOI: 10.1145/958432.958480. 26
B. Xiao, R. Lunsford, R. Coulston, M. Wesson, and S. L. Oviatt. 2003. Modeling multimodal integration patterns and performance in seniors: Toward adaptive processing of individual differences. Fifth International Conference on Multimodal Interfaces [ICMI], ACM, Vancouver. DOI: 10.1145/958432.958480. 33
J. Zhang and V. Patel. 2006. Distributed cognition, representation, and affordance. In I. Dror and S. Harnad, editors. Cognition Distributed: How Cognitive Technology Extends Our Mind. pp. 137–144. John Benjamins, Amsterdam. DOI: 10.1075/pc.14.2.12zha. 39
G. Yang, F. Pan, and W. B. Gan. 2009. Stably maintained dendritic spines are associated with lifelong memories. Nature, 462:920–924. DOI: 10.1038/nature08577. 34
Zhou, J., Yu, K., Chen, F., Wang, Y. and Arshad, S. 2017. Multimodal behavioral and physiological signals as indicators of cognitive load. S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos and A. Krüger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition Morgan Claypool Publishers, San Rafael, CA. 29
E. Zoltan-Ford. 1991. How to get people to say and type what computers can understand. International Journal of Man-Machine Studies, 34:527–547. DOI: 10.1016/0020-7373(91)90034-5. 37, 38
1. Approximately a 250 ms lag is required between speech and corresponding lip movements before asynchrony is perceived.