Читать книгу The Handbook of Multimodal-Multisensor Interfaces, Volume 1 - Sharon Oviatt - Страница 10

Оглавление

Introduction: Scope, Trends, and Paradigm Shift in the Field of Computer Interfaces

During the past decade, multimodal-multisensor interfaces have become the dominant computer interface worldwide. They have proliferated especially rapidly in support of increasingly small mobile devices (for history, see Oviatt and Cohen [2015]). In that regard, they have contributed to the development of smartphones and other mobile devices, as well as their rapidly expanding ecosystem of applications. Business projections estimate that by 2020 smart phones with mobile broadband will increase in number from two to six billion, resulting in two-to-three times more smartphones in use than PCs, along with an explosion of related applications [Evans 2014]. At a deeper level, the co-evolution of mobile devices and the multimodal-multisensor interfaces that enable using them is transforming the entire technology industry [Evans 2014].

Why Multimodal-Multisensor Interfaces Have Become Dominant

One major reason why multimodal-multisensor interfaces have dominated on mobile devices is their flexibility. They support users’ ability to select a suitable input mode, or to shift among modalities as needed during the changing physical contexts and demands of continuous mobile use. Beyond that, individual mobile devices like smart phones now require interface support for a large and growing array of applications. In this regard as well, the flexibility of multimodal-multisensor interfaces has successfully supported extremely multifunctional use. These advantages of multimodal interfaces have been well known for over 15 years:

In the area of mobile computing, multimodal interfaces will promote … the multi-functionality of small devices, in part due to the portability and expressive power of input modes. [Oviatt and Cohen 2000, p.52]

Multimodal-multisensor interfaces likewise are ideal for supporting individual differences and universal access among users. With the global expansion of smart phones in third-world countries, this aspect of interface flexibility has contributed to the adoption of mobile devices by users representing different native languages, skill levels, ages, and sensory and cognitive impairments. All of the above flexible attributes have stimulated the paradigm shift toward multimodal-multisensor interfaces on computers today, which often is further enhanced by either multimodal output or multimedia output. See Glossary for defined terms.

Flexible Multiple-Component Tools as a Catalyst for Performance

The transition to multimodal-multisensor interfaces has been a particularly seminal one in the design of digital tools. The single keyboard input tool has finally given way to a variety of input options, which now can be matched more aptly with different usage needs. Given human adeptness at developing and using a wide variety of physical tools, it is surprising that keyboard input (a throwback to the typewriter) has prevailed for so many decades as a single input option on computers.

Let’s consider for a moment how our present transition in digital input tools parallels the evolution of multi-component physical tools, which occurred approximately 200,000–400,000 years ago in homo sapiens. The emergence of multi-component physical tools is considered a major landmark in human cognitive evolution, which co-occurred with a spurt in brain-to-body ratio and shaped our modern cognitive abilities [Epstein 2002, Wynn 2002]. During this earlier renaissance in the design of physical tools made of stone, bone, wood, and skins, the emergence of flexible multi-component tools enabled us to adapt tool design (1) for a variety of different specific purposes, (2) to substantially improve their performance, and (3) to improve their ease of use [Masters and Maxwell 2002]. This proliferation in the design of multi-component physical tools led homo sapiens to experience their differential impact, and to begin to recognize how specific design features contribute to achieving desired effects. For example, a lightweight wooden handle attached to a pointed stone hand-axe could be thrown a long distance for spearing large game. In this regard, the proliferation of tools stimulated a new awareness of the advantages of specific design features, and of the principles required to achieve a particular impact [Commons and Miller 2002].

Glossary

Multimedia output refers to system output involving two or more types of information received as feedback by a user during human-computer interaction, which may involve different types of technical media within one modality like vision—still images, virtual reality, video images—or it may involve multimodal output such as visual, auditory, and tactile feedback to the user.

Multimodal input involves user input and processing of two or more modalities—such as speech, pen, touch and multi-touch, gestures, gaze, head and body movements, and virtual keyboard. These input modalities may coexist together on an interface, but be used either simultaneously or alternately [Oviatt and Cohen 2015]. The input may involve recognition-based technologies (e.g., speech, gesture), simpler discrete input (e.g., keyboard, touch), or sensor-based information (e.g., acceleration, pressure). Some modalities may be capable of expressing semantically rich information and creating new content (e.g., speech, writing, keyboard), while others are limited to making discrete selections and controlling the system display (e.g., touching a URL to open it, pinching gesture to shrink a visual display). These interfaces aim to support processing of naturally occurring human communication and activity patterns. They are far more flexible and expressively powerful than past keyboard-and-mouse interfaces, which are limited to discrete input.

Multimodal interfaces support multimodal input, and they may also include sensor-based controls. In many cases they may also support either multimodal or multimedia output.

Multimodal-multisensor interfaces combine one or more user input modalities with sensor information (e.g., location, acceleration, proximity, tilt). Sensor-based cues may be used to interpret a user’s physical state, health status, mental status, current context, engagement in activities, and many other types of information. Users may engage in intentional actions when deploying sensor controls, such as tilting a screen to change its orientation. Sensors also can serve as “background” controls, to which the interface automatically adapts without any intentional user engagement (e.g., dimming phone screen after lack of use). Sensor input aims to transparently facilitate user-system interaction, and adaptation to users’ needs. The type and number of sensors incorporated into multimodal interfaces has been expanding rapidly, resulting in explosive growth of multimodal-multisensor interfaces. There are numerous types of multimodal-multisensor interface with different characteristics, as will be discussed in this handbook.

Multimodal output involves system output from two or more modalities, such as a visual display combined with auditory or haptic feedback, which is provided as feedback to the user. This output is processed by separate human sensory systems and brain areas.

As we now embark upon designing new multimodal-multisensor digital tools with multiple input and output components, we can likewise expect a major transition in the functionality and performance of computer interfaces. With experience, users will learn the advantages of different types of input available on multimodal-multisensor interfaces, which will enable them to develop better control over their own performance. For professionals designing new multimodal-multisensor interfaces, it is sobering to realize the profound impact that our work potentially could have on further specialization of the human brain and cognitive abilities [Oviatt 2013].

More Expressively Powerful Tools Are Capable of Stimulating Cognition

This paradigm shift reflects the evolution of more expressively powerful computer input, which can substantially improve support for human cognition and performance. Recent findings have revealed that more expressively powerful interfaces (e.g., digital pen, multimodal) can stimulate cognition beyond the level supported by either keyboard interfaces or analogous non-digital tools. For example, the same students working on the same science problems generate substantially more domain appropriate ideas, solve more problems correctly, and engage in more accurate inferential reasoning when using a digital pen, compared with keyboard input. In different studies, the magnitude of improvement has ranged from approximately 10–40% [Oviatt 2013]. These results have generalized widely across different user populations (e.g., ages, ability levels), content domains (e.g., science, math, everyday reasoning), types of thinking and reasoning (problem solving, inference, idea generation), computer hardware, and evaluation metrics. From a communications perspective, results have demonstrated that more expressively powerful input modalities, and multimodal combinations of them, can directly facilitate our ability to think clearly and perform well on tasks.

Multimodal interfaces also support improved cognition and performance because they enable users to self-manage and minimize their own cognitive load. Working memory load is reduced when people express themselves using multimodal input, for example by combining speech and writing, because their average utterance length is reduced by conveying spatial information with pointing and gesturing. When speaking and writing together, people avoid speaking location descriptions, because they are error prone and increase mental load. Instead, they use written input (e.g., pointing, encircling) to indicate such content [Oviatt and Cohen 2015]. In Chapter 13, experts discuss neuroscience and human-computer interaction findings on how and why new multimodal-multisensor interfaces can more effectively stimulate human cognition and learning than previous computer interfaces.

One Example of How Multimodal-Multisensor Interfaces Are Changing Today

One of the most rapidly changing areas in multimodal-multisensor interface design today is the incorporation of a wide variety of new sensors. This is part of the long-term trend toward expanding the number and type of information sources available in multimodal-multisensor interfaces, which has been especially noteworthy on current smartphones. These changes have been coupled with experimentation on how to use different sensors for potentially valuable functionality, and also how to design a whole multimodal-multisensor interface in a synergistic and effective manner. Designers are beginning to grasp the many versatile ways that sensors and input modalities can be coupled within an interface—including that either can be used intentionally in the “foreground,” or they can serve in the “background” for transparent adaptation that minimizes interface complexity and users’ cognitive load (see Chapter 4). The separate research communities that historically have focused on multimodal versus ubiquitous sensor-based interfaces have begun to engage in collaborative cross talk. One outcome will be improved training of future students, who will be able to design better integrated multimodal-multisensor interfaces.

As will be discussed in this volume, one goal of multimodal-multisensor interfaces is to facilitate user-system interaction that is more human-centered, adaptive, and transparent. Sensor-based information sources are beginning to interpret a user’s physical state (e.g., walking), health status (e.g., heart-rate), emotional status (e.g., frustrated, happy), cognitive status (e.g., cognitive load, expertise), current context (e.g., driving car), engagement in activities (e.g., picking up cell phone), and many other types of information. As these capabilities improve in reliability, systems will begin to adapt by supporting users’ goal-oriented behavior. One critical role for sensor input on mobile devices is to transparently preserve users’ focus of attention on important primary tasks, such as driving, by minimizing distraction and cognitive load. However, mechanical sensors are not unique avenues for accomplishing these advances. Paralinguistic information from input modalities like speech and writing (e.g., volume, rate, pitch, pausing) are becoming increasingly reliable at predicting many aspects of users’ mental status, as will be detailed in other handbook chapters [Burzo et al. 2017, Cohn et al. 2017, Oviatt et al. 2017, Zhou et al. 2017].

Insights in the Chapters Ahead

This handbook presents chapters that summarize basic research and development of multimodal-multisensor systems, including their status today and rapidly growing future directions. This initial volume introduces relevant theory and neuroscience foundations, approaches to design and user modeling, and an in-depth look at some common modality combinations. The second volume [Oviatt et al. 2017a] summarizes multimodal-multisensor system signal processing, architectures, and the emerging use of these systems for detecting emotional and cognitive states. The third volume [Oviatt et al. 2017b] presents multimodal language and dialogue processing, software tools and platforms, commercialization of applications, and emerging technology trends and societal implications. Collectively, these handbook chapters address a comprehensive range of central issues in this rapidly changing field. In addition, each volume includes selected challenge topics, in which an international panel of experts exchanges their views on some especially consequential, timely, and controversial problem in the field that is in need of insightful resolution. We hope these challenge topics will stimulate talented students to tackle these important societal problems, and motivate the rest of us to envision and plan for our technology future.

Information presented in the handbook is intended to provide a comprehensive state-of-the-art resource for professionals, business strategists, and technology funders, interested lay readers, and training of advanced undergraduate and graduate students in this multidisciplinary computational field. To enhance its pedagogical value to readers, many chapters include valuable digital resources such as pointers to open-source tools, databases, video demonstrations, and case study walkthroughs to assist in designing, building, and evaluating multimodal-multisensor systems. Each handbook chapter defines the basic technical terms required to understand its topic. Educational resources, such as focus questions, are included to support readers in mastering these newly presented materials.

Theoretical and Neuroscience Foundations

The initial chapters in this volume address foundational issues in multimodal-multisensor interface design, including theoretical and neuroscience foundations, user modeling, and the design of interfaces involving rich input modalities and sensors. In Chapter 1, Oviatt discusses the theoretical foundation of multisensory perception and multimodal communication, which provides a basis for understanding the performance advantages of multimodal interfaces and how to design them to reap these advantages. This chapter describes the major theories that have influenced contemporary views of multimodal interaction and interface design, including Gestalt theory, Working Memory theory, and Activity theory—which subsume perception-action dynamic theories, and also limited-resource theories with a focus on attention and short-term memory constraints. Other theoretical perspectives covered in this chapter that have influenced multimodal interface design include Multiple Resource theory, Cognitive Load theory, Embodied Cognition, Communication Accommodation theory, and Affordance theory. These theories are emphasized in part because they are supported heavily by neuroscience findings.

In Chapter 2, James et al. discuss the human brain as an inherently multimodal-multisensory dynamic learning system. Although each sensory modality processes different signals from the environment in qualitatively different ways (e.g., sound waves, light waves, pressure, etc.), these signals ultimately are transduced into a common language and unified percept in the brain. From an Embodied Cognition viewpoint, humans also act on the world multimodally through hand movements, locomotion, speech, gestures, etc., and these physical actions directly shape the multisensory input we perceive. Given recent findings in neuroscience, this chapter discusses the multisensory-multimodal brain structures (e.g., multisensory neurons, multisensory-multimodal brain circuits) and processes (e.g., convergence, integration, multisensory enhancement, and depression) that produce human learning, and how multimodal learning affects brain plasticity in adults and children. Findings on this topic have direct implications for understanding how multimodal-multisensor technologies influence us at both the brain and behavioral levels. The final section of this chapter discusses implications for multimodal-multisensor interface design, which is considered further in Chapter 13 in the exchange among experts on the challenge topic “Perspectives on Learning with Multimodal Technology.”

Approaches to Design and User Modeling

In Chapter 3, MacLean et al. discuss multisensory haptic interfaces broadly as anything a user touches or is touched by to control, experience, or receive information from a computational device—including a wide range of both energetically passive (e.g. touchscreen input) and energetically active (e.g., vibrotactile feedback) interface techniques. This chapter delves into both conceptual and pragmatic issues required for designing optimally enriched haptic experiences—especially ones involving energetically active haptic interfaces, for which technological advances from materials to robotics have opened up many new frontiers. In approaching this topic, MacLean and colleagues describe human’s distributed and multi-parameter range of haptic sensory capabilities (e.g. temperature, texture, forces), and the individual differences associated with designing for them. They walk through scenarios illustrating multimodal interaction goals, and explain the many roles the haptic component can play. For example, this may involve notifying and then guiding a mobile user about an upcoming turn using haptic information that complements visual. This chapter includes hands-on information about design techniques and software tools that will be valuable for students and professionals in the field.

From the perspective of multimodal output, in Chapter 7 Freeman et al. discuss a wide range of alternatives to visual displays, including: (1) haptics, vibrotactile, thermal, force, and deformable feedback; (2) non-speech auditory icons, earcons, musicons, sonification, and spatial audio output; and (3) combined multimodal feedback—which they argue is indispensable in mobile contexts, including in-vehicle ones, and for users with sensory limitations. Their chapter describes the relevant mechanical devices, and research findings on how these different forms of feedback and their combinations affect users’ performance during tasks. In many cases, the non-visual feedback is providing background interface information to conserve users’ cognitive load—for example, during a mobile navigation task. In other cases, combined multimodal feedback is supporting more rapid learning, more accurate navigation, more efficient information extraction, and more effective warning systems—for example, during hand-over of control to the driver in an autonomous car system. This chapter is richly illustrated with digital demonstrations so readers can concretely experience the haptic and non-speech audio examples that are discussed.

In the related Chapter 4, Hinckley considers different design perspectives on how we can combine modalities and sensors synergistically within an interface to achieve innovative results. The dominant theme is using individual modalities and sensors flexibly in foreground vs. background interaction roles. Hinckley provides detailed illustrations of the many ways touch can be combined with sensor input about proximity, orientation, acceleration, pressure, grip, etc., during common action patterns—for example, lifting a cell phone to your ear. Such background information can provide context for predicting and facilitating next steps during an interaction, like automatically activating your phone before making a call. This context can guide system processing without necessarily requiring the user’s attention, which is especially valuable in mobile situations. It also can be used to reduce technological complexity. However, a major challenge is to achieve reliable activity recognition without false positives or false alarms that cause unintended system activation (i.e., Midas Touch problem)—such as inadvertently activating URLs when you scroll through news stories on your cell phone. In the final sections of this chapter, Hinckley discusses bimanual manipulation that coordinates touch with pen input, and also leverages a sensor-augmented stylus/tablet combination with inertial, grip, and other sensors for refined palm rejection. This type of pen and touch multimodal-multisensor interface is now commercially available on numerous pen-centric systems.

In Chapter 5, Jameson and Kristensson discuss what it means to give users freedom to use different input modalities in a flexible way. They argue that modality choice will vary as a function of user characteristics (e.g., abilities, preferences), the task (e.g., main multimodal task, any competing tasks), multimodal system characteristics (e.g., recognition accuracy, ease of learning and use), the context of use (e.g., social and physical environment), and the consequences of selecting certain modalities for interaction (e.g., expressive adequacy, errors, speed, potential interference with other modalities, potential for repetitive stress syndrome (RSI), social acceptability and privacy). Their chapter considers what is known from the existing literature and psychological models about modality choice, including the degree to which users’ multimodal interaction actually represents a conscious choice versus automatic behavior. The aim of the chapter is to understand the limitations of existing research on modality choice, and to develop system design strategies that can guide users in selecting input modalities more optimally. In support of the second objective, Jameson and Kristensson introduce two related models, ASPECT and ARCADE, which provide conceptual tools that summarize (1) common reasons for users’ choice patterns, and (2) different concrete strategies for promoting better modality choices (e.g., based on trial-and-error, consequences, social feedback). In summary, this chapter raises unsolved issues at the very heart of how to design better integrated multimodal-multisensor interfaces.

In Chapter 6, Kopp and Bergmann adopt a simulation-based cognitive modeling approach, with a focus on how users’ working memory load influences their speech, gesturing, and multimodal patterns during linguistic constructions involving spatial descriptions. Their approach involves computational modeling of communication, informed by substantial cognitive science and linguistics research, which also leverages data-driven processing. They provide a detailed walkthrough of their multimodal speech and gesture production model, which is based on activation spreading within dynamically shaped multimodal memories. They argue that semantic coordination across modalities arises from the interplay of modality-specific representations for speech and gestures under given cognitive resources. Results from preliminary simulation experiments predict the likelihood that multimodal constructions will involve gestures that are redundant vs. complementary with co-occurring speech, which varies when cognitive resources are less vs. more constrained, respectively. Kopp and Bergmann’s chapter provides a thoughtful discussion of the role and value of cognitive modeling in developing multimodal systems, as well as the specific use of multimodal speech and gesture production models for developing applications like virtual characters and social robotics.

Multimodal interfaces are well known to be the preferred direction for supporting individual differences and universal access to computing. In Chapter 8, Munteanu and Salah challenge us to understand the needs of one of the most rapidly growing and underserved populations in the world—seniors over 65 years. As a starting point, this chapter summarizes Maslow’s hierarchy of human needs in order to understand and design valuable technology for seniors. This includes designing for their basic physical needs (e.g., self-feeding, medications), safety and social-emotional needs (e.g., preventing falls, physical isolation, and loneliness), and esteem and self-actualization needs (e.g., independence, growth, and mastery experiences). Among the challenges of designing for this group are the substantial individual differences they exhibit (i.e., from healthy mobile, to physically and cognitively disabled), and their frequently changing status as they age. Munteanu and Salah describe examples of especially active application areas, such as socially assistive robotics, communication technologies for connecting and sharing activities within families, technologies for accessing digital information, personal assistant technologies, and ambient assistive living smart-home technologies. They highlight design methods and strategies that are especially valuable for this population, such as participatory design, adaptive multimodal interfaces that can accommodate seniors’ individual differences, balanced multimodal-multisensor interfaces that preserve seniors’ sense of control and dignity (i.e., vs. simply monitoring them), and easy-to-use interfaces based on rudimentary speech, touch/haptics, and activity tracking input.

Common Modality Combinations

Several chapters discussed above already have illustrated common modality combinations in multimodal-multisensor interfaces—for example, commercially available touch and pen input (Chapter 4), and multimodal output incorporating haptic and non-speech audio (Chapter 7) and speech and manual gesturing (Chapter 6). The chapters that follow examine common modality combinations in greater technical detail, with an emphasis on four different types of speech-centric multimodal input interfaces—incorporating user gaze, pen input, gestures, and visible speech movements. These additional modalities exhibit significant differences among them, most importantly in the sensors used, approach to information extraction and representation, fusion and integration of the second input modality with the speech signal, and specific application scenarios. These chapters address the main challenges posed by each of these modality combinations, and the most prevalent and successful techniques for building related systems.

In Chapter 9, Qvarfordt outlines the properties of human gaze, its importance in human communication, methods for capturing and processing it automatically, and its incorporation in multimodal interfaces. In particular, she reviews basic human eye movements, and discusses how eye-tracking devices capture gaze information. This discussion emphasizes gaze signal processing and visualization, but also practical limitations of the technology. She then provides an overview of the role that gaze plays when combined with other modalities such as pointing, touch, and spoken conversation during interaction and communication. As a concrete example, this discussion details a study on the utility of gaze in multi-party conversation over shared visual information. In the final section of this chapter, Qvarfordt discusses practical systems that include gaze. She presents a designspace taxonomy of gaze-informed multimodal systems, with two axes representing gaze as active vs. passive input, and in stationary vs. mobile usage scenarios. A rich overview then is presented of gaze-based multimodal systems for selection, detecting user activity and interest, supporting conversational interaction, and other applications.

In Chapter 10, Cohen and Oviatt motivate why writing provides a synergistic combination with spoken language. Based on the complementarity principle of multimodal interface design, these input modes have opposite communication strengths and weaknesses: whereas spoken language excels at describing objects, time, and events in the past or future, on the other hand, writing is uniquely able to render precise spatial information including diagrams, symbols, and information in a specific spatial context. Since error patterns of the component recognizers also differ, multimodal systems that combine speech and writing can support mutual disambiguation that yields improved robustness and stability of performance. In this chapter, the authors describe the main multimodal system components, language processing techniques, and architectural approaches for successfully processing users’ speech and writing. In addition, examples are provided of both research and commercially deployed multimodal systems, with rich illustrations of the scenarios they are capable of handling. Finally, the performance characteristics of multimodal pen/voice systems are compared with their unimodal counterparts.

The remaining two chapters present more of a signal-processing perspective on multimodal system development, a topic that will be elaborated in greater detail in [Oviatt et al. 2017a]. In Chapter 11, Katsamanis et al. discuss the ubiquity of multimodal speech and gesturing, which co-occur in approximately 90% of communications across cultures. They describe different types of gestures, segmental phases in their formation, and the function of gestures during spoken communication. In the second part of the chapter, the authors shift to presenting an overview of state-of-the-art multimodal gesture and speech recognition, in particular temporal modeling and detailed architectures for fusing these loosely-synchronized modalities (e.g., Hidden Markov Models, Deep Neural Nets). In order to facilitate readers’ practical understanding of how multimodal speech and gesture systems function and perform, the authors present a walk-through example of their recently developed system, including its methods for capturing data on the bimodal input streams (i.e., using RGB-D sensors like Kinect), feature extraction (i.e., based on skeletal, hand shape, and audio features), and two-pass multimodal fusion. They provide a detailed illustration of the system’s multimodal recognition of a word-gesture sequence, which shows how errors during audio-only and single-pass processing can be overcome during two-pass multimodal fusion. The comparative performance accuracy of this multimodal system also is summarized, based on the well-known ChaLearn dataset and community challenge.

In Chapter 12, Potamianos et al. focus on systems that incorporate visual speech information from the speaker’s mouth region into the traditional speech-processing pipeline. They motivate this approach based on the inherently audiovisual nature of human speech production and perception, and also by providing an overview of typical scenarios in which these modalities complement one another to enhance robust recognition of articulated speech (e.g., during noisy conditions). In the main part of their chapter, the authors offer a detailed review of the basic sensory devices, corpora, and techniques used to develop bimodal speech recognition systems. They specifically discuss visual feature extraction (e.g., based on facial landmarks, regions of interest), and audio-visual fusion that leverages the tight coupling between visible and audible speech. Since many of the algorithmic approaches presented are not limited to automatic speech recognition, the authors provide an overview of additional speech processing topics (e.g., speech activity detection) that can benefit from co-processing of the visual modality. Examples of different systems are illustrated, which showcase superior multimodal performance over audio-only speech systems. The authors note that recent advances in deep learning, and the availability of multimodal corpora and open-source tools, now have the potential to advance “in the wild” field applications that previously were viewed as extremely challenging.

Expert Exchange on Multidisciplinary Challenge Topic

Chapter 13 presents a multidisciplinary challenge topic, with a discussion among experts that focuses on how humans learn, and what the implications are for designing more effective educational technologies. Expert discussants include Karin James (cognitive neuroscience basis of learning), Dan Schwartz (learning sciences and educational technologies), Katie Cheng (learning sciences and educational technologies), James Lester (HCI, AI, and adaptive learning technology), and Sharon Oviatt (multimodal-multisensor interfaces, educational technologies). The discussants identify promising new techniques and future research that is needed to advance work on this topic. This exchange reveals what experts in the field believe are the main problems, what is needed to solve them, and what steps they envision for pursuing technology research and development in the near future.

Based on Chapter 2, which summarizes how complex multimodal action patterns stimulate multisensory comprehension of related content, James begins by highlighting that multimodal technologies could facilitate learning of more complex content. She then qualifies this by saying that, to be effective, these systems should not violate people’s previously learned action-perception contingencies. From an interface design viewpoint, Oviatt emphasizes that multimodal-multisensor input capabilities need to support rich content creation in order to help students master complex domain content. Both discussants summarize that a body of research (i.e., behavioral and brain science studies) has confirmed that optimal learning cannot be achieved with keyboard-based tools alone. Students need to expend effort producing complex action patterns, such as manipulating objects or writing symbols with a pen.

Lester and Oviatt address the issue of what role automation could play in designing adaptive multimodal-multisensor educational technologies to support maximum learning, rather than undermining it. They point out that multimodal technologies should not be designed to minimize students’ effort expenditure. Rather, Lester envisioned adaptive multimodal interfaces that introduce and exercise new sub-skills, followed by an incremental decrease in the level of automation by fading multimodal scaffolding as students learn. Oviatt mentioned that adaptive multimodal-multisensor interfaces could be designed to better focus students’ attention by minimizing the many extraneous interface features that distract them (e.g., formatting tools). She added that emerging computational methods for predicting students’ mental state (e.g., cognitive load, domain expertise) could be used in future multimodal-multisensor systems to tailor what an individual learns and how they learn it [Oviatt et al. 2017, Zhou et al. 2017]. Given these views, automation in future educational technologies would conduct temporary adaptations that support students’ current activities and mental state in order to facilitate learning sub-goals, preserve limited working memory resources, and similar objectives that facilitate integrating new information.

In a final section of Chapter 13, all of the participants discuss emerging computational, neuroimaging, and modeling techniques that are just now becoming available to examine multimodal interaction patterns during learning, including predicting students’ motivational and cognitive state. In future research, James and Oviatt encourage people to research how the process of learning unfolds across levels, for example by examining the correspondence between behavioral data (e.g., fine-grained manual actions during writing) and brain activation patterns. Schwartz and Cheng distinguish students’ perceptual-motor learning from their ability to provide explanations. They advocate using new techniques to conduct research on students’ ability to learn at the meta-cognitive level, and also to probe how students learn multimodally. With recent advances, Lester suggested that we may well be on the verge of a golden era of technology-rich learning, in which multimodal-multisensor technologies play an invaluable role in both facilitating and assessing learning.

References

M. Burzo, M. Abouelenien, V. Perez-Rosas, and R. Mihalcea. 2017. Multimodal deception detection. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krüger, editors. The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Morgan & Claypool Publishers, San Rafael, CA. 6

J. Cohn, N. Cummins, J. Epps, R. Goecke, J. Joshi, and S. Scherer. 2017. Multimodal assessment of depression and related disorders based on behavioural signals. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krüger, editors. The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Morgan & Claypool Publishers, San Rafael, CA. 6

M. Commons and P. Miller. 2002. A complete theory of human evolution of intelligence must consider stage changes. Behavioral and Brain Science, 25: 404–405. DOI: 10.1017/S0140525X02240078. 2

H. Epstein. 2002. Evolution of the reasoning hominid brain. Behavioral and Brain Science, 25: 408–409. DOI: 10.1017/S0140525X02270077. 2

B. Evans. 2014. Mobile is eating the world, Tech Summit. October 28, 2014; http://a16z.com/2014/10/28/mobile-is-eating-the-world/ (retrieved January 7, 2015). 1

R. Masters and J. Maxwell. 2002. Was early man caught knapping during the cognitive (r)evolution? Behavioral and Brain Science, 25: 413. DOI: 10.1017/S0140525 X02320077. 2

S. Oviatt. 2013. The Design of Future of Educational Interfaces. Routledge Press. DOI: 10.4324/9780203366202. 4

S. Oviatt and P. R. Cohen. 2000. Multimodal systems that process what comes naturally, Communications of the ACM, 43(3): 45–53. DOI: 10.1145/330534.330538. 2

S. Oviatt and P. R. Cohen. 2015. The Paradigm Shift to Multimodality in Contemporary Computer Interfaces. Human-Centered Interfaces Synthesis series (ed. Jack Carroll). Morgan & Claypool Publishers, San Rafael, CA. DOI: 10.2200/S00636ED1V01Y201503HCI030. 1, 3, 5, 622

S. Oviatt, J. Grafsgaard, L. Chen, and X. Ochoa. 2017. Multimodal learning analytics: Assessing learners’ mental state during the process of learning. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krüger, editors. The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Morgan & Claypool Publishers. San Rafael, CA. 6, 13

S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krüger. editors. 2017a. The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Morgan & Claypool Publishers, San Rafael, CA. 6, 12

S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krüger. editors. 2017b. The Handbook of Multimodal-Multisensor Interfaces, Volume 3: Multimodal Language Processing, Software, Tools, Commercial Applications, and Emerging Directions. Morgan & Claypool Publishers, San Rafael, CA. 6

T. Wynn. 2002. Archaeology and cognitive evolution. Behavioral and Brain Science, 25: 389–438. DOI: 10.1017/S0140525X02000079. 2

J. Zhou, K. Yu, F. Chen, Y. Wang, and S. Arshad. 2017. Multimodal behavioral and physiological signals as indicators of cognitive load. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krüger, editors. The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Morgan Claypool Publishers, San Rafael, CA. 6, 13

The Handbook of Multimodal-Multisensor Interfaces, Volume 1

Подняться наверх