Читать книгу Robot Learning from Human Teachers - Sonia Chernova - Страница 10

Оглавление

CHAPTER 3

Modes of Interaction with a Teacher

With insights from human social learning in mind, in this chapter we turn to a central design choice for every Learning from Demonstration (LfD) system: how to solicit demonstrations from the human teacher. As highlighted in Figure 3.1, this chapter forms the introduction to the technical portion of the book, laying the foundation for the discussion of both high-level and low-level learning methods. We do not entirely ignore the issues of usability and social interaction, after all, the choice of interaction method will impact not only the type of data available for policy learning, but also many of the topics discussed in the previous chapter (e.g., transparency, question asking, directing attention). However, these topics will remain in the background until Chapters 6 and 7, in which we discuss policy refinement and user study evaluation, respectively.

Figure 3.1: In this chapter, we discuss a wide range of techniques for collecting demonstration input for LfD algorithms.

In this chapter, we first introduce readers to the correspondence problem, which pertains to the differences in the capabilities and physical embodiment between the robot and user. We then characterize demonstration techniques under three general modes of interaction, which enable a robot to learn through doing, through observation, and from critique.

3.1 THE CORRESPONDENCE PROBLEM

An LfD dataset is typically composed of state-action pairs recorded during teacher executions of the desired behavior, sometimes supplemented with additional information. Exactly how demonstrations are recorded, and what the teacher uses as a platform for the execution, varies greatly across approaches. Examples range from sensors on the robot learner recording its own actions as it is passively teleoperated by the teacher, to a camera recording a human teacher as she executes the behavior with her own body. Some techniques have also examined the use of robotic teachers, hand-written control policies and simulated planners for demonstration.

Figure 3.2: The correspondence problem arises due to the differences in the sensing abilities and physical embodiment between the human and robot, making it more challenging to accurately map between their respective state and action representations [49].

For LfD to be successful, the states and actions in the learning dataset must be usable by the learner. In the most straightforward setup, the states and actions recorded during the demonstrations map directly to the sensing and movement capabilities of the robot. In other cases, however, a direct mapping does not exist between the teacher and learner due to differences in sensing ability, body structure or mechanics. For example, a robot learner’s camera will not detect state changes in the same manner as a human teacher’s eyes, nor will its gripper apply force in the same manner as a human hand. The challenges which arise from these differences are referred to broadly as the correspondence problem [186]. Specifically, the issue of correspondence deals with the identification of a mapping between the teacher and the learner that allows the transfer of information from one to the other.

The correspondence problem lies at the heart of Learning from Demonstration, and is intertwined in the choice of both the human-robot interaction method and computational technique used for learning. Using a direct demonstration technique that does not require correspondence simplifies the learning process significantly as it removes one source of possible error—the mapping function that translates human capabilities to those of the robot. As discussed below, several demonstration techniques directly map between the actions of the teacher and those of the student, the primary examples of which are teleoperation of the robot through kinesthetic teaching [51] or a controller such as a joystick or computer interface [1, 237]. However, not all systems are amenable to teleoperation. For example, low-level motion demonstrations are difficult on systems with complex motor control, such as high degree of freedom humanoids. Furthermore, physically controlling the robot may not be natural, or even possible, in a given situation. Instead, the teacher may find it more effective to perform the task with their own body while the robot watches. Enabling the robot to learn from observations of the teacher requires a solution for the correspondence problem, the states/actions of the teacher during the execution must be to be inferred and mapped onto the abilities of the robot. Learning in such settings depends heavily upon the accuracy of this mapping. Finally, the teacher may not demonstrate the task at all, and instead observe the robot and provide critique or corrections to the current behavior. In the following sections we discuss techniques for enabling the robot to learn from its own experiences, observation of the teacher and the teacher’s critiques. We conclude the chapter with a discussion of the tradeoffs and implications that the choice of interaction mode has on the design of the overall robot learning system.

Figure 3.3: (a) Kinesthetic teaching with the iCub robot [13]. (b) User controlling the full-body motions of an Aldebaran Nao robot using the Xsens MVN inertial motion capture suit [141].

3.2 LEARNING BY DOING

Teleoperation provides the most direct method for information transfer within demonstration learning. During teleoperation, the robot is operated by the teacher while recording from its own sensors. Demonstrations recorded through human teleoperation via a joystick have been used in a variety of applications, including flying a robotic helicopter [1], soccer kicking motions [40], robotic arm assembly tasks [64], and obstacle avoidance and navigation [118, 237]. Teleoperation has also been applied to a wide variety of simulated domains, such as mazes [70, 214], driving [3, 66], and soccer [7], and many other applications. Teleoperation interfaces vary in complexity from hand-held controllers to teleoperation suits [159]. Hand-written controllers have also been used to teleoperate the robot in the place of a human teacher [11, 102, 221, 237].

Kinesthetic teaching offers another variant for teleoperation. In this method, the robot is not actively controlled, but rather its passive joints are moved through the desired motions while the robot records the trajectory [51]. Figure 3.3(a) shows a person teaching a humanoid robot to manipulate an object. This technique has been extensively used in motion trajectory learning, and many complementary computational methods are discussed in Chapter 4. A key benefit of teaching through this method of interaction is that it ensures that the demonstrations are constrained to actions that are within the robot’s abilities, and the correspondence problem is largely eliminated. Additionally, the user is able to directly experience the limitation of the robot’s movements, and thus gain greater understanding about the robot’s abilities.

Another alternative to direct teleoperation is shadowing, in which the robot mimics the teacher’s demonstrated motions while recording from its own sensors. In comparison to teleoperation, shadowing requires an extra algorithmic component which enables the robot to track and actively shadow (rather than be passively moved by) the teacher. Body sensors are often used to track the teacher’s movement with a high degree of accuracy. Figure 3.3(b) shows an example setup used by [141], in which the Xsens MVN inertial motion capture suit worn by the user is used to control the robot’s pose. This example demonstrates tightly coupled interaction between the user and the robot, since almost every teacher movement is detected by the sensors.

Shadowing also allows for loosely coupled interactions, and has even been applied to robotic teachers. Hayes and Demiris [109] perform shadowing with a robot teacher whose platform is identical to the robot learner; the learner follows behind the teacher as it navigates through a maze. Nehmzow et al. [187] present an algorithm for robot motion control in which the robot first records the human teacher’s execution of the desired navigation trajectory, and then shadows this execution. While repeating the teacher’s trajectory, the robot records data about its environment using its onboard sensors. The action and sensor data are then combined into a feedback controller that is used to reproduce future instances of the demonstrated task.

Trajectory information collected through teleoperation, kinesthetic teaching or shadowing can be combined with other input modalities, such as speech. Nicolescu and Mataric [190] present an approach in which a robot learns by shadowing a robotic or human teacher. In addition to trajectory information, their technique enables the teacher to use simple voice cues to frame the learning (“here,” “take,” “drop,” “stop”), to provide informational cues about the relevance or irrelevance of observation inputs and indications of the desired behavioral output. In Rybski et al. [225], demonstration of the desired task is also performed through shadowing combined with dialog in which the robot is told specifically what actions to execute in various states. Meriçli et al. [175] present a similarly motivated approach which additionally supports repetitions (cycles) in the task representation and enables the user to modify and correct an existing task. Breazeal et al. [36] also explore this form of demonstration, enabling a robot to learn a symbolic high-level task within a social dialog.

Finally, some learning methods pay attention only to the state sequences, without recording any actions. This makes it possible to communicate the task objective function to the learner without traditional action demonstrations. For example, by drawing a path through a 2-D representation of the physical world, Ratliff et al. provide high-level path planning demonstrations to a rugged outdoor robot [215] and a small quadruped robot [143, 216]. Human-controlled teleoperation demonstrations are also utilized with the same outdoor robot for lower-level obstacle avoidance [216]. Since actions are not provided in the demonstration data, at run time a learned state-action mapping does not exist to provide guidance for action selection. Instead, actions are selected by employing low level motion planners and controllers [215, 216], and provided transition models [143].

Figure 3.4: (a) User teaching a forehand swing motion to a humanoid robot using the Sarcos Sen-Suit [115]. (b) Humanoid robot learning to play air hockey from observation of opponent player [25].

3.3 LEARNING FROM OBSERVATION

In many situations, it is more effective or natural for the teacher to perform the task demonstration using their own body instead of controlling the robot directly. As discussed above, this form of demonstration introduces a correspondence problem with respect to the mapping between the teacher’s and robot’s state and actions. As a result, this technique is commonly used with humanoid or anthropomorphic robots, since the robot’s resemblance to a human results in a simpler and more intuitive mapping, though learning with other robot embodiments is also possible. Unlike in the use of shadowing, the robot does not simultaneously mimic the teacher’s actions during the observation.

Accurately sensing the teacher’s actions is critical for the success of this approach. Traditionally, many techniques have relied on instrumenting the teacher’s body with sensors, including the use of motion capture systems and inertial sensors. Ijspeert et al. [114, 115] use a Sarcos Sen-Suit worn by the user to simultaneously record 35 DOF motion. The recorded joint angles were used to teach a 30-DoF humanoid to drum, reach, draw patterns, and perform tennis swings (Figure 3.3(a)). This work is extended in [184] to walking patterns. The same device, supplemented with Hall sensors, is used by Billard et al. to teach a humanoid robot to manipulate boxes in sequence [29]. In later work, Calinon and Billard combine demonstrations executed by human teacher via wearable motion sensors with kinesthetic teaching [50].

Wearable sensors, and other forms of specialized recording devices, provide a high degree of accuracy in the observations. However, their use restricts the adoption of such learning methods beyond research laboratories and niche applications. A number of approaches have been designed to use only camera data. One of the earliest works in this area was the 1994 paper by Kuniyoshi et al. [152], in which a robot extracts the action sequence and infers and executes a task plan based on observations of a human hand demonstrating a blocks assembly task. Another example of this demonstration approach includes the work of Bentivegna et al. [25], in which a 37-DoF humanoid learns to play air hockey by tracking the position of the human opponent’s paddle (Figure 3.3(b)). Visual markers are also often used to improve the quality of visual information, such as in [30], where reaching patterns are taught to a simulated humanoid. Markers are similarly used to optically track human motion in [122, 123, 259] and to teach manipulation [209] and motion sequences [10]. In recent years, the availability of low-cost depth sensors (e.g., Microsoft Kinect) and their associated body pose tracking methods makes this a great source of input data for LfD methods that rely on external observations of the teacher (e.g., [79]).

Related to the learning by observation problem, several works focus exclusively on the perceptual-motor mapping problem of LfD, where in order to imitate the robot has to map a sensed experience to a corresponding motor output. Often this is treated as a supervised learning problem, where the robot is given several sensory observations of a particular motor action. Demiris and Hayes use forward models as the mechanism to solve the dual-task of recognition and generation of action[80]. Mataric and Jenkins suggest behavior primitives as a useful action representation mechanism for imitation [122]. In their work on facial imitation, Breazeal et al. use an imitation game to facilitate learning the sensory-motor mapping of facial features tracked with a camera to robot facial motors. In a turn-taking interaction the human first imitates the robot as it performs a series of its primitive actions, teaching it the mapping, then the robot is able to imitate [37].

Finally, observations can also focus on the effects of the teacher’s actions instead of the action movements themselves. Tracking the trajectories of the objects being manipulated by the teacher, as in [249], can enable the robot to infer the desired task model and to generate a plan that imitates the observed behavior.

3.4 LEARNING FROM CRITIQUE

The approaches described in the above sections capture demonstrations in the form of state-action pairs, relying on the human’s ability to directly perform the task through one of the many possible interaction methods. While this is one of the most common demonstration techniques, other forms of input also exist in addition to, or in place of, such methods.

In learning from critique or shaping, the robot practices the task, often selecting actions through exploration, while the teacher provides feedback to indicate the desirability of the exhibited behavior. The idea of shaping is borrowed from psychology, in which behavioral shaping is defined as a training procedure that uses reinforcement to condition the desired behavior in a human or animal [234]. During training, the reward signal is initially used to reinforce any tendency towards the correct behavior, but is gradually changed to reward successively more difficult elements of the task.

Figure 3.5: A robot learning from critique provided by the user through a hand-held remote [138].

Shaping methods with human-controlled rewards have been successfully demonstrated in a variety of software agent applications [33, 135, 252] as well as robots [129, 138, 242]. Most of the developed techniques extend traditional Reinforcement Learning (RL) frameworks [245]. A common approach is to let the human directly control the reward signal to the agent [91, 119, 138, 241]. For example, in Figure 3.4, the human trainer provides positive and negative reward feedback via a hand-held remote in order to train the robot to perform the desired behavior [138].

Подняться наверх