Читать книгу Multi-Objective Decision Making - Diederik M. Roijers - Страница 12

Оглавление

CHAPTER 2

Multi-Objective Decision Problems

In this chapter, we introduce the concept of a multi-objective decision problem. Then we describe two concrete classes of multi-objective decision problems that we use throughout the book: multi-objective coordination graphs and multi-objective Markov decision processes. However, before introducing the concrete multi-objective decision problems, we first introduce their single-objective counterparts.

2.1 MULTIPLE OBJECTIVES

In this book, we focus on different (cooperative) multi-objective decision problems. Multi-objective decision problems contrast single-objective decision problems, which are more common in the literature. In their most general form, single-objective decision problems can be defined as a set of policies and a value function that we wish to maximize:

Definition 2.1 A cooperative single-objective decision problem (SODP), consists of:

• a set of allowed (joint) policies Π,

• a value function that assigns a real numbered value, Vπ ∈ ℝ, to each joint policy π ∈ Π, corresponding to the desirability, i.e., the utility, of the policy.

Definition 2.2 In a cooperative multi-objective decision problem (MODP), Π is the same as in

an SODP, but

• there are d ≥ 1 objectives, and

• the value function assigns a value vector, Vπ ∈ ℝd, to each joint policy π ∈ Π, corresponding to the desirability of the policy with respect to each objective.

We denote the value of policy π in the i-th objective as .

Both Vπ and Π may have particular forms that affect the way they should be computed, as we discuss in Chapter 3. For example, there may be multiple agents, each with its own action space. In such settings, a solution consists of a joint policy containing a local policy for each agent. We assume that Π is known and that it is in principle possible to compute the value of each (joint) policy. Furthermore, we assume that, if there are multiple agents, these agents are cooperative.

Definition 2.3 A multi-agent MODP is cooperative if and only if all agents get the same (team) value, Vπ, for executing a joint policy π ∈ Π i.e., there are no individual rewards. A single-agent MODP is cooperative by default.

This definition of cooperative is common in the field of decision theory, e.g., in multi-agent MDPs [Boutilier, 1996, Scharpff et al., 2016] and Dec-POMDPs [Oliehoek and Amato, 2016]. However, the term “cooperative” is used differently in cooperative game theory [Chalkiadakis et al., 2011, Igarashi and Roijers, 2017], in which agents form coalitions on the basis of their individual utilities. In this book, we consider only decision problems that are cooperative according to Definition 2.3.

In an SODP, the value function provides a complete ordering on the joint policies, i.e., for each pair of policies π and π′, Vπ must be greater than, equal to, or less than . By contrast, in an MODP, the presence of multiple objectives means that the value function Vπ is a vector rather than a scalar. Such value functions supply only a partial ordering. For example, it is possible that, but . Consequently, unlike in an SODP, we can no longer determine which values are optimal without additional information about how to prioritize the objectives, i.e., about what the utility of the user is for different trade-offs between the objectives.

In the unknown weights and decision support scenarios (Figure 1.1), the parameters of the scalarization function w, or even f itself, are unknown during the planning or learning phases. Therefore, an algorithm for solving an MODP should return a set of policies that contains an optimal policy for each possible w. Given such a solution set, the user can pick the policy that maximizes her utility in the selection phase. We want the solution set to contain at least one optimal policy for every possible scalarization (in order to guarantee optimality), but we also want the solution set to be as small as possible, in order to make the selection phase as efficient as possible. We discuss which solution sets are optimal, and how this can be derived from different assumptions about the scalarization function f (Definition 1.1), and the set of permitted policies Π in the MODP in Chapter 3. In the rest of this section, we introduce two different MODP problem classes.

2.2 MULTI-OBJECTIVE COORDINATION

The first class of MODPs that we treat is the multi-objective coordination graph (MO-CoG).1 In a MO-CoG, multiple agents need to coordinate their actions in order to be effective. For example, in the mining problem of Figure 1.2, each agent represents a van with workers from a single village. Each of these vans can go to different mines within reasonable traveling distance, leading to a set of different possible actions for each agent. Each mine yields a different expected amount of gold (the first objective) and silver (the second objective). Because mining can be done more efficiently when more workers are present at a mine, it is vitally important that the different agents (i.e., vans) coordinate which mines they go to.

Other examples of problems that can be modeled as a MO-CoG are: risk-sensitive combinatorial auctions, in which we want to maximize the total revenue, while minimizing the risk for the auctioneer [Marinescu, 2011], and maintenance scheduling for offices in which the energy consumption, costs, and overtime for the maintenance staff must all be minimized [Marinescu, 2011].

2.2.1 SINGLE-OBJECTIVE COORDINATION GRAPHS

Before we formally define MO-CoGs, we first define the corresponding single-objective problem, i.e., coordination graphs (CoGs) [Guestrin et al., 2002, Kok and Vlassis, 2004]. In the context of coordination graphs, the notion of reward is typically referred to as payoff in the literature. Payoff is usually denoted u (for utility). We adopt this terminology and notation.

Definition 2.4 A coordination graph (CoG) [Guestrin et al., 2002, Kok and Vlassis, 2004] is a tuple 〈D, A, U〉, where

• D = {1,…, n} is the set of n agents,

• A = Ai × … × An is the joint action space: the Cartesian product of the finite action spaces of all agents. A joint action is thus a tuple containing an action for each agent a = 〈a1,…, an〉, and

• U = {u1,…, uρ} is the set of ρ scalar local payoff functions, each of which has limited scope, i.e., it depends on only a subset of the agents. The total team payoff is the sum of the local payoffs: .

In order to ensure that the coordination graph is fully cooperative, all agents share the payoff function u(a). We abuse the notation e both to index a local payoff function ue and to denote the subset of agents in its scope; ae is thus a local joint action, i.e., a joint action of this subset of agents. The decomposition of u(a) into local payoff functions can be represented as a factor graph [Bishop, 2006] (Figure 2.1); a bipartite graph containing two types of vertices: agents (variables) and local payoff functions (factors), with edges connecting local payoff functions to the agents in their scope.

The main challenge in a CoG is that the size of the joint action space A grows exponentially with the number of agents. It thus quickly becomes intractable to enumerate all joint actions and their associated payoffs. Key to solving CoGs is therefore to exploit loose couplings between agents, i.e., each agent’s behavior directly affects only a subset of the other agents.

Figure 2.1: A CoG with three agents and two local payoff functions. The factor graph illustrates the loose couplings that result from the decomposition into local payoff functions. In particular, each agent’s choice of action directly depends only on those of its immediate neighbors, e.g., once agent 1 knows agent 2’s action, it can choose its own action without considering agent 3.

Figure 2.1 shows the factor graph of an example CoG in which the team payoff function decomposes into two local payoff functions, each with two agents in scope:


The local payoff functions are defined in Table 2.1. We use this CoG as a running example throughout this book. The local payoff functions, with their limited scopes, encode the loose couplings: each agent can only directly affect another agent when they share, i.e., are both in the scope of, a local payoff function. For example, if we fix the action for agent 2 to be 2, then agents 1 and 3 can decide upon their optimal actions independently, as they do not directly affect each other.

Table 2.1: The payoff matrices for u1(a1,a2) (left) and u2(a2,a3) (right). There are two possible actions per agent, denoted by a dot (1) and a bar (1).


2.2.2 MULTI-OBJECTIVE COORDINATION GRAPHS

We now consider the multi-objective setting:

Definition 2.5 A multi-objective coordination graph (MO-CoG) [Roijers et al., 2015b] is a tuple 〈D, A, U〉 where:

• D and A are the same as in a CoG, but,

• U = {u1,…,uρ} is now the set of ρ, d-dimensional local payoff functions. The total team payoff is the sum of local vector-valued payoffs: .

For example, payoffs for a MO-CoG structured as in Figure 2.1, i.e.,


are provided in Table 2.2.

Table 2.2: The two-dimensional payoff matrices for u1(a1,a2) (left) and u2(a2,a3) (right)


The only difference between a MO-CoG and a CoG is the dimensionality of the payoff functions. A CoG is thus a special case of a MO-CoG, i.e., a MO-CoG in which d = 1. Furthermore, when preferences are known, it may be possible to scalarize a MO-CoG and thus transform it into a CoG. For example, if we know the scalarization function is linear, i.e., f(u, w) = w · u, and its parameter vector w = (0.75,0.25) is, then we can scalarize the multi-objective payoff functions of Table 2.2 to the single-objective payoff functions of Table 2.1 before planning.

In a deterministic policy for a MO-CoG, the agents pick one joint action. In a stochastic policy, the agents draw a joint action from a probability distribution. Note that such a probability distribution is in general defined over joint actions, and there can thus still be coordination between the agents when the policy is stochastic.

When f and/or w are unknown in the planning or learning phase—as is the case in the unknown weights and decision support scenarios discussed in Section 1.1—we need to produce a set of policies that contains at least one optimal policy for each possible f and w. The solution of a MO-CoG is thus a coverage set of policies. In general, this can contain both deterministic and stochastic policies. We explain why this is important for MO-CoGs (but not for CoGs) in Chapter 3.

2.3 MULTI-OBJECTIVE MARKOV DECISION PROCESSES

The second class of MODPs that we treat is the multi-objective Markov decision process (MOMDP), in which a single agent needs to perform a sequence of actions over time. This sequence of actions typically takes place in an environment that is affected by these actions. Therefore, the agent has to consider not only its immediate reward, but also the reward it will be able obtain later in the states it visits in the future.

Consider the robot (shown as a Pacman symbol) in Figure 2.2 that needs to navigate in a socially appropriate way to reach the blue target zone in the upper right corner. We want the robot to reach the target as soon as possible, i.e., minimize the time to reach the target, but also minimize the hindrance that the robot causes to the other person by avoiding her personal space (indicated in red) along the way. By driving through the personal space of the person, it can obtain a higher value with respect to the first objective but a lower value in the second objective. Which policy is optimal thus depends on how much we value the first objective relative to the second objective.

Figure 2.2: A social robot navigation problem MDP: a robot (indicated by the pacman symbol) needs to reach the target zone in the upper right corner (objective 1). However, the robots needs to avoid the personal space of the person standing in between the start position of the robot and the target zone (objective 2).

2.3.1 SINGLE-OBJECTIVE MARKOV DECISION PROCESSES

The single-objective version of an MOMDP is an MDP:

Definition 2.6 A (single-objective) Markov decision process (MDP) [Bellman, 1957b] is a tuple (S, A, T, R, μ0} where,

• S is the state space, i.e., the set of possible states the environment can be in,

• A is the action space, i.e., the set of actions the agent can take,

T : S × A × S → [0,1] is the transition function, giving the probability of a next state given an action and a current state,

R : S × A × S → ℝ is the reward function, specifying the immediate expected scalar reward corresponding to a transition.

μ0 is the distribution over initial states s0.

At each timestep t, the agent observes the current state of the environment s ∈ S. When the agent takes an action a ∈ A the environment transitions to a new state s′.

The state in an MDP is Markovian, i.e., the current state s of the environment and the current action of the agent a are a sufficient statistic for predicting the next transition probabilities T(s′|s, a) and the associated expected immediate reward. The agent’s history, i.e., the states and actions that led to the current state, do not provide additional information in that respect.

The agent’s goal in an MDP is to find a policy π that maximizes the return Rt, which is some function of the rewards received from timestep t and onward. In the broadest sense, a policy can condition on everything that is known to the agent.

A state-indepedent value function Vπ specifies the expected return when following π from the initial state:

We further specify first the return, Rt, and then the policy, π.

Typically, the return is additive [Boutilier et al., 1999], i.e., it is a sum of the rewards attained at each timestep. When the returns are additive, Vπ becomes an expectation over this sum. In a finite-horizon setting there is a limited number of timesteps, h, and the sum is typically undiscounted:


In a discounted infinite-horizon setting, the number of timesteps is not limited, but there is a discount factor, 0 ≤ γ < 1, that specifies the relative importance of future rewards with respect to immediate rewards:


Which action at is chosen by the agent at each timestep t depends on its policy π. If the policy is stationary, i.e., it conditions only on the current state, then it can be formalized as π : S × A → [0, 1]: it specifies, for each state and action, the probability of taking that action in that state. We can then specify the state value function of a policy π:


for all t when st = s. The Bellman equation restates this expectation recursively for stationary policies:


Note that the Bellman equation, which forms the heart of most standard solution algorithms such as dynamic programming [Bellman, 1957a] and temporal difference methods [Sutton and Barto, 1998], explicitly relies on the assumption of additive returns. This is important because nonlinear scalarization functions f can interfere with this additivity property, making planning and learning methods that rely on the Bellman equation not directly applicable, as we discuss in Section 3.2.3.

State value functions induce a partial ordering over policies, i.e., π is better than or equal to π′ if and only if its value is greater for all states:


A special case of a stationary policy is a deterministic stationary policy, in which one action is chosen with probability 1 for every state. A deterministic stationary policy can be seen as a mapping from states to actions: π : SA. For single-objective MDPs, there is always at least one optimal policy π, i.e., , that is stationary and deterministic.

Theorem 2.7 For any additive infinite-horizon single-objective MDP, there exists a deterministic stationary optimal policy [Boutilier et al., 1999, Howard, 1960].

If more than one optimal policy exists, they share the same value function, known as the optimal value function V*(s) = maxπ Vπ(s). The Bellman optimality equation defines the optimal value function recursively:


Note that, because it maximizes over actions, this equation makes use of the fact that there is an optimal deterministic stationary policy. Because an optimal policy maximizes the value for every state, such a policy is optimal regardless of the initial state distribution μ0. However, the state-independent value (Equation 2.1) can be different for different initial state distributions. Using μ0, the state value function can be translated back into the state-independent value function (Equation 2.1):


2.3.2 MULTI-OBJECTIVE MARKOV DECISION PROCESSES

In many decision problems, such as the social robot in Figure 2.2, it is impossible, undesirable, or infeasible to define a scalar reward function, and we need a vector-valued reward function, leading to an MOMDP.

Definition 2.8 A multi-objective Markov decision process (MOMDP) [Roijers et al., 2013a] is a tuple 〈S, A, T, R〉 where,

• S, A, and T are the same as in an MDP, but,

R : S × A × S → ℝd is now a d-dimensional reward function, specifying the expected immediate vector-valued reward corresponding to a transition.

MOMDPs have recently been applied to many real-world decision problems, including: water reservoir control [Castelletti et al., 2013, 2008, Giuliani et al., 2015], where policies for releasing water from a dam must be found while balancing multiple uses of the reservoir, including hydroelectric production and flood mitigation; office building environmental control [Kwak et al., 2012], in which energy consumption must be minimized while maximizing the comfort of the building’s occupants; and medical treatment planning [Lizotte et al., 2010, 2012], in which the effectiveness of the treatment must be maximized, while minimizing the severity of the side effects.

In an MOMDP, when an agent executes a policy π, its value, Vπ is vector-valued, as it is an expectation of the sum over vector-valued rewards, i.e.,


in the finite-horizon setting, and,


in the infinite horizon setting.

Multi-Objective Decision Making

Подняться наверх