Читать книгу Active Learning - Burr Settles - Страница 12
ОглавлениеCHAPTER 2
Uncertainty Sampling
“Information is the resolution of uncertainty.”
— Claude Shannon
2.1 PUSHING THE BOUNDARIES
Let us revisit the alien fruits problem from Section 1.1, and use it as a running example. Recall that we want to efficiently test fruits for ⊕ safe
vs. ⊖ noxious
to eat. One solution is to lay out all the fruits in a line from most round to most irregular, and use a binary search algorithm to actively select which fruits to test. This approach is fine for our simple thought experiment, but we want a more general search algorithm for arbitrary problems with many input dimensions, potentially many choices for output (e.g., multiple class labels or output structures), and probably even noisy training labels. As a starting point, recall that we can use supervised learning to select a threshold parameter θ that lies somewhere in the transition from one label to the other.
One reasonable way to choose this threshold value is to be as non-committal as possible: set θ to be halfway between the known ⊕ and ⊖ fruits which are closest together (this is what so-called max-margin learning algorithms do). Such a classifier should be fairly confident about its predictions for fruits which are far away from the thresholded classification boundary, but as x (the fruit’s irregularity measure) approaches θ the model becomes much less certain. Intuitively, the instances that are least certain would offer the most information about the problem, since the more confident classifications are probably correct. What if we adopt a simple active learning strategy that queries the instance closest to the decision boundary? In fact, we would recover the binary search algorithm from Section 1.1, as illustrated by Figure 2.1.
This type of active learning strategy is commonly known as uncertainty sampling (Lewis and Catlett, 1994). The basic premise is that the learner can avoid querying the instances it is already confident about, and focus its attention instead on the unlabeled instances it finds confusing. The example in Figure 2.1 quantifies uncertainty by |θ − x|, the distance of instance x from the boundary θ, which is a fine measure for hypothesis classes that provide such a distance measure. However, we are interested in generalizing active learning to more complex problems that go beyond binary classification in a relatively noise-free environment like this one. An elegant way to extend the approach is to use a probabilistic classifier which can output a posterior distribution Pθ(Y|x) over the label variable Y given the input and learned model parameters. Under this interpretation, we would want to query the instance for which Pθ(ŷ|x)—where ŷ refers to the classifier’s most likely prediction for x—is closest to a uniform distribution (0.5 in the case of binary classification). While a probabilistic interpretation is not strictly necessary, there has been significant work in machine learning on probabilistic classifiers, and graphical models in particular (for a thorough overview, see Koller and Friedman, 2009). By framing our discussion of uncertainty sampling in the language of probability, we can easily generalize the techniques in this chapter to a variety of interesting cases, including problems with many input features, multiple output labels, and even structured prediction tasks, which we will discuss later in this chapter.
Figure 2.1: The binary search from Figure 1.3, re-interpreted as an uncertainty sampling approach. The best instance to query is deemed to be the one closest to the threshold θ.
2.2 AN EXAMPLE
To visualize the way in which uncertainty sampling generalizes to a noisy, two-dimensional classification problem, consider Figure 2.2. Figure 2.2(a) shows a toy data set constructed from two Gaussians centered at (-2,0) and (2,0) with standard deviation σ = 1. There are 400 instances total, 200 drawn from each class distribution. In a real-world setting, these instances may be available but their labels would not. Figure 2.2(b) illustrates the traditional supervised learning approach of randomly selecting instances for labeling. The line shows the linear decision boundary of a logistic regression model (i.e., where the posterior label probability equals 0.5) trained using 30 points. Notice that most of the labeled instances in this training set are far from zero on the horizontal axis, which is where the Bayes optimal decision boundary should be. As a result, this classifier only achieves 70% accuracy on the remaining unlabeled data. Figure 2.2(c) tells a different story, however: the active learner uses uncertainty sampling to focus on the instances closest to its decision boundary, assuming it can adequately explain the data in other parts of the input space. As a result, it avoids requesting labels for redundant or irrelevant instances, and achieves 90% accuracy using the same budget of 30 labeled instances.
Figure 2.2: Uncertainty sampling with a toy data set. (a) 400 instances, evenly sampled from two class Gaussians. Instances are represented as points in a 2D input space. (b) A logistic regression model trained with 30 labeled instances randomly drawn from the problem domain. The line represents the decision boundary of the classifier. (c) A logistic regression model trained with 30 actively queried instances using uncertainty sampling.
Figure 2.3: Generic pool-based uncertainty sampling algorithm.
2.3 MEASURES OF UNCERTAINTY
A general active learning algorithm is presented in Figure 2.3. The key component of the algorithm with respect to designing an active learning system is line 5, and we need a way to measure the uncertainty of candidate queries in the pool. For binary classification, the “closest to the decision boundary (probability ≈ 0.5)” heuristic will suffice. But when we deal with problems and models that have posterior distributions over three or more labels—or even multiple output structures—we need a more general measure of uncertainty or information content. From this point on, let denote the best instance that the utility measure A would select for querying.
Least Confident. A basic uncertainty sampling strategy is to query the instance whose predicted output is the least confident:
where ŷ = argmaxy Pθ(y|x), the prediction with the highest posterior probability under the model θ. In other words, this strategy prefers the instance whose most likely labeling is actually the least likely among the unlabeled instances available for querying. One way to interpret this uncertainty measure is the expected 0/1-loss, i.e., the model’s belief that it has mislabeled x. A drawback of this strategy is that it only considers information about the best prediction. Thus, it effectively throws away information about the rest of the posterior distribution.
Margin. A different active learning strategy is based on the output margin:
where ŷ1 and ŷ2 are the first and second most likely predictions under the model, respectively. Margin sampling addresses a shortcoming of the least confident strategy by incorporating the second best labeling in its assessment. Intuitively, instances with large margins are easy, since the learner has little doubt in how to differentiate between the two most probable alternatives. Instances with small margins are more ambiguous, though, and knowing the true labeling should help the model discriminate more effectively between them. However, for problems with a very large number of alternatives, the margin approach still ignores much of the output distribution.
Entropy. Perhaps the most general (and most common) uncertainty sampling strategy uses entropy (Shannon, 1948), usually denoted by H, as the utility measure:
where y ranges over all possible labelings of x. Entropy is a measure of a variable’s average information content. As such, it is often thought of as an uncertainty or impurity measure in machine learning. One interpretation of this strategy is the expected log-loss, i.e., the expected number of bits it will take to “encode” the model’s posterior label distribution.
Figure 2.4: Illustrations of various uncertainty measures. (a–c) Binary classification tasks. Each plot shows the utility score as a function of Pθ(⊕|x), which is the posterior probability of the positive class. (d–f) Ternary classification tasks (three labels). Heatmap corners represent posterior distributions where one label is very likely, with the opposite edge plotting the probability range between the other two classes (when that label has low probability). The center of each heatmap is the uniform distribution.
Figure 2.4 visualizes the implicit relationship among these uncertainty measures. For binary classification (top row), all three strategies are monotonic functions of one another. They are symmetric with a peak about Pθ(⊕|x) = 0.5. In effect, they all reduce to querying the instance that is closest to the decision boundary. For a three-label classification task (bottom row), the relationship begins to change. For all three measures, the most informative instance lies at the center of the triangular simplex, because this represents where the posterior label distribution is most uniform (and therefore most uncertain under the model). Similarly, the least informative instances are at the three corners, where one of the classes has extremely high probability (and thus little model uncertainty). The main differences lie in the rest of the probability space. For example, the entropy measure does not particularly favor instances for which only one of the labels is highly unlikely (i.e., along the outer side edges of the simplex), because the model is fairly certain that it is not the true label. The least confident and margin measures, on the other hand, consider such instances to be useful if the model has trouble distinguishing between the remaining two classes (i.e., at the midpoint of an outer edge). Intuitively, entropy seems appropriate if the objective function is to minimize log-loss, while the other two (particularly margin) are more appropriate if we aim to reduce classification error, since they prefer instances that would help the model better discriminate among specific classes.
2.4 BEYOND CLASSIFICATION
Not all machine learning is classification. In particular, we might want to use active learning to reduce the cost of training a model that predicts structured output, such as label sequences or trees. For example, Figure 2.5 illustrates how information extraction—the task of automatically extracting structured information like database entries from unstructured text—is framed as a sequence-labeling task. Let x = 〈x1, …, xT〉 be an observation sequence of length T with a corresponding label sequence y = 〈y1, …, yT〉. Words in a sentence correspond to tokens in x, which are mapped to labels in y.
Figure 2.5: An information extraction example viewed as a sequence labeling task. (a) A sample input sequence x and corresponding label sequence y. (b) A probabilistic sequence model represented as a finite state machine, illustrating the path of 〈x, y〉 through the model.
Figure 2.5(a) presents an example 〈x, y〉 pair. The labels indicate whether a given word belongs to an entity class of interest (org
and loc
in this case, for “organization” and “location,” respectively) or null
otherwise. Unlike simple classification, x is not represented by a single feature vector, but rather a sequence of feature vectors: one for each token (i.e., word). One approach is to treat each token as an instance, and train a classifier that scans through the input sequence, assigning output labels to tokens independently. However, the word “Madison,” devoid of context, might refer to an location (city), organization (university), or even a person. For tasks such as this, sequence models based on probabilistic finite state machines, such as hidden Markov models or linear-chain conditional random fields, are considered the state of the art. An example sequence model is shown in Figure 2.5(b). Such models can produce a probability distribution for every possible label sequence y, the number of which can grow exponentially in the sequence length T.
Fortunately, uncertainty sampling generalizes fairly easily to probabilistic structured prediction models. For example, the least confident strategy is popular for information extraction using sequences (Culotta and McCallum, 2005; Settles and Craven, 2008), because the most likely output sequence ŷ and the associated Pθ(ŷ|x) can be efficiently computed with dynamic programming. Selecting the best query is generally no more complicated or expensive than the standard inference procedure. The Viterbi algorithm (Corman et al., 1992) requires O(TM) time, for example, where T is the sequence length and M is the number of label states. It is often possible to perform “N-best” inference using a beam search as well (Schwartz and Chow, 1990), which finds the N most likely output structures under the model. This makes it simple to compute the necessary probabilities for ŷ1 and ŷ2 in the margin strategy, and comes at little extra computational expense: the complexity is O(TMN) for sequences, which for N = 2 merely doubles the runtime compared to the least confident strategy. Dynamic programs have also been developed to compute the entropy over all possible sequences (Mann and McCallum, 2007) or trees (Hwa, 2004), although this approach is significantly more expensive. The fastest entropy algorithm for sequence models requires O(TM2) time, which can be very slow when the number of label states is large. Furthermore, some structured models are so complex that they require approximate inference techniques, such as loopy belief propagation or Markov chain Monte Carlo (Koller and Friedman, 2009). In such cases, the least confident strategy is still straightforward since only the “best” prediction needs to be evaluated. However, the margin and entropy heuristics cease to be tractable and exact for these more complex models.
So far we have only discussed problems with discrete outputs—classification and structured prediction. Uncertainty sampling is also applicable to regression, i.e., learning tasks with continuous output variables. In this setting, the learner can simply query the unlabeled instance for which the learner has the highest output variance in its prediction. Under a Gaussian assumption, the entropy of a random variable is a monotonic function of its variance, so this approach is much in same the spirit as entropy-based uncertainty sampling. Another interpretation of variance is the expected squared-loss of the model’s prediction. Closed-form estimates of variance can be computed for a variety of model classes, although they can require complex and expensive computations.
Figure 2.6 illustrates variance-based uncertainty sampling using an artificial neural network. The target function is a Gaussian in the range [-10,10], shown by the solid red line in the top row of plots. The network in this example has one hidden layer of three logistic units, and one linear output unit. The network is initially trained with two labeled instances drawn at random (upper left plot), and its variance estimate (lower left plot) is used to select the first query. This process repeats for a few iterations, and after three queries the network can approximate the target function fairly well. In general, though, estimating the output variance is nontrivial and will depend on the type of model being used. This is in contrast to the utility measures in Section 2.3 for discrete outputs, which only require that the learner produce probability estimates. In Section 3.4, we will discuss active learning approaches using ensembles as a simpler way to estimate output variance. Active learning for regression has a long history in the statistics literature, generally referred to as optimal experimental design (Federov, 1972). However, the statistics community generally eschews uncertainty sampling in lieu of more sophisticated strategies, which we will explore further in Chapter 4.
Figure 2.6: Variance-based uncertainty sampling for a toy 1D regression task. Each column represents an iteration of active learning. In the top row, solid lines show the target function to be learned, while dashed lines show a neural network approximation based on available training data (black dots). The bottom row plots the network’s output variance across the input range, which is used to select the query for the next iteration.
2.5 DISCUSSION
Uncertainty sampling is possibly the most popular active learning strategy in practice. Perhaps this is because of its intuitive appeal combined with the ease of implementation. Most of the common uncertainty-based utility measures do not require significant engineering overhead to use. In fact, as long as the learner can provide a confidence or probability score along with its predictions, any of the measures in Section 2.3 can be employed with the learner as a “black box.” Standard classification or inference procedures can be used, leaving the choice of learning algorithm fairly modular. This is not necessarily the case for all active learning approaches. Furthermore, if inference is fast and tractable, then querying should also be fast and tractable.
Our discussion of uncertainty sampling has, thus far, been limited to the pool-based setting where the learner selects the “best” query from the set of unlabeled data U. Uncertainty sampling can also be employed in the stream-based selective sampling setting, where unlabeled instances are drawn x ∽ DX one at a time from an input distribution, and the learner decides on the spot whether to query or discard it. The simplest way to implement uncertainty-based selective sampling is to set a threshold on the uncertainty measure and use this to define a region of uncertainty. An instance is queried if it falls within the region of uncertainty, and discarded otherwise. The learner is re-trained after each new query instance (or batch of instances) is labeled and added to L.