Читать книгу Active Learning - Burr Settles - Страница 11
ОглавлениеCHAPTER 1
Automating Inquiry
“Computers are useless. They can only give you answers.”
— Pablo Picasso (attributed)
1.1 A THOUGHT EXPERIMENT
Imagine that you are the leader of a colonial expedition from Earth to an extrasolar planet. Luckily, this planet is habitable and has a fair amount of vegetation suitable for feeding your group. Importantly, the most abundant source of food comes from a plant whose fruits are sometimes smooth and round, but sometimes bumpy and irregular. See Figure 1.1 for some examples.
Figure 1.1: Several alien fruits, which vary in shape from round to irregular.
Almost immediately, physicians on the expedition notice that colonists who eat the smooth fruits find them delicious, while those who eat the irregular fruits tend to fall ill. Naturally, you want to be able to distinguish between which fruits are safe to eat and which ones are harmful. If you can accurately “classify” these fruits, it will be much easier to keep the colony healthy and well-fed while finding important alternative uses for the noxious fruits (such as processing them into dyes and fuels). The physicians assure you that the shape of a fruit is the only feature that seems related to its safety. The problem, though, is that a wide variety of fruit shapes from these plants exist: almost a continuous range from round to irregular. Since the colony has essential uses for both safe and noxious fruits, you want to be able to classify them as accurately as possible.
Supposing we have a way to quantify the “irregularity” of a fruit’s shape, we can formalize this classification task using a simple function. Let X be a range of data instances, e.g., fruits that have been harvested, where each x ∊ R represents the irregularity measure of a particular fruit. Each data instance has a hidden label y, which in this example comes from the finite set Y = {safe, noxious}
. Our classifier, then, is a function mapping h : X → Y, parameterized by a threshold θ:
The remaining challenge is to figure out what this threshold really is.
One way to pick the threshold is by supervised learning techniques, where we estimate θ from a set of labeled data instances . Here, 〈x, y〉 denotes a fruit x which has been tested and assigned a label y, e.g., by having someone eat it and then observing whether or not he or she becomes ill. A simple learning algorithm in this case isered set of fruits, we only need to te to harvest a large number of fruits, arrange them in order from round to irregular in shape, and have them all tested. The point at which the fruits switch from being safe
to noxious
is the best threshold θ* (assuming for the moment that everyone responds to a fruit in the same way). Figure 1.2 illustrates this method for a handful of labeled instances.
Figure 1.2: Supervised learning for the alien fruits example. Given a set of 〈x, y〉 instance-label pairs, we want to choose the threshold θ* that classifies them most accurately.
According to the PAC learning1 framework (Valiant, 1984), if the underlying data distribution can be perfectly classified by some hypothesis h in the hypothesis class H (in this case, the set of all values for θ), then it is enough to test O(1/∊) randomly selected instances, where ∊ is the maximum desired error rate. This means that if we want to achieve 99% accuracy in our “fruit safety” classifier, we might need to harvest and test on the order of hundreds of fruits using this approach. That is a lot of fruits to test, though, and many of your family and friends might get sick in the process!
Can we do better? While we certainly want a threshold that classifies these fruits as accurately as possible, we also want to discover this threshold as quickly and as economically as possible, by requiring fewer tests while achieving the same results (ideally). Note some key properties of this problem:
• the fruits themselves are plentiful, easily harvested, and easily measured;
• testing for a fruit’s safety is what mainly incurs a cost: the fruit must be consumed, and the person who eats it risks falling ill.
Figure 1.3: A binary search through the set of ordered, untested alien fruits. By only testing this subset of fruits, we can exponentially reduce costs while achieving the same result. The labels shown in light blue can be inferred, and therefore do not need to be tested.
As it turns out, we can do better by performing a directed search through the set of fruit shapes. Taking the two observations above into account, we can augment our supervised learning strategy in the following way. First, let us gather an arbitrary number of unlabeled instances for free (or very inexpensively) by simply harvesting a lot of fruits without testing them. Next, we can arrange these fruits in a sequence from round to irregular. Our goal is similar to the previous one: to discover where in this ordered series of fruits the transition from safe
to noxious
occurs, but by conducting as few tests as possible this time. If we execute a simple binary search through this ordered set of fruits, we only need to test ⌈log2 U⌉ items, instead of testing them all as we did before.
The process is illustrated in Figure 1.3. Given a set of unlabeled instances, this sequential algorithm would first test x = 5, then x = 7, and finally x = 6 before arriving at the exact same parameter value for θ*, alleviating the need to test the other six fruits (two of which happen to be harmful). This algorithmic speedup means that a classifier with error ∊ or less can be found with a mere O(log2 1/∊) tests, since all the other labels can be inferred. To put this in perspective, assume that 100 fruits were needed to obtain 99% accuracy in the earlier supervised setting. Then we would still need to harvest 100 fruits for our new binary search algorithm, but we would only need to test 6 or 7 of them to get the same guarantee. This is an exponential reduction in the number of tests necessary, which drastically reduces the number of people whose health is at risk, and helps the colony to make use of both safe and noxious fruits as quickly as possible.
1.2 ACTIVE LEARNING
The alien fruits example is a simple illustration of active learning. Active learning is a subfield of artificial intelligence and machine learning: the study of computer systems that improve with experience and training. Traditional “passive” learning systems induce a hypothesis to explain whatever training data happens to be available (e.g., a collection of labeled instances). By contrast, the hallmark of an active learning system is that it eagerly develops and tests new hypotheses as part of a continuing, interactive learning process. Another way to think about it is that active learners develop a “line of inquiry,” much in the way a scientist would design a series of experiments to help him or her draw conclusions as efficiently as possible. For example, the binary search algorithm in Figure 1.3 selects the next fruit to test—or query—after it has obtained an answer for the previous query from some oracle (or labeling source). For this reason, active learning is sometimes called “query learning” in the computational learning theory literature, and is closely related to work in “optimal experimental design” or “sequential design” in statistics.
Of course, the alien fruits example is a bit contrived and overly simplistic. Yet it illustrates some of the basic properties of many real-world problems, and shows how much can be gained from having the learner ask questions or be more involved in its own training process. In today’s information-drenched society, unlabeled data are often abundant, but the labels required to do supervised learning from these instances are difficult, time-consuming, or expensive to obtain. Consider a few examples:
• Classification and filtering. Learning to classify documents (e.g., articles or web pages) or any other kind of media (e.g., image, audio, and video files) usually requires people to annotate each item with particular labels, such as relevant
or not-relevant
. Unlabeled instances abound from electronic resources like the Internet, but annotating thousands of these instances can be tedious and even redundant.
• Speech recognition. Accurate labeling of speech utterances is extremely time consuming and requires trained linguists. While unannotated audio data is easily obtained from recording devices, Zhu and Goldberg (2009) have reported that annotation at the word level can take ten times longer than the actual audio (e.g., one minute of speech takes ten minutes to label), and annotating phonemes can take 400 times as long (e.g., nearly seven hours). The problem is compounded for rare languages or dialects.
• Information extraction. Systems that extract factual knowledge from text must be trained with detailed annotations of documents. Users highlight entities or relations of interest, such as person and organization names, or whether a person works for a particular organization. Locating entities and relations can take a half-hour or more for even simple newswire stories (Settles et al., 2008a). Annotations for specific knowledge domains may require additional expertise, e.g., annotating gene and disease mentions in biomedical text usually requires PhD-level biologists.
• Computational Biology. Increasingly, machine learning is used to interpret and make predictions about data from the natural sciences, particularly biology. For example, biochemists can induce models that help explain and predict enzyme activity from hundreds of synthesized peptide chains (Smith et al., 2011). However, there are 20n possible peptides of length n, which for 8-mers yields 208 ≈ 2.6 billion possibilities to synthesize and test. In practice, scientists might resort to random sequences, or cherry-picking subsequences of possibly interesting proteins, with no guarantee that either will provide much information about the chemical activity in question.
In all of these examples, data collection (for traditional supervised learning methods) comes with a hefty price tag, in terms of human effort and/or laboratory materials. If an active learning system is allowed to be part of the data collection process—to be “curious” about the task, if you will—the goal is to learn the task better with less training.
While the binary search strategy described in the previous section is a useful introduction to active learning, it is not directly applicable to most problems. For example, what if fruit safety is related not only to shape, but to size, color, and texture as well? Now we have four features to describe the input space instead of just one, and the simple binary search mechanism no longer works in these higher-dimensional spaces. Also consider that the bodies of different people might respond slightly differently to the same fruit, which introduces ambiguity or noise into the observations we use as labels. Most interesting real-world applications, like the ones in the list above, involve learning with hundreds or thousands of features (input dimensions), and the labels are often not 100% reliable.
The rest of this book is about the various ways we can apply the principles of active learning to machine learning problems in general. We focus primarily on classification, but touch on methods that apply to regression and structured prediction as well. Chapters 2–5 present, in detail, several query selection frameworks, or utility measures that can be used to decide which query the learner should ask next. Chapter 6 presents a unified view of these different query frameworks, and briefly touches on some theoretical guarantees for active learning. Chapter 7 summarizes the strengths and weaknesses of different active learning methods, as well as some practical considerations and a survey of more recent developments, with an eye toward the future of the field.
1.3 SCENARIOS FOR ACTIVE LEARNING
Before diving into query selection algorithms, it is worth discussing scenarios in which active learning may (or may not) be appropriate, and the different ways in which queries might be generated. In some applications, instance labels come at little or no cost, such as the “spam” flag you mark on unwanted email messages, or the five-star rating you might give to films on a social networking website. Learning systems use these flags and ratings to better filter your junk email and suggest new movies you might enjoy.
In cases like this, you probably have other incentives for providing these labels—like keeping your inbox or online movie library organized—so you provide many such labels “for free.” Deploying an active learning system to carefully select queries in these cases may require significant engineering overhead, with little or no gains in predictive accuracy. Also, when only a relatively small number (e.g., tens or hundreds) of labeled instances are needed to train an accurate model, it may not be appropriate to use active learning. The expense of implementing the query framework might be greater than merely collecting a handful of labeled instances, which might be sufficient.
Active learning is most appropriate when the (unlabeled) data instances themselves are numerous, can be easily collected or synthesized, and you anticipate having to label many of them to train an accurate system. It is also generally assumed that the oracle answers queries about instance labels, and that the appropriate hypothesis class for the problem is more or less already decided upon (naive Bayes, decision trees, neural networks, etc.). These last two assumptions do not always hold, but for now let us assume that queries take the form of unlabeled instances, and that the hypothesis class is known and fixed2. Given that active learning is appropriate, there are several different specific ways in which the learner may be able to ask queries. The main scenarios that have been considered in the literature are (1) query synthesis, (2) stream-based selective sampling, and (3) pool-based sampling.
Query Synthesis. One of the first active learning scenarios to be investigated is learning with membership queries (Angluin, 1988). In this setting, the learner may request “label membership” for any unlabeled data instance in the input space, including queries that the learner synthesizes de novo. The only assumption is that the learner has a definition of the input space (i.e., the feature dimensions and ranges) available to it. Figure 1.4(a) illustrates the active learning cycle for the query synthesis scenario. Efficient query synthesis is often tractable and efficient for finite problem domains (Angluin, 2001). The idea of synthesizing queries has also been extended to regression learning tasks, such as learning to predict the absolute coordinates of a robot hand given the joint angles of its mechanical arm as inputs (Cohn et al., 1996). Here the robot decides which joint configuration to test next, and executes a sequence of movements to reach that configuration, obtaining resulting coordinates that can be used as a training signal.
Query synthesis is reasonable for some problems, but labeling such arbitrary instances can be awkward and sometimes problematic. For example, Lang and Baum (1992) employed membership query learning with human oracles to train a neural network to classify handwritten characters. They encountered an unexpected problem: many of the query images generated by the learner contained no recognizable symbols; these were merely artificial hybrid characters with little or no natural semantic meaning. See Figure 1.4(b) for a few examples: is the image in the upper-right hand corner a 5, an 8, or a 9? It stands to reason that this ambiguous image could help the learner discriminate among the different characters, if people were able to discriminate among them as well. Similarly, one could imagine query synthesis for natural language tasks creating streams of text or audio speech that amount to gibberish. The problem is that the data-generating distribution is not necessarily taken into account (and may not even be known), so an active learner runs the risk of querying arbitrary instances devoid of meaning. The stream-based and pool-based scenarios (described shortly) have been proposed to address these limitations.
Figure 1.4: (a) An active learner might synthesize query instances de novo. (b) Query synthesis can result is awkward and uninterpretable queries, such as these images generated by a neural network attempting to learn how to recognize handwritten digits. Source: Lang and Baum (1992), reprinted with kind permission of the authors.
Nevertheless, King et al. (2004, 2009) found a promising real-world application of query synthesis. They employ a “robot scientist” which executes autonomous biological experiments to discover metabolic pathways in the yeast Saccharomyces cerevisiae. Here, an instance corresponds to a mixture of chemical solutions that constitute a growth medium crossed with a particular yeast mutant. A label, then, is whether or not the mutant thrived in the growth medium. Experiments are chosen using an active learning approach based on inductive logic programming, and physically performed using a laboratory robot. The active approach results in a three-fold decrease in the cost of experimental materials compared to naively running the least expensive experiment, and a 100-fold decrease in cost compared to randomly chosen experiments. In this task, all query instances correspond to well-defined and meaningful experiments for the robot (or even a human) to perform. In other situations, arbitrary queries may be meaningless and nonsensical and thus difficult for an oracle to make judgements about.
Stream-Based Selective Sampling. An alternative to synthesizing queries is selective sampling (Atlas et al., 1990; Cohn et al., 1994). The key assumption is that obtaining an unlabeled instance is free (or inexpensive), so it can first be sampled from the actual distribution, and then the learner can decide whether or not to request its label. The approach is illustrated in Figure 1.5(a). This is sometimes called stream-based active learning, as each unlabeled instance is typically drawn one at a time from the input source, and the learner must decide whether to query or discard it. If the input distribution is uniform, selective sampling may not offer any particular advantages over query synthesis. However, if the distribution is non-uniform and (more importantly) unknown, we are guaranteed that queries will still be sensible, since they come from a real underlying distribution.
Figure 1.5: (a) In selective sampling, the learner decides whether to query or discard items from a stream of input instances. (b) Pool-based active learning queries the most informative instance from a large pool of available unlabeled data U.
The decision whether to query or discard an instance can be framed several ways. One approach is to define a measure of utility or information content (several such measures are discussed in Chapters 2–5) and make a biased random decision, such that instances with higher utility are more likely to be queried (c.f., Dagan and Engelson, 1995). Another approach is to compute an explicit region of uncertainty (Cohn et al., 1994), i.e., the part of the instance space that is still ambiguous to the learner, and only query instances that fall within it. A naive way of doing this is to set a minimum threshold on a utility measure which defines the region. Instances whose evaluation is above this threshold are then queried. Another somewhat more principled approach is to define the region that is still unknown to the overall model class, i.e., to the set of hypotheses consistent with the current labeled training set called the version space (Mitchell, 1982). In other words, if any two models of the same model class (but different parameter settings) agree on all the labeled data, but disagree on an unlabeled instance sampled from the data source, then that instance lies within the region of uncertainty. Calculating this region completely and explicitly is computationally expensive, however, and it must be maintained after each new query. As a result, approximations are used in practice (more details in Chapter 3).
The stream-based scenario has been studied in several real-world tasks, including part-of-speech tagging (Dagan and Engelson, 1995), sensor scheduling (Krishnamurthy,2002), and learning ranking functions for information retrieval (Yu, 2005). Fujii et al. (1998) employed selective sampling for active learning in word sense disambiguation, e.g., determining if the word “bank” means land alongside a river or a financial institution in a given context (except they study Japanese words in their work). The approach not only reduces annotation effort, but also limits the size of the database used in nearest-neighbor learning, which in turn expedites the classification algorithm.
It is worth noting that some authors (e.g., Moskovitch et al., 2007; Thompson et al., 1999) use the term “selective sampling” to refer to the pool-based scenario described next. Under this interpretation, the term merely signifies that queries are made with a selected subset of instances sampled from a real data distribution. However, in most of the literature selective sampling refers to the stream-based scenario described here.
Pool-Based Sampling. For many real-world learning problems, large collections of unlabeled data can be gathered at once. This motivates pool-based sampling (Lewis and Gale, 1994), which assumes that there is a small set of labeled data L and a large pool of unlabeled data U available. The approach is illustrated in Figure 1.5(b). Queries are selected from the pool, which is usually assumed to be closed (i.e., static or non-changing), although this is not strictly necessary. Queries are typically chosen in a greedy fashion, according to a utility measure used to evaluate all instances in the pool (or, perhaps if U is very large, a subsample thereof). The binary search algorithm for the alien fruits example in Section 1.1 is a pool-based active learning algorithm.
The pool-based scenario has been studied for many real-world problem domains in machine learning, such as text classification (Hoi et al., 2006a; Lewis and Gale, 1994; McCallum and Nigam, 1998; Tong and Koller, 2000), information extraction (Settles and Craven, 2008; Thompson et al., 1999), image classification and retrieval (Tong and Chang, 2001; Zhang and Chen, 2002), video classification and retrieval (Hauptmann et al., 2006; Yan et al., 2003), speech recognition (Tür et al., 2005), and cancer diagnosis (Liu, 2004), to name only a few. In fact, pool-based sampling appears to be the most popular scenario for applied research in active learning, whereas query synthesis and stream-based selective sampling are more common in the theoretical literature.
The main difference between stream-based and pool-based active learning is that the former obtains one instance at a time, sequentially from some streaming data source (or by scanning through the data) and makes each query decision individually. Pool-based active learning, on the other hand, evaluates and ranks the entire collection of unlabeled data before selecting the best query. While the pool-based scenario appears to be much more common among application papers, one can imagine settings where the stream-based approach is more appropriate. For example, when memory or processing power is limited, as with mobile and embedded devices, or when the data set is too large to load into memory and must be scanned sequentially from disk. Unless otherwise noted, however, we will assume a pool-based scenario for our discussion of the algorithms discussed in the remainder of this book.
1More detail on PAC learning and active learning will be discussed in Chapter 6.
2Relaxations of these and other assumptions are discussed in Chapter 7.