• the fruits themselves are plentiful, easily harvested, and easily measured;
• testing for a fruit’s safety is what mainly incurs a cost: the fruit must be consumed, and the person who eats it risks falling ill.
Figure 1.3: A binary search through the set of ordered, untested alien fruits. By only testing this subset of fruits, we can exponentially reduce costs while achieving the same result. The labels shown in light blue can be inferred, and therefore do not need to be tested.
As it turns out, we can do better by performing a directed search through the set of fruit shapes. Taking the two observations above into account, we can augment our supervised learning strategy in the following way. First, let us gather an arbitrary number of unlabeled instances
for free (or very inexpensively) by simply harvesting a lot of fruits without testing them. Next, we can arrange these fruits in a sequence from round to irregular. Our goal is similar to the previous one: to discover where in this ordered series of fruits the transition fromsafe
to noxious
occurs, but by conducting as few tests as possible this time. If we execute a simple binary search through this ordered set of fruits, we only need to test ⌈log2 U⌉ items, instead of testing them all as we did before.
The process is illustrated in Figure 1.3. Given a set of unlabeled instances, this sequential algorithm would first test x = 5, then x = 7, and finally x = 6 before arriving at the exact same parameter value for θ*, alleviating the need to test the other six fruits (two of which happen to be harmful). This algorithmic speedup means that a classifier with error ∊ or less can be found with a mere O(log2 1/∊) tests, since all the other labels can be inferred. To put this in perspective, assume that 100 fruits were needed to obtain 99% accuracy in the earlier supervised setting. Then we would still need to harvest 100 fruits for our new binary search algorithm, but we would only need to test 6 or 7 of them to get the same guarantee. This is an exponential reduction in the number of tests necessary, which drastically reduces the number of people whose health is at risk, and helps the colony to make use of both safe and noxious fruits as quickly as possible.
1.2 ACTIVE LEARNING
The alien fruits example is a simple illustration of active learning. Active learning is a subfield of artificial intelligence and machine learning: the study of computer systems that improve with experience and training. Traditional “passive” learning systems induce a hypothesis to explain whatever training data happens to be available (e.g., a collection of labeled instances). By contrast, the hallmark of an active learning system is that it eagerly develops and tests new hypotheses as part of a continuing, interactive learning process. Another way to think about it is that active learners develop a “line of inquiry,” much in the way a scientist would design a series of experiments to help him or her draw conclusions as efficiently as possible. For example, the binary search algorithm in Figure 1.3 selects the next fruit to test—or query—after it has obtained an answer for the previous query from some oracle (or labeling source). For this reason, active learning is sometimes called “query learning” in the computational learning theory literature, and is closely related to work in “optimal experimental design” or “sequential design” in statistics.
Of course, the alien fruits example is a bit contrived and overly simplistic. Yet it illustrates some of the basic properties of many real-world problems, and shows how much can be gained from having the learner ask questions or be more involved in its own training process. In today’s information-drenched society, unlabeled data are often abundant, but the labels required to do supervised learning from these instances are difficult, time-consuming, or expensive to obtain. Consider a few examples:
• Classification and filtering. Learning to classify documents (e.g., articles or web pages) or any other kind of media (e.g., image, audio, and video files) usually requires people to annotate each item with particular labels, such as relevant
or not-relevant
. Unlabeled instances abound from electronic resources like the Internet, but annotating thousands of these instances can be tedious and even redundant.
• Speech recognition. Accurate labeling of speech utterances is extremely time consuming and requires trained linguists. While unannotated audio data is easily obtained from recording devices, Zhu and Goldberg (2009) have reported that annotation at the word level can take ten times longer than the actual audio (e.g., one minute of speech takes ten minutes to label), and annotating phonemes can take 400 times as long (e.g., nearly seven hours). The problem is compounded for rare languages or dialects.
• Information extraction. Systems that extract factual knowledge from text must be trained with detailed annotations of documents. Users highlight entities or relations of interest, such as person and organization names, or whether a person works for a particular organization. Locating entities and relations can take a half-hour or more for even simple newswire stories (Settles et al., 2008a). Annotations for specific knowledge domains may require additional expertise, e.g., annotating gene and disease mentions in biomedical text usually requires PhD-level biologists.
• Computational Biology. Increasingly, machine learning is used to interpret and make predictions about data from the natural sciences, particularly biology. For example, biochemists can induce models that help explain and predict enzyme activity from hundreds of synthesized peptide chains (Smith et al., 2011). However, there are 20n possible peptides of length n, which for 8-mers yields 208 ≈ 2.6 billion possibilities to synthesize and test. In practice, scientists might resort to random sequences, or cherry-picking subsequences of possibly interesting proteins, with no guarantee that either will provide much information about the chemical activity in question.
In all of these examples, data collection (for traditional supervised learning methods) comes with a hefty price tag, in terms of human effort and/or laboratory materials. If an active learning system is allowed to be part of the data collection process—to be “curious” about the task, if you will—the goal is to learn the task better with less training.
While the binary search strategy described in the previous section is a useful introduction to active learning, it is not directly applicable to most problems. For example, what if fruit safety is related not only to shape, but to size, color, and texture as well? Now we have four features to describe the input space instead of just one, and the simple binary search mechanism no longer works in these higher-dimensional spaces. Also consider that the bodies of different people might respond slightly differently to the same fruit, which introduces ambiguity or noise into the observations we use as labels. Most interesting real-world applications, like the ones in the list above, involve learning with hundreds or thousands of features (input dimensions), and the labels are often not 100% reliable.
The rest of this book is about the various ways we can apply the principles of active learning to machine learning problems in general. We focus primarily on classification, but touch on methods that apply to regression and structured prediction as well. Chapters 2–5 present, in detail, several query selection frameworks, or utility measures that can be used to decide which query the learner should ask next. Chapter 6 presents a unified view of these different query frameworks, and briefly touches on some theoretical guarantees for active learning. Chapter 7 summarizes the strengths and weaknesses of different active learning methods, as well as some practical considerations and a survey of more recent developments, with an eye toward the future of the field.
1.3 SCENARIOS FOR ACTIVE LEARNING
Before