Support Vector Machine
Active Learning with Applications to Text Classification
Simon Tong
Daphne Koller
STANFORD UNIVERSITY

Motivation
Training a classifier for Sports web pages
Need labeled training data
Pay student to label a fixed number of randomly sampled pages
More mileage from our money by picking “informative” pages to be labeled?
We show that this is possible

Pool Based Active Learning
Given pool of unlabeled data
Query the label of one pool instance
Repeat
Output a classifier
We shall show how to do this with SVMs

The Querying Function
The extra component in an Active learner:
Q( L È U ) = uÎU
Tells us the unlabeled instance to query next
Use Q to get our labeled training set
Give training set to a standard learner

Support Vector Machines
Maximal margin hyperplane
Highly effective for text classification
Can we do even better with active learning?
Assumption:
Actual labeling of the entire pool is linearly separable
Remark: assumption can be relaxed
We will give querying functions for SVMs

Version Space

Version Space

Version Space

Version Space

Choosing the Query
Want to be close to SVM h* trained on all labels
h* always in version space
It separates all of the labels so it separates any subset
Receiving label of a pool instance reduces version space
Idea: choose pool instance that will reduce version space as fast as possible

Choosing the Query

Choosing the Query: Approximation I

Summary of Methods
Simple: choose unlabeled instance closest to current SVM

Choosing the Query: Approximation I

Choosing the Query: Approximation II

Choosing the Query: Approximation II

Choosing the Query: Approximation II

Choosing the Query: Approximation II

Summary of Methods
Simple: choose unlabeled instance closest to current SVM
Ratio: choose unlabeled instance for which m+ and m- are similar in size

Simple vs Ratio
Simple
cheap (only need to train one SVM per query)
rougher approximation
Ratio
expensive (need two SVMs for each pool instance)
only much more expensive after several queries
better approximation

Simple vs Ratio

Summary of Methods
Simple: choose unlabeled instance closest to current SVM
Ratio: choose unlabeled instance for which m+ and m- are similar in size
Hybrid: do Ratio for first few queries then do Simple for the rest

Experiments
Reuters newswire stories
Used ten most frequent topics
Pool of 1000 unlabeled documents
Initialized with 1 positive, 1 negative instance
Compared Active with Passive (random)

Reuters Newswire
Average Accuracy over Ten Topics

Reuters Newswire
Topic Simple Ratio Equiv. Random
After 10 After 10         Size
Earn 86.1   89.0 12
Acq 54.2   57.3 12
Money-fx 35.6   38.4 52
Grain 50.3   60.3 51
Crude 58.2   58.4 55
Trade 50.7   50.6 85
Interest 40.6   43.7 60
Ship 53.9   53.8 >100
Wheat 64.1   66.6 >100
Corn 49.5   46.3 >100

Comp.* Newsgroups
Average Accuracy over All Five Topics

Newsgroups: IBM and Misc

Reuters: Transduction

Conclusions
Three ways to do active learning with SVMs
Active learning can significantly reduce the need for labeled instances
in a domain where SVMs already perform well
Simple = cheap but unstable in some cases
Ratio = expensive but more stable
Hybrid = combines benefits of both methods

Reuters: Different Pool Sizes