Support Vector Machine
Active Learning with Applications to Text Classification

Simon Tong

Daphne Koller

STANFORD UNIVERSITY

Motivation

Training a classifier for Sports web pages

Need labeled training data

Pay student to label a fixed number of randomly sampled pages

More mileage from our money by picking “informative” pages to be labeled?

We show that this is possible

Pool Based Active Learning

Given pool of unlabeled data

Query the label of one pool instance

Repeat

Output a classifier

We shall show how to do this with SVMs

The Querying Function

The extra component in an Active learner:

Q( L È U ) = uÎU

Tells us the unlabeled instance to query next

Use Q to get our labeled training set

Give training set to a standard learner

Support Vector Machines

Maximal margin hyperplane

Highly effective for text classification

Can we do even better with active learning?

Assumption:

Actual labeling of the entire pool is linearly separable

Remark: assumption can be relaxed

We will give querying functions for SVMs

Version Space

Choosing the Query

Want to be close to SVM h* trained on all labels

h* always in version space

It separates all of the labels so it separates any subset

Receiving label of a pool instance reduces version space

Idea: choose pool instance that will reduce version space as fast as possible

Choosing the Query

Choosing the Query: Approximation I

Summary of Methods

Simple: choose unlabeled instance closest to current SVM

Choosing the Query: Approximation I

Choosing the Query: Approximation II

Summary of Methods

Simple: choose unlabeled instance closest to current SVM

Ratio: choose unlabeled instance for which m⁺and m^-are similar in size

Simple vs Ratio

Simple

cheap (only need to train one SVM per query)

rougher approximation

Ratio

expensive (need two SVMs for each pool instance)

only much more expensive after several queries

better approximation

Simple vs Ratio

Summary of Methods

Simple: choose unlabeled instance closest to current SVM

Ratio: choose unlabeled instance for which m⁺and m^-are similar in size

Hybrid: do Ratio for first few queries then do Simple for the rest

Experiments

Reuters newswire stories

Used ten most frequent topics

Pool of 1000 unlabeled documents

Initialized with 1 positive, 1 negative instance

Compared Active with Passive (random)

Reuters Newswire

Average Accuracy over Ten Topics

Reuters Newswire

Topic Simple Ratio Equiv. Random

After 10 After 10 Size

Earn 86.1 89.0 12

Acq 54.2 57.3 12

Money-fx 35.6 38.4 52

Grain 50.3 60.3 51

Crude 58.2 58.4 55

Trade 50.7 50.6 85

Interest 40.6 43.7 60

Ship 53.9 53.8 >100

Wheat 64.1 66.6 >100

Corn 49.5 46.3 >100

Comp.* Newsgroups

Average Accuracy over All Five Topics

Newsgroups: IBM and Misc

Reuters: Transduction

Conclusions

Three ways to do active learning with SVMs

Active learning can significantly reduce the need for labeled instances

in a domain where SVMs already perform well

Simple = cheap but unstable in some cases

Ratio = expensive but more stable

Hybrid = combines benefits of both methods

Reuters: Different Pool Sizes


	Training a classifier for Sports web pages

	Need labeled training data

	Pay student to label a fixed number of randomly sampled pages

	More mileage from our money by picking “informative” pages to be labeled?

	We show that this is possible


	Given pool of unlabeled data
	Query the label of one pool instance
	Repeat
	Output a classifier


	We shall show how to do this with SVMs


	The extra component in an Active learner:
	Q( L È U ) = uÎU

	Tells us the unlabeled instance to query next






	Use Q to get our labeled training set
	Give training set to a standard learner


	Maximal margin hyperplane
	Highly effective for text classification
		Can we do even better with active learning?

	Assumption:
		Actual labeling of the entire pool is linearly separable
		Remark: assumption can be relaxed

	We will give querying functions for SVMs


	Want to be close to SVM h* trained on all labels

	h* always in version space
		It separates all of the labels so it separates any subset

	Receiving label of a pool instance reduces version space

	Idea: choose pool instance that will reduce version space as fast as possible


	Simple: choose unlabeled instance closest to current SVM

	Ratio: choose unlabeled instance for which m⁺and m^-are similar in size


	Simple
		cheap (only need to train one SVM per query)
		rougher approximation

	Ratio
		expensive (need two SVMs for each pool instance)
		only much more expensive after several queries
		better approximation


	Simple: choose unlabeled instance closest to current SVM

	Ratio: choose unlabeled instance for which m⁺and m^-are similar in size

	Hybrid: do Ratio for first few queries then do Simple for the rest


	Reuters newswire stories

	Used ten most frequent topics

	Pool of 1000 unlabeled documents

	Initialized with 1 positive, 1 negative instance

	Compared Active with Passive (random)


	Topic Simple Ratio Equiv. Random
	After 10 After 10 Size

	Earn 86.1 89.0 12
	Acq 54.2 57.3 12
	Money-fx 35.6 38.4 52
	Grain 50.3 60.3 51
	Crude 58.2 58.4 55
	Trade 50.7 50.6 85
	Interest 40.6 43.7 60
	Ship 53.9 53.8 >100
	Wheat 64.1 66.6 >100
	Corn 49.5 46.3 >100


	Three ways to do active learning with SVMs

	Active learning can significantly reduce the need for labeled instances
		in a domain where SVMs already perform well

	Simple = cheap but unstable in some cases
	Ratio = expensive but more stable
	Hybrid = combines benefits of both methods