|
|
|
Simon Tong |
|
Daphne Koller |
|
|
|
STANFORD UNIVERSITY |
|
|
|
|
Training a classifier for Sports web pages |
|
|
|
Need labeled training data |
|
|
|
Pay student to label a fixed number of randomly
sampled pages |
|
|
|
More mileage from our money by picking
“informative” pages to be labeled? |
|
|
|
We show that this is possible |
|
|
|
|
Given pool of unlabeled data |
|
Query the label of one pool instance |
|
Repeat |
|
Output a classifier |
|
|
|
|
|
We shall show how to do this with SVMs |
|
|
|
|
|
|
The extra component in an Active learner: |
|
Q( L È U ) = uÎU |
|
|
|
Tells us the unlabeled instance to query next |
|
|
|
|
|
|
|
|
|
|
|
|
|
Use Q to get our labeled training set |
|
Give training set to a standard learner |
|
|
|
|
|
Maximal margin hyperplane |
|
Highly effective for text classification |
|
Can we do even better with active learning? |
|
|
|
Assumption: |
|
Actual labeling of the entire pool is linearly
separable |
|
Remark: assumption can be relaxed |
|
|
|
We will give querying functions for SVMs |
|
|
|
|
|
|
|
|
|
Want to be close to SVM h* trained on all labels |
|
|
|
h* always in version space |
|
It separates all of the labels so it separates
any subset |
|
|
|
Receiving label of a pool instance reduces
version space |
|
|
|
Idea: choose pool instance that will reduce
version space as fast as possible |
|
|
|
|
|
|
Simple: choose unlabeled instance closest to
current SVM |
|
|
|
|
|
|
|
|
|
|
|
Simple: choose unlabeled instance closest to
current SVM |
|
|
|
Ratio: choose unlabeled instance for which m+
and m- are similar in size |
|
|
|
|
|
|
|
Simple |
|
cheap (only need to train one SVM per query) |
|
rougher approximation |
|
|
|
Ratio |
|
expensive (need two SVMs for each pool instance) |
|
only much more expensive after several queries |
|
better approximation |
|
|
|
|
|
Simple: choose unlabeled instance closest to
current SVM |
|
|
|
Ratio: choose unlabeled instance for which m+
and m- are similar in size |
|
|
|
Hybrid: do Ratio for first few queries then do Simple
for the rest |
|
|
|
|
Reuters newswire stories |
|
|
|
Used ten most frequent topics |
|
|
|
Pool of 1000 unlabeled documents |
|
|
|
Initialized with 1 positive, 1 negative instance |
|
|
|
Compared Active with Passive (random) |
|
|
|
|
|
|
Average Accuracy over Ten Topics |
|
|
|
|
Topic Simple Ratio Equiv.
Random |
|
After 10 After 10 Size |
|
|
|
Earn 86.1 89.0 12 |
|
Acq 54.2 57.3 12 |
|
Money-fx 35.6 38.4 52 |
|
Grain 50.3 60.3 51 |
|
Crude 58.2 58.4 55 |
|
Trade 50.7 50.6 85 |
|
Interest 40.6 43.7 60 |
|
Ship 53.9 53.8 >100 |
|
Wheat 64.1 66.6 >100 |
|
Corn 49.5 46.3 >100 |
|
|
|
|
Average Accuracy over All Five Topics |
|
|
|
|
|
|
|
Three ways to do active learning with SVMs |
|
|
|
Active learning can significantly reduce the
need for labeled instances |
|
in a domain where SVMs already perform well |
|
|
|
Simple = cheap but unstable in some cases |
|
Ratio = expensive but more stable |
|
Hybrid = combines benefits of both methods |
|
|