Active Learning for Parameter Estimation in Bayesian Networks
Simon Tong
Daphne Koller
Stanford University

Motivation
Conducting lung cancer study
Limited resources – can examine just a few people
Have a list of people’s ages and whether they smoke
Which people should we choose to examine?
Goal: to learn as much about the distribution as possible
Question: can we do better than random sampling?

Active Learning in Bayes Nets

Selective versus Interventional
Two types of fixing of controllable nodes:
Intervene to force node to take a certain value
e.g., perform an experiment
Select data instances whose values match our desired template
e.g., our lung cancer example

Parameter estimation
Given a Bayesian Network structure G
Want to estimate the CPD parameters q
We will be using discrete variables with multinomial table CPDs
And Dirichlet densities over parameters

Big Picture
Obtain measure of quality of our current model
Choose query that most improves quality
Update model
Need notion of model quality
Need to specify how to update CPDs
Data no longer i.i.d
Selective and Interventional scenarios

Value of Information
Actions: Q
Maximal Expected Utility
model quality
Value of Information of query Q
expected increase in model quality given query Q
But…
Continuous action space
Very high dimensional action space
Still need notion of quality and how to update CPDs

Updating parameter density
Do not update A since we are fixing it
If we select A then do not update B
Sampling from P(B|A=a) ¹ P(B)
If we force A then we can update B
Sampling from P(B|A:=a) = P(B)*
Update all other nodes as usual
Obtain new density:
*Pearl 2000

Bayesian point estimation I
Goal: a single estimate q
instead of a distribution p over q

Bayesian point estimation II
We do not know the true q
Density p represents our optimal beliefs over q
Choose q  that minimizes the expected loss:
q = argminq

Bayesian estimation
Standard Bayesian approach:
Maintain a density p(q) over CPD parameters
Given data we update p

Quality of p(q)
Use the expected loss of the Bayesian point estimate as a measure of quality of p(q):
Risk(p) =

The Querying component
Set the controllable variables so as to minimize the expected posterior risk:
   ExPRisk(p | Q=q)

Intuition
Choosing queries that reduce the expected loss of our estimate as much as possible
But… are we converging to the “correct” distribution?
Theorem: Our method is consistent
q converges to true q that is generating the data
 Reducing loss as fast as possible.
And converging to the right solution.

Obtaining an Algorithm
Consider a simple case first:

Proof Outline
Fairly complex. First prove lemma:
Let G(a) be the gamma function, Y(a) be the digamma function G’(a)/G(a) and H be the entropy function and a*=åaj
Define:
d(a1, … , ar)
Then:
Now use function properties of Y to derive the final result

Intuition
Suppose q1 is 100 less likely than q2
Suppose we have:
According to DRisk choosing q1 is worth more than choosing q2
Random sampling does not pay enough attention to the rarer cases

Scaling up
KL divergence decomposes with the network:
Leads to DRisk decomposing with the network
Need small approximation*: Pq(ui | q) = Pqx(ui | q)

The Algorithm
For each potential query q
Compute DRisk(X|q)
Choose q for which DRisk(X|q) is greatest
Cost of computing DRisk(X|q):
Cost of Bayesian network inference
Complexity: O (|Q|. Cost of inference)

Experiments I
Asia Network
8 nodes
18 parameters
2 controllable nodes
Setting the priors:
Assume have a fairly good background knowledge
Simulated this by sampling a few hundred examples first of all

Experiments II
Controllable nodes were root nodes
Values for controllable nodes include “Not set”
Measured KL divergence to true network
Run time: a few seconds to generate next query

Asia Network

Alarm Network

Asia Network (Uniform Prior)

Alarm Network (Uniform Prior)

Conclusions
Formal framework for Active Learning in BNs
Efficient algorithm
Theoretical gains supported by experimental results
Particularly useful when data is expensive
Most beneficial when domain consists of distinctive common and rare cases
Random sampling does not look at rare cases enough
In medical domains rare cases often even more important
Can adapt loss function to place even greater emphasis on them

Future Work
Cope with missing values
Hard to do in closed form in Bayesian framework
Approximations
Multiple queries
Currently myopic: figuring out the single next best query
Active learning for network structure
Need to maintain distribution over structures: expensive
Attempt to obtain closed form solution for fixed ordering
Then do MCMC over orderings