Active Learning for
Parameter Estimation in Bayesian Networks
|
|
|
Simon Tong |
|
Daphne Koller |
|
|
|
Stanford University |
Motivation
|
|
|
|
Conducting lung cancer study |
|
Limited resources – can examine just a
few people |
|
Have a list of people’s ages and
whether they smoke |
|
Which people should we choose to
examine? |
|
|
|
Goal: to learn as much about the
distribution as possible |
|
|
|
Question: can we do better than random
sampling? |
Active Learning in Bayes
Nets
Selective versus
Interventional
|
|
|
|
Two types of fixing of controllable
nodes: |
|
|
|
Intervene to force node to take a
certain value |
|
e.g., perform an experiment |
|
|
|
Select data instances whose values
match our desired template |
|
e.g., our lung cancer example |
Parameter estimation
|
|
|
|
|
|
Given a Bayesian Network structure G |
|
|
|
Want to estimate the CPD parameters q |
|
|
|
We will be using discrete variables
with multinomial table CPDs |
|
And Dirichlet densities over parameters |
|
|
Big Picture
|
|
|
|
Obtain measure of quality of our
current model |
|
Choose query that most improves quality |
|
Update model |
|
|
|
Need notion of model quality |
|
|
|
Need to specify how to update CPDs |
|
Data no longer i.i.d |
|
Selective and Interventional scenarios |
|
|
Value of Information
|
|
|
|
Actions: Q |
|
Maximal Expected Utility |
|
model quality |
|
Value of Information of query Q |
|
expected increase in model quality
given query Q |
|
|
|
But… |
|
|
|
Continuous action space |
|
Very high dimensional action space |
|
Still need notion of quality and how to
update CPDs |
Updating parameter
density
|
|
|
|
Do not update A since we are fixing it |
|
|
|
If we select A then do not update B |
|
Sampling from P(B|A=a) ¹ P(B) |
|
|
|
If we force A then we can update B |
|
Sampling from P(B|A:=a) = P(B)* |
|
|
|
Update all other nodes as usual |
|
|
|
Obtain new density: |
|
*Pearl 2000 |
Bayesian point estimation
I
|
|
|
|
Goal: a single estimate q |
|
instead of a distribution p over q |
Bayesian point estimation
II
|
|
|
We do not know the true q |
|
Density p represents our optimal beliefs over q |
|
Choose q
that minimizes the expected loss: |
|
|
|
q =
argminq |
|
|
|
|
Bayesian estimation
|
|
|
|
Standard Bayesian approach: |
|
Maintain a density p(q) over CPD parameters |
|
Given data we update p |
Quality of p(q)
|
|
|
Use the expected loss of the Bayesian
point estimate as a measure of quality of p(q): |
|
|
|
Risk(p) = |
|
|
The Querying component
|
|
|
|
|
Set the controllable variables so as to
minimize the expected posterior risk: |
|
|
|
ExPRisk(p | Q=q) |
Intuition
|
|
|
|
Choosing queries that reduce the
expected loss of our estimate as much as possible |
|
|
|
But… are we converging to the “correct”
distribution? |
|
|
|
Theorem: Our method is consistent |
|
q
converges to true q that is generating the data |
|
|
|
Reducing loss as fast as possible. |
|
And converging to the right
solution. |
|
|
Obtaining an Algorithm
|
|
|
Consider a simple case first: |
|
|
Proof Outline
|
|
|
|
Fairly complex. First prove lemma: |
|
|
|
Let G(a) be the gamma function, Y(a) be the digamma function G’(a)/G(a) and H be the entropy function and a*=åaj |
|
|
|
Define: |
|
d(a1, … , ar) |
|
|
|
Then: |
|
|
|
|
|
Now use function properties of Y to derive the final
result |
Intuition
|
|
|
Suppose q1 is 100 less
likely than q2 |
|
Suppose we have: |
|
|
|
|
|
According to DRisk choosing q1 is worth more than choosing q2 |
|
|
|
Random sampling does not pay enough
attention to the rarer cases |
Scaling up
|
|
|
|
KL divergence decomposes with the
network: |
|
|
|
|
|
|
|
Leads to DRisk decomposing with the network |
|
Need small approximation*: Pq(ui | q) = Pq†x(ui | q) |
|
|
The Algorithm
|
|
|
|
|
|
For each potential query q |
|
Compute DRisk(X|q) |
|
Choose q for which DRisk(X|q) is greatest |
|
|
|
Cost of computing DRisk(X|q): |
|
Cost of Bayesian network inference |
|
|
|
Complexity: O (|Q|. Cost of inference) |
|
|
|
|
Experiments I
|
|
|
|
Asia Network |
|
8 nodes |
|
18 parameters |
|
2 controllable nodes |
|
|
|
Setting the priors: |
|
Assume have a fairly good background
knowledge |
|
Simulated this by sampling a few
hundred examples first of all |
Experiments II
|
|
|
|
Controllable nodes were root nodes |
|
|
|
Values for controllable nodes include
“Not set” |
|
|
|
Measured KL divergence to true network |
|
|
|
Run time: a few seconds to generate
next query |
Asia Network
Alarm Network
Asia Network (Uniform
Prior)
Alarm Network (Uniform
Prior)
Conclusions
|
|
|
|
|
Formal framework for Active Learning in
BNs |
|
Efficient algorithm |
|
Theoretical gains supported by experimental
results |
|
|
|
Particularly useful when data is expensive |
|
|
|
Most beneficial when domain consists of
distinctive common and rare cases |
|
Random sampling does not look at rare
cases enough |
|
In medical domains rare cases often
even more important |
|
Can adapt loss function to place even
greater emphasis on them |
Future Work
|
|
|
|
|
Cope with missing values |
|
Hard to do in closed form in Bayesian
framework |
|
Approximations |
|
|
|
Multiple queries |
|
Currently myopic: figuring out the
single next best query |
|
|
|
Active learning for network structure |
|
Need to maintain distribution over
structures: expensive |
|
Attempt to obtain closed form solution
for fixed ordering |
|
Then do MCMC over orderings |