|
|
|
Simon Tong |
|
Daphne Koller |
|
|
|
Stanford University |
|
|
|
|
|
Conducting lung cancer study |
|
Limited resources – can examine just a few
people |
|
Have a list of people’s ages and whether they
smoke |
|
Which people should we choose to examine? |
|
|
|
Goal: to learn as much about the distribution as
possible |
|
|
|
Question: can we do better than random sampling? |
|
|
|
|
|
|
Two types of fixing of controllable nodes: |
|
|
|
Intervene to force node to take a certain value |
|
e.g., perform an experiment |
|
|
|
Select data instances whose values match our
desired template |
|
e.g., our lung cancer example |
|
|
|
|
|
|
|
Given a Bayesian Network structure G |
|
|
|
Want to estimate the CPD parameters q |
|
|
|
We will be using discrete variables
with multinomial table CPDs |
|
And Dirichlet densities over parameters |
|
|
|
|
|
|
|
Obtain measure of quality of our current model |
|
Choose query that most improves quality |
|
Update model |
|
|
|
Need notion of model quality |
|
|
|
Need to specify how to update CPDs |
|
Data no longer i.i.d |
|
Selective and Interventional scenarios |
|
|
|
|
|
|
|
Actions: Q |
|
Maximal Expected Utility |
|
model quality |
|
Value of Information of query Q |
|
expected increase in model quality given query Q |
|
|
|
But… |
|
|
|
Continuous action space |
|
Very high dimensional action space |
|
Still need notion of quality and how to update
CPDs |
|
|
|
|
|
Do not update A since we are fixing it |
|
|
|
If we select A then do not update B |
|
Sampling from P(B|A=a) ¹ P(B) |
|
|
|
If we force A then we can update B |
|
Sampling from P(B|A:=a) = P(B)* |
|
|
|
Update all other nodes as usual |
|
|
|
Obtain new density: |
|
*Pearl 2000 |
|
|
|
|
|
Goal: a single estimate q |
|
instead of a distribution p over q |
|
|
|
|
We do not know the true q |
|
Density p represents our optimal
beliefs over q |
|
Choose q
that minimizes the expected loss: |
|
|
|
q = argminq |
|
|
|
|
|
|
|
|
|
Standard Bayesian approach: |
|
Maintain a density p(q) over CPD parameters |
|
Given data we update p |
|
|
|
|
Use the expected loss of the Bayesian point
estimate as a measure of quality of p(q): |
|
|
|
Risk(p) = |
|
|
|
|
|
|
|
|
Set the controllable variables so as to minimize
the expected posterior risk: |
|
|
|
ExPRisk(p
| Q=q) |
|
|
|
|
|
Choosing queries that reduce the expected loss
of our estimate as much as possible |
|
|
|
But… are we converging to the “correct”
distribution? |
|
|
|
Theorem: Our method is consistent |
|
q
converges to true q that is generating the data |
|
|
|
Reducing
loss as fast as possible. |
|
And converging to the right solution. |
|
|
|
|
|
|
Consider a simple case first: |
|
|
|
|
|
|
|
Fairly complex. First prove lemma: |
|
|
|
Let G(a) be the gamma function, Y(a) be the digamma function G’(a)/G(a) and H be the entropy function and a*=åaj |
|
|
|
Define: |
|
d(a1, … , ar) |
|
|
|
Then: |
|
|
|
|
|
Now use function properties of Y to derive the final
result |
|
|
|
|
Suppose q1 is 100 less likely than q2 |
|
Suppose we have: |
|
|
|
|
|
According to DRisk choosing q1 is worth more than choosing q2 |
|
|
|
Random sampling does not pay enough attention to
the rarer cases |
|
|
|
|
|
KL divergence decomposes with the network: |
|
|
|
|
|
|
|
Leads to DRisk decomposing with the network |
|
Need small approximation*: Pq(ui
| q) = Pq†x(ui
| q) |
|
|
|
|
|
|
|
|
|
For each potential query q |
|
Compute DRisk(X|q) |
|
Choose q for which DRisk(X|q) is greatest |
|
|
|
Cost of computing DRisk(X|q): |
|
Cost of Bayesian network inference |
|
|
|
Complexity: O (|Q|. Cost of inference) |
|
|
|
|
|
|
|
|
|
Asia Network |
|
8 nodes |
|
18 parameters |
|
2 controllable nodes |
|
|
|
Setting the priors: |
|
Assume have a fairly good background knowledge |
|
Simulated this by sampling a few hundred
examples first of all |
|
|
|
|
|
Controllable nodes were root nodes |
|
|
|
Values for controllable nodes include “Not set” |
|
|
|
Measured KL divergence to true network |
|
|
|
Run time: a few seconds to generate next query |
|
|
|
|
|
|
|
|
|
|
Formal framework for Active Learning in BNs |
|
Efficient algorithm |
|
Theoretical gains supported by experimental
results |
|
|
|
Particularly useful when data is expensive |
|
|
|
Most beneficial when domain consists of
distinctive common and rare cases |
|
Random sampling does not look at rare cases
enough |
|
In medical domains rare cases often even more
important |
|
Can adapt loss function to place even greater
emphasis on them |
|
|
|
|
|
|
Cope with missing values |
|
Hard to do in closed form in Bayesian framework |
|
Approximations |
|
|
|
Multiple queries |
|
Currently myopic: figuring out the single next
best query |
|
|
|
Active learning for network structure |
|
Need to maintain distribution over structures:
expensive |
|
Attempt to obtain closed form solution for fixed
ordering |
|
Then do MCMC over orderings |
|