Active Learning for
Parameter Estimation in Bayesian Networks



Simon Tong 

Daphne Koller 



Stanford University 
Motivation




Conducting lung cancer study 

Limited resources – can examine just a
few people 

Have a list of people’s ages and
whether they smoke 

Which people should we choose to
examine? 



Goal: to learn as much about the
distribution as possible 



Question: can we do better than random
sampling? 
Active Learning in Bayes
Nets
Selective versus
Interventional




Two types of fixing of controllable
nodes: 



Intervene to force node to take a
certain value 

e.g., perform an experiment 



Select data instances whose values
match our desired template 

e.g., our lung cancer example 
Parameter estimation






Given a Bayesian Network structure G 



Want to estimate the CPD parameters q 



We will be using discrete variables
with multinomial table CPDs 

And Dirichlet densities over parameters 


Big Picture




Obtain measure of quality of our
current model 

Choose query that most improves quality 

Update model 



Need notion of model quality 



Need to specify how to update CPDs 

Data no longer i.i.d 

Selective and Interventional scenarios 


Value of Information




Actions: Q 

Maximal Expected Utility 

model quality 

Value of Information of query Q 

expected increase in model quality
given query Q 



But… 



Continuous action space 

Very high dimensional action space 

Still need notion of quality and how to
update CPDs 
Updating parameter
density




Do not update A since we are fixing it 



If we select A then do not update B 

Sampling from P(BA=a) ¹ P(B) 



If we force A then we can update B 

Sampling from P(BA:=a) = P(B)* 



Update all other nodes as usual 



Obtain new density: 

*Pearl 2000 
Bayesian point estimation
I




Goal: a single estimate q 

instead of a distribution p over q 
Bayesian point estimation
II



We do not know the true q 

Density p represents our optimal beliefs over q 

Choose q
that minimizes the expected loss: 



q =
argmin_{q} 




Bayesian estimation




Standard Bayesian approach: 

Maintain a density p(q) over CPD parameters 

Given data we update p 
Quality of p(q)



Use the expected loss of the Bayesian
point estimate as a measure of quality of p(q): 



Risk(p) = 


The Querying component





Set the controllable variables so as to
minimize the expected posterior risk: 



ExPRisk(p  Q=q) 
Intuition




Choosing queries that reduce the
expected loss of our estimate as much as possible 



But… are we converging to the “correct”
distribution? 



Theorem: Our method is consistent 

q
converges to true q that is generating the data 



Reducing loss as fast as possible. 

And converging to the right
solution. 


Obtaining an Algorithm



Consider a simple case first: 


Proof Outline




Fairly complex. First prove lemma: 



Let G(a) be the gamma function, Y(a) be the digamma function G’(a)/G(a) and H be the entropy function and a_{*}=åa_{j} 



Define: 

d(a_{1}, … , a_{r}) 



Then: 





Now use function properties of Y to derive the final
result 
Intuition



Suppose q_{1} is 100 less
likely than q_{2} 

Suppose we have: 





According to DRisk choosing q_{1} is worth more than choosing q_{2} 



Random sampling does not pay enough
attention to the rarer cases 
Scaling up




KL divergence decomposes with the
network: 







Leads to DRisk decomposing with the network 

Need small approximation*: P_{q}(u_{i}  q) = P_{q}_{†}_{x}(u_{i}  q) 


The Algorithm






For each potential query q 

Compute DRisk(Xq) 

Choose q for which DRisk(Xq) is greatest 



Cost of computing DRisk(Xq): 

Cost of Bayesian network inference 



Complexity: O (Q. Cost of inference) 




Experiments I




Asia Network 

8 nodes 

18 parameters 

2 controllable nodes 



Setting the priors: 

Assume have a fairly good background
knowledge 

Simulated this by sampling a few
hundred examples first of all 
Experiments II




Controllable nodes were root nodes 



Values for controllable nodes include
“Not set” 



Measured KL divergence to true network 



Run time: a few seconds to generate
next query 
Asia Network
Alarm Network
Asia Network (Uniform
Prior)
Alarm Network (Uniform
Prior)
Conclusions





Formal framework for Active Learning in
BNs 

Efficient algorithm 

Theoretical gains supported by experimental
results 



Particularly useful when data is expensive 



Most beneficial when domain consists of
distinctive common and rare cases 

Random sampling does not look at rare
cases enough 

In medical domains rare cases often
even more important 

Can adapt loss function to place even
greater emphasis on them 
Future Work





Cope with missing values 

Hard to do in closed form in Bayesian
framework 

Approximations 



Multiple queries 

Currently myopic: figuring out the
single next best query 



Active learning for network structure 

Need to maintain distribution over
structures: expensive 

Attempt to obtain closed form solution
for fixed ordering 

Then do MCMC over orderings 