Scaling up
n
KL divergence decomposes with the network:
n
Leads to
D
Risk decomposing with the network
n
Need small approximation*: P
q
(
u
i
|
q
) = P
q
†
x
(
u
i
|
q
)