# Structured Probabilistic Models
###### tags: `shared` `technical`
[TOC]
## Examples
+ What is "structured probabilistic model"?
+ A way of describing a probabilistic distribution
+ Typically with a graph representing interactions among nodes
+ restricted boltzmann machine (RBM), deep boltzmann machine (DBM), deep beleif network (DBN)
## Notations
### Conditional variational autoencoders
+ Given data $x$ and condition $c$; we want to predict $y$
+ This mechanism may be governed by a set of laten variables $z$
- $z$ is generated by a prior $p_{\theta}(z)$
- Then this $z$ is used to recover $x$ by $p_{\theta}(x\vert z)$
- Want to use $q_\phi (z\vert x)$ to approximate intractable posterior $p_\theta (z\vert x)$
- To figure out what $z$ is most probably to generate data point $x$
\begin{align}
\max_{z} \quad p_{\theta}(z\vert x)
\end{align}
+ An encoder-decoder architecture
- $p_\theta (z\vert x) \simeq q_{\phi}(z\vert x)$
- Encoder: $q_\phi(z\vert x)$ (recognition)
- Decoder: $p_\theta(x\vert z)$ (generation)
- $\phi$ and $\theta$ are learned together
## Evidence Lower BOund (ELBO)
### Formulation
+ Sometimes also called "variational free energy"
+ $\log p_{\theta}(x)$ is independent of $z$ and fixed once given the data
\begin{align}
\log p_{\theta}(x) & = \log p_{\theta}(x\vert z)
\end{align}
+ Expanding the data log-likelihood
\begin{align}
p_{\theta}(x) & = \frac{q_{\phi}(z\vert x)}{p_{\theta}(z\vert x)}\frac{p_{\theta}(x,z)}{q_{\phi}(z\vert x)} \\
\end{align}
+ Using normalization property of $p_{\theta}(z)$ and plug the expansion above into the equation
\begin{align}
\log p_\theta (x) & = \int_{z} q_{\phi}(z\vert x) \log p_\theta (x) dz \\
& = D_{KL}(q_{\phi}(z\vert x)\Vert p_\theta(z\vert x)) + \mathbb{E}_{q_{\phi}(z\vert x)}[\log p_{\theta}(x,z) - \log q_{\phi}(z\vert x)]
\end{align}
+ $\mathcal{L} (\theta,\phi;x)$ is the Evidence Lower BOund (ELBO) (second term in the equation above)
\begin{align}
\mathcal{L} (\theta,\phi;x) & = \log p_{\theta}(x) - D_{KL}(q_{\phi}(z\vert x)\Vert p_\theta(z\vert x)) \\
& = \log p_{\theta}(x) - \mathbb{E}_{q_{\phi}(z\vert x)}[\log q_{\phi}(z\vert x) - \log p_{\theta}(x,z) + \log p_{\theta}(x)] \\
& = - \mathbb{E}_{q_{\phi}(z\vert x)}[\log q_{\phi}(z\vert x) - \log p_{\theta}(x,z)] \quad \text{($p_{\theta}(x)$ is independent of $q_{\phi}(z\vert x)$)} \\
& = \mathbb{E}_{q_{\phi}(z\vert x)}[\log p_{\theta}(x,z)] + H(q(z\vert x))
\end{align}
- Note that if $q_{\phi}(z\vert x) = p_{\theta}(z\vert x)$, we have
\begin{equation}
\mathcal{L} (\theta,\phi;x) = \log p_{\theta} (x)
\end{equation}
### Alternative formulation
+ By expanding the second term (entropy)
\begin{align}
\mathcal{L} (\theta,\phi;x) & = \mathbb{E}_{q_{\phi}(z\vert x)}[\log p_{\theta}(x,z)] + H(q(z\vert x)) \\
& = \mathbb{E}_{q_{\phi}(z\vert x)}[\log p_{\theta}(x\vert z)] - D_{KL}(q_{\phi}(z\vert x)\Vert p_\theta(z)) \\
& \lt \log p_{\theta}(x)
\end{align}
### Objectives
+ $D_{KL}(\bullet)$ is always non-negative and $\log p_\theta (x)$ is fixed
+ We want the approximation $q_{\phi}(z\vert x)$ to be close to the true posterior $p_{\theta}(z\vert x)$
+ The actual posterior is intractable
+ Instead maximizing the ELBO $\Rightarrow$ $D_{KL}(\bullet)\rightarrow 0$
## Expectation Maximization (EM) Algorithm
## Mean-field approximation/approach
+ Variational learning/inference
+ Constuct a lower bound with approximated q where maximizing the lower bound ensures the KL-divergence of p & q are minimized
+ Classic strategy - EM algorithm (but not general enough)
+ Want to develop a more general approach to variational learning
+ Basic ideas
+ The lower bound should be maximized over a restricted family of distribution of q
+ This family of q should make it easy to compute $\mathbb{E}_q\log p(\mathbf{h},\mathbf{v})$ => typical way is to introduce assumptions about **how q factorizes**
+ Common approach - **factorial distribution**
+ $q(\mathbf{h}|\mathbf{v}) = \Pi_i q(h_i|\mathbf{v})$
+ Called **mean field approach**
+ We do NOT need to specify a specific parametric form for q
+ The optimization problem determines the optimal probability distribution within those factorizations
## Reparameterization trick
+ Often called reparameterization trick, stochastic-back-propagation, or perturbation analysis
+ Often applied in denosing autoencoders, networks with dropout, and conditional variational autoencoders
+ We use $q_{\phi}(z\vert x)$ to approximate the true posterior $p_{\theta}(z\vert x)$
+ Sample $z \sim p_{\phi}(z\vert x)$ and $z=g_{\phi}(\epsilon,x) \quad$ ($\epsilon$ is the source of randomness)
+ True posterior is regarded as the averaged probability of conditioned/reconstructed data probability
\begin{align}
p_{\theta}(x) & = \mathbb{E}_{z\sim q_{\phi}(z\vert x)}[p_{\theta}(x\vert z)]
\end{align}
- By similar procedures, we can lead to the same results as above
### Example: drawing samples from Gaussian distribution
+ In CVAE, we sample $z \sim q_{\phi}(z\vert x)$
+ Express $z$ as a deterministic variable $z=g_{\phi}(\epsilon,x)$. where $\epsilon$ is an auxiliary variable with independent marginal $p(\epsilon)$
+ Let $\epsilon \sim \mathcal{N}(0,1)$
+ $\mu$ and $\sigma^2$ are learned with CVAE since we are using the generated $z$ to feed into the following network (decoder)
+ So that we can have $q_{\phi}(z\vert x) \sim \mathcal{N}(\mu,\sigma^2)$
+ Why respresenting those parameters? WE SPECIFY THEM TO BE SO
+ Final note: use t-SNE on $\mu$ and you will see clear separation if the model is well-trained
## Variational autoencoder (VAE)
### Stochastic Gradient Variational Bayes (SGVB) estimator
+ Look into samples and data points
- $L$ is the number of samples per data point
- $M$ is the number of samples per minibatch
- $N$ is the number of samples
\begin{align}
\tilde{\mathcal{L}} & = - D_{KL}(q_{\phi}(z\vert x^{(i)})\Vert p_\theta(z)) + \mathbb{E}_{q_{\phi}(z\vert x)}[\log p_{\theta}(x^{(i)}\vert z)] \\
\tilde{\mathcal{L}} & = - D_{KL}(q_{\phi}(z\vert x^{(i)})\Vert p_\theta(z)) + \frac{1}{L}\sum_{l=1}^L \log p_{\theta}(x^{(i)}\vert z^{(i,l)}) \\
\mathcal{L} & = \frac{N}{M}\sum_{i=1}^M \tilde{\mathcal{L}} (\theta,\phi;x^{(i)})\\
\text{where} \quad z^{(i,l)} & = g_{\phi}(\epsilon^{(i,l)},x^{(i)})\quad \epsilon^{(l)}\sim p(\epsilon)
\end{align}
+ Neural networks are used as probabiblistic encoders and decoders
+ Note that the second term in the SGVB is negative "loss" (cross-entropy/reconstruction)
### Gaussian encoder/decoder
+ $\log p_{\theta}(z\vert x) = \log \mathcal{N}(x;\mu,\sigma^2I)$ where $\mu$ and $\log \sigma^2$ are learned (extracted from activations of the final layer)
+ It's possible to use cross-entropy loss for decoder
### $D_{KL}$ for Gaussian case
+ KL term will be integrated into evidence/variational lower bound that will be maximized
+ Let $J$ be the dimensionality of $z$ and prior $p(z) = \mathcal{N}(0,I)$
+ We will utilize the normalization property and definition of variance of Gaussian r.v. throughout the derivation
+ $\log$ here denotes natural logarithm
+ Gaussian differential entropy: $h_a(X) = \frac{1}{2}\log_a (2\pi e \sigma^2)$ and take $a=e$
#### Encoder (recognition)
Want to approximate $p_{\theta}(z\vert x)$ with $q_{\phi}(z\vert x)$
\begin{align}
\int_z q_{\phi}(z\vert x)\log p(z) dz & = \int_z \mathcal{N}(z;\mu,\sigma^2)\log \mathcal{N}(z;0,I)dz \\
& = -\frac{J}{2}\log(2\pi)-\frac{1}{2}\sum_{j=1}^J (\mu_j^2+\sigma_j^2)
\end{align}
#### Decoder (generation)
\begin{align}
\int_z q_{\phi}(z\vert x)\log q_{\phi}(z\vert x) dz & = \int_z \mathcal{N}(z;\mu,\sigma^2)\log \mathcal{N}(z;\mu,\sigma^2)dz \\
& = -\frac{J}{2}\log(2\pi)-\frac{1}{2}\sum_{j=1}^J (1+\log \sigma_j^2)
\end{align}
#### KL divergence
\begin{align}
-D_{KL}(q_{\phi}(z\vert x)\Vert p_{\theta}(z)) & = \int_z q_{\phi}(z\vert x)(\log p(z)-\log q_{\phi}(z\vert x))dz \\
& = \frac{1}{2}\sum_{j=1}^J \left(1+\log( \sigma_j^2)-\mu_j^2-\sigma_j^2 \right)
\end{align}
### Put it together
+ We want the posterior to be $\log q_{\phi}(z\vert x) \sim \log\mathcal{N}(z;\mu,\sigma^2I)$
+ For the $i^{th}$ data point and the $l^{th}$ generated sample
\begin{align}
z^{(i,l)} & \sim q_{\phi}(z\vert x^{(i)}) \\
z^{(i,l)} & = g_{\phi}(x^{(i)},\epsilon^{(l)}) = \mu^{(i)} + \sigma^{(i)}\odot\epsilon^{(l)}
\end{align}
where $\epsilon^{(l)}\sim\mathcal{N}(0,I)$
+ The resulting SGVB estimator for this model is then
\begin{align}
\tilde{\mathcal{L}}(\theta,\phi;x^{(i)})\simeq\frac{1}{2}\sum_{j=1}^J \left(1+\log( (\sigma_j)^2)-(\mu_j^2)-( \sigma_j^2) \right) + \frac{1}{L}\sum_{l=1}^L\log p_{\theta}(x^{(i)}\vert z^{(i,l)})
\end{align}
+ Gradient descent can then be applied here to jointly train $\theta$ and $\phi$ by maximizing this lower bound
## Conditional Variational Autoencoder (CVAE)
+ Adding conditions to VAE
+ Condition can be added in both input and latent representation (before sampling) by concatenation
+ The reason not to also include condition in output is that the conditional probability just won't work this way (given condition but generating probability of condition)
+ If the model is well-trained, it should have "separable" clusters for each class when projecting $\mu$ onto a 2D-plane using t-SNE
## Semi-supervised learning using CVAE (M1/M2 models)
## Recurrent CGM