Structured Probabilistic Models

# Structured Probabilistic Models ###### tags: `shared` `technical` [TOC] ## Examples + What is "structured probabilistic model"? + A way of describing a probabilistic distribution + Typically with a graph representing interactions among nodes + restricted boltzmann machine (RBM), deep boltzmann machine (DBM), deep beleif network (DBN) ## Notations ### Conditional variational autoencoders + Given data $x$ and condition $c$; we want to predict $y$ + This mechanism may be governed by a set of laten variables $z$ - $z$ is generated by a prior $p_{\theta}(z)$ - Then this $z$ is used to recover $x$ by $p_{\theta}(x\vert z)$ - Want to use $q_\phi (z\vert x)$ to approximate intractable posterior $p_\theta (z\vert x)$ - To figure out what $z$ is most probably to generate data point $x$ \begin{align} \max_{z} \quad p_{\theta}(z\vert x) \end{align} + An encoder-decoder architecture - $p_\theta (z\vert x) \simeq q_{\phi}(z\vert x)$ - Encoder: $q_\phi(z\vert x)$ (recognition) - Decoder: $p_\theta(x\vert z)$ (generation) - $\phi$ and $\theta$ are learned together ## Evidence Lower BOund (ELBO) ### Formulation + Sometimes also called "variational free energy" + $\log p_{\theta}(x)$ is independent of $z$ and fixed once given the data \begin{align} \log p_{\theta}(x) & = \log p_{\theta}(x\vert z) \end{align} + Expanding the data log-likelihood \begin{align} p_{\theta}(x) & = \frac{q_{\phi}(z\vert x)}{p_{\theta}(z\vert x)}\frac{p_{\theta}(x,z)}{q_{\phi}(z\vert x)} \\ \end{align} + Using normalization property of $p_{\theta}(z)$ and plug the expansion above into the equation \begin{align} \log p_\theta (x) & = \int_{z} q_{\phi}(z\vert x) \log p_\theta (x) dz \\ & = D_{KL}(q_{\phi}(z\vert x)\Vert p_\theta(z\vert x)) + \mathbb{E}_{q_{\phi}(z\vert x)}[\log p_{\theta}(x,z) - \log q_{\phi}(z\vert x)] \end{align} + $\mathcal{L} (\theta,\phi;x)$ is the Evidence Lower BOund (ELBO) (second term in the equation above) \begin{align} \mathcal{L} (\theta,\phi;x) & = \log p_{\theta}(x) - D_{KL}(q_{\phi}(z\vert x)\Vert p_\theta(z\vert x)) \\ & = \log p_{\theta}(x) - \mathbb{E}_{q_{\phi}(z\vert x)}[\log q_{\phi}(z\vert x) - \log p_{\theta}(x,z) + \log p_{\theta}(x)] \\ & = - \mathbb{E}_{q_{\phi}(z\vert x)}[\log q_{\phi}(z\vert x) - \log p_{\theta}(x,z)] \quad \text{($p_{\theta}(x)$ is independent of $q_{\phi}(z\vert x)$)} \\ & = \mathbb{E}_{q_{\phi}(z\vert x)}[\log p_{\theta}(x,z)] + H(q(z\vert x)) \end{align} - Note that if $q_{\phi}(z\vert x) = p_{\theta}(z\vert x)$, we have \begin{equation} \mathcal{L} (\theta,\phi;x) = \log p_{\theta} (x) \end{equation} ### Alternative formulation + By expanding the second term (entropy) \begin{align} \mathcal{L} (\theta,\phi;x) & = \mathbb{E}_{q_{\phi}(z\vert x)}[\log p_{\theta}(x,z)] + H(q(z\vert x)) \\ & = \mathbb{E}_{q_{\phi}(z\vert x)}[\log p_{\theta}(x\vert z)] - D_{KL}(q_{\phi}(z\vert x)\Vert p_\theta(z)) \\ & \lt \log p_{\theta}(x) \end{align} ### Objectives + $D_{KL}(\bullet)$ is always non-negative and $\log p_\theta (x)$ is fixed + We want the approximation $q_{\phi}(z\vert x)$ to be close to the true posterior $p_{\theta}(z\vert x)$ + The actual posterior is intractable + Instead maximizing the ELBO $\Rightarrow$ $D_{KL}(\bullet)\rightarrow 0$ ## Expectation Maximization (EM) Algorithm ## Mean-field approximation/approach + Variational learning/inference + Constuct a lower bound with approximated q where maximizing the lower bound ensures the KL-divergence of p & q are minimized + Classic strategy - EM algorithm (but not general enough) + Want to develop a more general approach to variational learning + Basic ideas + The lower bound should be maximized over a restricted family of distribution of q + This family of q should make it easy to compute $\mathbb{E}_q\log p(\mathbf{h},\mathbf{v})$ => typical way is to introduce assumptions about **how q factorizes** + Common approach - **factorial distribution** + $q(\mathbf{h}|\mathbf{v}) = \Pi_i q(h_i|\mathbf{v})$ + Called **mean field approach** + We do NOT need to specify a specific parametric form for q + The optimization problem determines the optimal probability distribution within those factorizations ## Reparameterization trick + Often called reparameterization trick, stochastic-back-propagation, or perturbation analysis + Often applied in denosing autoencoders, networks with dropout, and conditional variational autoencoders + We use $q_{\phi}(z\vert x)$ to approximate the true posterior $p_{\theta}(z\vert x)$ + Sample $z \sim p_{\phi}(z\vert x)$ and $z=g_{\phi}(\epsilon,x) \quad$ ($\epsilon$ is the source of randomness) + True posterior is regarded as the averaged probability of conditioned/reconstructed data probability \begin{align} p_{\theta}(x) & = \mathbb{E}_{z\sim q_{\phi}(z\vert x)}[p_{\theta}(x\vert z)] \end{align} - By similar procedures, we can lead to the same results as above ### Example: drawing samples from Gaussian distribution + In CVAE, we sample $z \sim q_{\phi}(z\vert x)$ + Express $z$ as a deterministic variable $z=g_{\phi}(\epsilon,x)$. where $\epsilon$ is an auxiliary variable with independent marginal $p(\epsilon)$ + Let $\epsilon \sim \mathcal{N}(0,1)$ + $\mu$ and $\sigma^2$ are learned with CVAE since we are using the generated $z$ to feed into the following network (decoder) + So that we can have $q_{\phi}(z\vert x) \sim \mathcal{N}(\mu,\sigma^2)$ + Why respresenting those parameters? WE SPECIFY THEM TO BE SO + Final note: use t-SNE on $\mu$ and you will see clear separation if the model is well-trained ## Variational autoencoder (VAE) ### Stochastic Gradient Variational Bayes (SGVB) estimator + Look into samples and data points - $L$ is the number of samples per data point - $M$ is the number of samples per minibatch - $N$ is the number of samples \begin{align} \tilde{\mathcal{L}} & = - D_{KL}(q_{\phi}(z\vert x^{(i)})\Vert p_\theta(z)) + \mathbb{E}_{q_{\phi}(z\vert x)}[\log p_{\theta}(x^{(i)}\vert z)] \\ \tilde{\mathcal{L}} & = - D_{KL}(q_{\phi}(z\vert x^{(i)})\Vert p_\theta(z)) + \frac{1}{L}\sum_{l=1}^L \log p_{\theta}(x^{(i)}\vert z^{(i,l)}) \\ \mathcal{L} & = \frac{N}{M}\sum_{i=1}^M \tilde{\mathcal{L}} (\theta,\phi;x^{(i)})\\ \text{where} \quad z^{(i,l)} & = g_{\phi}(\epsilon^{(i,l)},x^{(i)})\quad \epsilon^{(l)}\sim p(\epsilon) \end{align} + Neural networks are used as probabiblistic encoders and decoders + Note that the second term in the SGVB is negative "loss" (cross-entropy/reconstruction) ### Gaussian encoder/decoder + $\log p_{\theta}(z\vert x) = \log \mathcal{N}(x;\mu,\sigma^2I)$ where $\mu$ and $\log \sigma^2$ are learned (extracted from activations of the final layer) + It's possible to use cross-entropy loss for decoder ### $D_{KL}$ for Gaussian case + KL term will be integrated into evidence/variational lower bound that will be maximized + Let $J$ be the dimensionality of $z$ and prior $p(z) = \mathcal{N}(0,I)$ + We will utilize the normalization property and definition of variance of Gaussian r.v. throughout the derivation + $\log$ here denotes natural logarithm + Gaussian differential entropy: $h_a(X) = \frac{1}{2}\log_a (2\pi e \sigma^2)$ and take $a=e$ #### Encoder (recognition) Want to approximate $p_{\theta}(z\vert x)$ with $q_{\phi}(z\vert x)$ \begin{align} \int_z q_{\phi}(z\vert x)\log p(z) dz & = \int_z \mathcal{N}(z;\mu,\sigma^2)\log \mathcal{N}(z;0,I)dz \\ & = -\frac{J}{2}\log(2\pi)-\frac{1}{2}\sum_{j=1}^J (\mu_j^2+\sigma_j^2) \end{align} #### Decoder (generation) \begin{align} \int_z q_{\phi}(z\vert x)\log q_{\phi}(z\vert x) dz & = \int_z \mathcal{N}(z;\mu,\sigma^2)\log \mathcal{N}(z;\mu,\sigma^2)dz \\ & = -\frac{J}{2}\log(2\pi)-\frac{1}{2}\sum_{j=1}^J (1+\log \sigma_j^2) \end{align} #### KL divergence \begin{align} -D_{KL}(q_{\phi}(z\vert x)\Vert p_{\theta}(z)) & = \int_z q_{\phi}(z\vert x)(\log p(z)-\log q_{\phi}(z\vert x))dz \\ & = \frac{1}{2}\sum_{j=1}^J \left(1+\log( \sigma_j^2)-\mu_j^2-\sigma_j^2 \right) \end{align} ### Put it together + We want the posterior to be $\log q_{\phi}(z\vert x) \sim \log\mathcal{N}(z;\mu,\sigma^2I)$ + For the $i^{th}$ data point and the $l^{th}$ generated sample \begin{align} z^{(i,l)} & \sim q_{\phi}(z\vert x^{(i)}) \\ z^{(i,l)} & = g_{\phi}(x^{(i)},\epsilon^{(l)}) = \mu^{(i)} + \sigma^{(i)}\odot\epsilon^{(l)} \end{align} where $\epsilon^{(l)}\sim\mathcal{N}(0,I)$ + The resulting SGVB estimator for this model is then \begin{align} \tilde{\mathcal{L}}(\theta,\phi;x^{(i)})\simeq\frac{1}{2}\sum_{j=1}^J \left(1+\log( (\sigma_j)^2)-(\mu_j^2)-( \sigma_j^2) \right) + \frac{1}{L}\sum_{l=1}^L\log p_{\theta}(x^{(i)}\vert z^{(i,l)}) \end{align} + Gradient descent can then be applied here to jointly train $\theta$ and $\phi$ by maximizing this lower bound ## Conditional Variational Autoencoder (CVAE) + Adding conditions to VAE + Condition can be added in both input and latent representation (before sampling) by concatenation + The reason not to also include condition in output is that the conditional probability just won't work this way (given condition but generating probability of condition) + If the model is well-trained, it should have "separable" clusters for each class when projecting $\mu$ onto a 2D-plane using t-SNE ## Semi-supervised learning using CVAE (M1/M2 models) ## Recurrent CGM