# Optimization Techniques for Parameters of Linear Effect Model Edges in a Bayesian Network --- ## Definitions Suppose we have the following Markov Chain as the ground truth model: ![](https://i.imgur.com/PnvQAX3.jpg) $X \sim \mathcal{N}(0, 1)$ $Y \sim \mathcal{GP}(\beta_XX, K_Y + \sigma_n^2I)$ $Z \sim \mathcal{GP}(\beta_YY, K_Z + \sigma_n^2I)$ $\sigma_n^2 = 1$ We want to optimize for $\beta_XX, \beta_YY$ (weights of the linear mean function), $\theta$ (kernel hyperparameters) ## Overview The general framework for fitting parameters with latent variables is the following: 1. **Marginalization:** We marginalize out latent node $Y$ from the joint distribution of the observed nodes. $$ p(X_i, Z_i) = \int p(X_i, Y_i, Z_i) dY_i $$ If we are lucky, then $\int p(X_i, Y_i, Z_i) dY_i$ has a closed form that doesn't depend on $Y$. However, if $\int p(X_i, Y_i, Z_i) dY_i$ is difficult to compute, then we have to use approximation techniques. 2. **Maximization:** After marginalizing out our latent variables, our objective function is: $$ \beta*, \theta* = \sum_{i = 1}^N \log p(X_i, Z_i) = \sum_{i = 1}^N \log \int p(X_i, Y_i, Z_i) dY_i $$ We optimize using our favorite gradient based technique. ## Methods ### Closed Form Marginalization (Probabilistic Principal Component Analysis, GP-LVM) Refs: [Tipping and Bishop (1999)], [[Lawrence 2005]](https://www.jmlr.org/papers/volume6/lawrence05a/lawrence05a.pdf) Somtimes we get lucky and when we integrate over the latent variables, we get a nice closed form. For example: Let $y_n = Wx_n + \epsilon_n,$ where $\epsilon_n \sim \mathcal{N}(0, \sigma_n^2)$, and $x_n$ is latent $\forall n$. Where$y_n \in \mathbb{R}, x_n \in \mathbb{R}^D, W \in \mathbb{R}^{1 x D}$. Suppose we want to maximize $x_n$ but $W$ is latent. **Marginalization:** We assume a prior on $W$ s.t. $p(W) = MVN_{D}(0, I)$, then if we marginalize out weights $W$: $$ p(y_n|x_n, \sigma_n^2) = \int p(y_n|x_n, W, \sigma_n^2)p(W)dW $$ Since the product consists of two normal densities, our marginal likelihood is a closed from normal distribution: $$ p(y_n|x_n, \sigma_n^2) = \mathcal{N}(0, XX^T + \sigma_n^2I) $$ **Maximization:** We're also lucky that our likelihood function is simply the pdf of a Normal distribution. So maximization just involves setting X to the fixed point of the gradient: $$ \frac{\partial \mathcal{L}}{\partial X} = K^{-1}YY^TK^{-1}X-K^{-1}X $$ where $\mathcal{L}$ is the pdf of $p(Y|X, \sigma_n^2)$. This is the derivation for Probabilistic Principal Component Analysis. To obtain the GP-Latent Variable Model, we simply replace $x_n$ with $\phi(x_n)$ where $\phi(x_n)$ represents a feature mapping represented by some kernel $K$. **Marginalization:** $$ p(y_n|x_n, \sigma_n^2) = \mathcal{N}(0, K := \Phi\Phi^T + \sigma_n^2I) $$ **Maximization:** Since adding a feature mapping add non-linearity to our objective function, we can no longer optimize with a closed form solution and therefore have to use gradient based methods. ### Expectation Maximization Refs: [[Neal, Hinton, 1993]](https://www.cs.toronto.edu/~hinton/absps/emk.pdf) Recall this is our objective function: $$ \beta^*, \theta^* = \arg \max \sum_{i = 1}^N \log \int p(X_i, Y_i, Z_i) dY_i $$ Suppose the integral is difficult to compute, then we introduce an auxillary variable $q(Y_i) \implies$ $$ \arg \max \sum_{i = 1}^N \log \int p(X_i, Y_i, Z_i) dY_i $$ $$ = \arg \max \sum_{i = 1}^N \log \int \frac{p(X_i, Y_i, Z_i)}{q(Y_i)} * q(Y_i) dY_i $$ $$ = \arg \max \sum_{i = 1}^N \log \mathbb{E}\bigg[\frac{p(X_i, Y_i, Z_i)}{q(Y_i)}\bigg] \geq \arg \max \sum_{i = 1}^N \mathbb{E}\bigg[\log \frac{p(X_i, Y_i, Z_i)}{q(Y_i)}\bigg] $$ $$ = \arg \max \sum_{i = 1}^N \mathbb{E}_{q(Y_i)}\bigg[\log p(X_i, Y_i, Z_i)\bigg] $$ EM usually iterates between **E-Step:** We fix the parameters and compute posterior $p_{\theta, \beta}(Y_i|X_i, Z_i) = \underset{q}{\arg \max} \sum_{i = 1}^N \mathbb{E}_{q(Y_i)}\bigg[\log p(X_i, Y_i, Z_i)\bigg]$ **M-Step:** We use the compute posterior $q^* = p_{\theta, \beta}(Y_i|X_i, Z_i)$ to maximize the expectation: $$ \beta^*, \theta^*= \arg \max \sum_{i = 1}^N \mathbb{E}_{q(Y_i)}\bigg[\log p(X_i, Y_i, Z_i)\bigg] $$ Sometimes the E-Step is difficult because the posterior is difficult to calculate. So we can also do a MM approach where we do coordinate ascent. Note: $MM$ is just $EM$ but we set the probability distribution to a delta distribution parameterized at the MAP so the expectation disappears. **Step 1:** We fix the parameters and optimize: $$ Y^* = \underset{Y}{\arg \max} \sum_{i = 1}^N \log p(X_i, Y_i, Z_i) $$ **Step 2:** We fix imputed data points $Y^*$ and optimize for $\beta^*, \theta^*$: $$ \beta^*, \theta^*= \arg \max \sum_{i = 1}^N \log p(X_i, Y_i, Z_i) $$ ### Reparameterization Trick Recall this is our objective function: $$ \beta^*, \theta^* = \arg \max \log \int p(X, Y, Z) dY $$ We let $X, Z$ be a vector of all $N$ datapoints. Suppose the integral is difficult to compute, we use a sampling based method (Monte Carlo Estimation) to approximate the integral. **Marginalization:** We first decompose the integrand using dependence relationships and rearrange terms to create an expectations $$\int p(X, Y, Z) dY dY = \int p(X) p(Y|X) p(Z|Y) dY$$ $$= p(X) \int p(Z|Y) p(Y|X) dY$$ $$= p(X) \mathbf{E}_{Y|X}[p(Z|Y)]$$ Notice that $Y \sim \mathcal{GP}(\beta_XX, K_Y(X, X) + \sigma_n^2I) \implies Y \sim \mathcal{MVN}_N(\beta_XX, K_Y + \sigma_n^2I)$ By property of the multivariate normal, we can decompose $Y$ as: $Y | X = A_{\theta}B + \beta_XX$ where $K_Y(X, X) + \sigma_n^2I = A_{\theta}A_{\theta}^T$ Our procedure now, for every iteration, draws $M$ samples $B \sim \mathcal{MVN}_N(0, I)$, computes $M$ values of $Y | X$, and uses it to approximate: $$ \mathbf{E}_{Y|X}[p(Z|Y)] \approx \frac{1}{M}\sum_{m = 1}^M p(Z|Y_m) $$ **Maximization:** $$\beta^*, \theta^* = \arg \max \log [\frac{1}{M}\sum_{m = 1}^M p(Z|Y_m)]$$ Note, we were able to remove $p(X)$ because it doesn't depend on the parameters we are maximizing. ### Deep GPs TBD - Still chugging along and reading papers T__T #### 2-Layer GP-LVM Model (MAP Solution) [[Lawrence, 2007]](https://icml.cc/imls/conferences/2007/proceedings/papers/408.pdf) ## Pros and Cons * Closed Form Marginalization: Simple and straightforward but a pain to compute and most things aren't closed form. * EM/MM: MM is slow, EM is the standard approach but the posterior can be difficult to compute in the E-Step and may sometimes need to be approximated (Variational EM) * Reparameterization Trick: It's great and straightforward but not sure if it can work with more complex graphs. The randomization in drawing the normal samples $B$ means that the likelihood value changes with every iteration. Therefore we can have an inaccurate likelihood value when we're optimizing. Empirically it seems that as the number of datapoints increases, this problem goes away. * Deep GPs: Anna is still reading *sweat smile*