# Lecture #14 Summary: Review of Latent Variables, Compression and Clustering ### CS181 ### Spring 2023 ![](https://i.imgur.com/xDR9VQd.png) ### Example: Gaussian Mixture Models (GMMs) In a ***Gaussian Mixture Model (GMM)***, we posit that the observed data $Y$ is generated by a mixture, $\pi=[\pi_1, \ldots, \pi_K]$, of $K$ number of Gaussians with means $\mu = [\mu_1, \ldots, \mu_K]$ and covariances $\Sigma = [\Sigma_1, \ldots, \Sigma_K]$. For each observation $Y_n$ the class of the observation $Z_n$ is a latent variable that indicates which of the $K$ Gaussian is responsible for generating $Y_n$: \begin{aligned} Z_n &\sim Cat(\pi),\\ Y_n | Z_n&\sim \mathcal{N}(\mu_{Z_n}, \Sigma_{Z_n}), \end{aligned} where $n=1, \ldots, N$ and $\sum_{k=1}^K \pi_k = 1$. GMMs are examples of ***model based clustering*** - breaking up a data set into natural clusters based on a statistical model fitted to the data. ### Example: Item-Response Models In ***item-response models***, we measure an real-valued unobserved trait $Z$ of a subject by performing a series of experiments with binary observable outcomes, $Y$: \begin{aligned} Z_n &\sim \mathcal{N}(\mu, \sigma^2),\\ \theta_n &= g(Z_n)\\ Y_n|Z_n &\sim Ber(\theta_n), \end{aligned} where $n=1, \ldots, N$ and $g$ is some fixed function of $Z_n$. #### Applications Item response models are used to model the way "underlying intelligence" $Z$ relates to scores $Y$ on IQ tests. Item response models can also be used to model the way "suicidality" $Z$ relates to answers on mental health surveys. Building a good model may help to infer when a patient is at psychiatric risk based on in-take surveys at points of care through out the health-care system. ### Example: Factor Analysis Models In ***factor analysis models***, we posit that the observed data $Y$ with many measurements is generated by a small set of unobserved factors $Z$: \begin{aligned} Z_n &\sim \mathcal{N}(0, I),\\ Y_n|Z_n &\sim \mathcal{N}(\mu + \Lambda Z_n, \Phi), \end{aligned} where $n=1, \ldots, N$, $Z_n\in \mathbb{R}^{D'}$ and $Y_n\in \mathbb{R}^{D}$. We typically assume that $D'$ is much smaller than $D$. #### Applications Factor analysis models are useful for biomedical data, where we typically measure a large number of characteristics of a patient (e.g. blood pressure, heart rate, etc), but these characteristics are all generated by a small list of health factors (e.g. diabetes, cancer, hypertension etc). Building a good model means we may be able to infer the list of health factors of a patient from their observed measurements. ## PCA Versus Probabilistic PCA (pPCA) Suppose that our data $\mathbf{X}$ is centered (if our data is not centered then reasoning about PCA as a projection that minimizes reconstruction loss uses uglier formulae). Let's also assume that $z_n$ is one dimensional and $\mathbf{x}_n$ is two dimensional. | | PCA | pPCA | | -------- | -------- | -------- | |*Model* | $z_n = v^\top \mathbf{x}_n \in \mathbb{R}$ *(Projection)*, $\hat{\mathbf{x}}_n = vz_n \in \mathbb{R}^2$ *(Reconstrcution)* | $z_n\sim \mathcal{N}(0, 1)$, $\mathbf{x}_n\| z_n \sim \mathcal{N}(vz_n, \sigma^2)$ | |*Goal*|Minimize Reconstruction Loss| Maximize Observed Log-likelihood| |*Training Objective* | $\mathrm{min}_v \sum_{n=1}^N\\| \mathbf{x}_n - vz_n\\|^2_2$, $\\|v\\|_2 =1$ | $\mathrm{max}_{v, q({z}_n)}\; \sum_{n=1}^N\mathrm{ELBO}(v, q({z}_n))$ (Optional: $\\|v\\|_2=1$)| |*Good For*|Compression: $\mathbf{x}_n \mapsto z_n$| Modeling data distribution: $\mathbf{x}_n \sim p(\mathbf{x}_n\| v)$| ||| Generating New Data: $z^\mathrm{new}_n\sim \mathcal{N}(0, 1)$, $\mathbf{x}^\mathrm{new}_n\| z^\mathrm{new}_n \sim \mathcal{N}(vz^\mathrm{new}_n, \sigma^2)$| |*How to "Compress"*| Compute $z_n = v^\top \mathbf{x}_n$| Infer $p(z_n \| \mathbf{x}_n, v)$ | |*Training Algorithm*|SVD (find eigenvalues)|Expectation Maximization| |*Deep Model Analogue*|Autoencoder|Variational Autoencoder| ### What to Know About Expectation Maximization #### Motivation: Why Can't We Autodiff Log-likelihood and Gradient Descent? 1. Can't directly maximize conditional likelihood: $$\max_v \sum_{n=1}^N\log p(\mathbf{x}_n | z_n, v),$$ because $v^*$ will be in terms of unobserved variables $z_n$. This makes our $v^*$ useless. 3. Can't directly maximize observed data likelihood: $$\max_v \sum_{n=1}^N\log p(\mathbf{x}_n | v) = \max_v \sum_{n=1}^N\log \mathbb{E}_{p(z_n)} [p(\mathbf{x}_n | z_n, v)],$$ because we either: - can't compute $\mathbb{E}_{p(z_n)} [p(\mathbf{x}_n | z_n, v)]$ in closed form (the integral is too hard), and/or we cannot push the gradient inside the expectation or alternatively - we can compute compute $\mathbb{E}_{p(z_n)} [p(\mathbf{x}_n | z_n, v)]$ in closed form (the integral is easy), but gradient descent on $\log p(\mathbf{x}_n | v)$ struggles to converge #### Our Solution 1. We optimize a lower bound for the observed data loglikelihood, that is also easier for us to work with: $$ \sum_{n=1}^{N}\mathrm{ELBO}(v, q(\mathbf{z}_n)) = \sum_{n=1}^{N} \mathbb{E}_{q(z_n)}\left[ \log \frac{p(\mathbf{x}_n|z_n, v)p(z_n)}{q(z_n)}\right] $$ 2. The ELBO is a more tractable obejctive to work with because: - the expectation is with respect to a distribution we choose, $q(z_n)$, and thus we can choose $q(z_n)$ to not depend on $v$. So the gradient $\nabla_v$ can be pushed inside the expectation. - we can optimize the ELBO via *coordinate-wise* optimization, taking turns optimizing over each $q$ and $v$, while holding the other constant: - **(E-Step)** when optimizing the ELBO with respect to $q$, a lot of algebra shows us that we just need to set $q(z_n) = p(z_n| \mathbf{x}_n, v^\mathrm{old})$. ***Note:*** this is "easy" only if we can find the posterior $p(z_n| \mathbf{x}_n, v^\mathrm{old})$ in closed form. When we don't know the posterior in closed form, in the E-step we typically perform *variational inference*. - **(M-Step)** when optimizing the ELBO with respect to $v$, a lot of algebra shows us that the ELBO greatly simplifies: $$ v^\mathrm{new}=\mathrm{argmax}_v \mathbb{E}_{p(z_n| \mathbf{x}_n, v^\mathrm{old})} \left[ \log p(\mathbf{x}_n|z_n, v) + \log p(z_n)\right] $$ **Note:** this optimization is typically done analytically but you can also optimize in the M-step using gradient descent. ## Non-Probabilistic Clustering Versus Probabilistic Clustering | | Non-Probabilistic | Probabilistic (Mixture Models) | | -------- | -------- | -------- | | *Model* | Uses some notion of similarity to make clusters | $z_n\sim \mathrm{Cat}(\mathbf{\pi})$, $\mathbf{x}_n\| z_n \sim \mathcal{N}(\mathbf{\mu}_{z_n}, \Sigma_{z_n})$| |*Goal*|Maximize cluster "quality"|Maximize Observed Log-likelihood| |*Training Objective* | Minimize some loss | $\mathrm{max}_{v, q({z}_n)}\; \sum_{n=1}^N\mathrm{ELBO}(\pi, \mu_{z_n}, \Sigma_{z_n}, q({z}_n))$, with $\\|\pi\\|_1=1$| |*Good for*|Producing hard cluster assignments|Producing soft cluster assignments (uncertainty)| |||Generating new data: (1) sample $z_n = k$, (2) sample $\mathbf{x}_n$ from the $k$-th Gaussian| |*Training Algorithm*|Various|Expectation Maximization|