# Lecture #14 Summary: Review of Latent Variables, Compression and Clustering
### CS181
### Spring 2023

### Example: Gaussian Mixture Models (GMMs)
In a ***Gaussian Mixture Model (GMM)***, we posit that the observed data $Y$ is generated by a mixture, $\pi=[\pi_1, \ldots, \pi_K]$, of $K$ number of Gaussians with means $\mu = [\mu_1, \ldots, \mu_K]$ and covariances $\Sigma = [\Sigma_1, \ldots, \Sigma_K]$. For each observation $Y_n$ the class of the observation $Z_n$ is a latent variable that indicates which of the $K$ Gaussian is responsible for generating $Y_n$:
\begin{aligned}
Z_n &\sim Cat(\pi),\\
Y_n | Z_n&\sim \mathcal{N}(\mu_{Z_n}, \Sigma_{Z_n}),
\end{aligned}
where $n=1, \ldots, N$ and $\sum_{k=1}^K \pi_k = 1$.
GMMs are examples of ***model based clustering*** - breaking up a data set into natural clusters based on a statistical model fitted to the data.
### Example: Item-Response Models
In ***item-response models***, we measure an real-valued unobserved trait $Z$ of a subject by performing a series of experiments with binary observable outcomes, $Y$:
\begin{aligned}
Z_n &\sim \mathcal{N}(\mu, \sigma^2),\\
\theta_n &= g(Z_n)\\
Y_n|Z_n &\sim Ber(\theta_n),
\end{aligned}
where $n=1, \ldots, N$ and $g$ is some fixed function of $Z_n$.
#### Applications
Item response models are used to model the way "underlying intelligence" $Z$ relates to scores $Y$ on IQ tests.
Item response models can also be used to model the way "suicidality" $Z$ relates to answers on mental health surveys. Building a good model may help to infer when a patient is at psychiatric risk based on in-take surveys at points of care through out the health-care system.
### Example: Factor Analysis Models
In ***factor analysis models***, we posit that the observed data $Y$ with many measurements is generated by a small set of unobserved factors $Z$:
\begin{aligned}
Z_n &\sim \mathcal{N}(0, I),\\
Y_n|Z_n &\sim \mathcal{N}(\mu + \Lambda Z_n, \Phi),
\end{aligned}
where $n=1, \ldots, N$, $Z_n\in \mathbb{R}^{D'}$ and $Y_n\in \mathbb{R}^{D}$. We typically assume that $D'$ is much smaller than $D$.
#### Applications
Factor analysis models are useful for biomedical data, where we typically measure a large number of characteristics of a patient (e.g. blood pressure, heart rate, etc), but these characteristics are all generated by a small list of health factors (e.g. diabetes, cancer, hypertension etc). Building a good model means we may be able to infer the list of health factors of a patient from their observed measurements.
## PCA Versus Probabilistic PCA (pPCA)
Suppose that our data $\mathbf{X}$ is centered (if our data is not centered then reasoning about PCA as a projection that minimizes reconstruction loss uses uglier formulae). Let's also assume that $z_n$ is one dimensional and $\mathbf{x}_n$ is two dimensional.
| | PCA | pPCA |
| -------- | -------- | -------- |
|*Model* | $z_n = v^\top \mathbf{x}_n \in \mathbb{R}$ *(Projection)*, $\hat{\mathbf{x}}_n = vz_n \in \mathbb{R}^2$ *(Reconstrcution)* | $z_n\sim \mathcal{N}(0, 1)$, $\mathbf{x}_n\| z_n \sim \mathcal{N}(vz_n, \sigma^2)$ |
|*Goal*|Minimize Reconstruction Loss| Maximize Observed Log-likelihood|
|*Training Objective* | $\mathrm{min}_v \sum_{n=1}^N\\| \mathbf{x}_n - vz_n\\|^2_2$, $\\|v\\|_2 =1$ | $\mathrm{max}_{v, q({z}_n)}\; \sum_{n=1}^N\mathrm{ELBO}(v, q({z}_n))$ (Optional: $\\|v\\|_2=1$)|
|*Good For*|Compression: $\mathbf{x}_n \mapsto z_n$| Modeling data distribution: $\mathbf{x}_n \sim p(\mathbf{x}_n\| v)$|
||| Generating New Data: $z^\mathrm{new}_n\sim \mathcal{N}(0, 1)$, $\mathbf{x}^\mathrm{new}_n\| z^\mathrm{new}_n \sim \mathcal{N}(vz^\mathrm{new}_n, \sigma^2)$|
|*How to "Compress"*| Compute $z_n = v^\top \mathbf{x}_n$| Infer $p(z_n \| \mathbf{x}_n, v)$ |
|*Training Algorithm*|SVD (find eigenvalues)|Expectation Maximization|
|*Deep Model Analogue*|Autoencoder|Variational Autoencoder|
### What to Know About Expectation Maximization
#### Motivation: Why Can't We Autodiff Log-likelihood and Gradient Descent?
1. Can't directly maximize conditional likelihood:
$$\max_v \sum_{n=1}^N\log p(\mathbf{x}_n | z_n, v),$$ because $v^*$ will be in terms of unobserved variables $z_n$. This makes our $v^*$ useless.
3. Can't directly maximize observed data likelihood:
$$\max_v \sum_{n=1}^N\log p(\mathbf{x}_n | v) = \max_v \sum_{n=1}^N\log \mathbb{E}_{p(z_n)} [p(\mathbf{x}_n | z_n, v)],$$
because we either:
- can't compute $\mathbb{E}_{p(z_n)} [p(\mathbf{x}_n | z_n, v)]$ in closed form (the integral is too hard), and/or we cannot push the gradient inside the expectation
or alternatively
- we can compute compute $\mathbb{E}_{p(z_n)} [p(\mathbf{x}_n | z_n, v)]$ in closed form (the integral is easy), but gradient descent on $\log p(\mathbf{x}_n | v)$ struggles to converge
#### Our Solution
1. We optimize a lower bound for the observed data loglikelihood, that is also easier for us to work with:
$$
\sum_{n=1}^{N}\mathrm{ELBO}(v, q(\mathbf{z}_n)) = \sum_{n=1}^{N} \mathbb{E}_{q(z_n)}\left[ \log \frac{p(\mathbf{x}_n|z_n, v)p(z_n)}{q(z_n)}\right]
$$
2. The ELBO is a more tractable obejctive to work with because:
- the expectation is with respect to a distribution we choose, $q(z_n)$, and thus we can choose $q(z_n)$ to not depend on $v$. So the gradient $\nabla_v$ can be pushed inside the expectation.
- we can optimize the ELBO via *coordinate-wise* optimization, taking turns optimizing over each $q$ and $v$, while holding the other constant:
- **(E-Step)** when optimizing the ELBO with respect to $q$, a lot of algebra shows us that we just need to set $q(z_n) = p(z_n| \mathbf{x}_n, v^\mathrm{old})$.
***Note:*** this is "easy" only if we can find the posterior $p(z_n| \mathbf{x}_n, v^\mathrm{old})$ in closed form. When we don't know the posterior in closed form, in the E-step we typically perform *variational inference*.
- **(M-Step)** when optimizing the ELBO with respect to $v$, a lot of algebra shows us that the ELBO greatly simplifies:
$$
v^\mathrm{new}=\mathrm{argmax}_v \mathbb{E}_{p(z_n| \mathbf{x}_n, v^\mathrm{old})} \left[ \log p(\mathbf{x}_n|z_n, v) + \log p(z_n)\right]
$$
**Note:** this optimization is typically done analytically but you can also optimize in the M-step using gradient descent.
## Non-Probabilistic Clustering Versus Probabilistic Clustering
| | Non-Probabilistic | Probabilistic (Mixture Models) |
| -------- | -------- | -------- |
| *Model* | Uses some notion of similarity to make clusters | $z_n\sim \mathrm{Cat}(\mathbf{\pi})$, $\mathbf{x}_n\| z_n \sim \mathcal{N}(\mathbf{\mu}_{z_n}, \Sigma_{z_n})$|
|*Goal*|Maximize cluster "quality"|Maximize Observed Log-likelihood|
|*Training Objective* | Minimize some loss | $\mathrm{max}_{v, q({z}_n)}\; \sum_{n=1}^N\mathrm{ELBO}(\pi, \mu_{z_n}, \Sigma_{z_n}, q({z}_n))$, with $\\|\pi\\|_1=1$|
|*Good for*|Producing hard cluster assignments|Producing soft cluster assignments (uncertainty)|
|||Generating new data: (1) sample $z_n = k$, (2) sample $\mathbf{x}_n$ from the $k$-th Gaussian|
|*Training Algorithm*|Various|Expectation Maximization|