Naive Bayes Classifier as a Gaussian Mixture Model

# Naive Bayes Classifier as a Gaussian Mixture Model ### CS181 ### Spring 2023 ![](https://i.imgur.com/xDR9VQd.png) Today we will relate a Gaussian Mixturn Model to classification and provide a very different paradigm to classification that we've seen thus far in class! ## Gaussian Mixture Model Let's say that we have a ***Gaussian Mixture Model (GMM)*** with two components. Let's suppose that the observed data $\mathbf{x} \in \mathbb{R}^D$ is generated by a mixture, $\pi=[\pi_1, \pi_2]$, of two Gaussians with means $\mu = [\mu_1, \mu_2]$ and covariances $\Sigma = [\Sigma_1, \Sigma_2]$. For each observation $\mathbf{x}_n$, there is a latent variable $y_n$ that indicates which of the two Gaussians is responsible for generating $\mathbf{x}_n$ -- that is, $y_n$ is the *binary label* on $\mathbf{x}_n$ telling us to which cluster $\mathbf{x}_n$ belong: \begin{aligned} y_n &\sim Cat(\pi),\\ \mathbf{x}_n | y_n&\sim \mathcal{N}(\mu_{y_n}, \Sigma_{y_n}), \end{aligned} where $n=1, \ldots, N$ and $\pi_1 + \pi_2 = 1$. ![](https://i.imgur.com/3pHm4g6.png) We've studied the problem of infering $\mu^\mathrm{MLE}$, $\Sigma^\mathrm{MLE}$ and $\pi^\mathrm{MLE}$ even if we don't observe the binary labels $y_n$. That is, we use EM to infer $\mu^\mathrm{MLE}$, $\Sigma^\mathrm{MLE}$, and $\pi^\mathrm{MLE}$; we can infer the label $y_n$ using the posterior $p(y_n|x_n, \mu^\mathrm{MLE}, \Sigma^\mathrm{MLE}, \pi^\mathrm{MLE})$. But what is the point of infering $\pi$ and $y_n$? It turns out, if we are able to infer $\pi$ and $y_n$, or if we are given $\pi$ and $y_n$ as part of our data, we can use our model to make predictions about the label $y^\mathrm{new}$ for a new point $\mathbf{x}^\mathrm{new}$. ## From Gaussian Mixture Model to Classification Suppose we have all the parameters (either given or inferred) for a two-component Gaussian Mixture Model: \begin{aligned} \pi &= [\pi_1, \pi_2]\\ \mu &= [\mu_1, \mu_2]\\ \Sigma &= [\Sigma_1, \Sigma_2]\\ p(y=1) &= \pi_1\\ p(y=0) &= \pi_0\\ p(\mathbf{x}_n | y_n=1, \mu, \pi, \Sigma) &= \mathcal{N}(\mu_{1}, \Sigma_{1})\\ p(\mathbf{x}_n | y_n=0, \mu, \pi, \Sigma) &= \mathcal{N}(\mu_{0}, \Sigma_{0}) \end{aligned} We can treat the information given by a two component GMM as the data generating process for a binary classification problem: there are two classes 0 and 1; each data point $\mathbf{x}_n$ is given a class label $y_n$. Now suppose we are given a new point $\mathbf{x}^\mathrm{new}$, and we want to infer the corresponding class label $y^\mathrm{new}$. Instead of training a classifier like logistic regression on the labeled data, *we can use the GMM model itself* to infer a likely value for $y^\mathrm{new}$! For example, we can find the conditional distributions $p(y^\mathrm{new}=1|\mathbf{x}^\mathrm{new}, \mu, \pi, \Sigma)$ and $p(y^\mathrm{new}=0|\mathbf{x}^\mathrm{new}, \mu, \pi, \Sigma)$; then we label $\mathbf{x}^\mathrm{new}$ class $1$, if the conditional probability of $y^\mathrm{new}=1$ is larger than that of $y^\mathrm{new}=0$ (and we label the point class $0$ if the opposite is true). This way of using the information of a GMM to perform classification is called *Naive Bayes Classification*. ## Naive Bayes Classification Typically, when building the Naive Bayes Classifier, we assume that we are given a labeled data set $\mathcal{D} = \{(\mathbf{x}_n, y_n) \}_{n=1}^N$. *Note:* assuming that the class labels $y_n$ are given simplifies our inference. ![](https://i.imgur.com/FZdpfrG.png) Then we suppose that \begin{aligned} \mathbf{x_n} &\sim \mathcal{N}(\mu_1, \Sigma_1), \text{ if $y_n = 1$}\\ \mathbf{x_n} &\sim \mathcal{N}(\mu_0, \Sigma_0), \text{ if $y_n = 0$} \end{aligned} That is, the set of points in each class are distributed like a Gaussian. In this case, because we observe the class labels $y_n$, we can find the MLE parameters of the Gaussian corresponding to each class: \begin{aligned} \hat{\mu}_1, \hat{\Sigma}_1 &= \text{empirical mean and covariance of all $\mathbf{x}_n$ labeled 1}\\ \hat{\mu}_0, \hat{\Sigma}_0 &= \text{empirical mean and covariance of all $\mathbf{x}_n$ labeled 0} \end{aligned} as well as the MLE mixture parameters \begin{aligned} \hat{\pi}_0 &= \frac{\# (y_n = 0)}{N}\\ \hat{\pi}_1 &= \frac{\# (y_n = 1)}{N} \end{aligned} ![](https://i.imgur.com/ziDJkLr.png) Now, given a new point $\mathbf{x}^\mathrm{new}$, we compute the likelihood of labeling this point class $1$ using Bayes Rule: \begin{aligned} p(y^\mathrm{new}=1|\mathbf{x}^\mathrm{new}, \hat{\mu}, \hat{\pi}, \hat{\Sigma})&= \frac{p(\mathbf{x}^\mathrm{new} | y^\mathrm{new}=1, \hat{\mu}, \hat{\pi}, \hat{\Sigma})p(y^\mathrm{new}=1)}{p(\mathbf{x}^\mathrm{new}| \hat{\mu}, \hat{\pi}, \hat{\Sigma})}\\ &= \frac{\mathcal{N}(\mathbf{x}^\mathrm{new}; \hat{\mu}_1, \hat{\Sigma}_1)\hat{\pi}_1}{p(\mathbf{x}^\mathrm{new}| \hat{\mu}, \hat{\pi}, \hat{\Sigma})} \end{aligned} Note that the top of the fraction is easy to compute (for a computer), we just need to plug $\mathbf{x}^\mathrm{new}$ into the Gaussian pdf of $\mathcal{N}(\hat{\mu}_1, \hat{\Sigma}_1)$ and then multiply this to the fraction of total labels that are class 1. Similarly, we compute the likelihood of labeling this point class $0$: \begin{aligned} p(y^\mathrm{new}=0|\mathbf{x}^\mathrm{new}, \hat{\mu}, \hat{\pi}, \hat{\Sigma})&= \frac{p(\mathbf{x}^\mathrm{new} | y^\mathrm{new}=0, \hat{\mu}, \hat{\pi}, \hat{\Sigma})p(y^\mathrm{new}=0)}{p(\mathbf{x}^\mathrm{new}| \hat{\mu}, \hat{\pi}, \hat{\Sigma})}\\ &= \frac{\mathcal{N}(\mathbf{x}^\mathrm{new}; \hat{\mu}_0, \hat{\Sigma}_0)\hat{\pi}_0}{p(\mathbf{x}^\mathrm{new}| \hat{\mu}, \hat{\pi}, \hat{\Sigma})} \end{aligned} Again, we can easily compute the top of the above fraction. What about the denominator $p(\mathbf{x}^\mathrm{new}| \hat{\mu}, \hat{\pi}, \hat{\Sigma})$?! Note that whatever this value happens to be both $p(y^\mathrm{new}=0|\mathbf{x}^\mathrm{new}, \hat{\mu}, \hat{\pi}, \hat{\Sigma})$ and $p(y^\mathrm{new}=1|\mathbf{x}^\mathrm{new}, \hat{\mu}, \hat{\pi}, \hat{\Sigma})$ have the ***same*** denominator value! Therefore, when comparing the $p(y^\mathrm{new}=0|\mathbf{x}^\mathrm{new}, \hat{\mu}, \hat{\pi}, \hat{\Sigma})$ and $p(y^\mathrm{new}=1|\mathbf{x}^\mathrm{new}, \hat{\mu}, \hat{\pi}, \hat{\Sigma})$ we only need to compare the top of the two fractions. Formally, we conclude that $$p(y^\mathrm{new}=1|\mathbf{x}^\mathrm{new}, \hat{\mu}, \hat{\pi}, \hat{\Sigma}) > p(y^\mathrm{new}=0|\mathbf{x}^\mathrm{new}, \hat{\mu}, \hat{\pi}, \hat{\Sigma})$$ if and only if $$\mathcal{N}(\mathbf{x}^\mathrm{new}; \hat{\mu}_1, \hat{\Sigma}_1)\hat{\pi}_1 > \mathcal{N}(\mathbf{x}^\mathrm{new}; \hat{\mu}_0, \hat{\Sigma}_0)\hat{\pi}_0!$$ ## Generative Versus Discriminative Classifiers and OOD Because Naive Bayes is built on Gaussian Mixture Models, and GMMs are generative models, Naive Bayes Classifiers are called *generative classifiers*; in contrast, logistic regression and other classifiers we've studied previously are called *discriminative*, as they can only discriminate one class from another, given $\mathbf{x}$ but cannot generate new data from scratch. So why do we need yet another classifier??? It is not the case that generative classifiers are more accurate than discriminative or vice versa. But remember that accuracy is by far the only the we care about! Because generative models take the time to model the *distribution of data* in each class, it has a natural notion of how similar a new observation $\mathbf{x}^\mathrm{new}$ is to the observed data, via $\mathcal{N}(\mathbf{x}^\mathrm{new}; \hat{\mu}_0, \hat{\Sigma}_0)$ and $\mathcal{N}(\mathbf{x}^\mathrm{new}; \hat{\mu}_1, \hat{\Sigma}_1)$. If $\mathbf{x}^\mathrm{new}$ is unlikely under both the distribution of class 1 and under the distribution of class 0, then we suspect that $\mathbf{x}^\mathrm{new}$ is OOD and refrain from making a classification with our model!