Naive Bayes - HackMD

# Naive Bayes ## Abstract The Naïve Bayes classifier is a popular probabilistic algorithm for classification problem which is straightforward but effective. Its computational efficiency comes from assuming conditional independence among features, which allows joint likelihoods to factor into simple one-dimensional terms. This paper presents the mathematical underpinnings of the Naïve Bayes classifier, explains how a model is fitted and used for prediction, and uses a small plant-health dataset to demonstrate the process. . Connections to linear algebra are highlighted, showing how log-likelihoods can be expressed as linear functionals in feature space. ## 1. Introduction Naïve Bayes (NB) is a supervised learning algorithm based on Bayes' theorem. It is commonly employed in text classification, medical diagnosis, spam filtering, and other tasks where probabilistic modeling is advantageous. NB is particularly effective when the independence assumption approximately holds or when the dimensionality of the data is large relative to the amount of training data. In this paper, we describe the algorithm mathematically and demonstrate it through a simple plant health classification task using discrete (binary) features. ## 2. Mathematical Foundations of Naïve Bayes ### 2.1 Bayes’ Theorem Bayes’ theorem expresses how conditional probabilities relate observed data to an underlying hypothesis. It states that the probability of a class $C$ given features $X$ depends on how likely $X$ is under $C$, weighted by the prior probability of $C$. $$P(C | X) = \frac{P(X | C) . P(C)}{P(X)}$$ * $P(C∣X)$: the **posterior probability**, meaning the probability that a data point is in class $C$ after observing the features $X$. * $P(X∣C)$: the **likelihood**, the probability of observing the features $X$ if the data point is known to belong to class $C$. * $P(C)$: the prior probability, the probability that a data point belongs to class $C$ before looking at any features. * $P(X)$: the evidence, the overall probability of observing the features $X$, used to normalize the posterior. This framework allows us to update beliefs about $C$ once $X$ is observed. ### 2.2 The Naïve Conditional Independence Assumption Under Naïve Bayes, we assume features are conditionally independent given the class: $$ P(\mathbf{x} | C_k) = P(x_1, x_2, ..., x_n | C_k) = \prod_{i=1}^{n} P(x_i | C_k)$$ Logarithms prevent numerical underflow and convert products into sums. ## 3. Fitting a Naïve Bayes Model ### 3.1 Estimating Class Priors $$P(Y=k) = \frac{(Number\ of\ instances\ belonging\ to\ class\ k) }{(Total\ number\ of\ instances\ in\ the\ training\ set)}$$ ### 3.2 Estimating Likelihoods (Binary Features) $$P(Feature\ Value | Class) = \frac{(Count\ of\ Class\ samples\ with\ Feature\ Value )}{(Total\ Count\ of\ Class\ samples)}$$ ### 3.3 Linear Algebra View Define the vector of log-likelihood contributions $(y^{*}=\arg \max _{y}(\log P(X|Y=y)+\log P(Y=y))$ we know, $P(X|Y=y)= \prod_{j=1}^{d} P(X_{j}|Y=y)$ Taking the logarithm, this becomes $log P(X|Y=y)=\sum _{j=1}^{d}\log P(X_{j}|Y=y)$ Substituting this back into the maximization problem: $y^{*}=\arg \max _{y}\left(\sum _{j=1}^{d}\log P(X_{j}|Y=y)+\log P(Y=y)\right)$ This shows that Naïve Bayes can be represented as a linear model in log-likelihood feature space. ## 4.Naive Bayes Classifier — Applications and use-cases * **Real time classification** — because the Naive Bayes Classifier works very very fast(blazingly fast compared to other classification models) it is used in applications that require very fast classification responses on small to medium sized datasets. * **Spam filtering** — this is the use case you’ll hear most often when it comes to this classifier. It is widely used to identify if a mail is spam. * **Text classification** — the Naive Bayes Classifier works very well in text classification methods. The Naive Bayes Classifier generally works very well with multi-class classification and even it uses that very naive assumption, it still outperforms other methods. ## 5. Plant Health Example Let us walk through a problem statement and how Naive Bayes is used to solve it. We consider predicting whether a plant is Healthy or Sick based on four binary features: 1. $x_1$: Leaves brown? (1=yes, 0=no) 2. $x_2$: Soil dry? (1=yes, 0=no) 3. $x_3$: Wilting? (1=yes, 0=no) 4. $x_4$: Spots on leaves? (1=yes, 0=no) ### 5.1 Training Data Lets take below data, here we have $N=5$ observations |#| $x_1$ | $x_2$ | $x_3$ | $x_4$ | $Y$ | ------| -------- | -------- | -------- | -------- | -------- | |1| 0 | 0 | 0 | 0 | Healthy | |2| 0 | 1 | 0 | 0 | Healthy | |3| 1 | 1 | 1 | 1 | Sick | |4| 1 | 0 | 1 | 0 | Sick | |5| 0 | 1 | 1 | 0 | Sick | * **Healthy**: 2 * **Sick**: 3 ### 5.2 Compute Class Priors $$𝑃(Healthy)=\frac{2}{5} \And \ 𝑃(Sick)=\frac{3}{5}$$ ### 5.3 Compute Feature Likelihoods $\textbf{Healthy Class}$ From row 1 and 2 $x_1$ : 0 out of 2 $\to \ P(x_1 = 1|H)=0$ $x_2$ : 1 out of 2 $\to \ P(x_2 = 1|H)=\frac{1}{2}$ $x_3$ : 0 out of 2 $\to \ P(x_3 = 1|H)=0$ $x_4$ : 0 out of 2 $\to \ P(x_4 = 1|H)=0$ $\textbf{Sick Class}$ From row 3 to 5 $x_1$ : 2 out of 3 $\to \ P(x_1 = 1|S)=\frac{2}{3}$ $x_2$ : 2 out of 3 $\to \ P(x_2 = 1|S)=\frac{2}{3}$ $x_3$ : 3 out of 3 $\to \ P(x_3 = 1|S)=1$ $x_4$ : 1 out of 3 $\to \ P(x_4 = 1|S)=\frac{1}{3}$ ![image](https://hackmd.io/_uploads/HJ0mcYcZbx.png) ## 6. Classifying a New Plant Suppose a new plant has features: $$x\ = \ (x_1,x_2,x_3,x_4)=(1,1,0,0)$$ The score is calculated using $$ Score(C_k) = P(C_k) \times \prod_{i=1}^{n} P(x_i | C_k)$$ Multiplying small probabilities can lead to result being extremely small. Thus in practice we use log of probabilities. $$LogScore(C_k) = \log(P(C_k)) + \sum_{i=1}^{n} \log(P(x_i | C_k))$$ So using this we calculated the healthy score and the sick scoe $\textbf{Healthy Score}$ $L_H = log(P(H)) + log P(x_1 = 1| H)+ log P(x_2 = 1| H)+ log P(x_3 = 0| H)+ log P(x_4 = 0| H)$ Since several probabilities are zero (no Healthy plant had brown leaves), this yields $L_H = -\infty$ $\textbf{Sick Score}$ $L_S = log(P(S)) + log P(x_1 = 1| S)+ log P(x_2 = 1| S)+ log P(x_3 = 0| S)+ log P(x_4 = 0| S)$ We compute each $P(x_1 =1|S) = \frac{2}{3}$ $P(x_2 =1|S) = \frac{2}{3}$ $P(x_3 =0|S) = 1=1=0$ $P(x_4 =0|S) = 1-\frac{1}{3}=\frac{2}{3}$ $L_S = log\frac{3}{5} + log\frac{2}{3}+ log\frac{2}{3}+ log 0+ log\frac{2}{3}= -\infty$ Both likelihoods become zero because no Sick plant had $x_3 = 0$. In practice this is common, so **Laplace smoothing** is normally used. ***Using Laplace smoothing*** Let's say we are working with the plant health example where $x_1$ = "Leaves brown," and we want to compute $P(x_1 = 1|Y= Healthy)$. In the case of no occurrence of $x_1 = 1$ for class "Healthy", the raw probability would be zero. Using Laplace smoothing, we can compute the likelihood as: $$P(x_1 = 1|Y= Healthy) = \frac{0+1}{2+2} = 4$$ Now using this, lets calculate the Logarthimic Sick Score $L_S = log\frac{3}{5} + log\frac{3}{5}+ log\frac{3}{5}+ log \frac{1}{5} + log\frac{3}{5}$ This is finite and dominates the Healthy score (which also becomes finite under smoothing). Thus NAÏVE BAYES PREDICTS: $$y=Sick$$ ## 7. Stengths and weakness #### Strengths 1. **Computationally Efficient** : * Each class's feature occurrences only need to be counted. * Even with big datasets or lots of features, training is quick. 2. **Effective for Small Datasets**: * It works well with limited data because it estimates simple probabilities rather than complex parameters. 3. **Robust to Irrelevant Features** * Certain features do not considerably impair performance if they are class-independent. 4. **Handles High-Dimensional Data** * Frequently used in text classification (many words/features), the model is tractable due to the independence assumption. 5. **Probabilistic Output** * Gives class probabilities that can be combined with other models or used for risk assessment. #### Weaknesses 1. **Strong Independence Assumption** * Assumes all features are conditionally independent given the class. * Often unrealistic in real-world datasets, which can reduce accuracy. 2. **Zero Probability Problem** * The probability drops to zero, preventing prediction, if a feature value never occurs with a class during training. * Typically solved with Laplace smoothing. 3. **Restricted Expression** * incapable of capturing intricate feature relationships. less adaptable than neural network or tree-based models. * Less flexible than tree-based or neural network models. 4. **Sensitive to Imbalanced Data** * If one class is rare, priors dominate and can bias predictions toward the majority class. 5. **Feature Engineering Is Frequently Required** * Performance can improve dramatically with proper feature selection or discretization of continuous features. ## 8. Conclusion The Naïve Bayes classifier offers a simple and mathematically transparent approach to classification. Its independence assumption enables fast training, closed-form parameter estimates, and efficient prediction. Through the expanded plant health example, we illustrated how the algorithm computes priors, likelihoods, and posterior class probabilities, and how Laplace smoothing resolves zero-probability issues. The method’s linear-algebraic formulation further reveals how Naïve Bayes acts as a linear classifier in log-likelihood space. ## Reference 1. https://www.geeksforgeeks.org/machine-learning/naive-bayes-classifiers/ 2. https://en.wikipedia.org/wiki/Naive_Bayes_classifier