Naive Bayes Classification

*Sai Kiran Reddy M* # Naive Bayes Classification ### **Introduction:** Naive Bayes is a fundamental machine learning algorithm that is known for its simplicity and effectiveness. It is still one of the most useful classification techniques in spite of its "naive" presumptions, especially for spam filtering and text classification. The fundamental premise of the probabilistic classifier Naive Bayes, which is based on Bayes' Theorem, is that, given the class label, features in a dataset are mutually independent. The method works very well in practice, even though this assumption is rarely true in real-world situations. ### Mathematical Foundation **Bayes' Theorem:** The algorithm relies on Bayes' Theorem, which calculates the probability of a class $C_k$ given the data $X$: $$P(C_k|X)=\frac{P(X|C_k)*P(C_k)}{P(X)}$$ Where: $P(C_k|X)$ is the **Posterior probability**, the probability of class $C_k$ given features $X$. $P(X∣C_k)$ is the **Likelihood**, the probability of observing $X$ given class $C_k$. $P(C_K)$ is **Prior probability**, the probability of class $C_k$ occurring in the dataset before observing the features.. $P(X)$ is **Evidence** or **marginal probability** of the features $X$ across all classes. ### Feature Independence Assumption Naive Bayes makes a strong assumption that features are conditionally independent given the class label. This is known as the feature independence assumption. Mathematically, the joint likelihood of all features $𝑋 =\{x_1,x_2,…,x_𝑛\}$ given class $C_k$ is simplified to the product of the individual feature likelihoods: $$P(X|C_k)=\prod_{i=1}^{n}P(x_i|C_k)$$ This assumption significantly simplifies the calculation of the likelihood term $P(X∣C_k)$ in Bayes' Theorem. However, this assumption is rarely true in real-world data, where features often exhibit some degree of dependency. ### Scenarios Where the Independence Assumption May Not Hold True: #### **Highly Correlated Features** In many real-world datasets, features are often correlated. For example, in a medical dataset, features like blood pressure and cholesterol levels may be correlated because both are related to heart disease. Naive Bayes, however, treats these features as independent, which can lead to suboptimal performance when these correlations are strong. #### **Text Classification (Word Dependencies)** In text classification, where features represent the presence or absence of words (e.g., “free”, “offer”, “buy”), words often co-occur in similar contexts. For instance, the words "buy" and "now" often appear together in phrases like “buy now”. Naive Bayes assumes that these words are independent of each other, but in reality, they are often contextually dependent, which can affect classification accuracy. #### **Image Classification** In image classification, where features represent pixel intensities or features derived from image data (e.g., edges or textures), pixels are rarely independent. Adjacent pixels in an image are often highly correlated, but Naive Bayes would treat them as independent, leading to a loss of valuable spatial information. #### **Weather Prediction** Consider weather data where features include temperature, humidity, and wind speed. These features are often interdependent. For instance, high temperatures are likely to coincide with low humidity in many climates. Naive Bayes assumes independence, which may not hold true when features are correlated. ### Addressing the Independence Assumption While the feature independence assumption is a simplification, it often leads to good performance in practice, particularly in high-dimensional spaces. However, there are several strategies to handle correlated features: #### **Feature Engineering**: One approach is to reduce feature dependencies by creating new features that capture the relationships between the original features. #### **Laplace Smoothing**: In cases where certain feature-class combinations are not observed in the training data. Laplace Smoothing adds a small constant (typically 1) to all feature counts, ensuring no probability is ever zero. This adjustment helps the model handle unseen feature values by smoothing the probabilities: $$P(x_i|C_k)=\frac{count(x_i,C_k)+1}{count(C_k)+V}$$ Where $V$ is the number of unique features, and $count(x_i, C_k)$ is the frequency of feature $x_i$ in class $C_k$. #### **Alternative Models**: If the assumption of independence is severely violated, alternative models like Decision Trees, Random Forests, or Support Vector Machines may better capture the dependencies between features. ### Types of Naive Bayes Classifiers: #### 1. **Bernoulli Naive Bayes** This classifier is suited for binary features, where each feature is represented as a binary value (0 or 1). It is commonly used for text classification tasks where the presence or absence of specific words (rather than their frequency) is important, such as in spam detection. #### 2. **Multinomial Naive Bayes** Ideal for handling discrete count data, this variant is frequently used in document classification tasks. It models features as word frequencies or counts, making it effective for text classification where the frequency of word occurrence plays a crucial role, such as categorizing news articles. #### 3. **Gaussian Naive Bayes** This version is designed for continuous data and assumes that the features follow a normal (Gaussian) distribution. It is typically used when the dataset consists of numerical values, like predicting medical conditions based on continuous measurements (e.g., age, blood pressure, etc.). -------------- ### Steps to Fit and Predict with Naive Bayes #### **Fitting the Model** 1. **Calculate Prior Probabilities** For each class $C_k$, calculate: $$P(C_k)=\frac{Number\:of\:samples\:in\:C_k}{Total\:number\:of\:samples}$$ 2. **Calculate Likelihoods** For each feature $x_i$ and class $C_k$, compute: - **Categorical Data**: Frequency of each category. <img src="https://hackmd.io/_uploads/HJyiA-hm1x.png" alt="image" style="border: 2px solid black; padding: 5px;"> - **Continuous Data**: Use the Gaussian likelihood: $$P(x_i|C_k)=\frac{1}{\sqrt{2\pi\sigma_k^2}}exp(-\frac{(x_i-\mu_k)^2)}{2\sigma^2_k})$$ where $\mu_k$ and $\sigma_k^2$ are the mean and variance for feature $x_i$ in class $C_k$. <img src="https://hackmd.io/_uploads/r1JKXEY41e.png" alt="image" style="border: 2px solid black; padding: 5px;"> #### **Making Predictions** 1. For a new data point $X=\{x_1,x_2,x_3.....x_n\}$, compute the posterior probability for each class $C_k$: $$P(C_k|X)\propto P(C_k).\prod_{i=1}^nP(x_i|C_k)$$ 2. Assign the class label with the highest posterior probability:$$\hat{C}=arg\:max_{C_k}P(C_k|X)$$ ************** ### Demonstration of Naive Bayes Algorithm: **Email Classification (Spam vs. Not Spam)** Consider a dataset where each email is represented by two features: - Feature 1: Contains “Free” (Yes/No). - Feature 2: Contains “Money” (Yes/No). - Feature 3: Contains "Discount" (Yes/No). - Here, $C_k$={Spam, Not Spam} |Email ID | Free | Money | Discount | Class (Spam/Not Spam) | | -------- | ---- | --- | -------- | --------------------- | | 1 | Yes | No | Yes | Spam | | 2 | No | No | Yes | Not Spam | | 3 | Yes | Yes | No | Spam | | 4 | No | Yes | No | Not Spam | - The table indicates that if any mail has a word **Free/Free+Money** in it's subject description, it is directly marked as a spam e-mail. Whereas if it has a word only **Discount** or only **Money**, it can be considered as Not a spam e-mail.** **Step 1**: **Compute Priors**: $$P(Spam)=\frac{2}{4}=\frac{1}{2},\:P(Not\:Spam)=\frac{2}{4}=\frac{1}{2}$$ **Step 2:** **Compute Likelihoods** for each feature. For **Spam**: - $P(Free = Yes | Spam) =\frac{2}{2}=1$ - $P(Discount = Yes | Spam) = \frac{1}{2} = 0.5$ - $P(Money = Yes | Spam) = \frac{1}{2} = 0.5$ For **Not Spam**: - $P(Free = Yes | Not Spam) = \frac{0}{2} = 0$ - $P(Discount = Yes | Not Spam) = \frac{1}{2} = 0.5$ - $P(Money = Yes | Not Spam) = \frac{1}{2} = 0.5$ **Step 3: Classify a New Email** The new email has the features: **Money=Yes**, **Discount = Yes** and **Free = Yes**. We shall calculate the posterior probability for each class using Bayes' Theorem. #### **For Spam**: $$P(Spam |Discount = Yes, Free = Yes, Money=Yes)=P(Spam)⋅P(Money = Yes | Spam).P(Discount = Yes | Spam)⋅P(Free = Yes | Spam)$$ $$P(Spam | Discount = Yes, Free = Yes, Money=Yes)=1×0.5×0.5×0.5=0.125$$ #### **For Not Spam:** $$P(Not Spam | Discount = Yes, Free = Yes, Money=yes)=P(Not Spam)⋅P(Discount = Yes | Not Spam)⋅P(Free = Yes | Not Spam).P(Free = Yes | Not Spam)$$ $$P(Not Spam | Discount = Yes, Free = Yes, Money=Yes)=0.5×0.5×0×0.5=0$$ #### Step 4: Make a Decision Since $P(Spam | Discount = Yes, Free = Yes, Money=Yes)>P(Not Spam | Discount = Yes, Free = Yes, Money=Yes)$, the new email is classified as **Spam**. <img src="https://hackmd.io/_uploads/HkO-yG2X1l.png" alt="image" style="border: 2px solid black; padding: 5px;"> *********** ### Applications of Naive Bayes Classifier: - Spam detection, Sentiment analysis (positive/negative reviews) - Topic categorization, Language detection - Disease prediction, Gene classification - Fake news detection, User behavior analysis - Credit card fraud detection, Insurance fraud detection ### **Strengths:** - Efficient and fast for both training and prediction. - Works well even with small datasets. - Handles high-dimensional data well. ### **Weaknesses:** - The assumption of feature independence often does not hold in real-world data. - Performs poorly when data contains highly correlated features. - Zero-frequency problem: If a feature value doesn’t appear in the training data for a given class, it assigns zero probability to that class. - If the Prior probabilities and Posterior probabilities are equal, Naive Bayes classifier may randomly assign a class or rely on a predefined tie-breaking rule. *********** ### Conclusion Naive Bayes is an essential tool in the machine learning toolkit. Despite its simplicity and assumptions, it delivers reliable results for a variety of tasks. Its practical use cases demonstrate its robustness, especially in scenarios with limited data and high feature dimensions. ### References: - https://www.slideteam.net/naive-bayes-algorithm-supervised-learning-guide-for-beginners-ai-ss.html - https://towardsdatascience.com/introduction-to-na%C3%AFve-bayes-classifier-fa59e3e24aaf - https://youtu.be/Q8l0Vip5YUw?si=iRJPTE6q4mKXBVfY - https://stats.stackexchange.com/questions/567299/what-does-it-mean-for-the-bayes-classifier-to-be-optimal