Naive Bayes Classification

## Introduction: Naive Bayes is a classification technique based on probability theory, specifically Bayes Theorem. It is referred as "naive" because it makes a special assumption that each feature contributes independently to the final classification. However, this special assumption fails in real world scenarios because not all features in the world are independent of each other. In fact, most of the time, not taking an account of multicolinearity might impact the accuracy of classification, depending on the types of classification. ![Screenshot 2024-12-09 at 4.37.43 PM](https://hackmd.io/_uploads/ByQHQkrEJe.png) - *Figure: Bayes Thoerem for Naive Bayes Classifier* ### Assumption Explained: ![IMG_5F26B3AF6A76-1-min](https://hackmd.io/_uploads/ryo1cToQyl.jpg) - *Figure: Good and Bad features for Naive Bayes Assumption in real world scenario* Naive Bayes Classification is based on the assumption that each features in the class contribute to the final classification independently. Let's take an example of emails to be classified as regular email vs spam email. Here, features of email could be a repetitive occurence of words like alert, discount, free, hurry etc. that might be used to decide if incoming emails are fraud or genuine. The special assumption here in this case potrays that each features (words) can be evaluated to decide if the incoming emails are regular or spam emails. The drawback of this assumption is that it does not work effectively in real world scenarios. For instance, if symptoms like fever, cough, red eyes etc. are treated as features to independently decide the type of disease, then it becomes very difficult to effectively classify the disease, since symptoms are inter-related to each other. A sick patient with flu can have red eyes and fever, while a sick patient with allergy can also have red eyes and fever. Due to this, especially those requiring text classification, spam filtering, and recommendation systems, Naive Bayes classifiers exhibit remarkably high performance in spite of this special assumption. ## Types of Data: There are various types of data, Naive Bayes can work with by using different types of classifiers as listed below. ### 1. Gaussian Naive Bayes: This classifier works with **continuous** data. It assumes that the continuous features follow a normal distribution (Gaussian distribution). - For instance, age =25, 30, 40 etc., height= 5.5, 6.0, 7.1 feet etc., and weight= 150, 200, 250 lbs etc. has real and continuous data which can be classified using Gaussian Naive Bayes. ![Screenshot 2024-12-09 at 4.41.05 PM](https://hackmd.io/_uploads/H1dWV1rVyl.png) - *Figure: Normal distribution for Gaussian Naive Bayes* ### 2. Multinomial Naive Bayes: This classifier works with **discrete** data such as word counts in text classification. - For instance, a word "but" is repeated 4 times and another word "okay" is repeated 10 times. These types of data can be classified using Multinomial Naive Bayes. ### 3. Bernoulli Naive Bayes: This classifier works best with **binary** data or boolean data. It is suitable in text classification. - For instance, a bird can fly is "1", a fish can swim is "1", and a human can fly is "0". These type of data can be classified using Bernoulli Naive Bayes. ## Strength and Weakness: There are both advantages and disadvantages of using the Naive Bayes classifier while working with classification problems. Some of its strength and weakness are listed below. ### Strength: * It is simple and easy to understand and interpret. * It can be quickly implemented due to its low computational cost to the business. * It is both effective for large datasets or even limited datasets. * Since it is probabilistic in nature, it can often ignore the irrelevant features in the datasets. ### Weakness: * Since it has special assumption of treating each features independently, this creates a limitation in real world issues. * If a feature never appears in the training set, then indicating its probability as zero can potentially deviate from the correct classification. * It is often not suitable for continuous features without modification. ## Mathematics behind Naive Bayes: ![IMG_124DF47C8BAE-1](https://hackmd.io/_uploads/SystRTimyg.jpg) - *Figure: Flowchart of Naive Bayes Mathematics* ### 1. Bayes Theorem: Naive Bayes Classification rely on Bayes Theroem, which states as, - $P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$ where, - $P(A|B)$ : Posterior probability of class A given features B . - $P(B|A)$ : Likelihood of observing B feature given that A is the true class. - $P(A)$ : Prior probability of class A . - $P(B)$ : Probability of observing B feature across all classes. ### 2. Independent Features: Now, let's say $x_1$, $x_2$, $x_3$....$x_n$ are all different features. So, we can simply put values $x_1$, $x_2$, $x_3$....$x_n$, in the Bayes Theorem equation as, - $P(A|$ $x_1$, $x_2$, $x_3$....$x_n$) = $\frac{P(x_1,x_2,x_3...x_n|A) \cdot P(A)}{P(x_1,x_2,x_3...x_n)}$ Since, each features in the equation are **conditionally independent** of each other according to the Naive Bayes Classification, we can simplify the **likelihood** term by expressing it as the product of individual conditional probabilities as, - $P(x_1, \dots, x_n | A) = \prod_{i=1}^n P(x_i | A)$ where, $\prod_{i=1}^n$ is called a **Pi** notation. In other words, it is called the product of sequence. This symbol condenses each term multilplication from $1$ to $n$ into a single notation. ### 3. Simplified Bayes Theorem: After putting the independence features in the Bayes theorem, we get the following equation. - $P(A | x_1,x_2, x_3... x_n) = \frac{P(A) \prod_{i=1}^n P(x_i | A)}{P(x_1,x_2,x_3... x_n)}$ As we know that ${P(x_1,x_2,x_3... x_n)}$ remain constant in the equation, meaning it does not affect the result of classification. So, we can ignore this term in the equation which give us the following relational equation. - $P(A | x_1,x_2, x_3... x_n) \propto {P(A) \prod_{i=1}^n P(x_i | A)}$ ### 4. Classification Rule: Now in order to label the new class based on features, we need to find the $A$ that maximizes the posterior probability. - $\hat{A} = \arg \max_A P(A) \prod_{i=1}^n P(x_i | A)$ where, $\arg \max_A$ basically choses the input value that yield the highest output value. In our case, it calculates the probability of each class $A$ given the features and selects the one with the highest probability as predicted class $\hat{A}$. ![Screenshot 2024-12-09 at 4.43.18 PM](https://hackmd.io/_uploads/rJvcVyHE1l.png) - *Figure: Naive Bayes Classifier Plot* ## Fitting and Predicting using Naive Bayes: There are various step that a computer takes to fit the Naive Bayes model and then make prediction on the new data. Below are the steps that are explained in detail regarding fitting and predicting using Naive Bayes Classification. ![IMG_060C961F9F2D-1-min](https://hackmd.io/_uploads/H1TwQCi71l.jpg) - *Figure: Naive Bayes Classifier Model for Spam Email Filtration* ### Step 1: Organize Data There may be numerous features such as different words as buy, alert, free, hurry etc. that could be used to detect spam emails. These words need to be organized as features such as $x_1,x_2, x_3... x_n$. And all emails needs to be organized as Class (A) for regular email and Class (A') for spam email . #### Table Summary: | Features(x) | Description | Class(A) | Class(A') | | ----------- |:----------- |:---------- | ---------- | | $x_1$ | "buy" | occurrence | occurrence | | $x_2$ | "alert" | occurrence | occurrence | | $x_3$ | "free" | occurrence | occurrence | | $x_4$ | "hurry" | occurrence | occurrence | Let's assume we have $n=10$ emails with $x_1,x_2, x_3$ and $x_4$ features. We can label $n$ emails with the tag of Class $A$ and Class $(A')$. ### Step 2: Calculate Prior Probability Since, we have $n=10$ samples, we need to calculate the probability of each samples labeled as class $A$ by using following formula: - $P(A) = \frac{\text{number of samples in class } A}{\text{total number of samples}}$ Let's assume there are $7$ emails that are found to be regaular email and rest are spam emails. #### Table Summary: | Class | Samples | Prior Probability | | ---------------------- | ------- |:--------------------- | | Regular Email $(A)$ | 7 | 0.7 | | Spam Email ($A{\prime})$ | 3 | 0.3 | ### Step 3: Calculate Likelihood As we know from the Naive Bayes Classification, each features are conditionally independent of each other. This rule helps to calculate the probability of each features $x_1,x_2, x_3$ and $x_4$ labeled as class $A$. Since, the word count is of discrete data, we can use the following equation to calculate the likelihood of the features given the class $(A)$ and Class $(A')$: - Let' say there are total number of $8$ features (words) that occured in regular email $(n=7)$ and $16$ features (words) that occured in spam email $(n=3)$. - $P(x_i = v | y) = \frac{\text{count of samples where } x_i = v \text{ and class is } y}{\text{count of samples where class is } y}$ where, $v$ is the specific value $x_i$ can take such as specific words in the email. #### **Table Summary:** | Feature (x) | $P(x_i,\ given\ A)$ | $P(x_i,\ given\ A{\prime})$ | | ------------ | -------------------- | --------------------------- | | x_1: “buy” | $\frac4{8}$ = $0.5$ | $\frac1{16}$ = $0.063$ | | x_2: “alert” | $\frac1{8}$ = $0.13$ | $\frac8{16}$ = $0.5$ | | x_3: “free” | $\frac3{8}$ = $0.38$ | $\frac3{16}$ = $0.19$ | | x_4: “hurry” | $\frac2{8}$ = $0.25$ | $\frac4{16}$ = $0.25$ | #### Laplace Smoothing: Generally, one of the weakness of Naive Bayes Claasification is that it counts the absence of features in a class as zero probability which make the entire probability for that class as zero. In order to minimize this issue, **Laplace smoothing** is done where a small constant $\alpha$ is added to each features in the class and the likelihood is calculated afterthen. - $P(x_i|A) = \frac{\text{count}(x_i, A) + \alpha}{\text{count}(A) + \alpha \cdot n}$ where, $n$ is number of possible values $x_i$ can take. ### Step 4: Fit the Model Once the prior probability $P(A)$ and likelihood $P(x_i|A)$ with $n$ samples having $x_1,x_2, x_3,and\ x_4$ features each labeled as class $A$ is done, then the model is ready to be fitted in a new class given the features in order to predict the accurate classification of target (spam email vs regular email). ### Step 5: Predict New Data Predicting new data basically means finding the highest probability of new data or new class given a certain features $x_1,x_2, x_3... x_m$ so as to determine whether the email is spam or regular. This can be done by using the following formula where the prior proability of new class and likelihood for distinct new features are multiplied together and this gives the combined probability scores for new class given the features. - $P(A | x_1, x_2, \dots, x_m) \propto P(A) \prod_{i=1}^m P(x_i | A)$ For instance, there is an incoming email which has words such as "Alert", "Hurry". Now we want to use our trained model to predict whether this email is regular email or spam email. **Regular Email (Class(A)):** - $P(A | Alert,\ Hurry) \propto P(A) * P(Alert | A)* P(Hurry | A)$ **Spam Email (Class(A')):** - $P(A' | Alert,\ Hurry) \propto P(A') * P(Alert | A')* P(Hurry | A')$ #### Table Summary: | Class | Probability Calculation | Final Score | | ------------------------ | --------------------------- |:----------- | | Regular Email Class $(A)$ | $0.7 \cdot 0.13 \cdot 0.25$ | $0.023$ | | Spam Email Class $(A{\prime})$ | $0.3 \cdot 0.5 \cdot 0.25$ | $0.038$ | ### Step 6: Select Highest Probability The following equation can be then be used to select the highest combined probability score for the new class. - $\hat{A} = \arg \max_A \left( P(A) \prod_{i=1}^m P(x_i | A) \right)$ - $\hat{A} = 0.038$ Since, the highest probability $0.038$ belongs to the Spam Email Class $(A')$, the incoming email is classified as a spam email with "Alert" and "Hurry" features.This way, any incoming email can be classified fitting through the trained model in order to declare the accurrate classification. ## Summary: Naive Bayes Classification is a simple yet powerful concept that can efficiently classify specific tasks related to text based data. With the fact of not taking the correlation of features into consideration, this special benefit of treating features independently can still enhance the task speed and reliability in many cases. The different types of Naive Bayes Classifiers such as Gaussian, Multinomial, Bernoulli etc. can efficiently address the serious drawbacks of Naive Bayes with modifications. Hence, Naive Bayes Classification is a robust technique prefered to solve wide range of classification problems. ## References: 1. Wikipedia contributors. (n.d.). Naive Bayes classifier. In Wikipedia. Retrieved December 1, 2024, from https://en.wikipedia.org/wiki/Naive_Bayes_classifier#:~:text=In%20statistics%2C%20naive%20Bayes%20classifiers,gives%20the%20classifier%20its%20name. 2. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Retrieved December 1, 2024, from https://scikit-learn.org/1.5/modules/naive_bayes.html 3. IBM. (n.d.). What is Naive Bayes?. Retrieved December 1, 2024, from https://www.ibm.com/topics/naive-bayes 4. Starmer, J. (n.d.). Naive Bayes, Clearly Explained!!! [Video]. YouTube. Retrieved December 1, 2024, from https://www.youtube.com/watch?v=O2L2Uv9pdDA