Naive Bayes - HackMD

**Naive Bayes Classifier Overview:** Naive Bayes is a probabilistic classifier based on Bayes' theorem. It's particularly effective for text classification tasks like spam detection and sentiment analysis. It calculates the probability of a data point belonging to a particular class based on the features. We'll use a text classification example for this implementation. **Example Using a Text Classification Dataset:** **Step 1: Import Libraries** ```python import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, classification_report ``` **Step 2: Prepare the Text Data** ```python # Sample text data text_data = ["This is a positive message.", "Negative sentiment in this text.", "A positive outlook is important.", "This doesn't look good."] # Corresponding labels (0 for negative, 1 for positive) labels = np.array([1, 0, 1, 0]) ``` **Step 3: Vectorize the Text Data** ```python # Create a CountVectorizer to convert text into a numerical format vectorizer = CountVectorizer() # Fit and transform the text data into a document-term matrix X = vectorizer.fit_transform(text_data) ``` **Step 4: Split the Data into Training and Testing Sets** ```python X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42) ``` **Step 5: Create and Train the Naive Bayes Classifier Model** ```python # Create a Naive Bayes classifier model with customizable parameter naive_bayes_model = MultinomialNB(alpha=1.0) # Train the model on the training data naive_bayes_model.fit(X_train, y_train) ``` **Params That Can be Changed** 1. **alpha** (default=1.0): - Additive (Laplace/Lidstone) smoothing parameter. A value less than 1.0 accounts for no smoothing. Adjusting this value can impact the model's handling of rare words and help prevent zero probabilities in the probability estimates. **Step 6: Make Predictions** ```python # Make predictions on the test data y_pred = naive_bayes_model.predict(X_test) ``` **Step 7: Evaluate the Model** ```python # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy * 100:.2f}%") # Generate a classification report classification_rep = classification_report(y_test, y_pred, target_names=["Negative", "Positive"]) print("Classification Report:") print(classification_rep) ``` **Explanation:** 1. We import the necessary libraries, including NumPy for numerical operations, scikit-learn for Naive Bayes classification, and more. 2. We prepare a sample text dataset with corresponding labels. In this example, 0 represents negative sentiment, and 1 represents positive sentiment. 3. We use CountVectorizer to convert the text data into a numerical format, creating a document-term matrix where each row represents a document (text) and each column represents a unique word (feature). 4. The data is split into training and testing sets, with 80% used for training and 20% for testing. 5. We create a Naive Bayes classifier model using `MultinomialNB`, which is suitable for text data. 6. The `alpha` parameter is introduced, and it controls the level of smoothing applied to the probability estimates. A value less than 1.0 accounts for Laplace smoothing, which helps prevent zero probabilities and can improve the model's performance when dealing with rare words. 7. The model is trained on the training data using `fit`. 8. We use the trained model to make predictions on the test data. 9. We evaluate the model's performance using accuracy and generate a classification report that includes precision, recall, F1-score, and support for each class.