F24: CS 410/1411 Homework 5: Gradient Descent & Naive Bayes

# CS 410/1411 Homework 5: Gradient Descent & Naive Bayes **Due Date: 10/16/2024 at 12pm** **Need help?** Remember to check out [Edstem](https://edstem.org/us/courses/61309) and our website for TA assistance. ## Assignment Overview This assignment comprises two distinct parts:  1. A `numpy` primer, intended to introduce you to the `numpy` library, to facilitate coding with multidimensional objects (arrays, matrices, etc.), where you implement gradient descent, using `autograd`'s built-in automatic differentiation function, `grad`. 2. A supervised learning task, where your goal is to build a binary classifier, that classifies email as "ham" or "spam". The algorithm you will be using is called Naive Bayes. (It is a fairly sophisticated algorithm, in spite of its disparaging name.) ### Learning Objectives  What you will know: - how gradient descent works - how a Naive Bayes classifier works What you will be able to do: - translate mathematical formulas into code - analytically and automatically differentiate a function - use `numpy` to manipulate multidimensional objects in Python ## Introduction Steve is being bombarded with spam, and it's cluttering his Inbox just as he's preparing for an exciting journey to a rare Mushroom biome. Before he can embark on this adventure, he needs to get his spam problem under control. Using `numpy` and machine learning techniques like gradient descent and Naive Bayes, you can help Steve build a spam filter to keep his Inbox clean. With a streamlined email system, he'll be able to focus on exploring new biomes without distractions! ## Part 1: A `Numpy` Primer In other popular programming languages, like Java, C, and C++, the most common way to store data is in arrays. In Python, the most common data strcture is a *list*, which is akin to an [ArrayList](https://www.w3schools.com/java/java_arraylist.asp) in Java or a [Vector](https://en.cppreference.com/w/cpp/container/vector) in C++. The primary difference between an array and a list is lists are potentially infinite, while the maximum size of an array is declared in advance. A second key difference between lists and arrays in Python is, arrays are **monomorphic**, meaning all data stored in an array is necessarily of the same type. (**N.B.** Programming with monomorphic lists is good programming practice regardlees, even though this practice is not mandated by Python. Why? Because it makes debugging--where you spend most of your time--much easier!) Why does Python mandate that arrays be monomorphic, but not lists? One key reason comes down to the question of memory allocation: lists are stored as something like linked lists of objects (of potentially varying sizes). In contrast, the requisite *total* amount of space in memory can be allocated contiguously to arrays of finite size and monomorphic types. Python's version of arrays are provided by the `numpy` library. `Numpy`'s array operations are *much* faster than Python's list operations because their efficient allocation of space facilitates parallel computation and specialized CPU computations. To take advantage of these efficiencies, it is necessary to adopt a [vectorization](https://www.practicaldatascience.org/notebooks/class_2/week_4/11_vectorization.html) world view, that is, to reimagine your programs as single operations on vectors, rather than as multiple operations performed iteratively while traversing data in a collection. Vectorization not only yields faster code, it is often easier to read and more intuitive to write. :::warning **Matrix operations** As the name suggests, `numpy` provides simple syntax for numerical Python and vectorized operations on vectors. As matrices can be viewed as a concatenation of multiple vectors, it also provides simple syntax for vectorized operations on matrices, e.g., matrix multiplication, transposes, inverses. ::: Here is a [quickstart guide](https://numpy.org/devdocs/user/quickstart.html) for learning the basic syntax and attributes of `numpy` arrays, and how to operate on them. :::spoiler **Hint:** `numpy` library You can install `numpy` by running `pip install numpy`. ::: :::info **Practice (ungraded) Task** Implement `is_valid_matrix_multiplication`: Takes as input an ordered list of matrices and returns a `bool` indicating whether or not the matrices can be multiplied in the given order. For example, for two matrices $M_1\in \mathbb{R}^{3 \times 5}$ and $M_2 \in \mathbb{R}^{5 \times 2}$, the function should return `True`, since $M_1 \times M_2 \in \mathbb{R}^{3 \times 2}$. But, for two matrices $M_3\in \mathbb{R}^{5\times 4}$ and $M_4 \in \mathbb{R}^{3\times 2}$, the method should return `False`, since $M_3\times M_4$ is not a legal matrix multiplication. ::: :::spoiler **Hint:** `.shape` property Use the `.shape` property of `numpy` arrays find the dimensions of the input matrices. ::: :::warning **Warning** Do not try multiplying the matrices! ::: ### Part 1a: Fun with Vectors! :::info **Task** Implement the following four functions in `numpy_primer.py`. ::: 1. `mean_squared_error`: Given two sets of input labels $y_{\mathrm{true}} \in \mathbb{R}^n$ and $y_{\mathrm{pred}} \in \mathbb{R}^n$, return their **mean squared error** (MSE). The MSE is defined as: $$\mathrm{MSE}=\frac1n \sum_{i=1}^n (y_{\mathrm{true}}^i - y_{\mathrm{pred}}^i)^2$$ Where $y_{\mathrm{pred}}^i$ is the $i$th component of $y_{\mathrm{pred}}$. Your implementation should be *vectorized*; it should not use a loop. In fact, it should be a single line of code. 1. `vectorized_gradient_descent_step`: We have provided an implementation of `looped_gradient_descent_step`, which loops over the vector of weights $w$, updating each $w_i$ in turn according to the gradient descent update step: $$w_i \leftarrow w_i - \alpha \frac{\partial f}{\partial w_i}$$ Implement a vectorized version of this function in `vectorized_gradient_descent_step`, which does not use a loop, but rather updates vectors directly: $$\vec{w} \leftarrow \vec{w} -\alpha \nabla f$$ 1. `test_gradient_descent_step`: Write a function that generates `num_trials` random vectors and gradients, both of size `w_size`, and then confirms that your `vectorized_gradient_descent_step` function returns *essentially* the same results as our `looped_gradient_descent_step` function on these random vectors. :::spoiler **Hint** Be sure *not* check equality among `floats` using `==`. Instead, you can check that floats are within some tolerance of one another using Python's `math` library's `is-close` function. Similarly, `numpy` has an `is-close` function that you can use to check whether two `numpy` arrays are component-wise within some tolerance of one another. This function returns a `numpy` array of booleans. For simplicity, you can use `numpy`'s `all-close` function, which returns a scalar--a single boolean--when two `numpy` arrays are close component-wise. ::: :::warning **Warning** Do not attempt the next task until you have confirmed that your implementation of gradient descent functions equivalently to ours. ::: 4. `compare_gradient_descent_speed`: Design and implement a function to test the relative speed of `vectorized_gradient_descent_step` as compared to `looped_gradient_descent_step`. How much faster is your `numpy` version than our iterative version? Describe your test and your results to your README. ### Part 1b: Differentiation In this part of the assignment, you will practice differentiating. You will first differentiate a given function by hand, and then you will differentiate it automatically. After differentiating it by hand, you will code up your derivative, so that you can confirm that your manual derivative matches Python's automatic one. #### Part I: Analytic Differentiation Consider the following function: $$ f(x) = e^{x^2} \sin(x) + \frac{\ln(x+1)}{x^2 + 1} $$ :::info **Task** Differentiate $f(x)$ with respect to $x$ by hand. Then code your derivative in the function `df_manual` in `diff.py`. This function should simply return the value of the derivative on an input `x`. ::: :::spoiler **Hint:** Calculus review Use the sum rule, the product rule, the quotient rule, etc. ::: #### Part II: Automatic Differentiation Next, you'll be computing this same derivative using the `autograd` library to automatically differentiate $f$. To do so, you must install `autograd` by running `pip install autograd`. :::info **Task** Implement $f(x)$ as the function `f` in `diff.py` using the `autograd` library. ::: :::spoiler **Hint:** `autograd.numpy` Use the `autograd.numpy` `exp` function, the `autograd.numpy` `sin` function, `autograd.numpy` `log` function, etc. ::: :::info **Task** Use `autograd` to automatically differentiate `f` in the `df_autograd` function. ::: :::spoiler **Hint:** `grad` function Use the `grad` function in the `autograd` library. ::: :::info **Task** Calculate the derivative of $f$ for various values of $x$ (of your choosing), and confirm that `df_manual` and `df_autograd` yield nearly the same results. ::: #### Part III: `matplotlib` Finally, you'll practice plotting functions using `matplotlib`. :::spoiler **Hint:** `matplotlib` library - You can install `matplotlib` by running `pip install matplotlib`. - By convention, you should then insert the following line of code at the top of your file: ``` from matplotlib import pyplot as plt ``` - You can then use `plt.plot` to access the `plot` function in the `pyplot` module. - Feel free to reference the matplotlib [documentation](https://matplotlib.org/stable/users/index.html). ::: :::info **Task** Write a function to plot $f(x)$ and its derivative $f'(x)$ over the interval $x = [0.5, 5]$. Add a title to your plot, and axes labels. Save your function in `diff.py`. ::: :::spoiler **Hint:** `np.linspace` function Use the `np.linspace` function to generate an evenly distributed set of $x$ values. ::: ### Part 1c: Gradient Descent At this point, you have written a function to run a single step of gradient descent (i.e., `vectorized_gradient_descent_step`), and you know how to compute a gradient. So you can put it all together to optimize continuous functions using gradient descent! Here is pseudocode for the gradient descent algorithm: ``` def gradient_descent(f: function, alpha: learning rate, epsilon: convergence threshold, w: initial vector): vector loop until epsilon-convergence w = w - alpha * grad(f) return w ``` Observe that convergence is measured by a tolerance parameter $\epsilon$ that measures the difference between the vector $\vec{w}$ at one iteration and the next. Additionally, a second parameter dictates a maximum number of iterations, so that gradient descent is guaranteed to terminate, in case this convergence threshold is never reached, as there is no guarantee that it will be. :::info **Task** Implement `gradient_descent` in `numpy_primer.py`. ::: :::success **Signature** `gradient_descent` takes as input a `Callable` function `f`, an initial vector `initial_w`, a learning rate `alpha`, a tolerance `epsilon`, and a maximum number of steps `max_steps`. The `Callable` function `f` is the function to minimize, and should take as input a vector `w` and return a scalar. ::: :::info **Just for fun!** Uncomment the indicated lines in `main` in the `numpy_primer` file, and then run it, and you will see gradient descent in action! ::: :::info **Task** Implement `diy_function`. That is, create your own function using an `np` call that was not already used in this homework. Feel free to explore interesting functions, including functions with multiple minima! ::: :::info **Task** Call `animate_descent` on your `diy_function`. What happens? Try out three different values of `alpha`, one larg-ish, one small, and one very small. Explain the behavior of `gradient_descent` on your function as you vary `alpha` in your README. ::: ## Part 2: Naive Bayes In this part of the assignment, you will build a Naive Bayes classifier. Naive Bayes is a **probabilistic generative model**, meaning once it is trained (i.e., learns the relevant probabilities), it can be used to generate data, which it does according to its learned probabilities--much like large language models, like ChatGPT. You will learn about Naive Bayes by classifying email as either *ham* (i.e., not spam) or *spam*. In particular, the classifier you will build will learn a set of probabilities for ham emails, and another, for spam emails. By Bayes' rule, these same probabilities can instead be used to compute the probability of ham and spam, given an email. By the **maximum *a posteriori* principle (MAP)**, such class probabilities can be interepreted as a binary classifier: simply label the email according to the class with the higher probability. ### Data In this part of the assignment, you will be building two Naive Bayes classifiers, one by hand, and a second, in Python. In Python, the data set you will be working with is [SMS Spam Collection dataset](https://archive.ics.uci.edu/dataset/228/sms+spam+collection) from the UCI Machine Learning Repository. This dataset contains 5574 SMS messages, each one labeled as ham or spam. A key part of the implementation task will be to compute probabilities. These probablities will be estimated via counting: e.g., after 1000 flips of a coin where you see heads 490 times, you might estimate the probability of heads to be $0.49$. Similarly, the relevant probabilities in your Naive Bayes classifiers depend on the number of times various features (in this assignment, words) appear in ham and spam emails. For the manual part of this assignment, we present you with the relevant counts from the get-go; your only task will be to compute the requisite probabilities. ### Algorithm Naive Bayes' classifiers rely on Bayes' theorem to compute the **conditional** probability $P(A \mid B)$ of one random variable $A$, given another, $B$: $$P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}$$ In the language of Bayes' theorem, $P(A)$ is called the **prior** probability, $P(B \mid A)$ is the **likelihood function**, $P(B)$ is a normalization constant, and $P(A \mid B)$ is the **posterior** probability. The random variable $A$ represents the various class labels, in our domain ham and spam, while $B$ represents the **features** in the data, which are assumed to be causally related to the labels. For example, in our sample application domain, if an email is labelled as spam, it might be more likely than an email labelled as ham to include the feature (i.e., word) money.  In natural language (i.e., text) processing, a probabilistic generative model represents a probability distribution over documents (e.g., emails), where documents are sequences of features, usually **tokens**, or data units in text, such as words, subwords, or characters. In this assignment, we are assuming a **bag-of-words** model, which means that we do not consider documents as sequences, but rather, we make the simplifying assumption that they are *sets* of tokens (i.e., order does not matter). As a result, the probability distribution of interest in this assignment is a joint probability distribution over all tokens (e.g., words) in the vocabulary and all class labels (ham or spam). As the number of tokens can be very large (e.g., a typical English speaker might use 20,000 words regularly, and recognize as many as 40,000), it would be infeasible to represent this joint distribution in its full generality--the number of parameters required would be exponential in 20,000, or perhaps 40,000. Instead, to build a Naive Bayes classifier, we make the simplifying (albeit incorrect) assumption that all tokens are *conditionally independent* of one another, given their class label. In other words, while the probability of "money" and "prize" are not assumed to be independent of each other, they are assumed independent, given the class label `spam`. In this way, the number of parameters becomes on order the number of tokens, rather than exponential in the number of tokens: i.e., the probability of an email, given a class, is equal to the product of the probabilities of each token in the email, given that class. "Naive" Bayes is named for this naive (i.e., simplifying) assumption. To build (i.e., train) a binary classifier based on a probabilistic generative model, it is necessary to estimate two probabilities per class: the prior probability of that class, and the likelihood function, which is the joint probability over feature values for that class. These probabilities are generally computed via **maximum likelihood estimation**, which at the end of the day often amounts to simple counting. For example, we can estimate the prior probability of spam as the number of spam emails divided by the total number of emails; and likewise, for ham. The likelihood function is similarly estimated by counting. As already noted, Naive Bayes makes the simplifying assumption that all feature probabilities are conditionally independent of one another, given a class, so that the probability of an email, given a class, is the product of the probabilities of each token in the email, given that class. The probability of a token given a class is estimated as the number of times that token appears in the emails with that label divided by the total number of tokens in all emails with that label. Wow! That was a lot of words! And we had the gall to use the word "simple" multiple times. Below you can find the mathematical formulas corresponding to this wordy description. They may be easier to parse: :::warning **Training** - $P$ (class): The prior probability of each class, estimated based on the frequency of each class label in the training data. - $P$(features $\mid$ class): The likelihood of a document, represented by its feature values, given the class. Once again, in a bag-of-words model, a document is represented as a set (not a sequence) of tokens (i.e., words). **Classifying** - $P$ (class $\mid$ features): The posterior probability of each class, given a document (as above, represented by its feature values) using Bayes' Theorem: $$P (\mathrm{class} \mid \mathrm{features}) \propto P(\mathrm{features} \mid \mathrm{class}) \times P (\mathrm{class})$$ By the Naive Bayes independence assumption, this formula simplifies as follows: $$P (\mathrm{class} \mid \mathrm{features}) \propto \prod_{i=1}^{n} P(\mathrm{feature}_i \mid \mathrm{class}) \times P (\mathrm{class})$$ Finally, by the maximum *a posteriori* principle (MAP), a datum is assigned the class label of a class with the highest probability. ::: Presently, you will be implemeting Naive Bayes, both by hand and in Python, to make the above description concrete--and to hopefully convince you that Naive Bayes is indeed simple! ### Part 1: Written Exercises The table below reports counts of the words `Dear`, `Friend`, `Lunch`, and `Money`, as they appear in a small dataset of `ham` and `spam` emails. | | Dear | Friend | Lunch | Money | Total | | ------------- | ------ | -------- | ------- | ------- | ----- | | Ham | 8 | 5 | 3 | 1 | 17 | | Spam | 2 | 1 | 0 | 4 | 7 | | Total | 10 | 6 | 3 | 5 | 24 | Let $A$ be the random variable denoting an email's class label, while $h$ is the event that an email is ham, and $s$ is the event that an email is spam. 1. Estimate the prior probability of $A$: i.e., compute $P(A = h)$ and $P(A = s)$. :::spoiler **Hint:** Think *maximum likelihood estimation*! Estimate these probabilities using counts. ::: 2. Estimate the conditional probabilities of the features `Dear`, `Friend`, `Lunch`, and `Money`, given the labels `ham` and `spam`.  3. Imagine that you just received an email that reads "Money Money Money Money". Use the Naive Bayes algorithm to classify this email as either `ham` or `spam`. 4. Now imagine that you just received an email that reads "Lunch Money Money Money Money". Classify this email as either `ham` or `spam` using Naive Bayes. :::spoiler **Hint:** This problem is easy! You need not perform any calculations to classify this email. Why not? ::: Uh oh! Something went wrong! The standard fix to this problem is to initialize all counts to 1, so that probabilities are never 0. :::spoiler **Laplacian smoothing** This fix is called **Laplacian smoothing**, named for the 18th-century French mathematician Laplace, who used it to calculate the probability that the sun will rise tomorrow. ::: 5. Reclassify the email "Lunch Money Money Money Money" using Laplacian smoothing. ### Part 2: Implementation :::info **Task 1** Implement `calculate_prior_probabilities`. That is, calculate the prior probability of each class. ::: :::info **Task 2** Implement `calculate_likelihood`. That is, calculate the conditional probability of each feature value, given each class. Use Laplace smoothing. ::: ::: spoiler **Hint: What is α?** $\alpha$ is the Laplace smoothing parameter! By setting it to a value strictly greater than 0, you can ensure that your probabilities are never 0. :::  :::info **Task 3** Implement `predict` to predict the class (`ham` or `spam`), given an email. Base your predictions on the posterior probabilities of each class. ::: :::warning **Why am I getting underflow errors?** Very very small probabilities can result when multiplying many small probabilities. To overcome potential underflow errors, you can apply a logarithmic transformation to all the Naive Bayes formulas. Recall that Naive Bayes makes its predictions using the MAP principle, which assigns a class label $c$ according to the class with the highest probability. Because the logarithm is a monotonic function, assigning a class based on the highest *log* probability would not change this classification rule. Therefore, we can maximize *log* probabilities, rather than probabilities themselves. Moreover, since Naive Bayes computes probabilities as a product, after transforming to log probabilities, we instead compute a sum: \begin{align} & c \in \arg \max_{c \in C} P (\text{features} \mid c) P(c) \\ &= \arg \max_{c \in C} \prod_{i=1}^n P (\text{features}_i \mid c) P(c) \\ &= \arg \max_{c \in C} \log \prod_{i=1}^n P (\text{features}_i \mid c) P(c) \\ &= \arg \max_{c \in C} \sum_{i=1}^n \log P (\text{features}_i \mid c) + \log P(c) \end{align} :::  :::info **Task 4** Run `main` to train your model on the training set and evaluate its accuracy on the test set. ::: Accuracy is a measure of how often a classifier's predicted labels match the true labels. A **confusion matrix** provides more insight into a binary classifier's performance than accuracy alone, by summarizing the types of errors a classifier makes. Specifically, a confusion matrix comprises true and false positives and negatives, defined as follows in the email domain: | | **Predicted Spam** | **Predicted Ham** | |------------|--------------------|-------------------| | **Actual Spam** | TP | FN | | **Actual Ham** | FP | TN | - **True Positives (TP):** Spam correctly predicted as spam. $$ \mathrm{TP} = \sum (\text{Predicted Spam} \land \text{Actual Spam}) $$ - **False Negatives (FN):** Spam incorrectly predicted as ham. $$ \mathrm{FN} = \sum (\text{Predicted Ham} \land \text{Actual Spam}) $$ - **False Positives (FP):** Ham incorrectly predicted as spam. $$ \mathrm{FP} = \sum (\text{Predicted Spam} \land \text{Actual Ham}) $$ - **True Negatives (TN):** Ham correctly predicted as ham. $$ \mathrm{TN} = \sum (\text{Predicted Ham} \land \text{Actual Ham}) $$ :::info **Task 5** Write `compute_confusion_matrix` to calculate the true and false positives and negatives of the predicted and true labels. ::: ## Downloads Please click [here](https://classroom.github.com/a/KutkqMRE) to download the assignment code. ### Support Code * `SMSSpamCollection.tsv`: This is the email data set, with labels `ham` and `spam`. ### Stencil Code * `numpy_primer.py`: This is the file in which you will implement gradient descent and all related functions (Parts 1a and 1c). * `diff.py`: This is where you can differentiate and plot the function $f(x)$ (Part 1b). * `naive_bayes.py`: This is the file in which you will implement your Naive Bayes classifier. ## Submission ### Handin Your handin should contain: - all modified files, including comments describing the logic of your algorithmic modifications - a README, containing a brief overview of your implementation ### Gradescope Submit your assignment via Gradescope. To submit through GitHub, follow these commands: 1. `git add -A` 2. `git commit -m "commit message"` 3. `git push` Now, you are ready to upload your repo to Gradescope. *Tip*: If you are having difficulties submitting through GitHub, you may submit by zipping up your hw folder. ### Rubric | Component | Points | Notes | |-------------------------|--------|-------------------------------------------------------------------------| | `numpy` Exercises | 10 points | Points awarded for correctly implementing mean squared error, vectorized gradient descent, reporting performance on , and plotting a function using `matplotlib`. | | Differentiation | 20 points | Points awarded for correct manual differentiation of $f(x)$, implementing $f(x)$ in Python using `autograd`, automatically differentiating $f(x)$ using `grad`, and plotting $f(x)$ and $f'(x)$ over the interval using `matplotlib`. | | Gradient Descent | 20 points | Points awarded for correctly implementing gradient descent, and for correctly testing your implementation on your `diy_function`. | | Naive Bayes Calculations | 10 points | Points awarded for correctly computing probabilities by hand, and then correctly classifying emails about "Lunch" and "Money". | | Naive Bayes Implementation | 20 points | Points awarded for correctly computing prior probabilities, likelihood functions, and a Naive Bayes binary classifier. | | Confusion Matrix | 10 points | Points awarded for correctly calculating true and false positives and negatives, and outputting a confusion matrix. | | README | 10 points | Points awarded for a well-organized and easy-to-read README file. | :::success Congrats on submitting your homework; Steve is proud of you!! ![image](https://hackmd.io/_uploads/H1gKUsJ20.png) ![image](https://hackmd.io/_uploads/S1OQ2aCwA.png) :::