Homework 3: Part 2

--- title: "Homework 3: Part 2" label: homework layout: post geometry: margin=2cm tags: homework --- # CS 100 Homework #3, Part II ##### Due: December 5, 2022 at 10 pm ##### **N.B.** This assignment is optional. If completed, this grade may be substituted for one of Homework's 0, 1, 2, or 3 Part I. ### Instructions This assignment is extra credit. It will replace your lowest grade among the first four homeworks (if this grade is higher than that grade). Please submit to Gradescope your R Markdown (.Rmd) file. Please also knit your R markdown file, and submit the resulting PDF file as well. Be sure to follow the CS100 course collaboration policy as you work on this and all CS100 assignments. ### Overview The topic of this homework assignment is supervised learning. The first part is concerned with linear regression, and this, the second part, classification. ### Classification In class, you learned (or will be learning) about four different classification methods: $k$-NN, decision trees, naive Bayes, and logistic regression. Likewise, in studio you (will) have explored some of these algorithms in the context of *binary* classification, where data are classified in only one of two ways (e.g., Clinton supporters, or Trump supporters---in the 2020 presidential election). In this assignment, you're going to explore *multiclass* classification, where more than two classes may apply (e.g., in addition to the above two classes, Jill Stein supporters, or Gary Johnson supporters---again, in the 2020 presidential election). All the binary classification algorithms you learned about except logistic regression generalize to multiple classes. Thus, in this assignment, you will be building classifiers using $k$-NN, decision trees, and naive Bayes. #### Libraries Before you can begin working on this assignment, you must install, and then load, the necessary libraries: ```{r} library(class) # k-NN library(rpart) # Decision trees library(klaR) # Naive Bayes library(caret) # Cross-validation ``` This assignment is relatively free form compared to other CS 100 assignments, in the sense that there are no exact answers. We’ll tell you what accuracy ranges we expect your classifiers to achieve so that you can gauge your performance. If you can achieve those accuracies, you will have convinced us that you possess the necessary skills to perform basic classification tasks. At the end, you'll also be asked a few questions about both the data and the classification methods. ### Data The data for this assignment comprise nutritional measurements of several (260, to be exact) items on the McDonald's menu, such as an Egg McMuffin and a Big Mac. The variables include serving size, calories, sodium content, and various other features (24 in total) of McDonald’s menu items. #### Features 1. Category: Breakfast, Beef & Pork, Chicken & Fish, Salads, Snacks & Sides, Desserts, Beverages, Coffee & Tea, and Smoothies & Shakes. 2. Item: Name of the item. 3. Serving Size: Amount of food in one serving. Solid foods are described by grams, while liquids are described by milliliters. 4. Calories 5. Calories from Fat 6. Total Fat 7. Total Fat (% Daily Value) 8. Saturated Fat 9. Saturated Fat (% Daily Value) 10. Trans Fat 11. Cholesterol 12. Cholesterol (% Daily Value) 13. Sodium 14. Sodium (% Daily Value) 15. Carbohydrates 16. Carbohydrates (% Daily Value) 17. Dietary Fiber 18. Dietary Fiber (% Daily Value) 19. Sugars 20. Protein 21. Vitamin A (% Daily Value) 22. Vitamin C (% Daily Value) 23. Calcium (% Daily Value) 24. Iron (% Daily Value) The variable that your classifiers should aim to predict is the first one, `Category`, which ranges over eight possible values: Breakfast, Beef & Pork, Chicken & Fish, Salads, Snacks & Sides, Desserts, Beverages, and Coffee & Tea. #### Cleaning The data are located [here](https://cs.brown.edu/courses/cs100/homeworks/data/3/mcdonalds.csv). Load them, view them, summarize them, and think about how to prepare them for analysis. Be sure to take into account the differing requirements of the three learning algorithms: e.g., $k$-NN predicts only factors, from only numeric predictors. *Tips:* - Some classifiers, like the implementation of $k$-NN in the `class` library, cannot handle missing values. Thus, to run $k$-NN using this library, the data should be stripped of incomplete cases. - The `rpart` and the `klaR` libraries include an option called `na.action` which takes values `na.omit` or `na.pass`. The former omits observations with missing data, the latter includes them. The way they are used in decision trees (when `na.action = na.pass`) is via something called [surrogate splits](https://stats.stackexchange.com/questions/171574/meaning-of-surrogate-split/345033#345033). - When naive Bayes encounters an observation with a missing attribute value, and `na.action = na.pass`, it simply ignores that observation when computing probabilities pertaining to that attribute. ### Accuracy Spoiler alert: A naive Bayes or decision tree model that predicts `Category` using all the features in this data set can achieve greater than 90% accuracy. Regardless, we're going to work on building models with fewer features, so that they are easier to interpret. We'll begin with a single feature model, and work our way up to four. We also want you to build a model with the exact number of features specified (1, 2, 3, or 4). So if you happen to achieve the next section’s accuracy with fewer features (accuracy for three-feature models with two-feature models), you should still add at least one new feature. ### Grading You will be graded on two things: the accuracy of your models and your writeup. For each problem, we will specify two accuracy milestones, one that will yield you 3/4 credit, and a second for full credit. You will not get *any* credit for accuracy below what we specify. However, only one of your models has to achieve this accuracy, not all of them. That said, the rest of your models will probably achieve this same accuracy $\pm$ 10%. For each model, you should explain which features you chose and why. *Note:* To explain “why” will probably require that you produce descriptive statistics (including visualizations) of a feature’s relationship to `Category`. ### Training and Test Data We have divided the data into training and test sets. You can find the training data [here](https://cs.brown.edu/courses/cs100/homeworks/data/3/mcdonalds_train.csv), and the test data [here](https://cs.brown.edu/courses/cs100/homeworks/data/3/mcdonalds_test.csv). Remember to evaluate accuracy on the *test* data, not the training data. Otherwise, you will think you are getting a much higher accuracy than you really are. Analogously, you will think you are getting a much higher grade than you really are. ### Learning Algorithms Many learning algorithms have what are called **hyperparameters**, which are parameters that are used to specify a learning model. For example, the $k$ in $k$-NN is a hyperparameter. Likewise, the depth of a decision tree is a hyperparameter. (Naive Bayes does not have any hyperparameters, beyond the choice of variables.) The `rpart` library allows you to control many hyperparameters of a decision tree, including: - `maxdepth`: the maximum depth of a tree, meaning the maximum number of levels it can have - `minsplit`: the minimum number of observations that a node must contain for another split to be attempted - `minbucket`: the minimum number of observations that a terminal bucket (a leaf) must contain (i.e., further splits are not considered on features that would spawn children with too few observations) To control these hyperparameters, you should call the `rpart` function with all the usual suspects (described below, for completeness), as well as a fourth argument called `control`: ```{r} # Decision tree library(rpart.plot) # Install the rplot.plot package to visualize your decision tree models tree <- rpart(y ~ x, data = my_data, method = 'class') rpart.plot(tree) pruned_tree <- rpart(y ~ x, data = my_data, method = 'class', control = rpart.control(maxdepth = 4, minsplit = 10)) rpart.plot(pruned_tree) ``` The “usual suspects” (i.e, the basic arguments to `rpart`) are: - `formula` is the formula to use for the tree in the form of `label ~ feature_1 + feature_2 + ...`. - `data` specifies the data frame - `method` is either `class` for a classification tree or `anova` for a regression tree. ### Model Building For each section (i.e., number of features), you are tasked with building three models: $k$-NN, decision trees, and naive Bayes. To measure prediction accuracy for $k$-NN, you should proceed more or less as you did in Studio. ```{r} # For kNN my_predictions <- knn( ... ) mean(my_predictions == mcdonalds_test$Category) # Testing accuracy ``` For decision trees and naive Bayes, you should run the `predict` function on your model and your test data, as shown below. Then you can check whether the predictions output by your model match the `Category` values in the test data using: ```{r} # For Decision Trees my_dt <- rpart( ... ) predict <- predict(my_dt, mcdonalds_test, type = "class") mean(predict == mcdonalds_test$Category) # Testing accuracy # For Naive Bayes my_nb <- NaiveBayes( ... ) predict <- predict(my_nb, mcdonalds_test)$class mean(predict == mcdonalds_test$Category) # Testing accuracy ``` Feel free to make use of an accuracy function to save on typing, and at the same time limit opportunities for bugs! ```{r} accuracy <- function(x) mean(x == mcdonalds_test$Category) ``` #### 1-feature model Start off simple building one-feature models. Generate descriptive statistics to compare the relationships between various features of interest and `Category`. Use your intuition to guide your search. Percent of daily recommended intake of vitamin A may not significantly impact an item’s category. On the other hand, the serving size or the amount of saturated fat may have a significant impact. For 3/4 credit, achieve 61% accuracy; for full credit achieve 67% or higher. #### 2-feature model Building off of your one-feature model, add a new feature. For some models, it may actually be better to use a new combination of features, but you should still be able to add one new feature in total and increase your accuracy. For 3/4 credit, achieve 63% accuracy; for full credit achieve 69% or higher. #### 3-feature model Next, building off your two-feature model or not (i.e., feel free to start over from scratch), build a three-feature model. For 3/4 credit, achieve 67% accuracy; for full credit achieve 71% or higher. #### 4-feature model Finish up by building a model using four features. For 3/4 credit achieve 69% accuracy; for full credit achieve 73% or higher. #### Accuracy and explanatory power At this point, we’ll turn you loose (temporarily). Build the best model you can that trade-offs accuracy for explanatory power. That is, using a reasonable number of features (possibly more than four, but definitely less than 24), build a model that achieves relatively high accuracy, but at the same time is easy to explain. Report the accuracy of your model, and explain how/why it works. #### Cross-validation One shortcoming of the above analysis of your various models is that you evaluated them on only one partition of the data into training and test sets. We asked you to do this only so that we could standardize grading. In reality, however, it is good practice to cross-validate your models on multiple partitions of the data. The final step in this homework is to complete this cross-validation. You can do so using the `train` function, just as you did in studio, with the method argument set equal to `"knn"`, `"rpart"`, or `"nb"`. Do you expect your accuracies to go up or down, and why? *Note:* To test the accuracy of predictions using a model built by the `train` function, the `type` argument to the `predict` function should be set equal to `"raw"` not `"class"`. #### Follow Up Now that you’ve finished classifying the category of McDonald's food items, let's talk about the classifiers. For each algorithm—$k$-NN, decision trees, and naive Bayes—list their pros and cons, and provide examples of where and when you would recommend using each of them. (Feel free to surf the web for help with this discussion question. But please cite all your sources.)