Studio 9: Classification

--- title: 'Studio 9: Classification' tags: 'studio' layout: 'post' geometry: margin=2cm --- # CS 100: Studio 9 ### Classification ##### November 9, 2022 ### Instructions During today’s studio, you’ll be building several binary classifiers in R to classify the passengers on the Titanic as survivors or not. Specifically, you will be using the $k$-nearest neighbors algorithm, for various values of $k$. Upon completion of all tasks, a TA will give you credit for today's studio. If you do not manage to complete all the assigned work during the studio period, do not worry. You can continue to work on this assignment until Sunday, November 13 at 7 PM. Come by TA hours any time before then to show us your completed work and get credit for today's studio. ### Objectives To understand a supervised learning algorithm, namely $k$-nearest neighbors, and to use cross-validation to optimize the hyperparameter ($k$) of this algorithm. ### Libraries In this assignment, you'll be relying on some special R libraries for building classifiers. Open RStudio, and then run the following commands in the console to install the necessary packages. ```{r} # Classification packages install.packages("class") # kNN and other classifiers install.packages("rpart") # Decision Trees install.packages("rpart.plot") # Visualizing Decision Trees install.packages("caret", dependencies = TRUE) # Cross-validation install.packages("GGally") # An extension of ggplot2 ``` If a dialog box pops up, click `No`. Next, open a new R markdown file, and then enter and run the following setup code: ~~~ ```{r setup, include = FALSE} library(dplyr) library(ggplot2) library(rpart) library(rpart.plot) library(class) library(caret) library(GGally) ``` ~~~ ### Data The [RMS Titanic](http://www.history.com/topics/titanic) was a luxury steamship that sank in 1912 off the coast of Newfoundland in the North Atlantic during its very first voyage. Of the 2,240 passengers and crew on board, more than 1,500 lost their lives. The people on board were of different ages, social classes, gender, and occupation. In this studio, you will be investigating the use of machine learning to predict whether a person would survive this disaster. The [data](https://cs.brown.edu/courses/cs100/studios/data/9/titanic.csv) describe various features of the people who were aboard the Titanic, such as their name, sex, age, etc. This [file](https://cs.brown.edu/courses/cs100/studios/data/9/titanic.txt) contains a detailed description of all the features. Additionally, the data contain a binary variable (i.e., a label) indicating whether or not each passenger survived the sinking of the ship; 1 means the passenger survived, and 0 means they did not. In this assignment, your goal is to build several classifiers to predict who survived the crash. Use the following line of code to read in the data. Insert it towards the end of your setup chunk. ```{r} titanic <- read.csv('https://cs.brown.edu/courses/cs100/studios/data/9/titanic.csv') ``` Spend a few minutes getting a feel for the data set. Explore it with the usual functions (`str`, `summary`), and try out `glimpse` as well. How many passengers survived? *Hint:* You can use the `summary` function to answer this question: `summary(titanic$survived)`. Check the types of the features. You can use the functions `as.numeric`, `as.character`, etc. to convert any features whose types are not as you expect them to be. Note that the `class` library expects the class labels to be a factor, so you should be sure to convert `titanic$survived` as follows: ```{r} # Convert the labels to a factor titanic$survived <- as.factor(titanic$survived) ``` Finally, let's see if the data are clean. How many observations are incomplete? (*Hint:* Use `summary`.) Some supervised learning algorithms (like k-NN) cannot handle missing data, so let’s also remove all the observations with missing values. (If we had time to conduct a more sophisticated analysis, we might try to fill in missing ages, for example, using regression.) ```{r} titanic <- na.omit(titanic) ``` Explore the data again. How many passengers survived in this pruned data set? How many are female? How many are male? ### Feature Selection The first step in classification is to decide which features to include in your model. Spend some time discussing with your partner which features might best discriminate between a passenger's chances of survival or not. One tool you can use to guide your search for appropriate features is the scatter plot. For example, you can create a scatter plot depicting the relationship between any two numeric features (e.g., `age` vs. `fare`), and then color the points according to whether or not the passenger survived. You can do the same for one numeric and one categorical feature (e.g., `age` vs. `sex`). ```{r} # Sample plots attach(titanic) qplot(age, fare, col = survived) qplot(age, sex, col = survived) ``` When plotting two categorical variables, you can adjust the size of the points to reflect the number of observations that fall into the various categories. Alternatively, you can jitter the data to see the individual data points. ```{r} # Sample plots qplot(pclass, sex, col = survived, geom = "jitter") qplot(pclass, age, col = survived, geom = "jitter", size = age) qplot(pclass, age, col = survived, geom = "jitter", size = sibsp) detach(titanic) ``` *Note*: `qplot` is a plotting function in the `ggplot2` library. Here is a link to some more details about how to use [`qplot`](http://ggplot2.org/book/qplot.pdf). The GGally library (an extension of ggplot2) provides a function called `ggpairs`, which creates a matrix of scatterplots of one variable vs. another, including categorical variables, like `sex` and `survived`. Use it to create a scatterplot of a few variables, like this: ```{r} abridged <- titanic %>% select(age, fare, sex, survived) ggpairs(abridged) ``` Now that you have explored the relationships among a few of the variables in the titanic data, choose two or three features for your analysis. You will use these features all throughout this studio to build your classifiers. We will refer to these features as `feature_1`, `feature_2`, and `feature_3`. *Hint*: If you assign these features to variables, say `feature_i`, and do your analysis on these variables, you will find it easier to redo your analysis later on a different choice of features, because all you will have to do then is reassign your feature variables to new features, and the rest of your code should run more or less as is. ```{r} feature_1 <- titanic$age feature_2 <- ... feature_3 <- ... ``` ### Model building Now that we have selected our features, let's build some classifiers! #### \$k\$-Nearest Neighbors For starters, let’s use R's implementation of $k$-NN to classify passengers on the Titanic. As you learned in lecture, $k$-NN does not build a model, but instead classifies on the spot, the moment you call the function. We'll perform $k$-NN classification using the `knn` function in the `class` package. This function takes as input both training data *and* testing data, as well as ground truth: i.e., labels corresponding to the training data. Unsurprisingly, `knn` also takes as input `k`, the number of neighbors. You can view all the available additional/optional parameters by searching for `knn` in the `Help` window, or by entering `?knn` in the console. Create a training set `knn_train` using `select` to build a data frame that consists of only your preferred features : e.g., age and fare. *Important*: As $k$-nearest neighbors depends on a distance function, it works best on quantitative features, not categorical ones (though it does work for categorical data that are encoded quantitatively: e.g., `gender` as `0`, `1`, `2`, etc.). Be sure to “clean” your preferred features, so they are all numerical, not categorical. ```{r} # Training data knn_train <- data.frame(feature_1, feature_2) ``` For now, let’s use our training data as test data as well: ```{r} # Test data knn_test <- knn_train ``` Next, create a vector of labels. Note that this vector must be of type `factor`. ```{r} # Training labels knn_label <- titanic$survived ``` You are now ready to use `knn` to classify. Try `k = 3`: ```{r} # k-NN knn3_class <- knn(knn_train, knn_test, knn_label, k = 3) ``` The output of `knn` saved to `knn3_class` is a vector of predictions for all data in the test set. A **confusion (or error) matrix** is a tabular visualization of classification error. The diagonal entries correspond to observations that are correctly classified, while the off-diagonal entries correspond to errors. In the case of binary classification there are two types of errors: a **false positive** is an observation that is incorrectly labelled by a classifier as 1 (e.g., survived), when it should be labelled as 0 (e.g., did not survive), while a **false negative** is an observation that is incorrectly labelled by a classifier as 0 (e.g., did not survive), when it should be labelled as 1 (e.g., survived). Use the `table` function to see a summary of your model's classification error. ```{r} table(knn_label, knn3_class) ``` Repeat this exercise for `k = 1`, `k = 5`, and `k = 15`, creating models `knn_class1`, `knn_class5`, and `knn_class15`, respectively, and tabling their classification errors. ##### Scaling (Time Permitting) The $k$-NN algorithm relies on a distance metric to find neighbors. The usual distance metric is [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance). The trouble with this (or any numeric) metric is that it downweights features whose range of measurements is smaller than others. For example, age dominates travel class (i.e., first, second, or third), so that classifications that take into account the former are barely influenced by the latter. The data must be comparable, which can be achieved either through normalization or standardization. Write a `normalize` function that takes as input a vector `x`, and then subtracts the minimum value of `x` from all entries and divides by the difference between the maximum and the minimum values. Alternatively, write a `standardize` function that takes as input a vector `x`, and then subtracts the mean value of `x` from all entries and divides by the standard deviation. Preprocessing your features before learning using either function should improve your accuracies. Confirm that it does. *Note:* You can read more [here](http://sebastianraschka.com/Articles/2014_about_feature_scaling.html) about whether to prefer standardization or normalization as described (so-called min-max scaling). #### Decision Trees Next, let's build some decision trees. To do so, we will use the `rpart` function in the `rpart` package. This function takes several key arguments: `rpart(formula, data, method)`, where - `formula` is the formula to use for the tree in the form of `label ~ feature_1 + feature_2 + ...`. - `data` specifies the data frame - `method` is either `class` for a classification tree or `anova` for regression. We'll be using `class` today. Like for `knn`, you can view all the available additional/optional parameters by searching for `rpart` in the `Help` window, or by entering `?rpart` in the console. Begin by building a simple decision tree using two of the variables you selected. ```{r} # Decision tree tree_classifier <- rpart(survived ~ feature_1 + feature_2, data = titanic, method = 'class') ``` You can use the `print` function to view the resulting tree in a text-based format: ```{r} print(tree_classifier) ``` You can also summarize the tree: ```{r} summary(tree_classifier) ``` Additionally, the `rpart.plot` package includes a function that makes it very easy to visualize the decision trees it builds: ```{r} rpart.plot(tree_classifier) ``` Add a new feature (`feature_3`), then build a new decision tree using three features, and visualize it: ```{r} # Decision tree new_tree <- rpart(survived ~ feature_1 + feature_2 + feature_3, data = titanic, method = 'class') rpart.plot(new_tree) ``` In addition to the arguments already mentioned, `rpart` takes as input an additional argument called `rpart.control`, which controls various aspects of the tree. Here are some of its options: - `maxdepth` sets the maximum depth of a tree, meaning the maximum number of levels it can have - `minsplit` sets the minimum number of observations that a node must contain for another split to be attempted - `minbucket` sets the minimum number of observations that a terminal bucket (a leaf) must contain (i.e., further splits are not considered on features that would spawn children with too few observations) To use these options, add a fourth argument to your calls to `rpart`: ```{r} # Decision tree pruned_tree <- rpart(survived ~ feature_1 + feature_2 + feature_3, data = titanic, method = 'class', control = rpart.control(maxdepth = 4, minsplit = 10)) rpart.plot(pruned_tree) ``` Experiment with various settings of these parameters, and visualize the ensuing trees. Do the visualizations offer you any insights into who survived the sinking Titanic? Store at least two of your favorite trees as `tree_class1`, `tree_class2`, etc. Generating predictions was trivial with `knn`, as its output is precisely a vector of predictions. Other classifiers require a bit more work to extract predictions. The `class` package provides a function `predict` that outputs predictions from any of its classifiers (except $k$-NN, which outputs predictions immediately). The `predict` function takes as input a classifier, such as a decision tree, a set of data to make predictions on, and what `type` of output we would like. Every classifier has a different `type` that it can output, but all of them are at least able to output classes. You can use `help(classifier)` (e.g., `help(rpart)`) or `help(predict.rpart)` to find what types a classifier can output. Run `predict` with each of your classifiers using `titanic` as your data source. ```{r} tree1_predictions <- predict(tree_class1, titanic, type = 'class') tree2_predictions <- predict(tree_class2, titanic, type = 'class') # Remember, k-NN works a bit differently ... knn3_predictions <- knn3_class knn5_predictions <- knn5_class ``` As you did earlier for your `knn` predictions, `table` these prediction vectors: e.g., ```{r} table(titanic$survived, tree1_predictions) table(titanic$survived, tree2_predictions) ``` What do you notice about the accuracy of your classifiers? ### Evaluating model accuracy By now, you have now built several classifiers. Congratulations! You have also evaluated their so-called **training error**: i.e., the error on the training (or in-sample) data. But what we are really interested in is how well they will perform "in the wild": i.e., on unseen (so-called out-of-sample) data. To try to get a handle on this question, you should evaluate your classifiers on **test data** instead. Where do we get such test data? By *holding out*, or setting aside, some of the training data for testing purposes. #### Test Sets Before partitioning your data into training and test sets, it's best to first shuffle them, to avoid any potential biases in their ordering. The `dplyr` library provides the `sample_n` function that allows you to sample `n` rows from a data frame. Conveniently, you can also use this function to shuffle your data. In particular, if you run `sample_n(mydata, nrow(mydata))`, you will select all rows from a data frame in random order. Do that now on `titanic`. ```{r} shuffled <- sample_n(titanic, nrow(titanic)) ``` Next, let's partition `shuffled` into training and test sets. What’s the best way to go about this? A good rule of thumb is to set aside 20% of the data for testing, and to use the remaining 80% for training. Using this rule of thumb, let's calculate our split sizes: ```{r} split <- 0.8 * nrow(shuffled) ``` We can now partition into training and testing sets: ```{r} training <- shuffled[1:split, ] test <- shuffled[(split + 1):nrow(shuffled), ] ``` *Note:* A simpler way of splitting the data into training and test data is to use the `createDataPartition` function in the `caret` package. We walked you through the process manually just now so that you would understand what is going on behind the scenes, but feel free to use `createDataPartition` going forward. Here is an example; go to R’s Help window for explanation of the parameters. ```{r} split <- 0.80 index <- createDataPartition(knn_label, p = split, list = FALSE) training <- titanic[index, ] test <- titanic[-index, ] ``` Now that you have partitioned the data (possibly twice!) into training and test sets, you can retrain all of your classifiers on `training` and test their accuracy on `test`. For example, to build a new $k$-NN classifier, you should choose your features from the training and test sets and then proceed as follows (after defining `feature_1_training`, `feature_1_test`, etc., of course!): ```{r} knn_train <- data.frame(feature_1_training, feature_2_training, feature_3_training) knn_test <- data.frame(feature_1_test, feature_2_test, feature_3_test) knn_label <- training$survived ``` *Hint:* For best results, be sure to normalize or standardize your data! Build a few $k$-NN classifiers on this training and test set for a few values of $k$, and then `table` the results of their predictions, as you did earlier. Be sure to compare their predictions to `test$survived`! Time permitting, do the same for a few decision trees. Note the accuracies of your models now. Are they higher or lower than before? Do these new accuracies cause you to re-assess how well your various models work? Which classifiers experience the greatest decrease in accuracy? #### Cross-validation (Time Permitting) Beyond partitioning our data *once* into training and test sets, on which we might happen to see unusually high or unusually low accuracies, we can better estimate the accuracy of our models by partitioning our data into many training and test sets. This process, whereby we carry out the above procedure (train on training data; test on testing data) multiple times for multiple partitions of the data, and then average the accuracy across all partitions, is called **cross-validation**. Next, we will estimate the accuracies of our classifiers using cross-validation. For the \$k\$-NN models, we can use `knn.cv` function for cross-validation. For example: ```{r} knn3_cv <- knn.cv(knn_train, knn_label, k = 3) accuracy(knn3_cv, knn_label) ``` Repeat this for `k = 1`, `k = 5`, and `k = 15`. Which value of $k$ is most accurate? Try a few more values of $k$. Around what value of $k$ does accuracy peak? Talk with your partner about how the accuracy of your models has changed compared to using only a single test set. Was there a large change? Does this make intuitive sense? What does this make you think about the usefulness of cross validation? #### `caret` and `train` Before you complete this studio, we want to draw your attention to a very powerful function within the `caret` library, namely `train`. Enter `?train` into the console to read a little bit about `train`. The `train` function is an interface for calling all sorts of learning algorithms. That is, we need not call `lm` or `knn` specifically. Instead, we can call `train`, and pass it our preferred learning algorithm as an argument. For example: ```{r} knn_model <- train(survived ~ age + fare + sex, data = training, method = "knn", trControl = trainControl(method = "cv"), preProcess = c("center", "scale"), tuneLength = 20) ``` This call to `train` invokes `knn` on the training data using the features specified. Moreover, it carries out a cross validation (`trainControl(method = "cv")`) and it preprocesses the data by standardizing it (i.e., "center" subtracts the mean of a predictor from the predictor’s values, and "scale" divides the results by the standard deviation). The parameter `tuneLength` specifies how many values of the tuning parameters (in this case, $k$) to try. The results of the `knn_model` are stored in `knn_model$results`. Enter `knn_model$results` to see them. With these results in hand, you can plot, for example, the standard deviation of the accuracy vs. value of $k$, as follows: ```{r} qplot(knnFit$results[1][[1]], knnFit$results[4][[1]]) ``` Do you observe a tradeoff between bias and variance? Where is the bias greatest, and where is the variance greatest? Retrain your `knn_model` a few times and replot the results, if the graph is not exactly expected, to look for a pattern. *Note:* Time permitting (e.g., after studio), you might try writing a loop that iterates over (odd) values of $k$ in the range of, say, 1 to 25, and plots their accuracies. You can use this plot to determine the best value of $k$. ### Analysis Questions Discuss the following questions with your partner, and jot down your answers (in your R markdown file, if you like!) so that you can have a brief discussion about them with one of the TAs. - Which of the models best fit the data? - Which variables seem to have the most influence? (If you don’t know the answer to this question offhand, just discuss with your partner how you might find out.) - Should we ever evaluate classifier performance on the same data it was trained on? What if we don't have very many data; then, what might we do instead? - How does the Titanic sinking compare to modern natural disasters such as Hurricane Katrina. Briefly look at [this study](http://www.dhh.louisiana.gov/assets/docs/katrina/deceasedreports/KatrinaDeaths_082008.pdf) published by the American Medical Association on fatality rates across different demographics during Hurricane Katrina. Which groups suffered the highest fatality rates? [Here](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.537.3677&rep=rep1&type=pdf) is another paper on Hurricane Katrina from the Journal of Social Science Research. The data analysis section starts on page 302. - Are there similarities between the factors that affected Titanic fatality rates and Hurricane Katrina fatality rates? What factors are different between the two disasters? ### End of Studio When you are done please call over a TA to review your work, and check you off for this studio. If you do not finish within the two hour studio period, remember to come to TA office hours to get checked off.