Studio 10: Clustering

--- title: "Studio 10: Clustering" layout: post tags: studio geometry: margin=2cm --- # CS 100: Studio 10 ### Clustering ##### November 16, 2022 ### Instructions In this week’s studio, you will be clustering colleges using the [College Scorecard Data](https://collegescorecard.ed.gov/data/), first collected under President Obama. Upon completion of all tasks, a TA will give you credit for today's studio. If you do not manage to complete all the assigned work during the studio period, do not worry. You can continue to work on this assignment until Sunday, November 20 at 7 PM. Come by TA hours any time before then to show us your completed work and get credit for today's studio. ### Objectives By the end of this studio, you will know: - How to identify clusterings in data - How to group similar items By the end of this studio, you will be able to: - Apply $k$-means clustering to a dataset - Evaluate, and ultimately select, models to use for analysis ### Data The [College Scorecard Data](https://collegescorecard.ed.gov/data/) are collected by the U.S. Department of Education, and are intended to offer insights into undergraduate degree-granting institutions. We’ve extracted data about 35 select colleges, which roughly speaking, can be considered peers of Brown. The CSV file, [colleges-35.csv](https://docs.google.com/spreadsheets/d/1wxvwddLkvbrbDxTtaIHOEmo4HF6hozN4BhrHyctYFRQ/edit?usp=sharing), is named using a convention that is standard within the data science community. The name references the original data, and the number denotes the sample size. To begin, load the CSV file into Google Sheets. You’ll probably notice immediately that the column headers don’t make much sense. This is often the case with publicly available data—usually, for the sake of brevity, column headers are abbreviated. [Here](https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx) is the documentation for the College Scorecard data set. Open this file, and go to the `data_dictionary tab`, which explains what the headers mean, and how the information is stored (as an integer, a string, or something else). Figure out the meaning of all the features in the small subset of the College Scorecard Data we extracted for this studio. Feel free to ask the TAs for clarification as necessary. Spend some time exploring the data we provided. For example, you might sort the schools by SAT score, or by median income ten years after graduation. ### Intuition Your goal in this studio is to cluster these data; that is, to cluster universities according to their features. For example, "4 year" vs. "2 year" might be an obvious dimension along which to cluster colleges. Your first task is to cluster the colleges manually into three clusters, say, 1, 2, and 3. (Note that the labels 1, 2, and 3 are not meant to be meaningful; you could just as well label the schools "red", "green", and "blue".) Use what you and your partner know about these schools to cluster the data according to any metric you like, so that colleges are more similar to colleges in their cluster (from your subjective point of view) than to colleges in the others. Create a new variable/column in the spreadsheet, called `Cluster`, and save your clustering in a new file called `college-35-clustered.csv`. ### $k$-Means: Theory One of the simplest, and most common, clustering algorithms is called $k$-means. The $k$ in $k$-means refers to the number of clusters you are seeking in your data. Visit [this site](http://tech.nitoyon.com/en/blog/2013/11/07/k-means/) to play around with the algorithm. Create 35 nodes and cluster them into three groups, just like what you did by hand. Step through the algorithm to get a sense of how $k$-means works, reducing error by putting similar items into the same clusters. ### $k$-Means: Practice Now load the College Scorecard data set we provided into an R data frame. You should set the college names as the row names because $k$-means only works on numeric variables. You should also read in `NULL` values as `NA`s. To accomplish each of these things, respectively, provide the `read.csv` function with the arguments `row.names = 1` and `na.strings = "NULL"`. Familiarize yourself with the $k$-means function in R, `kmeans`, by running `help(kmeans)`. The function takes as input two principal arguments: a data frame `x` of observations to be clustered (like colleges!), and a number of clusters, or `centers`. ### Feature Selection via Visualization To start, try running `kmeans` on pairs of features, with `k = 3` (or similar), to see if you can produce a clustering that aligns well with your own intuitive clustering. Your code should look something like this, with `variable1` and `variable2` replaced by your own choices: ```{r} colleges_subset <- colleges %>% select(variable1, variable2) my_clustering <- kmeans(colleges_subset, centers = 3) ``` To visualize your clustering in these two dimensions, run the R commands below, again with `variable1` and `variable2` replaced by your own choices: ```{r} my_clustering$cluster <- as.factor(my_clustering$cluster) ggplot(colleges, aes(variable1, variable2, color = my_clustering$cluster)) + geom_point() ``` What do you notice about how these two variables affect the clusters? Repeat this exercise a few times with different pairs of variables. When $k = 3$, do the results resemble your manual clustering? Perhaps not. If not, or even if so, what necessary step should you take to improve your $k$-means clusterings? Take this necessary step, and repeat your analysis. Record the pairs of variables for which you observed the $k$-means algorithm clustering these data well. ### Feature Selection via Error Minimization A $k$-means model consists of several components. The `cluster` variable associates each observation (i.e., college) with a cluster. The `centers` variable stores the center (*a.k.a.* the centroid) of each cluster, and `size`, the number of observations in each cluster. The variables with `ss` in their names are mostly measures of error, computed in terms of the (squared) distances between data points. Assuming an $m$-dimensional data set $D$ of size $n$, we can define the (squared) distance between two points $p, q, \in D$ as follows: $\displaystyle d(p, q) = \sum_{j = 1}^{m} (p_j - q_j)^2$ The `totss` variable stands for “total sum of squares”, meaning the sum of the distances between all points: i.e., $\displaystyle \frac{1}{n} \sum_{p, q \in D} d(p_j, q_j)$ The `withinss` variable is an intra-cluster measure. It is a vector of length $k$, which for each cluster stores the sum of all the distances from each point in that cluster to its centroid: $\frac{1}{\left| C_i \right|} \sum_{p \in C_i} d(p, \mu_i)$ Here, $C_i$ is the $i^{th}$ cluster and $\mu_i$ is the centroid of the $i^{th}$ cluster. The value of `tot.withinss` is `sum(withinss)`, where the sum is across all clusters. The $k$-means algorithm seeks a clustering that minimizes `tot.withinss`. The `betweenss` variable is an inter-cluster measure. It is the difference between `totss` and `tot.withinss`, which amounts to the sum of the distances between centroids: i.e., $\frac{1}{k} \sum_{C_i, C_j} d(\mu_i, \mu_j)$ As usual, you can access any of these components using `$`. For example, to see which college is in which cluster, use: ``` sort(my_clustering$cluster) ``` Review the clusterings you constructed for different pairs of variables, this time, from the point of view of error minimization. Which of these clusterings seems best from this alternative point of view? Is it the same one that you deemed best based on visual inspection? Does it resemble your manual clustering? ### Model Selection Although it can make the results more difficult to interpret, in principle $k$-means can cluster using all features, rather than just a select few: e.g., ```{r} k3 <- kmeans(x = scale(colleges), centers = 3) ``` Write a `for` loop that loops over this line of code, trying out every value of $k$ between 1 and 19, and storing the error associated with each model (`tot.withinss`). Then, plot the errors as a function of $k$. What do you observe about the error as $k$ increases from 1 to 19? Why might the minimum-error producing $k$ not be the most desirable $k$? You can visualize a clustering computed in many dimensions using the `fviz_cluster` function, which plots a clustering in its two principal dimensions. This function is part of the `factoextra` package. Install this package, and then load the library. Now, to use `fviz_cluster`, pass it a clustering and the data: e.g., ```{r} fviz_cluster(my_clustering, colleges) ``` The `fviz_cluster` function produces a `ggplot` object, so you can add titles, etc., as usual: e.g., ```{r} fviz_cluster(my_clustering, colleges) + ggtitle("My Clustering") ``` Visualize your minimum-error clustering. Are you satisfied with what you have discovered? If not, feel free to tweak things (e.g., $k$ or the variables you selected) until you are satisfied. *Hint:* The `fviz_nbclust` function plots the `totwithin.ss` error as a function of $k$ (without the need to execute your `for loop`). Feel free to use this function to speed up any explorations you undertake with alternative variables. ```{r} fviz_nbclust(colleges, kmeans, method = "wss", k.max = 19) ``` How does your final clustering compare to your initial (manual) clustering? Did the $k$-means algorithm help you discover additional structure beyond that which you were able to identify using only your intuition? Is your final clustering still intuitive? ### End of Studio When you are done please call over a TA to review your work, and check you off for this studio. If you do not finish within the two hour studio period, remember to come to TA office hours to get checked off.