Homework 4 - HackMD

--- title: 'Homework 4' label: homework layout: post geometry: margin=2cm tags: homework --- # CS 100 Homework #4 ### Movie Recommendations ##### Due: December 2, 2022 at 10 pm ### Instructions Please submit to Gradescope your R Markdown (.Rmd) file. Please also knit your R markdown file, and submit the resulting PDF file as well. Be sure to follow the CS100 course collaboration policy as you work on this and all CS100 assignments. ### Objectives To practice clustering with a variety of similarity metrics, and to learn about content and collaborative filtering. ### Overview This assignment has three parts. You must do the first part, but then you can choose to do either the second or the third (or both, if you'd like extra credit). The first part is merely data wrangling, always a necessity. In the second part, you’ll explore movie data by clustering them in various ways. In the third part, you’ll use a technique called [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) to recommend movies to users (including yourself!), similar to how Netflix makes movie recommendations. More specifically, in the second part, you will be clustering movies based on their content; you won’t be making any recommendations. You should note, however, that if you can cluster movies effectively this way, then you could recommend new movies to users with content that is similar to other movies they are known to have liked. In the third part, you’ll use ollaborative filtering to go one step farther, and recommend movies to users. Collaborative filtering lies at the heart of Netflix's movie recommendation algorithm. ### Data The movie data found [here](https://cs.brown.edu/courses/cs100/homeworks/data/4/movies.csv) contains movies and some information about their content, namely their genres. The ratings data found [here](https://cs.brown.edu/courses/cs100/homeworks/data/4/ratings.csv) contains users' ratings of movies. ### Part 1: Data Wrangling Load the movie data provided into an R data frame called `movies`. Be sure to use `stringsAsFactors = FALSE`; otherwise, every single movie title will be assigned its own level. Take a `glimpse` of the data. You’ll notice that the "genre" variable concatenates all genres into a single string. You can/should split this string up into a list using `strsplit(movies$genres, "[|]")`. Next, use the `duplicated` predicate to find duplicate movie titles: e.g., `movies[duplicated(movies$title), ]`. *Note:* Movies produced and then reproduced (with a different cast, etc.) again later should not be treated as duplicates! *Hint:* You can use the `match` function to find the index of a particular title: e.g., `match("Hoop Dreams (1994)", movies$title)`. *Hint:* You can use the `which` function to find multiple row indices corresponding a particular title: e.g., `which(movies$title == "Hoop Dream (1994)"`. For all duplicate titles, choose one of the two observations to retain, deleting the other from the data frame. Delete the one that seems to you to be a less accurate description of the movie’s genres. Feel free to watch a trailer to help you decide! *Hint:* There are two duplicate titles (same title; same year), and in both cases, their corresponding genres differ. Your next goal is to build a matrix of movies by genres for eventual content filtering by genre. This matrix should have `nrow(movies)` rows, and the number of columns should be the number of genres (including no genre, in case some movies have no genres associated with them). Here’s a command to extract all the unique genres, and hence the number of them: ```{r} genres <- unique(unlist(movies$genres)) ``` Note the use of the `unlist` function. It is necessary because, for each movie in the data frame, the `genres` variable contains a *list* of that movie’s genres. With this information in hand, you can now create a matrix of the required size: ```{r} mat <- matrix(0, nrow(movies), length(genres)) ``` The first parameter, `0`, is the default value for entries in this matrix, which we will take to mean that a given movie is *not* of a given genre. The next two parameters describe its dimensions. So far so good! The next step is to populate this matrix; that is, to assign 1s rather than 0s to all rows (movies) and columns (genres), when a movie is indeed of that genre. To do so, you should write two *nested* for loops. The outer loop should iterate over all movies. The inner loop (inside the outer one), should loop over all genres associated with each movie. Inside these loops, you should set the entries in the matrix corresponding to that movie-genre combination to 1 rather than 0. *Hint:* At iteration $i$ of the outer loop (i.e., when working on the $i$th movie), the inner loop should iterate over all genres in `movies[i, "genres"]`: i.e., all of the $i$th movie’s genres. *Hint:* Consider using the `match` function again, matching each of a movie’s genres to the `genres` vector to find the columns in the matrix that should be set to 1 for that movie/row. You're almost there. Next, use `as.data.frame(mat)` to convert your matrix to a data frame. And finally, use the the `row.names` function to set the row names to be the movies’ titles, and the `names` function to name the variables as the appropriate genres, which, conveniently, you have already stored in the vector `genres`. At this point, you should have successfully created a data frame of movies by genres, where each observation represents a movie and its corresponding genres. How many movies are classified as romance? How many movies are classified as Westerns? How many movies are classified as romance *and* Westerns? *Hint:* You can use the `table` function to answer these questions. ### Part 2: Clustering ##### Part 2a: Similarity Metrics A **similarity measure** measures how alike two observations are. In data science, we sometimes describe similarity in terms of distance; if distance is small, then there is a high degree of similarity, whereas if distance is large, then there is a low degree of similarity. Popular similarity metrics include Euclidean distance, cosine similarity, Pearson correlation, and Jaccard similarity. Refer to the class slides for a refresher on these different metrics. Install the following package, which contains a variety of similarity metrics, including the ones mentioned above. ```{r, eval = FALSE} install.packages('proxy') library(proxy) ``` You can now use the `dist(x, method = "..." )` function in R to calculate distances, passing in `x`, the data frame of observations that you want to compare, and the name of a distance/similarity metric (e.g., "euclidean", "jaccard", etc.). *Note:* The `proxy` package includes a `simil` function as well as a `dist` function. All measures can be computed either as similarities or distances using one of these two functions. It is a simple matter to transform one to the other; check out [this documentation](https://www.rdocumentation.org/packages/proxy/versions/0.4-19/topics/dist) for more details. Report the similarity between the movies "Star Wars: Episode IV - A New Hope (1977)" and "Stargate (1994)" using the four metrics mentioned above. Similarly, report the similarity between "Toy Story (1995)" and "Die Hard (1988)" using the four metrics. Do the different metrics behave similarly? Which ones seem most to work best on these examples? ##### Part 2b: Hierarchical Clustering Recall the steps in (agglomerative) hierarchical clustering: 1. Initialize the clustering with all data points in their own cluster 2. Compute distances between all clusters 3. Identify the two closest clusters, and combine them 4. Repeat until only one cluster remains Choose 15 to 30 movies of interest to you in the dataset, and create a data frame that contains only these movies. Then calculate the distances between these movies using any two similarity metrics that seem to you to work best. (*Hint:* Be sure to remove the movie names before attempting to calculate similarities, or R will choke.) Create a hierarchical clustering of your choice movies using `hclust`. Save this clustering as, say, `hc`, and then truncate the result so that it contains no more than ten clusters using the `cutree` function: e.g., `cutree(hc, k = 7)`.  The output of `hclust` is a vector of clusters: e.g., 1 1 2 1 2 3 3 2 3 1. In this sample output, cluster 1 consists of the first, second, fourth, and ten movies; cluster 2 consists of the third, fifth, and eighth movies; and cluster 3 consists of the sixth, seventh, and ninth movies. That’s all good and well, but it would be nice to know the names of these movies! The function `which` can be used to extract the `indices` corresponding to each cluster from the output of `hclust`. You can then find the names of the movies in each cluster by looking up these indices in the `name` column of your data frame (say, `my_movies`) with this line of code: `my_movies$name[indices]`. To extract the indices corresponding to the $i$th cluster from `hc`, the output of `hclust`, you can use `which(hc %in% rep(i, num_movies))`. Here, `num_movies` is the number of movies in your data frame, `rep(i, num_movies)` is a vector of all $i$’s of length `num_movies`, and the `which` function returns the indices of `hc` whose value matches $i$. Extract the movie names corresponding to the output of `hclust`. Does `hclust` do a good job of clustering movies by content? Which similarity metric(s) seems to work best? What happens when you set $k$ too high or too low? ### Part 3: Recommender Systems A recommender system is a program that recommends products—ranging from movies to books to clothing to news articles to search queries—to users. Prominent examples of companies that use recommender systems are Spotify, to find you songs like the songs you often listen to, and Amazon, to find you books (and many other products!) that you might like. Netflix is famous for the efforts they spent developing and refining their recommender system. From 2006 to 2009, Netflix "crowdsourced" an algorithm to boost their sales and improve customer satisfaction. What this means is that they invited anyone (without any connection to Netflix) to submit code that would predict user ratings for movies, and they offered a large monetary prize ($1 million!), to any team that could best Netflix’s own algorithm by more than 10%. It was a high bar, but [BellKor's Pragmatic Chaos](https://www.wired.com/2009/09/bellkors-pragmatic-chaos-wins-1-million-netflix-prize/) team succeeded. Most recommendation systems follow one of two models: either collaborative filtering or content filtering. **Collaborative filtering** collects a large amount of user data – behavior, preferences, etc. – and analyzes it, predicting what users will like based on the behavior of other similar users. **Content filtering**, on the other hand, assigns keywords or descriptive tags to items, and then recommends items to users based on item similarity. Netflix uses a hybrid recommender system, based on both collaborative filtering and filtering. It can first use collaborative filtering to determine that Rutvik and Julia have similar taste in movies. Then, through content filtering, it can discover that "Toy Story," which both of them liked, is categorized under the same set of genres as "Inside Out," which neither of them has seen. If it then goes on to recommend "Inside Out" to both of them and they both love it, everyone wins. ### Collaborative Filtering Your goal in this problem is to create a similarity matrix between movies, based not on their content (e.g., genres) as above, but instead based on their user ratings. To do so, load the ratings data into an R data frame, say `ratings`. This file is very large, so we recommend truncating your `ratings` data frame. You can set a hard limit using `ratings <- head(ratings, n)`, but a better way to do this would be to only include users that have seen sufficiently many (e.g., 500) movies. In either case, the larger your data frame, the better your recommendations will be, but the slower your code will run. From the truncated `ratings` data frame, create a matrix with users as rows, and movies as columns. Call it `mat`. The entries in this matrix should be mostly 0s, since most users have not seen most movies. But if the user has rated a movie, the entry in the matrix corresponding to that user (row) and movie (column) should be the user’s rating of that movie. After creating this ratings matrix, you can then use the `cosine` function in the `lsa` package to create a similarity matrix based on [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) as follows: `cosine(mat)`. Note that the `cosine` function calculates similarities among *columns* in a matrix, not rows. The similarity matrix is not that informative without the movie titles, so set the row and column names to be the movie titles. *Hint:* Use the `match` function to find the indices of `ratings$movieId` in `movies$movieId`. Use the `View` command to view the similarity matrix. Clicking on a column sorts the data by that column, and clicking twice sorts the data by that column *in descending order*. Click on three columns of your choice, and for each, report the five movies that are most similarly rated by users. Do you agree with these recommendations? ##### Recommendations for you! If we transpose `mat`, the rows (movies) become the columns, and the columns (users), the rows. Do this, using the `t` function applied to `mat` (i.e., `tmat <- t(mat)`), and then create a similarity matrix among users (`cosine(tmat)`), rather than movies. As above, use the `View` command to view the similarity matrix. This matrix might tell you, for example, that user 7 is most similar to user 4. This information is interesting/useful, because when user 7 asks for a movie recommendation, the system can recommend movies that user 4 rated highly, and that user 7 has not yet seen. Manually add a new row to the user-movie matrix that contains your own movie ratings. The more movies you rate, the better your recommendations will be, so try to rate at least 5, but preferably 10. Repeat the steps above to calculate a new similarity matrix among users, including yourself. You can now see which users rate films most similarly to you! Finally, ask the system to recommend movies for you based on the ratings of the user (or users) who is (are) most similar to you. ##### Final Analysis 1. Name three other websites or apps or services (besides Netflix, Amazon, and Spotify) that use recommendation systems and briefly describe what they use them for. 1. If Twitter were to suggest accounts for a particular user to follow based on the accounts they already follow, which kind of filtering would that be—content or collaborative? 1. List some pros and cons of content filtering and collaborative filtering. Why might a particular website/app/service choose one over the other in their recommendation system?