CS410 Homework 10: Unsupervised Learning

# CS410 Homework 10: Unsupervised Learning > **Due Date: 11/20/2024** > **Need help?** Remember to check out Edstem and our website for TA assistance. ## Assignment Overview Unsupervised learning involves finding hidden patterns or intrinsic structures in data. Unlike in supervised learning, there are no labeled responses. In this assignment, you will use foundational unsupervised learning techniques to develop an AI movie recommendation system. This assignment covers Principal Component Analysis (PCA), K-means clustering, and collaborative filtering--the heart of Netflix's movie recommendation algorithm. At the end of the assignment, you will compare collaborative filtering with and without clustering to see which method yields better recommendations. ## Introduction Last week, Steve introduced his friend Alex to the movie *The Lion King*. Alex loved it and is now eager for more recommendations. Steve decides to analyze his friends' watch data to recommend another movie Alex will love. In this assignment, you will help Steve achieve this by clustering similar movies together and recommending them based on other villagers' preferences. So grab your pickaxe, and let's dig into the data! ## Getting Started ### Stencil Please click [here](https://github.com) to get the stencil code. It should contain these files: `preprocess_data.py`, `kmeans.py`, `collaborative_filtering.py`, `movies.csv`, and `ratings.csv`. ### Environment You will need to use the virtual environment that you made in Homework 0 to run code in this assignment, which you can activate by using `conda activate csci410`. ## MovieLens Dataset The MovieLens Dataset consists of 2 files: `movies.csv` and `ratings.csv`, listing 100,000 ratings across 1,700 movies and 1,000 users. * `movies.csv`: Stores movieID, title, genres. * `ratings.csv`: Stores userID, movieID, rating, and timestamp. Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars). Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.  ## Tasks ### Step 1: Data Preprocessing First, load the movie data into a pandas DataFrame from `movies.csv` into `preprocess_data.py`. To inspect the initial structure, take a quick look at the first few rows of the DataFrame. Notice that the "genres" column contains concatenated genre strings, which can be split into lists. :::info TODO: Split the genre string into a list using the `str.split('|')` method. ::: For all duplicate titles, choose one of the two observations to retain, deleting the other from the DataFrame. Delete the one that seems to you to be a less accurate description of the movie’s genres. Feel free to watch a trailer to help you decide! :::info TODO: Remove duplicate titles. ::: :::warning **Note:** Movies produced and then reproduced (with a different cast, etc.) again later should not be treated as duplicates! ::: :::spoiler Hint Hint: There are two duplicate titles (same title; same year), and in both cases, their corresponding genres differ. ::: Your next goal is to build a matrix of movies by genres for eventual content filtering by genre. This matrix should have have the dimensions (number of movies $x$ number of genres). Don't forget, the number of genres should include "no genre" in case any movies have no genres associated with them. :::info TODO: Create a DataFrame of movies by genres, where each observation represents a movie and its corresponding genres. Start by creating a set of all unique genres, then initialize a DataFrame with zeros. Populate this by setting the appropriate entries to 1 based on the genres of each movie. ::: :::spoiler Hint Keep in mind that for each movie in the DataFrame, the genres variable contains a list of that movie’s genres. ::: To check your work, calculate and print the number of movies classified as Romance (1545), Western (168), and both Romance and Western (20), to analyze the genre distribution in the dataset. ### Step 2: Apply PCA Since the high dimensionality of user-rating matrix which is mostly empty at the beginning makes the similarity computation very difficult, we use a PCA-based dimension reduction process. As one of the most successful feature extraction techniques, PCA is widely used in collaborative filtering systems. The main idea of PCA is to convert the original data to a new coordinate space which is represented by principal component of data with highest eigenvalue. The first principal component vector carries the most significant information after ordering them by eigenvalues from high to low. In general, the components with lesser significance are ignored to form a space with fewer dimensions than the original one. When it comes time to making your recommender system, you will need to make a similarity matrix based on user ratings. This `ratings.csv` file is very large, so you will need to truncate your ratings DataFrame. In theory, you could set a hard limit, only including $n$ rows or only including users that have seen sufficiently many (e.g., 500) movies. In either case, the larger your DataFrame, the better your recommendations will be, but the slower your code will run. Rather than abitrarily truncating your `ratings` DataFrame, let's use PCA! ### Step 3: K-Means Clustering The aim of clustering is to divide users into different groups to form “like-minded” (nearest) neighbors instead of searching the whole user space, which could dramatically improve the system scalability. In `kmeans.py`, implement K-means clustering using Euclidean distance. You can run the `visualize` method to see a graph of the centroids and clusters when you run K-Means clustering on the dataset. You can find the psuedocode and TODOs in the Python file. :::spoiler Note: Use NumPy operations whenever possible. K-Means is much faster if you write the update functions using operations on NumPy arrays, instead of manually looping over the arrays and updating the values yourself. ::: ### Step 4: Collaborative Filtering Recommender A recommender system is a program that recommends products—ranging from movies to books to clothing to news articles to search queries—to users. Prominent examples of companies that use recommender systems are Spotify, to find you songs like the songs you often listen to, and Amazon, to find you books (and many other products!) that you might like. Most recommendation systems follow one of two models: either **collaborative filtering** or **content filtering**. Collaborative filtering collects a large amount of user data – behavior, preferences, etc. – and analyzes it, predicting what users will like based on the behavior of other similar users. Content filtering, on the other hand, assigns keywords or descriptive tags to items, and then recommends items to users based on item similarity. Generally, collaborative filtering works like this: look at what products the current user has used or liked, find other users that have used or liked similar products, and then recommend other products that those users have used or liked. Firstly, let's understand how User-based collaborative filtering works. **User-based collaborative filtering** makes recommendations based on user-product interactions in the past. The assumption behind the algorithm is that similar users like similar products. User-based collaborative filtering algorithm usually has the following steps: 1. Find similar users based on interactions with common items. 2. Identify the items rated high by similar users but have not been exposed to the active user of interest. 3. Calculate the weighted average score for each item. 4. Rank items based on the score and pick top n items to recommend. To implement collaborative filtering, you should first create a similarity matrix between movies. Your goal now is to create a similarity matrix between movies, based not on their content (e.g., genres) as above, but instead based on their user ratings. :::info From the truncated `ratings` DataFrame, create a matrix with users as rows and movies as columns. ::: :::spoiler Hint The entries in this matrix should be mostly 0s, since most users have not seen most movies. But if the user has rated a movie, the entry in the matrix corresponding to that user (row) and movie (column) should be the user’s rating of that movie. ::: After creating this `ratings` matrix, you can then create a similarity matrix based on [cosine similarity](https://www.geeksforgeeks.org/cosine-similarity/). Note that the cosine function calculates similarities among columns in a matrix, not rows. :::info TODO: Using cosine similarity, create a similarity matrix. Add the movie titles to the matrix to make it more informative. ::: :::spoiler Hint For tips on implementing cosine similarity, read [here](https://datastax.medium.com/how-to-implement-cosine-similarity-in-python-505e8ec1d823). ::: :::spoiler Testing it out You can sort the data by a column to view it in descending order. To test your recommender, generate recommendations for a few sample users. ::: In movie recommendation, clustering is widely used to make algorithms more scalable. :::info TODO: Try collaborative filtering within generated clusters to see if it yields better recommendations. ::: ## Submission ### Grading ### Hand-In Submit the assignment via Gradescope under the corresponding project assignment by **zipping up your hw** folder or through **GitHub** (recommended). To submit through GitHub, follow these commands: 1. `git add -A` 2. `git commit -m "commit message"` 3. `git push` Now, you are ready to upload the repo to Gradescope. :::success Congrats on submitting your homework; Steve is proud of you!! ![image](https://hackmd.io/_uploads/S1OQ2aCwA.png) :::