Project 1 - HackMD

--- title: Project 1 tags: Projects --- # Project 1: Song Analysis ## Due Date * Song Analysis is due February 20th @ 11:59pm * Handin on Gradescope ## Summary The 2010s sure had some #lit jams! Artists across every genre contributed some masterpieces (or masterpiece in the case of Rihanna) over the last ten years. As we move on to a new decade, we want a measure of similarity for new projects. Since we don't know what music will look like in the future, we can look for trends amongst songs of the past to help us generalize a pattern to apply to future songs. To do this, we decide to use term frequency - inverse document frequency and k - nearest neighbors. In order to test our algorithms, we first need to train them extensively on a data set which can be found [here](https://drive.google.com/drive/folders/1ppiO4WDvuql6lnltsE_vlSx4GDiuLHws?usp=sharing). This way we can begin to identify trends of which lyrics correspond to which genres. Once our algorithm can detect patterns between words and the genre that they belong to, we can begin using our program on new songs and see how they are classifed. This will make cataloging the songs of the coming decade so much easier! ## Project Learning Goals In this project, students will work with data sets of varying sizes, get an introduction to machine learning, and practice implementing the tf-idf algorithm and the k-NN algorithm. ## Design Check For the design check, groups will be expected to handin `song_analysis.py` with block comments answering the questions asked below. For example, if we asked you to set up data structures for a zoo, your answer might say "the zoo is a list of animals, where animals are a dataclass", placed in the area where you would create the zoo dataclass. ### Design Questions: 1. After reading through the handout, describe your understanding of tf-idf. In compute_tf and compute_idf, write out in words the steps you would take to program these functions. 2. The K-nearest neighbor algorithm plays an important part in finding the best matching genre for a song. Describe the importance of how K-nn interacts with the tf-idf. Once you have described how they interact, in nearest_neighbor, write in words how you plan to implement this function. 3. Being able to test your program is an important step in deciding how accurate your genre classifier is. After reading the handout, consider ways that you can test the program and its accuracy, then list a few ideas of how you can make your program more accurate at classifying genres. Consider thinking about your corpus and how you are calculating k-nn. ## TF-IDF tf-idf, or Term Frequency-Inverse Document Frequency, is a metric of computing a term's importance within the corpus or collection of documents. tf-idf can be a useful tool and is the basis of search algorithms (like Google!), keyword extraction, information retrieval, and natural language processing. ### Term Frequency (tf) In our case, Term Frequency is described as the number of times a term *t* appears in a certain document *d*. Given the document: "Tik Tok on the clock, the party don't stop". The term frequencies would be as follows: | Tik | Tok | on | the | clock | party | don't | stop | | ---- | --- | ---| --- | --- | --- | ---- | ---- | | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | ### Inverse Document Frequency (idf) Inverse Document Frequency is used to scale up the importance of words that occur with less frequency in the corpus. To compute Inverse Document Frequency, we first need to get a fraction of the total number of documents in the corpus *n* over the number of times a term appears in the corpus *n**t*. Once this fraction is found, idf can be computed by taking the logarithm of the fraction. ![](https://i.imgur.com/LcGAamu.png) Consider the corpus: | Doc 1 | Doc 2 | Doc 3 | Doc 4 | Doc 5 | Doc 6 | Doc 7 | | ---- | --- | --- | --- | --- | --- | ---- | ---- | | Apple | Banana | Watermelon | Mango | Strawberry | Kiwi Strawberry | Strawberry Banana | To calculate the _idf_ score of "strawberry," we would first identify _n_, the total number of documents (in this case 7). Then, we would divide by the number of documents "strawberry" appears in, *n**t*. To find _idf_, we take the natural logarithm of _n_/*n**t*, so _ln_ (7/3) = .37. ### tf-idf For our purposes, we want to use both term frequency and inverse document frequency in order to understand how important a single term is in the context of the corpus. The tf-idf value is represented by the product of the term frequency and inverse document frequency. Modeled by the following formula: ![](https://i.imgur.com/gIWLYUS.png) Where tfij represents the term frequency and idfi represents the inverse document frequency ## K-NN Now that we have a way to vectorize the song lyrics, we need to find a way to compare and classify all of our new hits with the oldies. For this, we can use k-NN or k nearest neighbors. Refer to the lecture notes for more details on the structure of k-NN. For this project, we will be picking the singular song which best matches based on cosine simialirity i.e. k = 1. ## Requirements 0. **Find a partner**. In this class, all projects will be in a group format, where you will have one or more partners to collaborate with, depending on the project. For this project, you will have one group partner to work through this project with. Please check campuswire for a partner-pairing post to find a partner for the project! Once you have a partner, sign up for a design check appointment [here](https://forms.gle/ZQUU1dkmKX8Eu1DVA). **Make sure to invite your project partner to the event!** (we need this information for grading purposes). 1. **Setup**: Install the project stencil by running the command ```cs0112_install song_analysis```. This will install a song_analysis directory in your ~/course/ directory on your department machine. 2. **Design Check**: Answer the design check questions and begin thinking about the return types of the functions that you will create and how they may work with your intended program organization. 3. **Data/Coding**: (1) Figure out which data structures you need to store the song data from the csv file, as well as for the computed tf-idf vectors for each song. These data structures should be efficient and useful for querying similarity between two songs! (2) With this representation of how you will store the song data from the csv in mind, write the `create_corpus` function (in `song_analysis.py`). 4. **Style and Design**: Be sure to follow all of the style and design guidelines outlined in the cs0112 python style guide. 5. **Testing**: You will also be evaluated based on the testing of your implementation. Make sure that your code works for a variety of cases and consider where your implementation may fail. ## The Implementation Phase In working through this assignment, there are several components the algorithms can be broken into: TF, IDF, TF-IDF, and Nearest-Neighbor. These components are used in sequential order and may build on each other. We recommend that you complete the project in the following order: * *create_corpus* * *compute_tf* * *compute_idf* * *compute_tf_idf* * *compute_corpus_tf_idf* * *nearest_neighbor* ### Details: Create_Corpus This function sets up the entire underlying structure for the project. It takes in no arguments and returns nothing. Since it is a void function, it populates the corpus with song objects parsed from the csv file. ### Details: TF In order to calculate the Term Frequency, you need to fill out the *compute_tf()* function in the stencil code. This function is designed to calculate how many times a term appears in a specifc song. It takes in a list of strings representing the lyrics for a song and produces a dictionary containing the term frequency where the key is the term and the value is the frequency for that term. ### Details: IDF In order to calculate the Inverse Document Frequency for each song in your corpus, you need to fill out the *compute_idf()* function in the stencil code. Like the title of the suggests, this function is responsible for creating the idf for the corpus. This void function takes in no arguments and outputs nothing. ### Details: TF-IDF In order to calculate the Term Frequency - Inverse Document Frequency, you need to fill out the *compute_tf_idf()* function in the stencil code. This function takes in a list of strings representing song lyrics and outputs a dictionary where the key is a term from the song and the value is the tf-idf for that term. ### Details: Cosine_Similarity This function is provided in the stencil code and is used to find a measure of similarity. It takes in two dictionaries that are the result of tf-idf calculations (effectively the lyrics have been vectorized to have a magnitude and direction so that each song could be plotted on a 2d plane now). We can find the similarity of two vectors *(defined by the proximity of the vectors in 2d space)* by dividing the dot product of the two vectors by the product of the magnitude of the two vectors. ### Details: Nearest-Neighbor In order to calculate the nearest neighbor of a novel song, you need to fill out the *nearest_neighbors()* function in the stencil code. This function takes in a String representing song lyrics and outputs the *song object* that is most similar to the lyrics input. ## Testing ### Rihanna makes what? Now that you have built your genre classifier, its time to test some of your favorite lyrics and see what genre they fall under! Rihanna reached out to us to see if we will classify some of her songs to see what genre of songs she makes in order to prepare to release an appropriate album for the new decade. Look up a few of your favorite Rihanna, or any of your favorite artists lyrics, using the function nearest neighbor. For this project, we would like you to put all of your tests in ```test_song_analysis.py```. You can test your program there, testing different edge cases that could be passed to the classifier, along with individually testing each of your functions individually. We recommend creating a smaller corpus csv file and using it to more accurately test your program. You should test all functions that do not require reading input from a file. > ***NOTE:*** When testing with the full data set, it will take a few minutes to compute your corpus of tf_idf vectors! Be patient when running your program! ### Example Queries In order to check what a few queries might produce, we will provide a set of songs lyrics and the nearest songs they produce. This will be provided after design checks have concluded. ## Handin You may submit as many times as you want. Only your latest submission will be graded. This means that if you submit after the deadline, you will be using a late day – so do NOT submit after the deadline unless you plan on using late days. The README template can be found [here](https://drive.google.com/file/d/1JBglGM0B-6g7RLgB3sVTLFsXFtIE17pI/view). Please follow the [design and clarity guide](https://cs.brown.edu/courses/csci0111/fall2019/docs/Python-Testing-Design-and-Clarity-Guide.html)--part of your grade will be for code style and clarity. After completing the homework, you will submit: - `README.txt` - `song_analysis.py` - `test_song_analysis.py` Because all we need are the files, you do not need to submit the whole project folder. As long as the file you create has the same name as what needs to be submitted, you’re good to go! If you are using late days, make sure to make a note of that in your README. Remember, you may only use a maximum of 3 late days per assignment. If the assignment is late (and you do NOT have anymore late days) no credit will be given. Please don't put your name anywhere in any of the handin files--we grade assigments anonymously! You can follow this step-by-step guide for submitting assignments through gradescope [here](https://docs.google.com/document/d/1T5IH8_WnNVKcq9N_QnfzlCluwnNaXBEbwHV1bqQMeDM/edit).

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.