cs0112
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Help
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- title: Project 1 tags: Projects --- # Project 1: Song Analysis ## Due Date * Song Analysis is due February 20th @ 11:59pm * Handin on Gradescope ## Summary The 2010s sure had some #lit jams! Artists across every genre contributed some masterpieces (or masterpiece in the case of Rihanna) over the last ten years. As we move on to a new decade, we want a measure of similarity for new projects. Since we don't know what music will look like in the future, we can look for trends amongst songs of the past to help us generalize a pattern to apply to future songs. To do this, we decide to use term frequency - inverse document frequency and k - nearest neighbors. In order to test our algorithms, we first need to train them extensively on a data set which can be found [here](https://drive.google.com/drive/folders/1ppiO4WDvuql6lnltsE_vlSx4GDiuLHws?usp=sharing). This way we can begin to identify trends of which lyrics correspond to which genres. Once our algorithm can detect patterns between words and the genre that they belong to, we can begin using our program on new songs and see how they are classifed. This will make cataloging the songs of the coming decade so much easier! ## Project Learning Goals In this project, students will work with data sets of varying sizes, get an introduction to machine learning, and practice implementing the tf-idf algorithm and the k-NN algorithm. ## Design Check For the design check, groups will be expected to handin `song_analysis.py` with block comments answering the questions asked below. For example, if we asked you to set up data structures for a zoo, your answer might say "the zoo is a list of animals, where animals are a dataclass", placed in the area where you would create the zoo dataclass. ### Design Questions: 1. After reading through the handout, describe your understanding of tf-idf. In compute_tf and compute_idf, write out in words the steps you would take to program these functions. 2. The K-nearest neighbor algorithm plays an important part in finding the best matching genre for a song. Describe the importance of how K-nn interacts with the tf-idf. Once you have described how they interact, in nearest_neighbor, write in words how you plan to implement this function. 3. Being able to test your program is an important step in deciding how accurate your genre classifier is. After reading the handout, consider ways that you can test the program and its accuracy, then list a few ideas of how you can make your program more accurate at classifying genres. Consider thinking about your corpus and how you are calculating k-nn. ## TF-IDF tf-idf, or Term Frequency-Inverse Document Frequency, is a metric of computing a term's importance within the corpus or collection of documents. tf-idf can be a useful tool and is the basis of search algorithms (like Google!), keyword extraction, information retrieval, and natural language processing. ### Term Frequency (tf) In our case, Term Frequency is described as the number of times a term *t* appears in a certain document *d*. Given the document: "Tik Tok on the clock, the party don't stop". The term frequencies would be as follows: | Tik | Tok | on | the | clock | party | don't | stop | | ---- | --- | ---| --- | --- | --- | ---- | ---- | | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | ### Inverse Document Frequency (idf) Inverse Document Frequency is used to scale up the importance of words that occur with less frequency in the corpus. To compute Inverse Document Frequency, we first need to get a fraction of the total number of documents in the corpus *n* over the number of times a term appears in the corpus *n*<sub>*t*</sub>. Once this fraction is found, idf can be computed by taking the logarithm of the fraction. ![](https://i.imgur.com/LcGAamu.png) Consider the corpus: | Doc 1 | Doc 2 | Doc 3 | Doc 4 | Doc 5 | Doc 6 | Doc 7 | | ---- | --- | --- | --- | --- | --- | ---- | ---- | | Apple | Banana | Watermelon | Mango | Strawberry | Kiwi Strawberry | Strawberry Banana | To calculate the _idf_ score of "strawberry," we would first identify _n_, the total number of documents (in this case 7). Then, we would divide by the number of documents "strawberry" appears in, *n*<sub>*t*</sub>. To find _idf_, we take the natural logarithm of _n_/*n*<sub>*t*</sub>, so _ln_ (7/3) = .37. ### tf-idf For our purposes, we want to use both term frequency and inverse document frequency in order to understand how important a single term is in the context of the corpus. The tf-idf value is represented by the product of the term frequency and inverse document frequency. Modeled by the following formula: ![](https://i.imgur.com/gIWLYUS.png) Where tf<sub>ij</sub> represents the term frequency and idf<sub>i</sub> represents the inverse document frequency ## K-NN Now that we have a way to vectorize the song lyrics, we need to find a way to compare and classify all of our new hits with the oldies. For this, we can use k-NN or k nearest neighbors. Refer to the lecture notes for more details on the structure of k-NN. For this project, we will be picking the singular song which best matches based on cosine simialirity i.e. k = 1. ## Requirements 0. **Find a partner**. In this class, all projects will be in a group format, where you will have one or more partners to collaborate with, depending on the project. For this project, you will have one group partner to work through this project with. Please check campuswire for a partner-pairing post to find a partner for the project! Once you have a partner, sign up for a design check appointment [here](https://forms.gle/ZQUU1dkmKX8Eu1DVA). **Make sure to invite your project partner to the event!** (we need this information for grading purposes). 1. **Setup**: Install the project stencil by running the command ```cs0112_install song_analysis```. This will install a song_analysis directory in your ~/course/ directory on your department machine. 2. **Design Check**: Answer the design check questions and begin thinking about the return types of the functions that you will create and how they may work with your intended program organization. 3. **Data/Coding**: (1) Figure out which data structures you need to store the song data from the csv file, as well as for the computed tf-idf vectors for each song. These data structures should be efficient and useful for querying similarity between two songs! (2) With this representation of how you will store the song data from the csv in mind, write the `create_corpus` function (in `song_analysis.py`). 4. **Style and Design**: Be sure to follow all of the style and design guidelines outlined in the cs0112 python style guide. 5. **Testing**: You will also be evaluated based on the testing of your implementation. Make sure that your code works for a variety of cases and consider where your implementation may fail. ## The Implementation Phase In working through this assignment, there are several components the algorithms can be broken into: TF, IDF, TF-IDF, and Nearest-Neighbor. These components are used in sequential order and may build on each other. We recommend that you complete the project in the following order: * *create_corpus* * *compute_tf* * *compute_idf* * *compute_tf_idf* * *compute_corpus_tf_idf* * *nearest_neighbor* ### Details: Create_Corpus This function sets up the entire underlying structure for the project. It takes in no arguments and returns nothing. Since it is a void function, it populates the corpus with song objects parsed from the csv file. ### Details: TF In order to calculate the Term Frequency, you need to fill out the *compute_tf()* function in the stencil code. This function is designed to calculate how many times a term appears in a specifc song. It takes in a list of strings representing the lyrics for a song and produces a dictionary containing the term frequency where the key is the term and the value is the frequency for that term. ### Details: IDF In order to calculate the Inverse Document Frequency for each song in your corpus, you need to fill out the *compute_idf()* function in the stencil code. Like the title of the suggests, this function is responsible for creating the idf for the corpus. This void function takes in no arguments and outputs nothing. ### Details: TF-IDF In order to calculate the Term Frequency - Inverse Document Frequency, you need to fill out the *compute_tf_idf()* function in the stencil code. This function takes in a list of strings representing song lyrics and outputs a dictionary where the key is a term from the song and the value is the tf-idf for that term. ### Details: Cosine_Similarity This function is provided in the stencil code and is used to find a measure of similarity. It takes in two dictionaries that are the result of tf-idf calculations (effectively the lyrics have been vectorized to have a magnitude and direction so that each song could be plotted on a 2d plane now). We can find the similarity of two vectors *(defined by the proximity of the vectors in 2d space)* by dividing the dot product of the two vectors by the product of the magnitude of the two vectors. ### Details: Nearest-Neighbor In order to calculate the nearest neighbor of a novel song, you need to fill out the *nearest_neighbors()* function in the stencil code. This function takes in a String representing song lyrics and outputs the *song object* that is most similar to the lyrics input. ## Testing ### Rihanna makes what? Now that you have built your genre classifier, its time to test some of your favorite lyrics and see what genre they fall under! Rihanna reached out to us to see if we will classify some of her songs to see what genre of songs she makes in order to prepare to release an appropriate album for the new decade. Look up a few of your favorite Rihanna, or any of your favorite artists lyrics, using the function nearest neighbor. For this project, we would like you to put all of your tests in ```test_song_analysis.py```. You can test your program there, testing different edge cases that could be passed to the classifier, along with individually testing each of your functions individually. We recommend creating a smaller corpus csv file and using it to more accurately test your program. You should test all functions that do not require reading input from a file. > ***NOTE:*** When testing with the full data set, it will take a few minutes to compute your corpus of tf_idf vectors! Be patient when running your program! ### Example Queries In order to check what a few queries might produce, we will provide a set of songs lyrics and the nearest songs they produce. This will be provided after design checks have concluded. ## Handin You may submit as many times as you want. Only your latest submission will be graded. This means that if you submit after the deadline, you will be using a late day – so do NOT submit after the deadline unless you plan on using late days. The README template can be found [here](https://drive.google.com/file/d/1JBglGM0B-6g7RLgB3sVTLFsXFtIE17pI/view). Please follow the [design and clarity guide](https://cs.brown.edu/courses/csci0111/fall2019/docs/Python-Testing-Design-and-Clarity-Guide.html)--part of your grade will be for code style and clarity. After completing the homework, you will submit: - `README.txt` - `song_analysis.py` - `test_song_analysis.py` Because all we need are the files, you do not need to submit the whole project folder. As long as the file you create has the same name as what needs to be submitted, you’re good to go! If you are using late days, make sure to make a note of that in your README. Remember, you may only use a maximum of 3 late days per assignment. If the assignment is late (and you do NOT have anymore late days) no credit will be given. Please don't put your name anywhere in any of the handin files--we grade assigments anonymously! You can follow this step-by-step guide for submitting assignments through gradescope [here](https://docs.google.com/document/d/1T5IH8_WnNVKcq9N_QnfzlCluwnNaXBEbwHV1bqQMeDM/edit).

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully