# <center> Music Genre Classification with CNN </center>
Keith Ng 1003515
Li Yuxuan 1003607
Ng Jia Yi 1003696
## Topic
Our project seeks to classify music into its respective genre by passing its spectrogram into a convolutional neural network(CNN). The need for automatic music genre classification (AMGC) revolves around the fact that nowadays artists deliberately distribute their songs on their websites, make music database management a must. Another recent tendency is to consume music via streaming, raising the popularity of on-line radio stations that play similar songs based on a genre preference. In addition, browsing and searching by genre on the web and smart personalised playlists generation choosing specific tunes among gigabytes of songs on personal portable audio players are important tasks that facilitate music mining.
## Architecture

### Inputs and Outputs
Our model will take in a wav file and output a label which specifies the song genre.
### Training & Validation
For training and validation, we will use the [GTZAN](https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification) dataset from Kaggle. This dataset provides 100 songs for 10 genres (100 songs per genre). 80% of the **GTZAN** dataset will be used as training data, and 20% will be used as validation data. The splitting of train.validation data will be done uniformly across each genre. The song will first be processed by `librosa` to get the spectrogram. The audio spectrogram will be passed into a CNN model for training and the model will try to classify each spectrogram into a genre.
### Testing
For testing, we will source for our own `wav` audio files. We plan to prepare 5 songs for each genre so there will be 50 songs in total for testing. The `wav` files will be converted into spectrograms by the python `librosa` package. Lastly, the model will predict the genre based on the spectrogram. As we intend to select the songs from spotify, we will used the spotify assigned labels as their true labels.
## State of the art
First off, we would like to compare with this medium article that we have online, that discusses about music genre classification: http://cs229.stanford.edu/proj2018/report/21.pdf. There are multiple methods used for image classification, including **RBF kernel SVM**, **KNN**, **feed-forward network**, and **CNN**. They have tested with both raw amplitude as input data, as well as mel-spectrograms as input data, and have found out that **converting raw audio into mel-spectrograms produced better results on all the models**, and CNN surpasses human accuracy.
They have obtained their dataset from a public GTZAN dataset, which has 1000 30-second audio clips. They take **discrete cosine transform of the result**, which is common in signal processing to get the final output mel-spectrogram. They have used **Librosa library** (which we intent to use), and chose to use 64 mel-bins and window length of 512 samples, and have yield results as followed.

After running the experiment, they have obtained the results as such:

CNN has awarded them the best result, and their CNN model uses 3 convolution layers, each with its own max pool and regularisation, feeding into 3 fully connected layers with ReLU activation, softmax output and cross entropy loss.

The reason why frequency based mel-spectrograms produced higher accuracy results is because amplitude only provides information on intensity, whereas the frequency distribution over time provides information on the content of the sound. Mel-spectrograms are also visuals, and CNN work better with picture.
We also found an article that uses RNN-LSTM model to tackle audio genre classification. The input audio is processed by "librosa" python library to obtain audio features in the form of the **Melfrequency cepstral coefficients** (MFCC), then the features are passed into LSTM model for training.

(Mel Frequency Cepstral graph https://medium.com/@premtibadiya/music-genre-classification-using-rnn-lstm-1c212ba21e06)
Using this approach, it managed to achieve **59.70%** average accuracy. The author claimed that it is because the dataset is small, hence yielding a low accuracy.
In conclusion, according the findings we had above, which used the same GTZAN dataset of 1000 samples, it appears to us that **CNN** model would be the better approach to tackle this problem.
## Deliverables
We will provide our **source code** for the model. In addition to that, we intend to come up with a **simple GUI** that allows users to upload a song in wav format, and then output the predicted song genre.
## References
1. Music Genre Classification using RNN-LSTM https://medium.com/@premtibadiya/music-genre-classification-using-rnn-lstm-1c212ba21e06
2. Music Genre Classification http://cs229.stanford.edu/proj2018/report/21.pdf