--- title: 'MGC 2022' disqus: hackmd --- ###### tags:`TEEP 2022` ## Music Genre Classification [TOC] ## Researcher Intro This is Eric, I'm new to HackMD and music analysis. Looking forward to learning more and sharing the results! GTZAN genre dataset (in progress) --- Discuss the GTZAN dataset here. MFCC extraction with librosa (in progress) --- Discuss MFCCs and how they were extracted here. Multi-Layer Perceptron Model --- **FEB 16, 2022:** I was able to get 55% training validation accuracy (90:10 train:test split of 1000 images) out of a simple fully connected layer. I also took this opportunity to work thorough a tutorial notebook and learn about the various initialization and optimization functions available. ~Model:~ Linear Fully Connected ~Optimizer:~ RMSprop ~Epochs:~ 669 ~Validation~ ~Accuracy:~ 53.7% ![](https://i.imgur.com/KD1Z97U.png) The generalization gap -- the difference between training and validation accuracies -- is quite large. Given that the data is in the form of images, a convolutional network is probably necessary. LSTM First Attempt --- **FEB 18, 2022:** I tried plugging the GTZAN 13-MFCC images into a pytorch LSTM template, but I keep getting a dimension error. The MFCC images are tensors with dimensions (3, 70, 70) (height and width cropped, can be optimized later). 3 is the channel dimension (Red, Green, Blue, for images. Converting to grayscale would not solve the problem, because there would still be 3 dimensions) PyTorch forums suggest running the data through a CNN first to extract image features. Apparently, the structure of LSTM does not allow for higher-dimensional inputs. I will take this opportunity to test a CNN-FC model without LSTM first. Later, we can try adding LSTM for comparison. **FEB 21, 2022:** Upon further reading, LSTM seems better suited for sequential/time-series data such as text and musical notes in a song. This reinforces my novice finding that MFCC images are not ideal inputs into a LSTM network. Therefore, I plan to explore LSTM performance by some sort of sequential input. Drop MFCCs and feed in pitch vs time arrays? No, this has the same problem -- multiple values (12 pitches) per timestep. The data has to be 1-dimensional, as far as I understand. ResNet18 Experiment --- **FEB 20, 2022:** After spending a few hours pointlessly trying to fit the previous initialization/optimization notebook functions with a CNN, I gave up and started from scratch with ResNet18. Borrowing from a standard pytorch test/train script (https://www.pluralsight.com/guides/introduction-to-resnet), I fed in my 90:10 training data set and played around with the hyperparameters. I also changed the optimization function to Adam. ~Model:~ ResNet18 ~Optimizer:~ Adam ~Epochs:~ 30 ~Learning_Rate:~ 1e-5 ~Training~ ~Accuracy:~ 98% ~Validation~ ~Accuracy:~ 70% ![](https://i.imgur.com/AF1GkJi.png) Better than the fully-connected experiment, but there is clearly still room for improvement. As I was reading about train-validation splitting, I realized that pytorch has a tool to randomly do this automatically. The code would look something like: ~~~Gherkin train_set, val_set = torch.utils.data.random_split(dataset, [900, 100]) ~~~ Since I did not randomize these subsets when I generated the MFCC images, I have re-consolidated them into a single dataset folder and shuffled both training and validation subsets using torch.utils.data.random_split(). **FEB 21, 2022:** ~Model:~ ResNet18 ~Optimizer:~ Adam ~Epochs:~ 50 ~Learning_Rate:~ 1e-4 ~Training~ ~Accuracy:~ 100% ~Validation~ ~Accuracy:~ 80% ![](https://i.imgur.com/kfX9T8W.png) https://github.com/ericodle/Genre-Classification-Using-LSTM/blob/main/resnet18_init_and_opt.ipynb With a shuffled dataset and playing with parameters, I was able to get the validation accuracy up to 80%. ResNet18 is set to 3-channel (color) images, but grayscale inputs may be worth exploring later. **FEB 24, 2022:** I have re-analyzed the GTZAN clips using librosa's chroma_stft function. This converts the music stream into its dominant notes vs. time. This approach still generates an image, so it's back to ResNet18. Moreover, this approach seems to diminsh the importance of rhythm. Another alteration to GTZAN analysis was splitting each 30-second clip into 3 approx. 10-second clips. This not only increases the training volume, but arguably the variety as well. This is admittedly genre-dependent. Whereas a 30-second clip of a classical song can vary widely in tempo, style and mood, hiphop and pop songs tend to be more homogeneous. At worst, some training images will be essentially duplicated. To offset this, I will change the train:validation split to 70:30 Ultimatley, chroma didn't seem to offer any advantage over MFCC-based classification. MFCCs as Python dictionary in .json --- **FEB 27, 2022:** Generating png image files for each song in the GTZAN dataset was slow (on Colab) and the approach did not lend itself well to this task. The images are not scenes with discrete objects such as cars or cats, but a bunch of squares in an array. There may be some image network that can effectively learn these, and I may come back to that approach later. However, now I have re-generated the MFCC (13 per song) following Valerio Velardo's YouTube tutorial on the subject. Valerio echoes the thought I had earlier of separating each 30-second audio clip into multiple train/test elements, and I followed his [script](https://github.com/musikalkemist/DeepLearningForAudioWithPython/tree/master/12-%20Music%20genre%20classification:%20Preparing%20the%20dataset/code) to do so. Next, I decided to follow along and try a simple multi-layer perceptron model with 4 fully connected linear layers headed by a flatten layer. The tutorial was written using Keras, so I had to adapt it over to PyTorch. More complex was shifting from PyTorch's user-friendly imagedata loader to writing my own custom dataloader from the json file. It was good practice and I learned that the granualar DIY approach is what sets PyTorch apart from Keras. Ideally, this newfound control over the training dataset will enable me to better play around with LSTM. **FEB 28, 2022:** A 4-layer CNN on top of a fully connected (FC) classification layer yielded results lower than those achieved by Ceylan, et al. ~Model:~ CNN ~Optimizer:~ Adam ~Epochs:~ 154 ~Learning_Rate:~ 0.001 (CyclicLR) ~Accuracy:~ 83.6% ![](https://i.imgur.com/BBgDGgO.png) The next step is to try adapting Velardo's LSTM model using PyTorch. As an aside, the way Keras and PyTorch denote 2D convolutional layers, as well as the way activation functions are expressed in relation to a particular layer, are different. This Keras-PyTorch business is annoying, but there are many helpful message board posts to be found for most issues. RNN experiments --- **MAR 1, 2022:** After lots of troubleshooting, I achieved a working LSTM model in PyTorch for the GTZAN MFCC dataset. With a 70:30 train/val split, an LSTM model was able to be trained that surpassed Lin Lab training results (~60%) to date. ~Model:~ LSTM ~Optimizer:~ RMSprop ~Epochs:~ 70 ~Learning_Rate:~ 0.001 (CyclicLR) ~Accuracy:~ 89.5% ![](https://i.imgur.com/wBi0vp5.png) **MAR 3, 2022:** Today, I tested multiple optimization functions, learning rates, and lr schedulers on a Bidirectional version of the LSTM model used on MAR 1. The conversion was easy -- the model required a doubling of the h0 and c0 forward function dimensions, as well as a doubling of the fully-connected hidden dimension. Oh, and don't forget to set Bidirectional=True ~Model:~ Bidirectional LSTM ~Optimizer:~ RMSprop ~Epochs:~ 90 ~Learning_Rate:~ 0.001 (lr_scheduler.CyclicLR) ~Accuracy:~ 89.6% ![](https://i.imgur.com/bNNkBTe.png) The bidirectional LSTM model didn't improve training results compared to normal LSTM. The accuracy for both train and val sets show a hill-like pattern in the later half of training, whereas the unidirectional LSTM only saw sharp dips around epochs 125 and 170. The hilly pattern is likely due to a change I made in the LR scheduler. It's been difficult to push up the training accuracy into the 90s. Ceylan et al. (2021) claim their CNN achieved a training accuracy of ~93% on an 80:20 split of 13-MFCC GTZAN images. My 70:30 non-image approach has achieved accuracy in the high 80's, which may still produce good results on an naive song classification test. I checked if changing my train/val split to 80:20 could achieve a higher accuracy, and got an approx. 1% increase. The 70:30 split feels for robust to me, so I'll keep it. **MAR 5, 2022** GRU (gated recurrent unit) model showed the best training results yet. ~Model:~ GRU ~Optimizer:~ RMSprop ~Epochs:~ 68 ~Learning_Rate:~ 0.001 (CyclicLR) ~Accuracy:~ 91.6% ![](https://i.imgur.com/928gH5V.png) Testing --- **MAR 6, 2022** (on 80/10/10 train/val/test split) (note: X and Y axis labels should be switched. Sorry.) GRU model: 90.7% testing accuracy ![](https://i.imgur.com/NBZ68uo.png) BiLSTM model: 86.0% testing accuracy ![](https://i.imgur.com/3GkWlpE.png) LSTM model: 88.8% testing accuracy ![](https://i.imgur.com/RYfpnLv.png) CNN model: 81.3% testing accuracy ![](https://i.imgur.com/1utfhnv.png) MLP model: 57.1% testing accuracy ![](https://i.imgur.com/CjHda3q.png)