MLE - Music Genre Classification

# MLE - Music Genre Classification ## **Summary** This project is to reimplement Audio Classification: from the [kaggle](https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification) ## **Music Classification** | Criteria | Description | | -------- | -------- | | Problem | I have many song of wav files in genre. I input another song then this will classify this song to other genre | | Solution | Train data with various model to predict the genre of the song | | Data | [Dataset 1 ](https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification) | | Preprocess | Learning [Audio signal processing](https://www.youtube.com/watch?v=iCwMQJnKk2c&list=PL-wATfeyAMNqIee7cH3q1bh4QJFAaeNv0&index=1) | | Model | Multi model | | Re-Search | MFCC and Spectrogram| ### **Notice about data collection:** * ***genres original*** - A collection of 10 genres with 100 audio files each, all having a length of 30 seconds (the famous GTZAN dataset, the MNIST of sounds) * ***images original*** - A visual representation for each audio file. One way to classify data is through neural networks. Because NNs (like CNN, what we will be using today) usually take in some sort of image representation, the audio files were converted to Mel Spectrograms to make this possible (we'll be talking about this more in depth later) ## **Time Bucket** 2.5 hours per day for 7 days/week 4 Weeks = 70hours # **Time log everyday** ## **WEEK 1** **Day 1:** * Working with **MFCC** to extract fearture https://musicinformationretrieval.com/genre_recognition.html https://mikesmales.medium.com/sound-classification-using-deep-learning-8bc2aa1990b7 https://musicinformationretrieval.com/mfcc.html https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d **Spectrogram:** * **Spectrograms** are a useful technique for visualising the spectrum of frequencies of a sound and how they vary during a very short period of time. **MFCC** - **Mel-Frequency Cepstral Coefficients:** * **Detail**: on [Wiki](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum) * (MFCCs) of a signal are a small set of features (usually about 10-20) which concisely describe the overall shape of a spectral envelope. In MIR, it is often used to describe timbre. **Difference** is that a **spectrogram** uses a linear spaced frequency scale (so each frequency bin is spaced an equal number of Hertz apart), whereas an **MFCC** uses a quasi-logarithmic spaced frequency scale, which is more similar to how the human auditory system processes sounds. ![](https://i.imgur.com/6VnPGbd.png) **Some python Lib: ** [python_speech_features](https://python-speech-features.readthedocs.io/en/latest/) [librosa.feature.mfcc](https://musicinformationretrieval.com/mfcc.html) **Day 2:** **Try extract feature** and input to simple model **Feature Extraction** list: * [Zero Crossing Rate](https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d) * Spectral Centroid * Spectral Rolloff * MFCC — Mel-Frequency Cepstral Coefficients ``` y, sr = librosa.load(songname, mono=True, duration=30) chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr) rmse = librosa.feature.rmse(y=y) spec_cent = librosa.feature.spectral_centroid(y=y, sr=sr) spec_bw = librosa.feature.spectral_bandwidth(y=y, sr=sr) rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr) zcr = librosa.feature.zero_crossing_rate(y) ``` **Result:** ![](https://i.imgur.com/qMvGoUs.png) Many loss make the model become not exactly the genre of music. => Split the file and increase more export data features: ``` def GetFeature_enhance(g, filename, y, sr, start_sample, end_sample, s): # print(songname, g, filename, s) chroma_stft = librosa.feature.chroma_stft(y=y[start_sample:end_sample], sr=sr) rmse = librosa.feature.rms(y=y[start_sample:end_sample]) spec_cent = librosa.feature.spectral_centroid(y=y[start_sample:end_sample], sr=sr) spec_bw = librosa.feature.spectral_bandwidth(y=y[start_sample:end_sample], sr=sr) rolloff = librosa.feature.spectral_rolloff(y=y[start_sample:end_sample], sr=sr) zcr = librosa.feature.zero_crossing_rate(y=y[start_sample:end_sample]) mfcc = librosa.feature.mfcc(y=y[start_sample:end_sample], sr=sr) filename = filename + '_' + str(s) to_append = f'{filename} {np.mean(chroma_stft)} {np.mean(rmse)} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(rolloff)} {np.mean(zcr)}' for e in mfcc: to_append += f' {np.mean(e)}' to_append += f' {g}' return to_append ``` In one file, this will be split into 10 time if it can. ``` file = open(os.path.join(class_xls_path,'data1.csv'), 'w', newline='') with file: writer = csv.writer(file) writer.writerow(header) genres, nb_train_samples = GetGenre(genre_music_path) for g in genres: for filename in os.listdir('{0}/{1}'.format(genre_music_path, g)): songname = '{0}/{1}/{2}'.format(genre_music_path, g,filename) if FileCheck(songname)!=1: continue y, sr = librosa.load(songname, mono=True, duration=29) for s in range(NUM_SLICES): start_sample = SAMPLES_PER_SLICE * s end_sample = start_sample + SAMPLES_PER_SLICE to_append = GetFeature_enhance(g,filename, y, sr, start_sample, end_sample, s) file = open(os.path.join(class_xls_path,'data1.csv'), 'a', newline='') with file: writer = csv.writer(file) writer.writerow(to_append.split()) ``` ![](https://i.imgur.com/kjNeRfG.png) **Fourier Transform** https://www.youtube.com/watch?v=spUNpyF58BY **Suggest Next research:** recurrent network model **Day 3:** (too busy) Increase the features list to 48 features * Spectral Centroid * Spectral Rolloff * chroma * rmse * rolloff * zcr * harmony * perceptr * tempo * MFCC (20) — Mel-Frequency Cepstral Coefficients Try machine learning for all sections of sound with: * Accuracy Naive Bayes : 0.51952 * Accuracy Stochastic Gradient Descent : 0.65532 * Accuracy KNN : 0.80581 * Accuracy Decission trees : 0.64631 * Accuracy Random Forest : 0.80881 * Accuracy Support Vector Machine : 0.75475 * Accuracy Logistic Regression : 0.6977 * Accuracy Neural Nets : 0.6303 * Accuracy Cross Gradient Booster : 0.90224 * Accuracy Cross Gradient Booster (Random Forest) : 0.74808 ![](https://i.imgur.com/FC4QqTB.png) Need to predict with trim single sound event audio clips ***=>*** **Suggest Next research:** Recurrent Network Model, time section predict https://towardsdatascience.com/real-time-sound-event-classification-83e892cf187e https://towardsdatascience.com/audio-deep-learning-made-simple-sound-classification-step-by-step-cebc936bbe5 https://towardsdatascience.com/how-to-predict-a-time-series-part-1-6d7eb182b540 **Day 4:** Structure code Project **Day 8:** Try with LSMT model and XGboost [LSMT](https://github.com/alo360/MusicClassification/blob/main/test05.ipynb), Accuracy : <50% [XGboost](https://github.com/alo360/MusicClassification/blob/main/test04_b.ipynb) Accuracy : 0.64411 **Day 9:** Try with CNN model and current mfcc feature: [CNN](https://github.com/alo360/MusicClassification/blob/main/test06_b.ipynb) ![](https://i.imgur.com/rqDXHQt.png) Better performance with **Less feature** and **Simple model** => If the data is very small and many features, the model will not fit. **Model: "CNN"** |Layer (type) | Output Shape | Param #| | -------- | -------- | -------| |conv2d (Conv2D) | (None, 128, 11, 64) | 640 | |max_pooling2d (MaxPooling2D) |(None, 64, 6, 64) | 0 | |batch_normalization (BatchNo |(None, 64, 6, 64) | 256 | |conv2d_1 (Conv2D) |None, 62, 4, 32) | 18464 | |max_pooling2d_1 (MaxPooling2 |None, 31, 2, 32) | 0 | |batch_normalization_1 (Batch |None, 31, 2, 32) | 128 | |conv2d_2 (Conv2D) |None, 30, 1, 32) | 4128 | |max_pooling2d_2 (MaxPooling2 |None, 15, 1, 32) | 0 | |batch_normalization_2 (Batch |None, 15, 1, 32) | 128 | |conv2d_3 (Conv2D) |None, 15, 1, 16) | 528 | |max_pooling2d_3 (MaxPooling2 |None, 8, 1, 16) | 0 | |batch_normalization_3 (Batch |None, 8, 1, 16) | 64 | |flatten_2 (Flatten) |None, 128) | 0 | |dense_9 (Dense) |(None, 64) | 256 | |dropout_3 (Dropout) |(None, 64) | 0 | |dense_10 (Dense) |None, 10) | 650 | |Total params: |33,242| |Trainable params: |32,954| |Non-trainable params: |288| ## **All Researching page** Source code https://github.com/thisiseshan/music-gan/blob/master/MUSIC_GAN.ipynb https://github.com/teomotun/Music-Generation-with-GAN https://github.com/korneelvdbroek/mp3net Science page https://www.kaggle.com/andradaolteanu/work-w-audio-data-visualise-classify-recommend https://towardsdatascience.com/learning-from-audio-the-mel-scale-mel-spectrograms-and-mel-frequency-cepstral-coefficients-f5752b6324a8 Recommend info https://projectcocoon.org/projects/65/bts-voice-classification https://musicinformationretrieval.com/index.html https://www.coursera.org/learn/audio-signal-processing librosa parameters https://librosa.org/doc/0.6.3/glossary.html Classification https://musicinformationretrieval.com/genre_recognition.html Slice a file https://www.kaggle.com/danaelisanicolas/cnn-part-1-create-subslices-for-each-sound https://www.kaggle.com/tanulsingh077/audio-albumentations-transform-your-audio#Cut-Out Special information https://www.youtube.com/watch?v=spUNpyF58BY https://github.com/mljar/mljar-supervised