# MLE - Music Genre Classification
## **Summary**
This project is to reimplement Audio Classification: from the [kaggle](https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification)
## **Music Classification**
| Criteria | Description |
| -------- | -------- |
| Problem | I have many song of wav files in genre. I input another song then this will classify this song to other genre |
| Solution | Train data with various model to predict the genre of the song |
| Data | [Dataset 1 ](https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification) |
| Preprocess | Learning [Audio signal processing](https://www.youtube.com/watch?v=iCwMQJnKk2c&list=PL-wATfeyAMNqIee7cH3q1bh4QJFAaeNv0&index=1) |
| Model | Multi model |
| Re-Search | MFCC and Spectrogram|
### **Notice about data collection:**
* ***genres original*** - A collection of 10 genres with 100 audio files each, all having a length of 30 seconds (the famous GTZAN dataset, the MNIST of sounds)
* ***images original*** - A visual representation for each audio file. One way to classify data is through neural networks. Because NNs (like CNN, what we will be using today) usually take in some sort of image representation, the audio files were converted to Mel Spectrograms to make this possible (we'll be talking about this more in depth later)
## **Time Bucket**
2.5 hours per day for 7 days/week
4 Weeks = 70hours
# **Time log everyday**
## **WEEK 1**
**Day 1:**
* Working with **MFCC** to extract fearture
https://musicinformationretrieval.com/genre_recognition.html
https://mikesmales.medium.com/sound-classification-using-deep-learning-8bc2aa1990b7
https://musicinformationretrieval.com/mfcc.html
https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d
**Spectrogram:**
* **Spectrograms** are a useful technique for visualising the spectrum of frequencies of a sound and how they vary during a very short period of time.
**MFCC** - **Mel-Frequency Cepstral Coefficients:**
* **Detail**: on [Wiki](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum)
* (MFCCs) of a signal are a small set of features (usually about 10-20) which concisely describe the overall shape of a spectral envelope. In MIR, it is often used to describe timbre.
**Difference** is that a **spectrogram** uses a linear spaced frequency scale (so each frequency bin is spaced an equal number of Hertz apart), whereas an **MFCC** uses a quasi-logarithmic spaced frequency scale, which is more similar to how the human auditory system processes sounds.

**Some python Lib: **
[python_speech_features](https://python-speech-features.readthedocs.io/en/latest/)
[librosa.feature.mfcc](https://musicinformationretrieval.com/mfcc.html)
**Day 2:**
**Try extract feature** and input to simple model
**Feature Extraction** list:
* [Zero Crossing Rate](https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d)
* Spectral Centroid
* Spectral Rolloff
* MFCC — Mel-Frequency Cepstral Coefficients
```
y, sr = librosa.load(songname, mono=True, duration=30)
chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr)
rmse = librosa.feature.rmse(y=y)
spec_cent = librosa.feature.spectral_centroid(y=y, sr=sr)
spec_bw = librosa.feature.spectral_bandwidth(y=y, sr=sr)
rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
zcr = librosa.feature.zero_crossing_rate(y)
```
**Result:**

Many loss make the model become not exactly the genre of music.
=> Split the file and increase more export data features:
```
def GetFeature_enhance(g, filename, y, sr, start_sample, end_sample, s):
# print(songname, g, filename, s)
chroma_stft = librosa.feature.chroma_stft(y=y[start_sample:end_sample], sr=sr)
rmse = librosa.feature.rms(y=y[start_sample:end_sample])
spec_cent = librosa.feature.spectral_centroid(y=y[start_sample:end_sample], sr=sr)
spec_bw = librosa.feature.spectral_bandwidth(y=y[start_sample:end_sample], sr=sr)
rolloff = librosa.feature.spectral_rolloff(y=y[start_sample:end_sample], sr=sr)
zcr = librosa.feature.zero_crossing_rate(y=y[start_sample:end_sample])
mfcc = librosa.feature.mfcc(y=y[start_sample:end_sample], sr=sr)
filename = filename + '_' + str(s)
to_append = f'{filename} {np.mean(chroma_stft)} {np.mean(rmse)} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(rolloff)} {np.mean(zcr)}'
for e in mfcc:
to_append += f' {np.mean(e)}'
to_append += f' {g}'
return to_append
```
In one file, this will be split into 10 time if it can.
```
file = open(os.path.join(class_xls_path,'data1.csv'), 'w', newline='')
with file:
writer = csv.writer(file)
writer.writerow(header)
genres, nb_train_samples = GetGenre(genre_music_path)
for g in genres:
for filename in os.listdir('{0}/{1}'.format(genre_music_path, g)):
songname = '{0}/{1}/{2}'.format(genre_music_path, g,filename)
if FileCheck(songname)!=1:
continue
y, sr = librosa.load(songname, mono=True, duration=29)
for s in range(NUM_SLICES):
start_sample = SAMPLES_PER_SLICE * s
end_sample = start_sample + SAMPLES_PER_SLICE
to_append = GetFeature_enhance(g,filename, y, sr, start_sample, end_sample, s)
file = open(os.path.join(class_xls_path,'data1.csv'), 'a', newline='')
with file:
writer = csv.writer(file)
writer.writerow(to_append.split())
```

**Fourier Transform**
https://www.youtube.com/watch?v=spUNpyF58BY
**Suggest Next research:** recurrent network model
**Day 3:** (too busy)
Increase the features list to 48 features
* Spectral Centroid
* Spectral Rolloff
* chroma
* rmse
* rolloff
* zcr
* harmony
* perceptr
* tempo
* MFCC (20) — Mel-Frequency Cepstral Coefficients
Try machine learning for all sections of sound with:
* Accuracy Naive Bayes : 0.51952
* Accuracy Stochastic Gradient Descent : 0.65532
* Accuracy KNN : 0.80581
* Accuracy Decission trees : 0.64631
* Accuracy Random Forest : 0.80881
* Accuracy Support Vector Machine : 0.75475
* Accuracy Logistic Regression : 0.6977
* Accuracy Neural Nets : 0.6303
* Accuracy Cross Gradient Booster : 0.90224
* Accuracy Cross Gradient Booster (Random Forest) : 0.74808

Need to predict with trim single sound event audio clips ***=>***
**Suggest Next research:** Recurrent Network Model, time section predict
https://towardsdatascience.com/real-time-sound-event-classification-83e892cf187e
https://towardsdatascience.com/audio-deep-learning-made-simple-sound-classification-step-by-step-cebc936bbe5
https://towardsdatascience.com/how-to-predict-a-time-series-part-1-6d7eb182b540
**Day 4:** Structure code Project
**Day 8:** Try with LSMT model and XGboost
[LSMT](https://github.com/alo360/MusicClassification/blob/main/test05.ipynb),
Accuracy : <50%
[XGboost](https://github.com/alo360/MusicClassification/blob/main/test04_b.ipynb)
Accuracy : 0.64411
**Day 9:** Try with CNN model and current mfcc feature:
[CNN](https://github.com/alo360/MusicClassification/blob/main/test06_b.ipynb)

Better performance with **Less feature** and **Simple model**
=> If the data is very small and many features, the model will not fit.
**Model: "CNN"**
|Layer (type) | Output Shape | Param #|
| -------- | -------- | -------|
|conv2d (Conv2D) | (None, 128, 11, 64) | 640 |
|max_pooling2d (MaxPooling2D) |(None, 64, 6, 64) | 0 |
|batch_normalization (BatchNo |(None, 64, 6, 64) | 256 |
|conv2d_1 (Conv2D) |None, 62, 4, 32) | 18464 |
|max_pooling2d_1 (MaxPooling2 |None, 31, 2, 32) | 0 |
|batch_normalization_1 (Batch |None, 31, 2, 32) | 128 |
|conv2d_2 (Conv2D) |None, 30, 1, 32) | 4128 |
|max_pooling2d_2 (MaxPooling2 |None, 15, 1, 32) | 0 |
|batch_normalization_2 (Batch |None, 15, 1, 32) | 128 |
|conv2d_3 (Conv2D) |None, 15, 1, 16) | 528 |
|max_pooling2d_3 (MaxPooling2 |None, 8, 1, 16) | 0 |
|batch_normalization_3 (Batch |None, 8, 1, 16) | 64 |
|flatten_2 (Flatten) |None, 128) | 0 |
|dense_9 (Dense) |(None, 64) | 256 |
|dropout_3 (Dropout) |(None, 64) | 0 |
|dense_10 (Dense) |None, 10) | 650 |
|Total params: |33,242|
|Trainable params: |32,954|
|Non-trainable params: |288|
## **All Researching page**
Source code
https://github.com/thisiseshan/music-gan/blob/master/MUSIC_GAN.ipynb
https://github.com/teomotun/Music-Generation-with-GAN
https://github.com/korneelvdbroek/mp3net
Science page
https://www.kaggle.com/andradaolteanu/work-w-audio-data-visualise-classify-recommend
https://towardsdatascience.com/learning-from-audio-the-mel-scale-mel-spectrograms-and-mel-frequency-cepstral-coefficients-f5752b6324a8
Recommend info
https://projectcocoon.org/projects/65/bts-voice-classification
https://musicinformationretrieval.com/index.html
https://www.coursera.org/learn/audio-signal-processing
librosa parameters
https://librosa.org/doc/0.6.3/glossary.html
Classification
https://musicinformationretrieval.com/genre_recognition.html
Slice a file
https://www.kaggle.com/danaelisanicolas/cnn-part-1-create-subslices-for-each-sound
https://www.kaggle.com/tanulsingh077/audio-albumentations-transform-your-audio#Cut-Out
Special information
https://www.youtube.com/watch?v=spUNpyF58BY
https://github.com/mljar/mljar-supervised