# Comparative anlysis of the novel Audio Spectrogram Transformer (AST) for FMA Genre Classification
Group 41
* Tim Rood - 4695968
* Jim Vos - 4922905
* Andrei Ionescu - 5002575
## Introduction
Through its diverse range of sounds, music becomes an art form that unites individuals from different backgrounds, allowing all of us to experience and enjoy its beauty. The value we give to music is displayed by the uprise of streaming services like Spotify. These streaming services are growing bigger every day gaining up to 60.000 songs a day [[1]](https://toneisland.com/spotify-statistics/#:~:text=Spotify%20uploads%2060%2C000%20new%20tracks%20every%20day.,-Spotify%20confirms%20through&text=It%20also%20means%20that%20a,reach%20the%20100%2Dmillion%20milestone). This calls for efficient, automated categorization method, since labelling these streams of audio manually gets incredibly expensive.
The rise of deep learning models saw Convolutional Neural Networks (CNN's) and Recurrent Neural Networks (RNN's) take that role and are now common architectures used in genre classification problems. The recent boom of Transformer-based models, in particular the Vision Transformer (ViT or VT) [[7]](https://arxiv.org/abs/2010.11929v2) has slowly but surely pushed the CNN off its throne as the go-to architecture for vision-based tasks and gave rise to a broad range of VT's made for an equally broad range of tasks.
One of those is the Audio Spectrogram Transformer [REF](https://huggingface.co/docs/transformers/v4.29.1/en/model_doc/audio-spectrogram-transformer#transformers.ASTConfig), which achieved state-of-the-art results on AudioSet and ESC-50 and Speech Commands v2, some benchmarks for speech recognition. It's relatively new, but in line with recent transformer successes, is it ready to outperform CNN's out-of-the-box?
In this blog post, we will investigate the following:
1. How the amount of training sample influences the AST's convergence and performance
2. How do Vision Transformers compare relative to CNN's and RNN's
3. What the absolute best achievable performance of the AST is on a small music genre dataset
We'll first provide some background theory before describing our methodology, showing our findings and drawing a conclusion.
## Theory
### Spectograms
Audio is stored in many formats like .mp3, .wav or .aac which compress audio signals for efficient storage. This is one-dimensional sequential data, meaning every timepoint has one value (amplitude). A way to make the one-dimensional data two-dimensional is to convert it to spectrograms. This conversion uses Fourier Transforms to determine frequency intensities over time and turns the audio data into two-dimensional data. Adding this second dimension to music data usually better represents its characteristics and may be useful for deep learning tasks.

*Figure 1: Two representations of audio: standard waveform (top, one-dimensional) and mel-spectrogram (bottom, two-dimensional)*
Since human hearing has evolved to better distinguish low-frequency variations, the spectrograms are sometimes converted to log-mel spectrograms by scaling the y-axis (frequency) logarithmically.
Another representation of audio is that of mel-frequency cepstral coefficients (MFCC). It's a variant of mel-frequency spectrograms that aims to mimic the human auditory system's perception of sound by accounting for the non-linear characteristics of human hearing.

*Figure 2: Visual representation of MFCC*
One is not superior to the other: both MFCC and (log-)mel spectrograms have their place in audio handling and audio classification in particular.
### Vision transformers (ViT)
A Vision Transformer (ViT) applies the popular self-attention mechanism initially developed for language tasks to vision tasks. Simply put, it learns the intertwined relationship between pixels of an image to learn structures on multiple levels (heads) and learn to relate to those structures for downstream tasks such as classification. For more on ViT's, we refer to the original paper introducing it [[7]](https://arxiv.org/abs/2010.11929).

*Figure 3: Visualisation of a ViT from the proposing paper [[7]](https://arxiv.org/abs/2010.11929v2)*
## Dataset
For our research, we used FMA [[2]](https://github.com/mdeff/fma), a free-access music library. In total FMA contains 106,574 tracks which are labelled by 161 hierarchical ordered (sub) genres. As our time and resources were limited we opted for the smaller subset of FMA, containing 8 root genres as shown in Figure 4. We also used a mini subset shown in Figure 5 as it has a similar size to GTZAN [[3]](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification), another commonly used dataset. From these tracks we extracted a random sequence of 5 seconds to keep the input size within manageable proportion. We expect that most information is not in the slow progression in these snippets (clear information on song structure - which could definitely be useful - requires much longer snippets of a song), but rather in timbre of instruments, rhythm and structure on a low-level, hence the added value versus an increased computational load of feeding the AST full 30 second clips is deemed negligible.

*Figure 4: The distribution of sample labels in FMA small*

*Figure 5: The distribution of sample labels per genre in FMA mini:*
## Methodology
Literature in music genre classification is rather sparse, hence there's a lot of variety in similar papers, particularly in the number of samples used. Also, there is no consensus on the duration of samples within the genre classification field. Some extract 6 datapoints of 5 seconds out of one track [[6]](https://ieeexplore.ieee.org/document/9573568), or extract 18 sequences that together predict its genre [[4]](https://doi.org/10.1145/3408127.3408137).
We'll compare our results with two papers and try to explain found differences:
1. A paper testing a CNN, RNN-LSTM and a VT on FMA small with 48k samples (8k samples divided into six sub-samples). This paper uses MFCC ()[REF]
2. Another paper proposing a self-built VT on GTZAN with 1k samples[REF]
Additionally, we will train a simple [CNN network](https://colab.research.google.com/drive/1o0lF_5AOrZBZLA0gclSYZmiMN8nkDol8). ourselves too to use as a baseline for comparison.
The CNN network contains of five convolutional layers, and the network progressively learns intricate patterns and nuances from the input data. Batch normalization layers stabilize the training process by normalizing activations, while ReLU activation functions introduce non-linearity, capturing complex relationships in the data.
On top of that, max pooling reduces spatial dimensions, preserving essential information. The fully connected layer makes genre predictions, aided by dropout regularization to prevent overfitting. Softmax activation assigns music genre probabilities.
We'll emphasize the absolute performance given the differences in set-up, but will relate performance with these papers lightly too.
## Implementation
HuggingFace comes with an off-the-shelf Audio Spectrogram Transformer (AST)[[5]](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer). We created a [Google Colab repository](https://colab.research.google.com/drive/1Z-d72hqU5k5DTwqc6avlgvvjYKUjc1Ga?usp=sharing) in which we developed a simple model environment using PyTorch and TensorFlow. We condense the metadata into just the songs present in the small FMA subset, and combine this with the audio files to create a `dataset` class, which can write or read labelled tensors. The dumped files are quite large: the 8000-sample, medium-resolution set is about 2GB in size. The folder with our data can be found [here](https://drive.google.com/drive/folders/1bFENgwgQtGu793-5HZ3sDjY9lVN-9Ghl?usp=drive_link).
Once the data is loaded, we proceed with a very standard training procedure in PyTorch and TensorFlow.
### Setup and hyperparameters
We looked for the best combination of hyperparameters by means of trial and error and tracked all hyperparameters. An overview of all results can be found in this [Excel sheet](https://docs.google.com/spreadsheets/d/1TbU7wgJCPCRPNGcuqquSKQrKSfQGUH38OIYpVmOacQU/edit?usp=sharing). The last two rows represent the reported runs for FMA mini and small, respectively.
We trained on an Nvidia T4 Tensor Core GPU within the Colab environment on varying amounts of epochs, which, depending on hyperparameters (especially the amount of attention heads), took from one to twenty minutes for FMA mini and ten minutes to two hours for FMA small.
Some notable findings on training the AST:
1. The model had a remarkable tendency to overfit, especially with the small, GTZAN-like set, FMA mini. Higher regularization measures such as weight decay, dropout and reduction in model complexity had trouble containing that tendency without limiting the model too much.
2. The sample features were downscaled by a factor four in both frequency dimension (mel-bins) and time steps. This did not seem to impact performance by much, but reduced
3. Normalizing input makes a large difference in performance, as noted by the original developers of the AST we used.
4. Most models converged fairly quickly, within 15 to 50 epochs.
5. The difference between overfitting on the mini and small datasets was large: more on that later.
The best hyperparameters for this task, to our knowledge, are:
|Training parameter | Value |Hyperparameter | Value |
|--------- |--------- |--------- |--------- |
|Batch size | 64 |Patch size | 16 |
|Learning rate | 0.001 |Intermediate size| 1536 |
|Weight decay | 0.01 |Hidden size| 768 |
|||Attention heads| 6 |
|||Hidden layers| 12 |
|||Dropout probabilities| 0.1|
## Evaluation
### Results
|Model| Spectrograms | Dataset | Samples | Best accuracy |
|--------- |--------- | -------- | -------- | -------- |
|AST (ours) | Log-mel | FMA mini | 1000 | 0.2650 |
|CNN (ours) | Log-mel | FMA mini | 1000 | 0.2187 |
|AST (ours) | Log-mel | FMA small | 8000 | 0.4075 |
|CNN (ours) | Log-mel | FMA small | 8000 | 0.5424 |
|RNN-LSTM [4]| MFCC | FMA small* | 48000 |0.5713 |
|CNN [4]| MFCC | FMA small* | 48000 | 0.5159 |
|ViT [4] | MFCC | FMA small* | 48000 | 0.5685 |
|ViT [6] | Log-mel | GTZAN** | 1000 | 0.7800 |
*\* All tracks in FMA small were devided into 6 short tracks of 5 seconds, whereas our models were trained on only 5 seconds per track.*
** *An important remark on this paper is that it is not very transparent, so there is no way to tell whether these results are valid*
As visible we can see that our solutions get clearly dominated by the other existing models for genre classification. The ViT models introduced in [[4]](https://doi.org/10.1145/3408127.3408137) and [[6]](https://ieeexplore.ieee.org/document/9573568) are specifically designed for music genre classification. They utilize the transformer architecture with modifications to handle audio data effectively. Our ViT also still loans a lot of aspects from a speech recognition system, including its feature extractor, making it less optimised. As our time and computational resources were limited we had to test with a smaller and less optimised Transformer network. These architectural differences between their models and our models could contribute to differences in their performance.
On top of that, the performance of deep learning models heavily depends on the quality and characteristics of the dataset used for training and evaluation. The FMA mini and small datasets used in the blog post are used differently. [[4]](https://doi.org/10.1145/3408127.3408137) preprocesses the tracks into 6 datapoints, meaning the same track could end up in the train and test set. Similarities in train and test sets could be the reason for the improved performance. For the preprocessing in [[6]](https://ieeexplore.ieee.org/document/9573568) the data was again split into 6 sequences of 5 seconds, which were bundled together for a single prediction. As visible in Figure 6 where we see the accuracy in relation to the number of samples it shows that our models still have potential with bigger datasets.

*Figure 6: Accuracies relative to the number of samples retrieved from FMA small*
### Model comparison
Our training progress is shown in Figure 7 below. After about 50 epochs, the AST starts overfitting. The CNN shows similar behaviour. We observe that the CNN model generally outperforms the AST on both datasets. Although the AST achieved lower accuracies than the CNN model, the improvement between the mini and small datasets indicates its ability to benefit from a larger dataset and learn more representative features. We believe that a strength of CNN's are that they perform better straight out of the box performance compared to the more complex ViT. We can show that they have a similar capability to learn as in Figure 8 they show a similar loss. In Figure 7 it is visible that the AST train accuracy is not as high reaching a max of 57%, while the gap between the train and test accuracy of our AST is not as large compared to the CNN. We thus hypothesise that with better tuning and feature extraction the increase of train accuracy could directly translate to a higher test accuracy. With more research and optimisation of AST's we still think that there is a potential to close the gap to CNN's and maybe even exceed them.

*Figure 7: CNN and AST learning curves on FMA small*

*Figure 8: CNN and AST loss on FMA small*
It should be noted that the CNN results for FMA mini depend a lot on the training set shuffle. For two subsequent runs, the test accuracy can differ by up to 100% (e.g. two subsequent runs can get an accuracy of 0.2 and 0.4 on different shuffle training and test sets).
The code for the CNN can be found in [Google Colab repository CNN](https://colab.research.google.com/drive/1o0lF_5AOrZBZLA0gclSYZmiMN8nkDol8).
## Discussion and limitations
As discussed, the lack of research on this dataset makes concluding challenging. Although we had trouble finding a benchmark for genre classification, we emphasize choosing a dataset with as large academic volume of results available as possible.
Moreover, we suggest extending this experiment to a much larger dataset, such as FMA full (106,574 samples) reduced to root genres (which are the eight genres used in our research), or to a set that simply contains more similar samples, to see whether the relationship holds for and to see whether the AST truly outperforms a CNN reliably.
## Conclusion
The AST shows mediocre performance given the number of samples it is fed. For 1k samples it overfits quickly and doesn't perform much better than a random guesser (26.5% vs. 12.5% respectively). For 8k samples, its performance is much better (40.8%), and it shows much less tendency to overfit. This relationship is expanded in the case with 48k ('augmented') samples, where performance is even higher (56.9%), but that may be because of the way they augment (a 30 second clip split into six smaller clips contains a lot of similarities - that may end up implicitly overfitting models to the training data that then contains samples extremely similar to test data).
Given this conclusion, we see a clear trend in the amount of training data relative to the performance of the AST and VT's in general. Whether it is capable of outperforming. It is generally well-known that transformers require a lot of training data to work effectively, as confirmed by the original paper that introduced the ViT when trained on ImageNet [REF](https://arxiv.org/pdf/2010.11929.pdf). As such we suspect that our experiments were done on datasets too small to draw a clear conclusion.
No datasets of similar size are available for music genre classification. The largest, high-quality music genre dataset to our knowledge is FMA.
Perhaps the task of classifying genres is a more difficult one than classifying other types of sound too, as the paper proposing the AST scored high even on ESC-50 - a 2000-sample dataset of environmental sounds: the scale argument does not *always* hold. The proposition that classifying music genres is intrinsically hard is backed by the fact that for little data, CNN's performance depended strongly on the train and test splits it was dealt.
For now, we can only conclude that while the AST definitely showed learning capability in our relatively tiny setting, the AST was not capable of outperforming a CNN in classifying genres for FMA small. As datasets become larger, we expect the AST to outperform the CNN on a more significant level, as our comparisons seem to suggest.
## References
[1] https://toneisland.com/spotify-statistics/#:~:text=Spotify%20uploads%2060%2C000%20new%20tracks%20every%20day.,-Spotify%20confirms%20through&text=It%20also%20means%20that%20a,reach%20the%20100%2Dmillion%20milestone
[2] https://github.com/mdeff/fma
[3] https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification
[4] Yingying Zhuang, Yuezhang Chen, and Jie Zheng. 2020. Music Genre Classification with Transformer Classifier. In Proceedings of the 2020 4th International Conference on Digital Signal Processing (ICDSP 2020). Association for Computing Machinery, New York, NY, USA, 155–159. https://doi.org/10.1145/3408127.3408137
[5] https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer
[6] Y. Khasgiwala and J. Tailor, "Vision Transformer for Music Genre Classification using Mel-frequency Cepstrum Coefficient," 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON), Kuala Lumpur, Malaysia, 2021, pp. 1-5, doi: 10.1109/GUCON50781.2021.9573568. https://ieeexplore.ieee.org/document/9573568