Emotion Recognition - Model Evaluation Report

# Emotion Recognition - Model Evaluation Report *This report contains the model evaluation process with comparison to existing scientific research published on standard benchmark dataset on emotion recognition .* We have compared peer reviewed scinetific literature from leading academic journals and used a standard emotion recongniton dataset as a benchmark to compare **Venus Engine** , Emotion AI classification technology from **Wonder Technologies**. ## Introduction to the dataset We use **IEMOCAP** ([Interactive Emotional Dyadic Motion Capture](https://sail.usc.edu/iemocap/)), developed by the University of Southern California. IEMOCAP is the most commonly used dataset for speech emotion recognition, and contains ~12 hours of audiovisual data, including video, speech, motion capture of face, text transcriptions. The dataset is made of **5 sessions**, each session consisting of multiple dialogs involving **two actors**: one male and one female (total of 10 distinct speakers). Each dialog is breaked down at the **utterance-level** (~= one sentence), and **each utterance is annotated** with one of 10 emotion labels: : anger, happiness, excitement, sadness, frustration, fear, surprise, disgust, other and neutral state. Each utterance is annotated by at least 3 human annotators, and **we only use data points for which at least 2 annotators agree**, which happens for 75% of the data. **It is standard practice to restrict evaluation on only 4 emotion classes (neutral, happy, sad, and angry)** because of a lack of data for some emotion classes which can make results unreliable, and wo do so for our model as well. We also **merge the happiness and excitment classes together** to remain consistent with other research. ### Overall Distribution of Data | Emotion | Number of Utterances | | ------------- | -------------------- | | Neutral | 1,708 (30.9%) | | Happy+Excited | 1,636 (29.6%) | | Sad | 1,084 (19.6%) | | Angry | 1,103 (19.9%) | | **Total** | **5,531 (100%)** | ### Top results published on IEMOCAP dataset ![](https://i.imgur.com/Ts5mqyO.png =1000x) https://docs.google.com/spreadsheets/d/1aMbzGV21qNTgy1PKhSfBDHDMO3bQ5ZIsvlHbc6jksUk/edit?usp=sharing ## Evaluation Procedure ![](https://i.imgur.com/rgBy1nr.png =500x) We evaluate our model using **Speaker-Independent Cross-Validation**, a standard and rigorous evaluation method, which is the most commonly-used technique when evaluating models on the IEMOCAP dataset. We divide the entire dataset in **10 folds, one per speaker** (meaning that one fold contains all the utterances from only one speaker). We then **train and evaluate our model 10 times** based on 10 different "splits" of our data. For each split, 8 folds/speakers are used for training, one is used for validation, and the last one for testing. We report our model's accuracy as the **average of the accuracy on the test set over these 10 splits.** Training the model multiple times like this allows us to obtain a precise estimation of the model performance and its generalization error. ### Audio Features We combine two types of audio features to pass to our model. * **Mel Spectogram:** Represent the power for each frequency over time. They are obtained through a Fast Fourier Transform on the signal, and then converted to a Mel Scale. ![](https://i.imgur.com/WxeRFbN.png) * **Mel-Frequency Cepstral Coefficients (MFCC):** Represent the shape of the human's vocal tract, which is responsible for sound generation. They are obtained through a mathemical transformation applied on the Mel-Spectrogram. ![](https://i.imgur.com/Dt5fFkS.png) To compute these features for the entire audio file, we **first breakdown the signal into short windows of 250 milliseconds each** (with a stride of 64ms). The Mel Spectrograms and MFFCs are then computed for each individual window, resulting in a **time-series representation**. At each time step we compute the Mel Spectrogram for **128 frequency bins**, as well as **40 Mel-Frequency Cepstral Coefficients**. The final representation of a given audio file is then a total of **168 features per time step.** ## Multimodal Model In our multimodal deep learning model, we combine audio features with textual ones. It has been shown in the litterature that complementing paracoustic features with linguistic one can help greatly improve the model accuracy. ![](https://i.imgur.com/4EIb2GC.png) ### Textual Features We use pretrained [GloVe Embeddings](https://nlp.stanford.edu/projects/glove/) to represent individual words in a vector space. The embeddings were pretrained by a research team at Stanford on a large corpus of text gathered from the Internet (mainly in English). Although GloVe has a total vocabulary of ~1.9M words, we only use the most frequent words (20,000) for computational efficiency. **Important Notes:** For our evaluation on the IEMOCAP dataset, we use the official transcriptions provided by the authors. The performance of the system is lower in real situation because transcriptions will come from an external ASR system (Automatic Speech Recognition). The drop in performance is estimated to be around 3% of accuracy. ### Accuracy Weighted accuracy is computed by taking the average, over all the classes, of the fraction of correct predictions in this class (i.e. the number of correctly predicted instances in that class, divided by the total number of instances in that class). Unweighted accuracy is the fraction of instances predicted correctly (i.e. total correct predictions, divided by total instances). The distinction between these two measures is useful especially if there exist classes that are under-represented by the samples. Unweighted accuracy gives the same weight to each class, regardless of how many samples of that class the dataset contains. Weighted accuracy weighs each class according to the number of samples that belong to that class in the dataset.. Here, recall refers to the true positive rate, frac{text{true positive}}{true positive + false negative} , while precision refers to frac{text{true positive}}{true positive + false positive} . | Metric | Average over 10 folds | Standard deviation | |---------|--------------------| ------ | | Weighted Accuracy | **75.26%** | 3.1% | | Unweighted Accuracy | **76.88%** | 3.5% | ### Confusion Matrix ![](https://i.imgur.com/IGSCAy0.png) ## Preprocessing Algorithms The Application of the Emotion AI is used in an **AI based meditation application** where users can utter how they are feeling and the system will recognize the emotional state and accomodate the user with specific content to help them relax and virtual meditation and breathing excerises. ![Imgur](https://i.imgur.com/dodN8C2.png) Venus engine is using state of the art Voice activity detection algorithm which preprocess the data The system can understand the difference between human voice and other sounds ( silence random background music )