MSP-Lab SER Hackathon (Fighting!!!)

# MSP-Lab SER Hackathon (Fighting!!!) > [name=David, Lucas, Seong-Gyun, Luz, Ali, Abinay, Jeri] > > [time=Sat, Jun 18, 2022 12:19 PM] # [Github](https://github.com/sgleem/SS_for_SER/tree/development) # [UTD BOX](https://utdallas.box.com/s/epmdxo2fm8qorntezrbmz9tr5sulnqzn) # [Results Table](https://docs.google.com/spreadsheets/d/1yuQ106ZgJ2dZ7TEpuq9F16O9tTOsO4m2cm1F6vKF1nY/edit?usp=sharing) # [Overleaf](https://www.overleaf.com/7187996787gtxmgpjnsmfk) # Experiments to Run * USC-IEMOCAP: - David's Rule/Multi-label/ALL_secondary/Hard-label/All_partitions - Plurality rule/All experiments * CREMA-D - Majority voting/Hard-label&Soft-label&Distribution-label/ALL_primary/Parition5 * MSP-IMPROV - Majority voting/Soft-label/four_primary/Partition4 - Plurality rule/Hard-label&Soft-label&Distribution-label/four_primary/Partition4 - Plurality rule/Soft-label/ALL_secondary/Partition6 # Computational Resources ## Taiwan Computational Cloud (80 GPUs)   ------------------------------------------ # To-do list - [x] Code - [ ] Clean code - [x] Learn how to use gitghub (git) - [ ] Learn how to use Docker - [ ] Modify code for categorical emotion classifiation tasks - [ ] Pre-trained model - [ ] Know the differences between pre-trained models - [ ] SER Model - [ ] Replace pooling layer with Winston's chunk-level mechanism ![](https://i.imgur.com/oefpQuA.png) - [ ] How to save models - [ ] Experiments - [ ] List all accessablie computation resoures * Each pre-trained model needs different sizes of GPU memory - [ ] List all experiments * Where to save trained models weights * - [ ] Tasks - [ ] Is it possilbe to adopt the models for the continuous-level emotion recognition? - [ ] Objective function design * How to define the loss functions for models to learn distribution-labels - [ ] Penalize the predictions if the predictions have emotions which the ground truth haven't - [ ] Kullback–Leibler divergence can make models learn well when the ground truth have specific emotions, but no penality on predictions which have emotions which the ground truth haven't - [x] Dataset usage * Make sure each dataset is well-set - [x] Format - [x] Audios - [x] Labels ------------------------------------------ # Goal: [AAAI 2023](https://aaai.org/Conferences/AAAI-23/) * **August 1, 2022: Submit to Dr. Busso** * **August 8, 2022: Abstracts due at 11:59 PM UTC-12** * **August 15, 2022: Full papers due at 11:59 PM UTC-12** # [Opponent product](https://twitter.com/audeering/status/1539973140570198016?s=21&t=UIM5jm1_VBtEFr33HaWsaQ) ------------------------------------------ # Outlines > [TOC] ------------------------------------------ # Progress ## [Install HuggingFace](https://huggingface.co/docs/transformers/installation) ## [Re-implement Code of Dawn of the transformer era in SER](https://cometmail-my.sharepoint.com/:u:/g/personal/hdc210001_utdallas_edu/EdwIMZKJ3dpDgssPm24EHnYB_fhG1ymKG5S5XOJV44VqXA) ## Experiments (Audios/Labels) * MSP-PODCAST v1.7 - [x] Attributes **by Seong-Gyun** - [ ] Categories * MSP-PODCAST v1.10 - [x] Attributes **by Seong-Gyun** - [x] Categories * USC-IEMOCAP - [x] Attributes - [x] Categories - [ ] Cross-corpus (MSP-Podcast) vs. within corpus performance **by Lucas** * MSP-IMPROV - [x] Attributes **by Lucas** - [x] Categories - [ ] Cross-corpus (MSP-Podcast) vs. within corpus performance **by Lucas** * CREMA-D - [x] Categories **Only** * (Optional) NTHU-NNIME (In Chinese) * Modify model code for two different tasks - [x] Attributes **by Seong-Gyun** - [ ] Categories **by Seong-Gyun** * Provide labels of all databases - [x] Attributes **by David** - [ ] Categories **by David** ------------------------------------------ # Research Questions ## Tow tasks - [ ] Efficiency vs Performance - [ ] Does chunk based model affect performance? - [ ] Model size vs Speed - [ ] Is it better to combine with openSMILE LLDs features? - [ ] How does normalization affect performance of SER using the pre-trained models as upstream? - [ ] Normalization by each utterance - [ ] Normalization by training set - [ ] Normalization by speaker - [ ] Does curriculum learning improve performance of SER using the pre-trained models as upstream? - [ ] Training models from higher agreement utterances - [ ] Do the SOTA SER systems need calibrations on the predictions? ## Attributes (Arousal, Dominance, Valence) - [ ] Can we reproduce the results of the paper ([DAWN OF THE TRANSFORMER ERA IN SPEECH EMOTION RECOGNITION: CLOSING THE VALENCE GAP](https://arxiv.org/pdf/2203.07378.pdf))? - [ ] What are performances of using the other pre-trained models based on [the superbbenchmark leaderboard](https://superbbenchmark.org/leaderboard), such as WavLM, data2vec, or wav2vec? - [ ] What are performances of cross-corpus without fine-tuning (e.g., train on MSP-PODCAST v.10, and test on USC-IEMOCAP)? - [ ] How's robutness of the SOTA model under noise scenario (using MSP-PODCAST v1.8)? - [ ] Is emotion attributes perception is universal (cross-language emotion recognition without fine-tuning)? (migth need to use multilingual pre-trained models, sucs [XLSR-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) or XLM) ## Categories (Anger, Sadness, Happiness, Neutral, to name a few) - [ ] What are performances of the SOTA model using various pre-trained models on emotion classification tasks? - [ ] Is wav2vec still the most effective SSL features on emotion classification tasks over multiple datasets (4-class emotion classification) based on the analyses [IS2021 paper](https://www.isca-speech.org/archive/pdfs/interspeech_2021/keesing21_interspeech.pdf)? - [ ] What are performances of the SOTA models trained with different label learning methods? - [ ] Hard-label learning - [ ] Soft-label learning - [ ] Multi-label learning - [ ] Distribution learning - [ ] What are performances of cross-corpus without fine-tuning (e.g., train on MSP-PODCAST v.10, and test on USC-IEMOCAP)? - [ ] Is emotion classification perception is universal (cross-language emotion recognition without fine-tuning)? (migth need to use multilingual pre-trained models, sucs [XLSR-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) or XLM) - [ ] How to deal with long-tail (very unbalanced) emotion recognition? ------------------------------------------ # Installation Environment ## (Recommend) [Github](https://github.com/sgleem/SS_for_SER/tree/development) ## For Python v3.7.13 & CUDA v11.0 & PyTorch v1.7.1 0. Download requirement file, [``huggin-face_env.txt''](https://utdallas.box.com/s/s36yt73h6mhyzoaa806lc0bi885spsqk) 1. (base)`conda create --name HuggingFace --file huggin-face_env.txt` 2. (base)`conda activate HuggingFace` 3. (base) `pip install transformers` **(install HuggingFace)** 4. (HuggingFace) `python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))"` ## For Python v3.9 & CUDA v11.2 & PyTorch v1.11.0 0. Download requirement file, [``hugging-face_python_3-9''](https://utdallas.box.com/s/5np8cy0gpusfu0zcg5o5kba8zb8o415a) 1. (base)`conda create --name HuggingFace --file hugging-face_python_3-9` 2. (base)`conda activate HuggingFace` 3. (base) `pip install transformers` **(install HuggingFace)** 4. (HuggingFace) `python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))"` --PS: IF USING THIS VERSION DO THE FOLLOWING: ***Change the code in Line(34) in train.py if using code provided by Seong_Gyun*** * Replace `torch.set_deterministic(True)` with `torch.use_deterministic_algorithms(True)` * ------------------------------------------ # Code ## Usage ### Model type (`--model_type `) * wav2vec2-base, wav2vec2-large, **wav2vec2-large-robust** * hubert-base, **hubert-large** * wavlm-base, wavlm-base-plus, **wavlm-large**, * data2vec-base, **data2vec-large** ### Default model type * wav2vec2: **wav2vec2-large-robust** * hubert: **hubert-large** * wavlm: **wavlm-large** * data2vec: **data2vec-large** ### Model GPU estimated memory requirements | Model | Batchsize | GPMs | Estimated Time perEpoch (hr) | | --------------------- | --------- | ---- | ---------------------------- | | wav2vec2-base | 32 | 17GB | 1 | | wav2vec2-large | 32 | 50GB | 1 | | wav2vec2-large-robust | 32 | 40GB | .85 | | hubert-base | 32 | 30GB | .50 | | hubert-large | 32 | 50GB | 1 | | wavlm-base | 32 | 30GB | .6 | | wavlm-base-plus | 32 | 30GB | .6 | | wavlm-large | 32 | 50GB | 1.35 | | data2vec-base | 32 | 33GB | .5 | | data2vec-large | 32 | 50GB | 1.35 | ### Computational Resources * Cost: $1.5 USD per hour per GPU * Each GPU has 32 GB memories * The maxiums GPUs in one vitual environments: 8 * The maxium of GPU memories: 256 GB ### Corpus Estimated Cost * MSP-PODCAST v1.10 | Model | Number of GPUs | Number of Epochs | Epoch | Estimated Time perEpoch | Cost | | --------------------- | -------------- | ---------------- | ----- | ----------------------- | ---- | | wav2vec2-base | 1 | 12GB | | | | | wav2vec2-large | 2 | 50GB | | | | | wav2vec2-large-robust | 2 | 40GB | | | | | hubert-base | 1 | 30GB | | | | | hubert-large | 2 | 50GB | | | | | wavlm-base | 1 | 30GB | | | | | wavlm-base-plus | 1 | 30GB | | | | | wavlm-large | 2 | 50GB | | | | | data2vec-base | 2 | 33GB | | | | | data2vec-large | 2 | 50GB | | | | ## Definition * categorical 1. Primary emotions: every rater **only can choose one emotion** from options pool 2. Secondary emotions: every rater is able to choose **more than one emotion** from options pool 3. Single-label task: every datasample has only one emotion as ground truth 4. Multi-label task: every datasample should be able to have co-occuring emotions (one or more) as ground truth * label learning: * Example: (anger, sadness, neutral, happiness); there are 2 anger and 3 neutral; * Soft-label: **(0.4, 0.0, 0.6, 0.0)** * Soft-label with alpha=0.05 label smoothing: **(0.39, 0.016667, 0.576667, 0.016667)** * Single-label task * Hard-label learning (cross-entropy) * Hard-label of the exmple: **(0,0,1,0)** * The label of the example with label smoothing will look like: **(0.016667, 0.016667, 0.95, 0.016667 )** * Soft-label learning (cross-entropy) * Soft-label: **(0.4, 0.0, 0.6, 0.0)** * The label of the example with label smoothing will look like: **(0.39, 0.016667, 0.576667, 0.016667)** * Multi-label task * Multi-label learning (binary cross-entropy): no matter how many annotations are given; take the emotions into account * The example with label : **(1,0,1,0)** * Output activation layer: **Sigmoid** * The proabilities can be binarized by thresold **0.5** * Distribution-lable learning (binary cross-entropy): learn the distribution similarity between groundtruths and predictions * The example will be the same as the soft-label **(0.39, 0.016667, 0.576667, 0.016667)** * The distribution also can be binarized by threshold **1/k** (where k is the number of classes); for 4-class emotion classification, the threshold is $1/4=0.25$. Therefore, the multiple-hot vector output: **(1,0,1,0)** ## Arguments 0. Database * 1. MSP-IMPROV * Example 1. Partioned_data_secondary 1. labels_consensus_EmoS_4class_P 1. labels_consensus_1.csv * D: davids rule * P: Plurality * M: Majority * different datasets have different splits * IMPROV has 6 splits 3. MSP-Podcast * one partition 5. USC-IEMOCAP * five partitions 7. CREMA-D * five paritions 9. NNIME * five partitions 2. Model (sorted by the results on [SUPERB leaderboard](https://superbbenchmark.org/leaderboard)) 1. WavLM 2. data2vec 3. Hubert 4. wav2vec2 5. wav2vec **(Hugging Face doesn't support wav2vec, so make it lower priority)** 3. pre-processing 1. Normalization 1. Feature Normalization 1. utterrance norm (U-norm) 2. training norm (T-norm) 3. speaker norm (S-norm) 4. no norm (N-norm) 2. Label Normalization for Emotional Attribute 1. MSP-Podcast (ranged from 1 to 7) - $(label - 1) / (7 - 1)$ 2. MSP-IMPROV (ranged from 1 to 5) - Flip arousal label (from "1 to 5" to "5 to 1") (**Fixed**) - 6 - aro - $(label - 1) / (5 - 1)$ 3. USC-IEMOCAP (ranged from 1 to 5) - $(label - 1) / (5 - 1)$ 3. NTHU-NNIME (ranged from 1 to 5) - $(label - 1) / (5 - 1)$ 4. CREMA-D : (no emotional attribute) 2. Label selection * categorical 1. majoriy vote 1. Hard label 2. Soft label 2. plurarity rule 1. Hard label 2. Soft label 3. multi-label 1. Multiple-hot label 5. Davids rule 1. Distribution label 4. multi-task vs single-task learning * Single-task * categorical 1. single-label task * 4-class 2. multi-label task * 4-class * primary emotions * secondary emotions 5. loss * categorical 1. **cross-entropy** for hard/soft label 2. **binary cross-entropy** for multiple-hot label 3. **Kullback–Leibler divergence** for distribution label * dimentional (VAD) 1. (1 - CCC) *(CCC,concordance coefficient correlation)* 6. classes 1. categorical 1. 4-class for **all datasets** *(Neutral, Anger, Sadness, Happiness)* 2. 8-calss **primary** emotions for MSP-PODCAST (v1.8/v1.9/v1.10) 3. 6-calss **primary** emotions for CREAM-D 4. 16-class secondary emotions for MSP-PODCAT (v1.8/v1.9/v.10) 5. 10-class secondary emotions for MSP-IMPROV 6. 9-class secondary emotions for USE-IEMOCAP 7. 11-class secondary emotions for NTHU-NNIME 2. dimentional (VAD) 7. Hyper-parameters 1. Epochs 2. Batch size 3. sequence length? 4. dropout? 5. depth? 8. Evaluation metrics 1. categorical * Single-label task 1. UAR 2. UAP 3. ACC 4. Macro-F1 5. Micro-F1 6. Weighted-F1 * Multi-label task 1. Hamming loss 2. Ranking loss 3. ... 4. ... 5. .... * Distribution-label 1. cosine simarlity 2. KLD 3. rmse 4. 3. dimentional (VAD) * CCC ------------------------------------------ # Task ## Categories * Majority Vote (more than 50% raters aggrement on labels) * Plurality Rule (the class gets the most votes) * David's Rule (set a threshold, such 1/k, where k is number of emotion class) ### Objective Functions * Hard-label learning * Objective function: cross-entropy * Output layer activation function: softmax * Soft-label learning * Objective function: cross-entropy * Output layer activation function: softmax * Multi-label learning * Objective function: binary cross-entropy * Output layer activation function: **sigmoid** * Distribution learning * Objective function: Kullback–Leibler divergence (KLD) * Output layer activation function: softmax ## Attributes * Average all raters' answers ### Objective Functions * 1- CCC(concordance coefficient correlation) ------------------------------------------ # Resources ## Emotion Databases Summary * All datasets have 4 emotions ($angry, sad, neutral, happy$) > Choice means how many emotions annotators can provide | **Dataset** | **Choice** | **Class** | **Processed** | angry | sad | neutral | happy | other | frustrated | annoyed | disappointed | disgust | depressed | contempt | confused | concerned | fear | surprise | amused | excited | joy | relaxed | disappointed | | ------------------------------- | ------------ | --------- | ----------------- | ----- | ----- | ------- | ----- | ----- | ---------- | ------- | ------------ | ------- | --------- | -------- | -------- | --------- | ----- | -------- | ------ | ------- | ----- | ------- | ------------ | | **MSP-IMPROV** (**Primary**) | **Single** | 5 | **V** | **V** | **V** | **V** | **V** | **V** | | | | | | | | | | | | | | | | | **MSP-PODCAST** (**Primary**) | **Single** | 9 | **V** | **V** | **V** | **V** | **V** | **V** | | | | **V** | **V** | | | | **V** | **V** | | | | | | | **CREMA-D** | **Single** | 6 | **X (not split)** | **V** | **V** | **V** | **V** | | | | | **V** | | | | | **V** | | | | | | | | **MSP-IMPROV** (**Secondary**) | **Multiple** | 11 | **V** | **V** | **V** | **V** | **V** | **V** | **V** | | | **V** | | | **V** | | **V** | **V** | | | | | | | **MSP-PODCAST** (**Secondary**) | **Multiple** | 17 | **V** | **V** | **V** | **V** | **V** | **V** | **V** | **V** | **V** | **V** | **V** | **V** | **V** | **V** | **V** | **V** | **V** | **V** | | | | | **USC-IEMOCAP** | **Multiple** | 10 | **V** | **V** | **V** | **V** | **V** | **V** | **V** | | | **V** | | | | | **V** | **V** | | **V** | | | | | **NNIME** | **Multiple** | 12 | | **V** | **V** | **V** | **V** | **V** | **V** | | | | | | | | **V** | **V** | | **V** | **V** | **V** | **V** | ## MSP-PODCAST v1.10 ### Categories (**Primary Emotion**)  ### Categories (**Secondary Emotion**)  ### 4class Categories (**Primary Emotion**) * 4class = $angry, sad, neutral, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | --------------- | ----------------- | | Majority Vote | 4 | Single-label | 26.08% (27,190) | 73.92% (77,077) | | Plurality Rule | 4 | Single-label | 19.44% (20,270) | 80.56% (883,997) | | David's Rule | 4 | Multi-label | 2.48% (2,589) | 97.52% (101,678) | * Majority Vote ![](https://i.imgur.com/WRvpxxP.png) * Plurality Rule ![](https://i.imgur.com/3zNOqHl.png) * David's Rule ![](https://i.imgur.com/2Y4DB7S.png) ### 4class Categories (**Secondary Emotion**) * 4class = $angry, sad, neutral, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | --------------- | ---------------- | | Majority Vote | 4 | Single-label | 32.79%(34,186) | 67.21% (70,081) | | Plurality Rule | 4 | Single-label | 17.80% (18,564) | 82.20% (85,703) | | David's Rule | 4 | Multi-label | 0.40%(418) | 99.60% (103,849) | * Majority Vote ![](https://i.imgur.com/kYTpeRH.png) * Plurality Rule ![](https://i.imgur.com/jiqvFlE.png) * David's Rule ![](https://i.imgur.com/BZNnkUM.png) ### 8class Categories (**Primary Emotion**) * 8class = $angry, sad, disgust, contempt, fear, neutral, surprise, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | --------------- | ---------------- | | Majority Vote | 8 | Single-label | 48.65% (50,721) | 51.35% (53,546) | | Plurality Rule | 8 | Single-label | 18.91% (19,718) | 81.09% (84,549) | | David's Rule | 8 | Multi-label | 0.00% (1) | 99.99% (104,266) | ### 16class Categories (**Secondary Emotion**) * 8class = $angry, frustrated, annoyed, disappointed, sad, disgust, depressed, contempt, confused, concerned, fear, neutral, surprise, amused, excited, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | --------------- | --------------- | | Majority Vote | 8 | Single-label | 88.68% (92,467) | 11.32% (11,800) | | Plurality Rule | 8 | Single-label | 29.31% (30,562) | 70.69% (73,705) | | David's Rule | 8 | Multi-label | 0.00% (0) | 100% (104,267) | ## MSP-PODCAST v1.9 ### Categories (**Primary Emotion**)  ### Categories (**Secondary Emotion**)  ### 4class Categories (**Primary Emotion**) * 4class = $angry, sad, neutral, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | --------------- | --------------- | | Majority Vote | 4 | Single-label | 25.23% (21,792) | 74.77% (64,597) | | Plurality Rule | 4 | Single-label | 19.13% (16,525) | 80.87% (69,864) | | David's Rule | 4 | Multi-label | 3.47% (2,997) | 96.53% (83,392) | * Majority Vote ![](https://i.imgur.com/7OIzqiR.png) * Plurality Rule ![](https://i.imgur.com/28gYzu0.png) * David's Rule ![](https://i.imgur.com/abH3jzh.png) ### 4class Catergories (**Secondary Emotion**) * 4class = $angry, sad, neutral, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | --------------- | -------------- | | Majority Vote | 4 | Single-label | 31.38% (27,110) | 68.62 (59,279) | | Plurality Rule | 4 | Single-label | 00000 | 000000 | | David's Rule | 4 | Multi-label | 0.52% (451) | 99.48 (85,938) | * Majority Vote ![](https://i.imgur.com/pEchJWs.png) * Plurality Rule ![](https://i.imgur.com/kBkX6Kv.png) * David's Rule ![](https://i.imgur.com/QlyLOiA.png) ### 8class Catergories (**Primary Emotion**) * 8class = $angry, sad, disgust, contempt, fear, neutral, surprise, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | --------------- | --------------- | | Majority Vote | 8 | Single-label | 45.90% (39,654) | 54.10% (46,735) | | Plurality Rule | 8 | Single-label | 17.60% (15,206) | 82.40% (71,183) | | David's Rule | 8 | Multi-label | 0.00% (1) | 99.99% (86,388) | ### 16class Catergories (**Secondary Emotion**) * 16class = $angry, frustrated, annoyed, disappointed, sad, disgust, depressed, contempt, confused, concerned, fear, neutral, surprise, amused, excited, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | --------------- | --------------- | | Majority Vote | 8 | Single-label | 86.29% (74,542) | 13.71% (11,847) | | Plurality Rule | 8 | Single-label | 27.54% (23,794) | 72.46% (62,595) | | David's Rule | 8 | Multi-label | 0.00% (0) | 100% (86,389) | ## IEMOCAP ### Catergories ![](https://i.imgur.com/Lo9us4F.png) | Session | Totall | frustrated | angry | sad | disgust | excited | fear | neutral | surprise | happy | other | | -------- | ------ | ---------- | ----- | ----- | ------- | ------- | ---- | ------- | -------- | ----- | ----- | | All | 34,367 | 7,994 | 4,734 | 4,016 | 264 | 4,598 | 415 | 7,007 | 811 | 3,461 | 1,067 |  ### 4class Catergories * 4class = $angry, sad, neutral, happy$ | Method | Class | Task | Discard | Used | | ------------------ | ----- | ------------ | ------------- | -------------- | | Majority Vote (\*) | 4 | Single-label | 44.90%(4,508) | 55.10% (5,531) | | Majority Vote | 4 | Single-label | 25.17%(2,527) | 74.82% (7,512) | | Plurality Vote | 4 | Single-label | 24.94%(2,504) | 75.06% (7,535) | | David's Rule | 4 | Multi-label | 10.41%(1,045) | 89.59% (8,994) | > \*: Previous works only take one annotation from each annotator, but autually annotators might provide more than one emotion labels. ### 9class Catergories (All emotions) * 9class = $frustrated, angry, sad, disgust, excited, fear, neutral, surprise, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | ------------- | -------------- | | Majority Vote | 9 | Single-label | 31.37%(3,149) | 68.63% (6,890) | | Plurality Vote | 9 | Single-label | 25.32%(2,542) | 74.68% (7,497) | | David's Rule | 9 | Multi-label | 0%(0) | 100% (1,0039) | ## MSP-IMPROV ### Catergories (**Primary Emotion**)  | Session | Totall | angry | sad | neutral | happy | other | | -------- | ------ | ----- | ----- | ------- | ------ | ----- | | All | 34,367 | 7,960 | 8,632 | 2,4393 | 17,558 | 3,159 |  ### Catergories (**Secondary Emotion**)  | Session | Totall | depressed | frustrated | angry | sad | disgust | excited | fear | neutral | surprise | happy | other | | -------- | ------- | --------- | ---------- | ----- | ------ | ------- | ------- | ----- | ------- | -------- | ------ | ----- | | All | 105,587 | 4,174 | 10,552 | 8,955 | 10,017 | 4,214 | 9,418 | 1,775 | 26,758 | 6,199 | 19,057 | 4,468 |  ### 4class Catergories (**Primary Emotion**) * 4class = $angry, sad, neutral, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | ----------- | -------------- | | Majority Vote | 4 | Single-label | 9.20% (776) | 90.8% (7,662) | | Plurality Rule | 4 | Single-label | 5.63% (475) | 94.37% (7,963) | | David's Rule | 4 | Multi-label | 0.01% (1) | 99.99% (8,437) | * Majority Vote ![](https://i.imgur.com/QTRLFrq.png) * Plurality Rule ![](https://i.imgur.com/i3qXI96.png) * David's Rule ![](https://i.imgur.com/lnxxFnu.png) ### 4class Catergories (**Secondary Emotion**) * 4class = $angry, sad, neutral, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | -------------- | -------------- | | Majority Vote | 4 | Single-label | 14.42% (1,048) | 87.58% (7,390) | | Plurality Rule | 4 | Single-label | 6.28% (530) | 93.72% (7,908) | | David's Rule | 4 | Multi-label | 0.01% (1) | 99.99% (8,437) | * Majority Vote ![](https://i.imgur.com/drHhrGT.png) * Plurality Rule ![](https://i.imgur.com/yBn9jzX.png) * David's Rule ![](https://i.imgur.com/I9OvLqR.png) ### 10class Catergories (**Secondary Emotion**) * 4class = $depressed, frustrated, angry, sad, disgust, excited, fear, neutral, surprise, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | -------------- | -------------- | | Majority Vote | 4 | Single-label | 54.17% (4,571) | 45.82% (3,867) | | Plurality Rule | 4 | Single-label | 12.34% (1,041) | 87.66% (7,397) | | David's Rule | 4 | Multi-label | 0.01% (1) | 99.99% (8,437) | ### Attributes | Method | Task | Discard | Used | | ------- | ---- | --------------- | -------------------- | | Average | Dom. | 0.62%(52/8,438) | 99.38% (8,386/8,438) | ## [CREMA-D](https://github.com/CheyneyComputerScience/CREMA-D) ### Focus on the audio alone ![](https://i.imgur.com/JWPEgXA.png) | Session | Totall | angry | sad | disgust | fear | neutral | happy | | ------- | ------ | ------ | ----- | ------- | ----- | ------- | ----- | | All | 73,254 | 10,376 | 7,404 | 9,588 | 8,831 | 32,145 | 4,910 | ### 6 class Catergories * 6class = $angry, sad, disgust, fear, neutral, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | -------------- | --------------- | | Majority Vote | 4 | Single-label | 35.80% (2,664) | 64.20% (4,778) | | Plurality Rule | 4 | Single-label | 8.55% (636) | 91.45% (6,806) | | David's Rule | 4 | Multi-label | 0.00% (0) | 100.00% (7,442) | ### 4class Catergories * 4class = $angry, sad, neutral, happy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | -------------- | -------------- | | Majority Vote | 4 | Single-label | 13.76% (1,024) | 86.24% (6,418) | | Plurality Rule | 4 | Single-label | 7.65% (569) | 93.72% (6,873) | | David's Rule | 4 | Multi-label | 5.03% (374) | 94.97% (7,068) | * Majority Vote ![](https://i.imgur.com/UIVa2cB.png) * Plurality Rule ![](https://i.imgur.com/oTwnpNw.png) * David's Rule ![](https://i.imgur.com/KECxoxe.png) ## NTHU-NNIME ![](https://i.imgur.com/2Sx1OIi.png) | Session | Totall | angry | frustrated | disappointed | sad | fear | neutral | surprise | excited | happy | relax | joy | otehr | | ------- | ------ | ----- | ---------- | ------------ | ----- | ---- | ------- | -------- | ------- | ----- | ----- | --- | ----- | | All | 18,631 | 2,161 | 797 | 415 | 1,041 | 580 | 8,776 | 1,294 | 1,519 | 304 | 674 | 767 | 303 | ### 4class Catergories * 4class = $angry, sad, neutral, happy$ * **happy includes joy and happy** | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | -------------- | -------------- | | Majority Vote | 4 | Single-label | 27.61% (1,545) | 72.39% (4,051) | | Plurality Rule | 4 | Single-label | 27.50% (1,539) | 72.50% (4,057) | | David's Rule | 4 | Multi-label | 21.46% (1,201) | 78.54% (4,395) | * Majority Vote ![](https://i.imgur.com/FGwNKVj.png) * Plurality Rule ![](https://i.imgur.com/7hoxIea.png) * David's Rule ![](https://i.imgur.com/w5yfB90.png) ### 11class Catergories * 11class = $angry, frustrated, disappointed, sad, fear, neutral, surprise, excited, happy, relax, joy$ | Method | Class | Task | Discard | Used | | -------------- | ----- | ------------ | -------------- | -------------- | | Majority Vote | 4 | Single-label | 32.34% (1,810) | 67.66% (3,786) | | Plurality Rule | 4 | Single-label | 25.04% (1,401) | 74.96% (4,195) | | David's Rule | 4 | Multi-label | 9.29% (520) | 90.71% (5,076) |  ------------------------------------------ # Experimental Setup ## Features: SOTA SSL Speech SSL Representation ([SUPERB Benchmark Leaderboard](https://superbbenchmark.org/leaderboard)) * [WavLM Large](https://arxiv.org/pdf/2110.13900) * [Github](https://github.com/microsoft/unilm/tree/master/wavlm) * [HuggingFace](https://huggingface.co/models?other=wavlm) * [Speaker Verification Demo on HuggingFace](https://huggingface.co/spaces/microsoft/wavlm-speaker-verification) * [HuBERT Large](https://arxiv.org/abs/2106.07447) * [Github based on fairseq (Meta platform)](https://github.com/facebookresearch/fairseq/blob/main/examples/hubert/README.md) * [HuggingFace](https://huggingface.co/docs/transformers/model_doc/hubert) * [data2vec Large](https://arxiv.org/pdf/2202.03555.pdf) * [Github](https://github.com/facebookresearch/fairseq/tree/main/examples/data2vec) * [HuggingFace](https://huggingface.co/docs/transformers/model_doc/data2vec) * [wav2vec 2.0 Large](https://arxiv.org/abs/2006.11477) * [Github](https://github.com/facebookresearch/fairseq/blob/main/examples/wav2vec/README.md) * [HuggingFace](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) ## Hyperparameter | Hyperparameter | Number | | -------------- | ------ | | Batchsize | 16 | | Epoch | 10 | ## Validation ## Scripts ### Here is the description of the argument that you can change for train.py: * *model_type*: wav2vec2, hubert, data2vec, or wavlm <= to change the model type that you want to run * *seed*: just to change the initial model weight and minibatch order during training * *batch_size, epochs, lr*: batch size, total number, and learning rate of epochs for training > It will save the model that shows the best performances in the development set within the maximum epochs * *hidden_dim, num_layers*: # of nodes and hidden layer for the classification head * *model_path*: the directory that you want to save the model ------------------------------------------ # Results ## Attributes ### MSP-PODCAST v1.7 | Upstream | Normalization | Arousal | Dominance | Valence | | -------- | ------------- | ------- | --------- | ------- | | wav2vec2 | | 0.7051 | 0.6345 | 0.574 | | WavLM | | 0.7108 | 0.6160 | 0.455 | | HuBERT | | 0.7140 | 0.6550 | 0.452 | | Upstream | Chunking | Normalization | Arousal | Dominance | Valence | | -------- |----------- | ------------- | ------- | --------- | ------- | | Wav2vec2-large-robust |Average Pooling | T-norm | 0.7051 | 0.6345 | 0.574 | | Wav2vec2-large-robust | LSTM-RNNAttenVec | T-norm | 0.6288 | 0.5384 | 0.029 | ### MSP-PODCAST v1.8 * **Clean** | Upstream | Arousal | Dominance | Valence | | -------- | ------- | --------- | ------- | | wav2vec2 | | | | | HuBERT | | | | | Text | Text | Text | | * **Noise (SNR: xxx)** | Upstream | Arousal | Dominance | Valence | | -------- | ------- | --------- | ------- | | wav2vec2 | | | | | HuBERT | | | | | Text | Text | Text | | * **Noise (SNR: xxx)** | Upstream | Arousal | Dominance | Valence | | -------- | ------- | --------- | ------- | | wav2vec2 | | | | | HuBERT | | | | | Text | Text | Text | | * **Noise (SNR: xxx)** | Upstream | Arousal | Dominance | Valence | | -------- | ------- | --------- | ------- | | wav2vec2 | | | | | HuBERT | | | | | Text | Text | Text | | ### MSP-PODCAST v1.10 | Upstream | Normalization | Arousal | Dominance | Valence | | -------- | ------------- | ------- | --------- | ------- | | wav2vec2 | | 0.5846 | 0.4548 | 0.4586 | | HuBERT | | 0.5448 | 0.4179 | 0.3306 | | Text | | Text | Text | | ### USC-IEMOCAP * Used pretrained model:**wav2vec** * Normalization: **T-norm** | Partition | Train | Development | Test | Arousal | Dominance | Valence | | --------- | ----------------- | ----------- | ----- | ------- | --------- | ------- | | 01 | Ses01,Ses02,Ses03 | Ses04 | Ses05 | 0.7164 | 0.5435 | 0.5671 | | 02 | Ses02,Ses03,Ses04 | Ses05 | Ses01 | 0.6829 | 0.5839 | 0.5755 | | 03 | Ses03,Ses04,Ses05 | Ses01 | Ses02 | 0.6706 | 0.4357 | 0.6484 | | 04 | Ses04,Ses05,Ses01 | Ses02 | Ses03 | 0.7130 | 0.4270 | 0.5726 | | 05 | Ses05,Ses01,Ses02 | Ses04 | Ses04 | 0.7232 | 0.4396 | 0.6067 | | Average | - | - | - | 0.70122 | 0.48594 | 0.59406 | ### MSP-IMPROV * Used pretrained model:**wav2vec** * Normalization: T-norm | Partition | Train | Development | Test | Arousal | Dominance | Valence | | --------- | ----------------------- | ----------- | ----- | ------- | --------- | ------- | | 01 | Ses01,Ses02,Ses03,Ses04 | Ses05 | Ses06 | 0.6386 | 0.4837 | 0.6414 | | 02 | Ses06,Ses01,Ses02,Ses03 | Ses04 | Ses05 | 0.6642 | 0.4577 | 0.6571 | | 03 | Ses05,Ses06,Ses01,Ses02 | Ses03 | Ses04 | 0.5734 | 0.3834 | 0.5632 | | 04 | Ses04,Ses05,Ses06,Ses01 | Ses02 | Ses03 | 0.7122 | 0.5499 | 0.4837 | | 05 | Ses03,Ses04,Ses05,Ses06 | Ses01 | Ses02 | 0.6594 | 0.4404 | 0.4640 | | 06 | Ses02,Ses03,Ses04,Ses05 | Ses06 | Ses01 | 0.5448 | 0.4179 | 0.3306 | | Average | - | - | - | 0.6321 | 0.4555 | 0.5233 | ### NTHU-NNIME ## Categories ### MSP-PODCAST v1.7 ### MSP-PODCAST v1.8 ### MSP-PODCAST v1.9 ### MSP-PODCAST v1.10 ### USC-IEMOCAP * **Majority Vote (Hard label; Single-label task)** | Partition | Train | Development | Test | macroF1 | microF1 | weightedF1 | | --------- | ----------------- | ----------- | ----- | ------- | ------- | ---------- | | 01 | Ses01,Ses02,Ses03 | Ses04 | Ses05 | | | | | 02 | Ses02,Ses03,Ses04 | Ses05 | Ses01 | | | | | 03 | Ses03,Ses04,Ses05 | Ses01 | Ses02 | | | | | 04 | Ses04,Ses05,Ses01 | Ses02 | Ses03 | | | | | 05 | Ses05,Ses01,Ses02 | Ses04 | Ses04 | | | | | Average | - | - | - | | | | * **Majority Vote (Soft label; Single-label task)** | Partition | Train | Development | Test | macroF1 | microF1 | weightedF1 | | --------- | ----------------- | ----------- | ----- | ------- | ------- | ---------- | | 01 | Ses01,Ses02,Ses03 | Ses04 | Ses05 | | | | | 02 | Ses02,Ses03,Ses04 | Ses05 | Ses01 | | | | | 03 | Ses03,Ses04,Ses05 | Ses01 | Ses02 | | | | | 04 | Ses04,Ses05,Ses01 | Ses02 | Ses03 | | | | | 05 | Ses05,Ses01,Ses02 | Ses04 | Ses04 | | | | | Average | - | - | - | | | | * **David's Vote (Soft label; Multi-label task)** | Partition | Train | Development | Test | macroF1 | microF1 | weightedF1 | | --------- | ----------------- | ----------- | ----- | ------- | ------- | ---------- | | 01 | Ses01,Ses02,Ses03 | Ses04 | Ses05 | | | | | 02 | Ses02,Ses03,Ses04 | Ses05 | Ses01 | | | | | 03 | Ses03,Ses04,Ses05 | Ses01 | Ses02 | | | | | 04 | Ses04,Ses05,Ses01 | Ses02 | Ses03 | | | | | 05 | Ses05,Ses01,Ses02 | Ses04 | Ses04 | | | | | Average | - | - | - | | | | ### MSP-IMPROV * **Majority Vote (Hard label; Single-label task)** | Partition | Train | Development | Test | macroF1 | microF1 | weightedF1 | | --------- | ----------------------- | ----------- | ----- | ------- | ------- | ---------- | | 01 | Ses01,Ses02,Ses03,Ses04 | Ses05 | Ses06 | | | | | 02 | Ses06,Ses01,Ses02,Ses03 | Ses04 | Ses05 | | | | | 03 | Ses05,Ses06,Ses01,Ses02 | Ses03 | Ses04 | | | | | 04 | Ses04,Ses05,Ses06,Ses01 | Ses02 | Ses03 | | | | | 05 | Ses03,Ses04,Ses05,Ses06 | Ses01 | Ses02 | | | | | 06 | Ses02,Ses03,Ses04,Ses05 | Ses06 | Ses01 | | | | | Average | - | - | - | | | | * **Majority Vote (Soft label; Single-label task)** | Partition | Train | Development | Test | macroF1 | microF1 | weightedF1 | | --------- | ----------------------- | ----------- | ----- | ------- | ------- | ---------- | | 01 | Ses01,Ses02,Ses03,Ses04 | Ses05 | Ses06 | | | | | 02 | Ses06,Ses01,Ses02,Ses03 | Ses04 | Ses05 | | | | | 03 | Ses05,Ses06,Ses01,Ses02 | Ses03 | Ses04 | | | | | 04 | Ses04,Ses05,Ses06,Ses01 | Ses02 | Ses03 | | | | | 05 | Ses03,Ses04,Ses05,Ses06 | Ses01 | Ses02 | | | | | 06 | Ses02,Ses03,Ses04,Ses05 | Ses06 | Ses01 | | | | | Average | - | - | - | | | | * **Plurality Rule (Hard label; Single-label task)** | Partition | Train | Development | Test | macroF1 | microF1 | weightedF1 | | --------- | ----------------------- | ----------- | ----- | ------- | ------- | ---------- | | 01 | Ses01,Ses02,Ses03,Ses04 | Ses05 | Ses06 | | | | | 02 | Ses06,Ses01,Ses02,Ses03 | Ses04 | Ses05 | | | | | 03 | Ses05,Ses06,Ses01,Ses02 | Ses03 | Ses04 | | | | | 04 | Ses04,Ses05,Ses06,Ses01 | Ses02 | Ses03 | | | | | 05 | Ses03,Ses04,Ses05,Ses06 | Ses01 | Ses02 | | | | | 06 | Ses02,Ses03,Ses04,Ses05 | Ses06 | Ses01 | | | | | Average | - | - | - | | | | * **Plurality Rule (Soft label; Single-label task)** | Partition | Train | Development | Test | macroF1 | microF1 | weightedF1 | | --------- | ----------------------- | ----------- | ----- | ------- | ------- | ---------- | | 01 | Ses01,Ses02,Ses03,Ses04 | Ses05 | Ses06 | | | | | 02 | Ses06,Ses01,Ses02,Ses03 | Ses04 | Ses05 | | | | | 03 | Ses05,Ses06,Ses01,Ses02 | Ses03 | Ses04 | | | | | 04 | Ses04,Ses05,Ses06,Ses01 | Ses02 | Ses03 | | | | | 05 | Ses03,Ses04,Ses05,Ses06 | Ses01 | Ses02 | | | | | 06 | Ses02,Ses03,Ses04,Ses05 | Ses06 | Ses01 | | | | | Average | - | - | - | | | | * **David's Vote (Soft label; Multi-label task)** | Partition | Train | Development | Test | macroF1 | microF1 | weightedF1 | | --------- | ----------------------- | ----------- | ----- | ------- | ------- | ---------- | | 01 | Ses01,Ses02,Ses03,Ses04 | Ses05 | Ses06 | | | | | 02 | Ses06,Ses01,Ses02,Ses03 | Ses04 | Ses05 | | | | | 03 | Ses05,Ses06,Ses01,Ses02 | Ses03 | Ses04 | | | | | 04 | Ses04,Ses05,Ses06,Ses01 | Ses02 | Ses03 | | | | | 05 | Ses03,Ses04,Ses05,Ses06 | Ses01 | Ses02 | | | | | 06 | Ses02,Ses03,Ses04,Ses05 | Ses06 | Ses01 | | | | | Average | - | - | - | | | | ### [CREMA-D](https://github.com/CheyneyComputerScience/CREMA-D) ### NTHU-NNIME # Reference ## Pytorch Extention Tools * [pytorchlightning (Trainers)](https://www.pytorchlightning.ai/) * [NeptuneLogger in pytorchlightning](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.loggers.neptune.html) * [HuggingFace Installation Doc.](https://huggingface.co/docs/transformers/installation) * [Mertics](https://github.com/Lightning-AI/metrics) * [Macro-F1](https://torchmetrics.readthedocs.io/en/stable/classification/f1_score.html) ## Develop Envirment * [Docker Desktop for Linux user manual](https://docs.docker.com/desktop/linux/) ## Other Papers Worth Reading * [M-SENA: An Integrated Platform for Multimodal Sentiment Analysis (ACL2020 Demo Track)](https://aclanthology.org/2022.acl-demo.20.pdf ) ## Paper ### Emotion Recognition * [Multi-modal Emotion Estimation for in-the-wild Videos](https://arxiv.org/pdf/2203.13032) * Findings * **Combination of wav2vec and ComParE got the best!** * [Dawn of the transformer era in speech emotion recognition: closing the valence gap](https://arxiv.org/pdf/2203.07378.pdf) * Same authers' paper * [Probing Speech Emotion Recognition Transformers for Linguistic Knowledge (INTERSPEECH 2022)](https://arxiv.org/pdf/2204.00400.pdf) * Findings: * **Fine-tuning the transformer layers is necessary.** * **Starting from a pre-trained model reduces the number of epochs needed to converge and improves performance stability across training runs with different seeds.** * **A reduction of training samples without loss in performance is only possible for arousal and dominance. With respect to valence, there is no sweet point in our data.** * Code: * [HuggingFace](https://github.com/audeering/w2v2-how-to) * [Seong-Gyun Code](https://cometmail-my.sharepoint.com/:u:/g/personal/hdc210001_utdallas_edu/EdwIMZKJ3dpDgssPm24EHnYB_fhG1ymKG5S5XOJV44VqXA) * Results: * Within-corpus result (Fine-tuned and tested with same corpus) U-Norm | | Arousal |Dominance| Valence | |-----------------|---------|---------|---------| |Original model |**0.744**|**0.655**|**0.638**| |MSP-Podcast v1.7 | 0.671 | 0.502 | 0.587 | |MSP-Podcast v1.10| 0.577 | 0.432 | 0.445 | |MSP-IMPROV (pt-1)| 0.655 | 0.467 | 0.613 | |MSP-IMPROV (pt-2)| 0.670 | 0.508 | 0.634 | * Cross-corpus result (finetuned with MSP-Podcast v1.10, tested with different corpus) | | Arousal |Dominance| Valence | |-----------------|----------|---------|---------| |MSP-Podcast v1.10| 0.577 | 0.432 | 0.445 | |MSP-IMPROV | 0.536 | 0.423 | 0.339 |

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.