---
title: 'Asyrofi Project in WIDM Lab'
disqus: hackmd
---
\begin{equation}
\newcommand{\argmin}{\mathop{\mathrm{arg\,min}}
\newcommand{\argmin}{\mathop{\mathrm{argmin}}
\newcommand{\argmin}{\mathop{\mathrm{arg\,min}\nolimits}
\newcommand{\argmin}{\mathop{\mathrm{argmin}\nolimits}
\end{equation}
Asyrofi Project Progress in Web Intelligence & Data Mining (WIDM) Lab
===
This is several notes and reference papers that I have worked on and read so far. I hope, I can explain more with this note fluently.
## Table of Contents
[TOC]
## :notebook: Several Notes from Resources
This is several notes from course and another resources. I need to know several knowledge, how develop code and several experiments such as:
1. [ML 2021 Spring Website](https://speech.ee.ntu.edu.tw/~hylee/ml/2021-spring.php) by Lee, Hung-Yi
2. [Deep learning Course](https://www.youtube.com/playlist?list=PLVHZ98Gy86j4A7WkBbktu-gJE_gW3UqQC) by Prof. Sun, Min-Te
3. [Standford CS224U: NLU Spring 2019](https://youtube.com/playlist?list=PLoROMvodv4rObpMCir6rNNUlFAn56Js20) by Stanford Online
4. [Stanford CS224N: NLU Winter 2021](https://youtube.com/playlist?list=PLoROMvodv4rOSH4v6133s9LFPRHjEmbmJ) by Stanford Online
5. [Stanfrod CS224U: NLU Spring 2021](https://youtube.com/playlist?list=PLoROMvodv4rPt5D0zs3YhbWSZA8Q_DyiJ) by Stanford Online
6. [Stanford CS229M: ML Theory Fall 2021](https://www.youtube.com/playlist?list=PLoROMvodv4rP8nAmISxFINlGKSK4rbLKh) by Stanford Online
7. [Stanfrod CS224N: NLP Winter 2019](https://youtube.com/playlist?list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z) by Stanford Online
8. [Hugging Face Course](https://www.youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o) from Youtube
### Lesson Timeline Machine Learning by Lee, Hung-Yi
This is several lesson Timeline, there are explanation works in several studies:
example
```mermaid
gantt
title Learning Schedule
section Time
Introduction of ML :2023-01-01, 2023-01-22
Introduction of DL :2023-01-01, 2023-01-22
#Roadmap of Improving Model :2023-01-01, 2023-01-22
Analyzing Critial Points :2023-01-01, 2023-01-22
Batch and Momentum :2023-01-01, 2023-01-22
Error Surface is Rugged :2023-01-01, 2023-01-23
Classification :2023-01-01, 2023-01-23
Batch Normalization :2023-01-01, 2023-01-23
Convolutional Neural Networks :2023-01-01, 2023-01-23
Self Attention :2023-01-01, 2023-01-23
Transformers :2023-01-01, 2023-01-23
Generative Adversaial Network :2023-01-01, 2023-01-27
```
This is Explanation from Lee, [Machine Learning (2021)](https://speech.ee.ntu.edu.tw/~hylee/ml/2021-spring.php) by Lee, Hung-Yi
that I have done to watch and review
| Lesson | Video | Slide | Date |
| --------- | ----- | ------ | ------|
| Introduction of ML | [video](#) :clock9: | [Slide](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/regression%20(v16).pdf) :heavy_check_mark:| 2023-01-22 |
| Introduction of DL | [video](#) :clock9: | [Slide](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/regression%20(v16).pdf) :heavy_check_mark:| 2023-01-22 |
| Roadmap of Improving Model | [video](#) :clock9: | [Slide](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/overfit-v6.pdf) :heavy_check_mark:| 2023-01-22 |
| Analyzing Critical Points | [video](#) :clock9: | [Slide](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/small-gradient-v7.pdf) :heavy_check_mark:| 2023-01-22 |
| Batch and Momentum | [video](#) :clock9: | [Slide](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/small-gradient-v7.pdf) :heavy_check_mark:| 2023-01-22 |
| Error Surface is Rugged | [video](#) :clock9: | [Slide](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/optimizer_v4.pdf) :heavy_check_mark:| 2023-01-22 |
| Classification | [video](#) :clock9: | [Slide](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/classification_v2.pdf) :heavy_check_mark:| 2023-01-23 |
| Batch Normalization | [video](#) :clock9: | [Slide](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/normalization_v4.pdf) :heavy_check_mark:| 2023-01-23 |
| Convolutional Neural Networks | [video](#) :clock9: | [Slide](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/cnn_v4.pdf) :heavy_check_mark:| 2023-01-23 |
| Self Attention | [video](#) :clock9: | [Slide](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/self_v7.pdf) :heavy_check_mark:| 2023-01-23 |
| Transformers | [video](#) :clock9: | [Slide](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/seq2seq_v9.pdf) :heavy_check_mark:| 2023-01-23 |
| Generative Adversarial Network | [video](#) :clock9: | [Slide](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/gan_v10.pdf) :heavy_check_mark:| 2023-01-24 |
#### Explanation of Machine/ Deep Learning
1. Introduction of Machine/Deep Learning
* Machine Learning = Looking for function
- Speech Recognition f(sound) = output sound
- Image Recognition f(picture) = output image
- Playing Go f(game) = output game
* Different types of functions
- Regression: the function outputs a scalar
ex. predict PM 2.5
- Classification: Given options (classes), the function ouptuts the correct one
ex. spam filtering and playing go
2. Study Case: Youtube Channel
* Function with Unkown Parameters y = f(no. of views)
- Model: $y = b + wx_{1}$
- Feature:
y: no. of views on 2/26
$x_{1}$: no. of views on 2/25
w & b are unkown parameters (learned from data)
- weight: w
- bias: b
* Define Loss from Training Data
- Loss is a function of parameters: L(b, w)
- Loss: how good a set of values is
(0.5k,1) $y = b + wx_{1}$ -> $y = 0.5k + 1x_{1}$ how good it is?
$e_{N} = |y - \hat y|$
- Loss: $L = \frac{1}{N} \sum_{n} e_{n}$
- $e = |y - \hat y|$ L is mean absolute error (MAE)
- $e = (y - \hat y)^2$ L is mean square error (MSE)
- if y and $\hat y$ are both probably distributions -> cross-entropy
* Optimization
$w^* = \argmin_w L$
- Gradient Descent
1. (Randomly) Pick an initial value $w^0$
2. compute $\frac{\partial L}{\partial w}|_{w=w^0}$
3. $w^1 <- w^0 - \eta \frac{\partial L}{\partial w} |_{w=w^0}$
4. $\eta = learning rate$
5. update w iteratively, does local minima truly cause the problem?
$w^*, b^* = \argmin_w,b L$
- Gradient Descent
1. (Randomly) Pick Initial values $w^0, b^0$
2. Compute -> can be done in one line in most deep learning frameworks
$\frac{\partial L}{\partial w}|_w=w^0, b=b^0$
$\frac{\partial L}{\partial b}|_w=w^0, b=b^0$
$w^1 <- w^0 - \eta \frac{\partial L}{\partial w} |_w=w^0, b=b^0$
$b^1 <- b^0 - \eta \frac{\partial L}{\partial b} |_w=w^0, b=b^0$
3. Update w and b interatively
4. model $y = b + wx_{1}$
$w^*, b^* = \argmin_{w,b} L$
2. Machine Learning is so simple
* Step 1: function with unknown
$y= b + wx_{1}$
* Step 2: define loss from trainng data
* Step 3: optimization
$w^* = 0.97, b^* = 0.1k$
$L(w^*, b^*) = 0.48k$
* from step 1 to step 3 is called Training
$y = 0.1k + 0.97x_{1}$ achieves the smallest loss L = 0.48k on data of 2017 - 2020 (training data), how about data of 2021 (unseen during training)?
3. Linear Models are too simple, we need more sophisticated modes.
* Linear models have severe limitation.
* mmodel bias, we need a more flexible model
* Sigmoid function $y = c \frac{1}{1+e^-(b+wx_{1})}$
$y = c sigmoid(b + wx_{1})$
- different w, change slopes
- different b, shift
- different c, change height
* New model: more features
* $y = b + wx_{i}$
$y = b + \sum_{i} c_{i} sigmoid(b_{i} + w_{i}x_{1})$
* $y = b + \sum_{j} w_{j}x_{j}$
$y = b + \sum_{i} c_{i} sigmoid(bi +\sum_{j} w_{ij}x_{i})$
* $j: 1, 2, 3$ no. of features
* $i: 1, 2, 3$ no.of sigmoid
* $r_{1} = b_{1} + w_{11}x_{1} + w_{12}x_{2} + w_{13}x_{3}$
$r_{2} = b_{2} + w_{21}x_{1} + w_{22}x_{2} + w_{23}x_{3}$
$r_{3} = b_{3} + w_{31}x_{1} + w_{32}x_{2} + w_{33}x_{3}$
* $w_{ij}$: weight for $x_j$ for i-th sigmoid
* $r = b + wx$
* $\alpha_{1} = sigmoid(r_{1}) = \frac{1}{1+e^-r_{1}}$
$\alpha_{2} = sigmoid(r_{2}) = \frac{1}{1+e^-r_{2}}$
$\alpha_{3} = sigmoid(r_{3}) = \frac{1}{1+e^-r_{3}}$
* $\alpha = \sigma (r)$
* $y = b + c^T \alpha$
* Function with unknown parameters:
$y = b + c^T \sigma (b + wx)$
- $x$: feature
- $w, b, c^T, b$: unkown parameters
4. Back to ML Frameworks
* step1: function with unkown
$y = b + c^T \sigma (b+wx)$
* step2: define loss from training data
* step3: optimization
* More variety of models
* Sigmoid -> ReLU
sigmoid: $y = b + \sum_{i} c_{i} sigmoid(b_{i} + \sum+{j} w_{ij}x_{j})$
Relu: $y = b+ \sum_{2i} c_{i} max(0, b_{i} + \sum_{j} w_{ij}x_{j})$
* Many layers means Deep -> Deep Learning
* Deep = Many Hidden Layers
* Why don't we go deeper?
* Loss for multiple hidden layers
* 100 ReLU for each layer
* Input features are the no. of views in the past 56 days. (better on training data, worse on unseen data)
* If we want to select a model for predicting no. of views today, which one will you use?
#### General Guidance
1. Framework of ML
* Training Data: ${(x^1, \hat y^1),(x^2, \hat y^2), ..., (x^N, \hat y^N) }$
* Testing Data: ${x^{N+1}, x^{N+2}, ..., x^{N+M}}$
* Speech Recognition x(sound) -> $\hat y$: phoneme
* Speaker Recognition x(sound) -> $\hat y$: John (speaker)
* Image Recognition x(image) -> $\hat y$: soup
* Machine Translation x(text) -> $\hat y$: text_output
* use y = $f_{\theta ^*}(x)$ to label the testing data
* ouput data = {y^N+1, y^N+2, ..., y^N+M} -> upload to kaggle
2. General Guide
* loss on training data
* large
* model bias -> make your model complex
* optimization
* small
* loss on testing data
* large
* overfitting -> more training data augmentation, make you model simpler
* mismatch
* small
* trade-off: split your training data into training set and validation set for model selection
* model bias
* the model is to simple: like find a needle in a haystack, but there is no needle.
* Solution: redesign your model to make it more flexible.
$y = b+wx_{1}$ more features
$y = b + \sum_{j=1}{56}w_{j}x_{j}$
$y = b + \sum_{i} c_{i} sigmoid (b_{i} + \sum_{j} w_{ij}x_{j})$
* optimization issue
* large loss not always imply model bias. There is another possibility: A needle is in a haystack just cannot find it.
* Gaining the insights from comparison
* Start from shallower networks (or other models), which are easier to optimize.
* if deeper networks do not obtain smaller loss on traiing data, then there is optmization issue.
* Solution: more powerful optimization technology
* Overfitting
* small loss on training data, large loss on testing data. why?
* An extreme exmaple
* Training data: ${(x^1, \hat y^1), (x^2, \hat y^2), ..., (x^N, \hat y^N)}$
$$
f(x)=
\begin{cases}
\hat y^i & \quad \text{$\exists x^i = x$}\\
random & \quad \text{otherwise}
\end{cases}
$$
* This function obtains zero training loss, but large testing loss.
* less parameters, sharing parameters
* loss features
* early stopping
* regularization
* dropout
* Bias-Complexity Trade-Off
* Model becomes complex (e.g. more features, more parameters)
* The extreme example again
$$
f(x)=
\begin{cases}
\hat y^i & \quad \text{$\exists x^i = x$} \quad \text{k= 1- 1 million}\\
random & \quad \text{otherwise}
\end{cases}
$$
It is possible that $f_{56789}(x)$ happens to get good performance on public testing set.
So you select $f_{56789}(x)$.. random on private testing set.
* Cross Validation
* How to split? training set to training set and validation set.
* using the results of public testing data to select your model you are making public set better than private set.
* N-Fold Cross Validation
* Traininig set split into train and validation cross line to get the average mse from best model that set on testing set (public/ private)
* Mismatch
* Your training and testing data have different distributions. Be aware of how data is generated.
#### When gradient is small
1. Optimization fails because (training loss)
* no way to go in local minima
* escape in saddle point.
* gradient is close to zero in cirital point.
* not small enough when updates.
3. Tayler Series Approximation
* $L(\theta)$ around $\theta$ = $\theta'$ can be approximated below:
$L(\theta) \approx L(\theta') + (\theta - theta')^Tg + \frac{1}{2}(\theta - \theta') H(\theta - theta')$
- gradient g is vector
$g = \delta L(\theta')$
$g_{i} = \frac{\partial L(theta')}{\partial \theta_{i}}$
- hessian H is a matrix
$H_{ij} = \frac{\partial^2}{\partial \theta_{i} \partial \theta_{j}} L(\theta')$
5. Hessian
* $L(\theta)$ around $\theta$ = $\theta'$ can be approximated below:
$L(\theta) \approx L(\theta') + \frac{1}{2}(\theta - \theta') H(\theta - theta')$
* for all v
$v^T Hv > 0$ -> around $\theta': L(\theta) > L(\theta')$ -> Local minima
H is positive definite = All eigen values are positive
$v^T Hv > 0$ -> around $\theta': L(\theta) < L(\theta')$ -> Local maxima
H is negative definite = All eigen values are negative
Sometimes $v^T Hv > 0$, sometimes $v^T Hv < 0$ -> Saddle Point, some eigen are positive, and some are negative.
6. Saddle Point vs Local Minima
* When you have lots of parameters, perhaps local minima is rare?
* Empirical study:
- Train a network once, until it converges to critical point
- More "like" local minima
- never reach a real "local minima"
- $minimum ratio = \frac{number_of_positive_eigen_values}{number_of_eigen_values}$
* Small Gradient
- very slow at plateau
- stuck at saddle point
- stuck at local minima
#### Tips for training: Batch and Momentum
1. Batch
* Review: Optimization with Batch
$\theta^* = \argmin_{\theta} L$
- (randomly) pick initial values $\theta^0$
- compute gradient $g^N = \delta L^N(\theta^0) L^N$
- update $\theta ^{N+1} \longleftarrow \theta ^N - \eta g^N$
- 1 epoch = see all the batches once -> shuffle after each epoch
* Small Batch vs Large Batch
- larger batch size does not require longer time to compute gradient (unless batch size is too large)
- smaller batch requires longer time for one epoch (longer time for seeing all data once)
- Smmaler batch size has better performance
- what's wrong with large batch size? optimization fails.
- "noisy" update is better for training
- Small batch is better on testing data?
Table comparison which batch size is a hyperparmeter you have to decide.
| Variable | Small | Large |
| -------- | ----- | ----- |
| Speed for one update (no parallel) | Faster | Slower |
| Speed for one update (w/ parallel) | Same | Same (not too large) |
| Time for one epoch | Slower | Faster |
| Gradient | Noisy | Stable |
| Optimization | Better | Worse |
| Generalization | Better | Worse |
* Small Batch vs Large Batch
- [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/abs/1904.00962)
- [Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 minutes](https://arxiv.org/abs/1711.04325)
- [Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes well](https://arxiv.org/abs/2001.02312)
- [Large Batch Training of Convolutional Networks](https://arxiv.org/abs/1708.03888)
- [Accurate, large Minibatch SGD: Training Imagenet in 1 hour](https://arxiv.org/abs/1706.02677)
2. Momentum
* Small gradient -> how about put this phenomenon in gradient descent?
* (Vanilla) Gradient Descent
- Starting at $\theta^0$
- Compute Gradient $g^0$
- Move to $\theta^1 = \theta^0 - \eta g^0$
- compute gradient $g^1$
- move to $\theta^2 = \theta^1 - \eta g^1$
* Gradient Descent + Momentum
Movement: Movement of last Step minus gradient at present.
Movement not just based on gradient, but previous movement.
- starting at $\theta^0$
- movement $m^0 = 0$
- compute gradient $g^0$
- movement $m^1 = \lambda m^0 - \eta g^0$
- move to $\theta^1 = \theta^0 + m^1$
- compute gradient $g^1$
- movement $m^2 = \lambda m^1 - \eta g^1$
- move to $\theta^2 = \theta^1 + m^2$
- $m^i$ is the weighted sum of all the previous gradient: $g^0,g^1, .., g^{i-1}$
$m^0 = 0$
$m^1 = -\eta g^0$
$m^2 = -\lambda \eta g^0 - \eta g^1$
- movement = negative of $\frac {\partial L}{\partial w}$ + Last Movement
* Concluding Remarks
- Critial Points have zero gradients
- Critial points can be either saddle points or local minima
- It can be determined by the Hessian Matrix
- It is possible to escape saddle points along the direction of eigenvectors of the Hessian Matrix.
- Local Minima may be rare.
- Smaller batch size and momentum help escape critical points.
#### Error Surface is Rugged - Tips for Training: Adaptive Learning Rate
1. $Training \neq Small$
* People believe training stuck because the parameters are around a critical point.
* Training can be difficult even without critical points.
- This errror surface is convex
- Learning rate cannot be one-size-fits-all.
* Different parameters needs different learning rate
- formulation for one parmeter:
- $\theta_{i}^{t+1} \longleftarrow \theta_{i}^{t} - \eta g_{i}^{t}$
- $g_{i}^{t} = \frac{\partial L}{\partial \theta_{i}}|_{\theta = \theta^t}$
- $\theta_{i}{t+1} \longleftarrow \theta_{i}{t} - \frac{\eta}{\sigma_{i}^{t}}$
- Root Mean Square:
- $\theta_{i}^{t+1} - \frac{eta}{\sigma_{i}^{t}} g_{i}^{t}$
- $\sigma_{i}^{t} = \sqrt{\frac{1}{t+1} \sum_{i=0}^{t} (g_{i}^{t})^2}$ to be used in adagrad
- Learning Rate adapts dynamically.
- RMSProp:
- $\theta_{i}^{t+1} \longleftarrow \theta_{i}^{t} - \frac{\eta}{\sigma_{i}^{t}} g_{i}^{t}$
- the recent gradient has larger influence, and the past gradient have less influence.
- small $\sigma_{i}{t}$ larger step
- increase $\sigma_{i}{t}$ smaller step
- decrease $\sigma_{i}{t}$ larger step
- Adam: RMSProp + Momentum
- without adaptive learning rate
- Learning rate scheduling
$\theta_{i}^{t+1} \longleftarrow \theta_{i}^{t} - \frac{\eta^t}{\sigma_{i}^{t}} g_{i}^{t}$
- learning rate decay: as the training goes, we are closer to the destination, so we reduce the learning rate.
- Warmp Up: Increase and then decrease?
At the begining, the estimate of $\sigma_{i}{t} has large variance$
- Summary of Optimization
- (vanilla) Gradient Descent
$\theta_{i}^{t+1} \longleftarrow \theta_{i}{t} - \eta g_{i}{t}$
- Various Improvements
$\theta_{i}^{t+1} \longleftarrow \theta_{i}{t} - \frac{\eta^t}{\sigma_{i}^{t}} m_{i}^{t}$
learning ratescheduling
momentum: weighted sum of the previous gradients, consider direction.
root mean square of the gradients, only magnitude
#### Classification (Short Version)
1. Classification as Regression?
* Regression
x -> model -> y $\longleftrightarrow$ $\hat y$
* Classiciation as regression?
x -> model -> y $\longleftrightarrow$ $\hat y$ class
different? similar?
2. Class as one-hot vector
1. only output one value
2. How to output multiple values?
3. Regression vs Classification
* Regression
$\hat y \longleftrightarrow y = b + c^T \sigma(b + Wx)$
* Classification
$y = b' + w'\sigma(b+ Wx)$
$\hat y \longleftrightarrow y' = softmax(y)$ make all values between 0 and 1.
* softmax
$y'_{i} = \frac{exp(y_{i})}{\sum_{j} exp(y_{i})}$
$1 > y'_{i} > 0$
$\sum_{i} y'_{i} = 1$
* Loss of Classification
$\hat y \longleftarrow_{e}\longrightarrow y \longleftarrow network \longleftarrow x$
$L = \frac{1}{N} \sum_{n} e_{n}$
Mean Square Error (MSE) $e = \sum_{i} (\hat y_{i} -y'_{i})^2$
Cross-Entropy $e = - \sum_{i} \hat y_{i} ln y'_{i}$
Minimizing cross-entropy is equivalent to maximizing likelihood
Changing the loss function can change the difficulty of optimization
#### Quick Introduction of Batch Normalization
1. Changing Landscape
* $\hat y \longleftarrow_{e} \longrightarrow b + w_{n} + \Delta w_{n} x_{n}$
* $L = \sum e$
3. Feature Normalization
* $\tilde x_{i}^r \longleftarrow \frac{x_{i}^r - m_{i}}{\sigma_{i}}$
* For each dimension i;
* mean: $m_{i}$
* standard deviation: $\sigma_{i}$
* In general, feature normalization makes gradient descent converge faster
4. Considering Deep Learning
* $\tilde x^N \longrightarrow W^N \longrightarrow z^N \longrightarrow sigmoid \longrightarrow \alpha^N \longrightarrow W^N$
$\tilde X_{N}$: feature normalization
$z^N$: Different dims have different ranges and also need normalization
$W^N$: also difficult to optimize.
* $\tilde x^N \longrightarrow W^N \longrightarrow z^N \longrightarrow \mu$
$\tilde x^N \longrightarrow W^N \longrightarrow z^N \longrightarrow \sigma$
$\mu = \frac{1}{3} \sum_{i=1}^{3} z^i$
$\sigma = \sqrt{\frac{1}{3} \sum_{i=1}^3 (z^i-\mu)^2}$
* $\tilde x^N \longrightarrow W^N \longrightarrow z^N \longrightarrow \tilde z^N \longrightarrow sigmoid \longrightarrow \alpha^N$
$\mu \longrightarrow \sigma \longrightarrow \tilde z^N$
$\sigma \longrightarrow \tilde z^N$
$\tilde z^i = \frac{z^i - \mu}{\sigma}$
$\mu$ and $\sigma$ depends on $z^i$
Consider a batch Normalization
* $\tilde x^N \longrightarrow W^N \longrightarrow z^N \longrightarrow \tilde z^N \longrightarrow \hat z^N$
$\mu \longrightarrow \sigma \longrightarrow \tilde z^N$
$\sigma \longrightarrow \tilde z^N$
$\tilde z^i = \frac{z^i - \mu}{\sigma}$
$\tilde z^i = \gamma \odot \tilde z^i + \beta$
$\beta \longrightarrow \hat z^N$
$\gamma \longrightarrow \hat z^N$
$\mu$ and $\sigma$ depends on $z^i$
Consider a batch Normalization
4. Batch Normalization - Testing
* We do not always have batch at testing stage
* computing the moving average of $\mu$ and $\sigma$ of the batches during training
* $\bar \mu \longleftarrow p \bar \mu + (1-p) \mu^t$
* [orignal paper](https://arxiv.org/abs/1502.03167)
5. Internal Covariate Shift?
* How does batch normalization help optimization?
* Batch normalization make $\alpha$ and $\alpha'$ have similar statistics.
Experimental results do not support the above idea.
* from [this paper](https://arxiv.org/abs/1805.11604) mentioned experimental results (and theoritical analysis) support batch normalization change the landscape of error surface.
* to learn more, you can check out several paper. There are [batch renormalization](https://arxiv.org/abs/1702.03275), [layer normalization](https://arxiv.org/abs/1607.06450), [instance normalization](https://arxiv.org/abs/1607.08022), [group normalization](https://arxiv.org/abs/1803.08494), [weight normalization](https://arxiv.org/abs/1602.07868), and [spectrum normalization](https://arxiv.org/abs/1705.10941).
#### Convolutional Neural Network (CNN)
Network Architecture designed for Image
1. Image Classification
* All the images to be classified have the same size
$(100x100) \longrightarrow model \longrightarrow y' \longleftarrow \longrightarrow \hat y$
$(100x100) \longrightarrow 3Dtensor \longrightarrow 3 channels \longrightarrow Value Represents Intensity$
* Fully connected network
do we really need "fully connected" in image processing?
2. Observation 1:
* identifying some critial patterns,
such as human also identify birds in a similar way.
* Need to see the whole image? some patterns are much smaller than the whole image.
* A neuron does not have to see the whole image (basic detector and advanced detector).
3. Simplification 1:
* Can different neurons have different sizes of receptive field?
* Cover only some channels? can be overlapped
* Not square receptive field? the same receptive field
* Each receptive field has a set of neuorons (e.g. 64 neurons)
* The receptive fields cover the whole image.
4. Observation 2:
* The same patterns appear in different regions
* Each receptive field needs a "beak" detector? I detect "beak" in my receptive field.
5. Simpilfication 2:
* Two neurons with same receptive field would not share parameters
$x_{n} \sigma(w_{n}x_{n} + w_{n+1}x_{n+1} + ...)$
$x'_{n} \sigma(w_{n}x'_{n} + w_{n+1}x'_{n+1} + ...)$
* Typical setting
Each receptive field has a set of neurons (e.g. 64 Neurons)
Each receptive field has the neurons with the same set of parameters.
6. Benefit of Convolutional Layer
* Fully Connected Layer -> Jack of all trades, master of none
* Convolutional Layer -> Larger model bias (for image)
* Some patterns are much smaller than the whole image.
* The same patterns appear in different regions.
Table Comparison between Neuron and filter version story
| Neuron Version Story | Filter Version Story |
| -------- | -------- |
| Each neuron only considers a receptive field | There are a set of filters detecting small patterns|
| The neurons with different receptive fields share the parameters | Each filter convolves over the input image. |
4. Observation 2:
* Subsampling the pixels will not change the object.
* Convolutional Layers + Pooling
several $convolution \longrightarrow pooling \longrightarrow falltern$ and to be fully connected layers.
5. More Application
* [Speech](https://dl.acm.org/doi/10.110)
* [Natural Language Processing](https://www.aclweb.org/anth)
* CNN is not invariant to scaling and rotation (we need data augmentation)
#### Self-attention
Detail Application
1. Sophisticated Input
* Input is a vector
$input \longrightarrow model \longrightarrow ScalarOrClass$
* Input is a set of vectors
$input \longrightarrow model \longrightarrow ScalarsOrClasses$
2. Vector set as input
* One-hot Encoding
* Work Embedding
* Graph is also a set of vectors (consider each node as a vector)
3. What is the output?
* Each vector has a label
$word \longrightarrow model \longrightarrow wordLabel$
example: POS Tagging
* The whole sequence has a label
$word \longrightarrow model \longrightarrow oneLabel$
example: Sentiment Analysis
* Model decides the number of labels itself.
$word N \longrightarrow model \longrightarrow word N'$
4. Sequence Labeling
* Is it possible to consider the context? Fully-Connected (FC) can consider the neighbor
* How to consider the whole sequence?
* A window covers the whole sequence?
5. Self-Attention
* $\alpha$ can either input or a hidden layer.
$\alpha ^n \longrightarrow b^n$
* find the relevant vectors in a sequence
$\alpha'_{1,i} = frac{exp(\alpha_{i,i})}{\sum_{j} exp(\alpha_{1,j})}$
* Extract Information based on attention scores
$b^1 = \sum_{i} \alpha'_{1,i} v^i$
* Attention Matrix
$Q = W^q I$
$K = W^k I$
$V = W^v I$
$A' \longrightarrow A = K^T Q$
$O = V.A'$
6. Positional Encoding
* No position information in self-atention
* Each position has a unique positional vector $e^i$
* Each column represents a positional vector $e^i$
* hand-crafted
* leaned from data
$$
e^i + a^i
\begin{cases}
q^i \\
k^i \\
v^i
\end{cases}
$$
* [Comparing position representation methods](https://arxiv.org/abs/)
* Many Applications such as [Transformers](https://arxiv.org/abs/1706.03762) and [BERT](https://arxiv.org/abs/1810.04805)
7. Self-attention for Image
* An image can also be considered as a vector set as reference [here](https://www.researchgate.net/figure/Color-image-representation-and-RGB-matrix_fig15_282798184)
* [Self-Attention GAN](https://arxiv.org/abs/1805.08318)
* Detection Transformer ([DETR](https://arxiv.org/abs/2005.1287234))
9. Self-attention vs CNN
* CNN: self-attention that can only attends in a receptive field
CNN is simplified self-attention
* Self-attention: CNN with learnable receptive field
Self-attention is the complex version of CNN.
* On the relationship between Self-Attention & Convolutional Layers as reference [here](https://arxiv.org/abs/1911.03584)
* An image is worth 16x16 words: transformers for image recognition at scale as reference [here](https://arxiv.org/pdf/2010.11929.pdf)
11. Self-attention vs RNN
* Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention as reference [here](https://arxiv.org/abs/2006.16236)
10. Self-attention for Graph
* Consider edge: only attention to connected nodes
* This is one type of graph neural network (GNN)
11. To learn more about self-attention, you can check it out below refernces:
* [Long Range Arena: A Benchmark for Efficient Transformers](https://arxiv.org/abs/2011.04006)
* [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732)
#### Transformer
Transformers Application
1. Sequence to Sequence (Seq2Seq)
* Input a sequence, output a sequence
* The output length is determined by model
* $T \longrightarrow speechRecognition \longrightarrow N$
* $N \longrightarrow machineTranslation \longrightarrow N'$
* $MachineLearning \longrightarrow speechTranslation_{Language without text} \longrightarrow N$
2. Most Natural Language Processing Applications
$question, context \longrightarrow seq2seq \longrightarrow answer$
* QA can be done by seq2seq as refence [here](https://arxiv.org/abs/1806.08730) and [here](https://arxiv.org/abs/1909.03329)
* Deep Learning for Human Language Processing as reference [here](https://speech.ee.ntu.edu.tw/~hylee/dlhlp/2020-spring.html)
* Seq2seq for Multi-label Classification as reference [here](https://arxiv.org/abs/1909.03434) and [here](https://arxiv.org/abs/1707.05495)
* Seq2seq for object detection as refernce [here](https://arxiv.org/abs/2005.12872)
* Sequemce to sequence learning with Neural Networks as reference [here](https://arxiv.org/abs/1409.3215)
* Transformers as reference [here](https://arxiv.org/abs/1706.03762)
3. To learn more about transformers
* On layer normalization in transformer architecture as reference [here](https://arxiv.org/abs/2002.047)
* PowerNorm: Rethinking Batch Normalization in Transformers as reference [here](https://arxiv.org/abs/2003.078)
* Copy Mechanism as reference [here](https://arxiv.org/abs/1704.04368)
* Incorporating Copying Mechanism in sequence-to-sequence learning as reference [here](https://arxiv.org/abs/1603.06393)
9. Scheduled Samping
* [original scheduled sampling](https://arxiv.org/abs/1506.03099)
* [Scheduled sampling for transformer](https://arxiv.org/abs/1906.07651)
* [Parallel Scheduled Sampling](https://arxiv.org/abs/1906.04331)
#### Generation
Network as Generator
1. Why discribution?
* video prediction as [reference](https://github.com/dyelax/Adversarial_Video_Generation)
* Especially for the task needs "creativity"
* Drawing and Chatbot
2. Generative Adversarial Network (GAN)
* read more [reference](https://github.com/hindupuravinash/the-gan-zoo) for GAN
* Anime Face Generation
- uncoditional generation
- discriminator
* basic idea of GAN
- this is where the term "adversarial" comes from
- Algorithm:
* Initialize generator & discriminator
* in each training iteration:
1. Step 1: fix generator G, and update discriminator D
2. Step 2: fix discriminator D, and update generator G. Generator learns to "fool" the disrcriminator
- anime face generation [references](https://zhuanlan.zhihu.com/p/24767059)
* Professive GAN as reference [here](https://arxiv.org/abs/1710.10196)
* The first GAn as reference [here](https://arxiv.org/abs/1406.2661)
* BigGAn as reference [here](https://arxiv.org/abs/1809.11096)
3. Theory behind GAN
4. Tips for GAN
* Tips from Soutmith as reference [here](https://github.com/soumith/ganhacks)
* Tips in DCGAN: Guideline for Network Architecture design for Image Generation as reference [here](https://arxiv.org/abs/1511.06434)
* Improved Techniques for training GANs as reference [here](https://arxiv.org/abs/1606.03498)
* Tips from BigGAN as reference [here](https://arxiv.org/abs/1809.11096)
### Lesson Timeline by Prof. Min-Te, Sun
This is several lesson Timeline, there are explanation works in several studies:
example
```mermaid
gantt
title Learning Schedule
section Time
Deep Learning Getting Started:2023-01-01, 14d
ANN w/ Keras: 14d
Training Deep Neural Nets: 14d
Vision using CNN: 14d
Sequence using RNN & CNN: 14d
Representation Learning Generative using AutoEncoders GANs: 14d
```
This is Explanation from [Deep learning Course](https://www.youtube.com/playlist?list=PLVHZ98Gy86j4A7WkBbktu-gJE_gW3UqQC) by Prof. Sun, Min-Te
that I have done to watch and review
| Lesson | Video | Date |
| --------- | ----- | ---- |
| The Pipeline Function | [video](#) :clock9: | --- |
| What happens inside the pipeline function | [video](#) :clock9: | --- |
| Instantiate a Transformers model (PyTorch & Tensorflow) | [video](#) :clock9: | --- |
| The Tokenizer Pipeline | [video](#) :clock9: | --- |
| Batchning Inputs Together (PyTorch & TensorFlow) | [video](#) :clock9: | --- |
| Batchning Inputs Together (PyTorch & TensorFlow) | [video](#) :clock9: | --- |
### Lesson Timeline from HuggingFace
This is several lesson Timeline, there are explanation works in several studies:
example
```mermaid
gantt
title Learning Schedule
section Time
The Pipeline Function:Today, 14d
What happens inside the pipeline:14d
Inistiate a Transformers model (PyTorch & TensorFLow):14d
```
This is Explanation from [Hugging Face Course](https://www.youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o) from Youtube that I have done to watch and review
| Lesson | Slide | Date |
| --------- | ----- | ------ |
| Deep Learning Getting Started | [slide](https://drive.google.com/file/d/11HwnijaT_-omOVSiVmAldetElnvOvtBz/view?usp=share_link) :clock9: | ---- |
| Artificial Neural Networks with Keras | [slide](https://docs.google.com/presentation/d/15DHkTzBjJ1WPqgYx_ccQl0jzVEyvGmCQ/edit?usp=share_link&rtpof=true&sd=true) :clock9:| ---- |
| Training Deep Neural Nets | [slide](https://drive.google.com/file/d/1jOo3UR7pg1Pg1nr68oqLypmn75RcUbyM/view?usp=share_link) :clock9:| ---- |
| Vision using Convolutional Neural Networks | [slide](https://drive.google.com/file/d/1QOJGuYLVcanJRD4kGUbOdvkQPiTqzpZc/view?usp=share_link) :clock9:| ---- |
| Sequence using RNN & CNN | [slide](https://drive.google.com/file/d/1wwNL_JeY0Gg7a7CuHgbynuZNJIKZ5o2W/view?usp=share_link) :clock9:| ---- |
| Representation Learning Generative using AutoEncoders GANs | [slide](https://drive.google.com/file/d/1dGTzBltcgIpwGBAQdNKAP2dhxlnfirka/view?usp=share_link) :clock9:| ---- |
## Several Instruction for Code
This is an explanation of how the Code works. Following are the procedures that need to be carried out, which are as follows::
1. The first time working on this project was to connect our network to a remote ssh server such as `ssh widm@140.115.54.59`
2. We must prepare the repository that we want to clone first. Because we are using a Web Intelligence and Data Mining (WIDM) Lab Server, we must make a folder first and clone that file from github.
3. After cloning that repository to the server. We need to make a virtual container using Docker. This docker is a virtual container that has a function to build the multi-cloud platform to run the environment easily. There are several steps to using docker.
4. After We have installaed the container at the server. We need to install a virtual environment, So we can modify the custom environment at the server, activate the virtual environment (venv) and install the requirements.txt file.
5. After that, we need to run the procedure as a repository mentiond in the ReadME file int repository link.
## Short Development Explanation
In Web Intelligence and Data Mining (WIDM) Lab, I was given task by my advisor about the development of several domain knowledge [here](https://sites.google.com/site/nculab/).
| Name | Paper | Code | Status |
| --- | ----- | ---- | --------- |
| CEM | [](https://aclanthology.org/2022.acl-long.475.pdf) | [](https://github.com/Sahandfer/CEM) | tested :heavy_check_mark: |
| CoG-BART | [](https://arxiv.org/abs/2112.11202) | [](https://github.com/whatissimondoing/CoG-BART) | tested :heavy_check_mark: |
| CQG | [](https://aclanthology.org/2022.acl-long.475/) | [](https://github.com/sion-zcfei/CQG) | tested :heavy_check_mark: |
| QCStepByStep | [](https://aclanthology.org/2022.acl-long.475/) | [](https://github.com/sion-zcfei/CQG) | Progress :clock9: |
| KEMP | [](https://arxiv.org/pdf/2009.09708.pdf) | [](https://github.com/atselousov/transformer_chatbot) | tested :heavy_check_mark: |
In WIDM Lab, there are several research topics, but because I'm focused on the "Story Chatbot" Domain. that're consist of Emotion Recognition and Response Generation. So I learn about several paper below:
1. **CEM**: Commonsense-Aware Empathetic Response Generation
2. **CoG-BART**: Contrast and Generation Make BART a Good Dialogue Emotion Recognizer
3. **CQG**: A Simple and Effective Controlled Generation Framework Question Generation through Step byStep Rewriting
4. **QCStepbyStep** - Guiding the Growth: Difficulty-Controllable Question Generation through Step-by-Step Rewriting
5. **KEMP**: Knowledge Bridging for Empathetic Dialogue Generation
There several paper that contains the EmpatheticDialogues dataset, So I need to focus on several paper that inputs from that.
1. CEM: Commonsense-Aware Empathetic Response Generation
2. KEMP: Knowledge Bridging for Empathetic Dialogue Generation
3. What makes a conversation satisfying and engaging? - An Analysis on Reddit Distress Dialogues
4. Conditional Variational Autoencoders for Emotionally - Aware Chatbot based on Transformers
5. Building Empathetic Transformers-Based Chatbot: Deepening and Widening the Chatting Topic
6. Exploring Role of Interjections in Human Dialogs
7. Question Types and Intents in Human Dialogues
8. Emotion Classification on Empathetic Dialogues using BERT-based Models
9. A Dialogue Dataset containing Emotional Support for People in Distress
CEM: Commonsense-Aware Empathetic Response Generation
---
<a href= "https://aclanthology.org/2022.acl-long.475.pdf">
<img src="http://img.shields.io/badge/Paper-PDF-red.svg" alt="venue"/>
</a>
<a href= "https://github.com/Sahandfer/CEM">
<img src="http://img.shields.io/badge/Code-Github-orange.svg" alt="venue"/>
</a>
A key trait of daily conversation betwee individuals is the ability to express empathy towards others, and exploring ways to implement empathy is a crucial step towards human-like dialogue systems.
I explain several key point that consists of:
1. Previous approaches on this topic mainly focus on detecting and utilizing the user's emotion for generating empathetic responses.
2. Since empathy includes both aspects of affection and cognition, we argue that in addition to identifying the user's emotion, cognitive understanding of the user's situation should also be considered.
3. They propose a novel approach for empathetic response generation, which leverages commonsense to draw more information about the user's situation and uses this additional information to further enhance the empathy expression in generated responses.
4. They evaluate their approach on EmpatheticDialogues, which is a widely-used benchmark dataset for empathetic response generation.
5. Empirical results demonstrate that our approach outperforms the baseline models in both automatic and human evaluation. That can generate more informative and empathetic responses.
Their contribution are summarized as follows:
1. They propose to leverage commonsense to improve the understanding of interlocutors situations and feelings, which is an important part of cognitive empathy.
2. They introduce CEM, a novel approach that uses various type of commonsense reasoning to enhance empathetic response generation.
3. Automatic and manual evaluation demonstrate that with the addition of commonsense, CEM is able to generate more informative and empathetic responses compared with the previous methods.
### Preliminaries of This Module
1. Empathetic Dialogue Generation
* However, these works usually focus on detecting the context emotion and do not pay enough attention to the cognitive aspect of empathy.
3. Commonsense and Empathy
* They use ATOMIC (Sap et al.2019) as their commonsense knowledge base.
* ATOMIC is a collection of commonsense reasoning inferences about everyday if-then events.
* For each event, ATOMIC infers six commonsense relations for the person involved in the event:
* The effect of the event of the person (xEffect)
* Their reaction to the event (xReact)
* Their intent before the event (xIntent)
* What they need in order for the event to happen (xNeed)
* What they would want after the event (xWant)
* An Inferred attribute of the person's characteristics (xAttr)
4. Task Formulation
* They conduct their experiments on the EMPATHETICDIALOGUES (Rashkin et al. 2019), a large-scale multi-turn dataset containing 25k empathetic conversation between crowdsourcing workers.
* The dataset also provides an emotion label for each conversation from the total 32 avalilable emotions.
* In this dataset, each conversation is between a speaker and a listener.
* The task requires a dialogue model to play the role of the listener and generates empathetic responses.
* $D = [u_{1}, u_{2}, u_{3},.. u_{k-1}]$
* $U_{i} = [W^i_{1}, W^i_{2}, W^i_{3}, ..., W^i_{M}]$
* Their goal is generating the listener's next utterance $U_{k}$ which is coherent to the context, informative, and empathetic to the speaker's situation and feelings.
### Methodology

In Figure 2, that illustrate how the process of CEM that's mainly divided into 5 stages: context encoding, knowledge acquisition, context refinement, knowledge selection, and response generation.
1. Context Encoding
* $H_{CTX} = ENC_{CTX}(E_{C})$
3. Knowledge Acquisition
* $H_{xReact} = Enc_{Aff}(E_{CS_{xReact}})$
* $H_{r} = Enc_{Cog}(E_{CS_{r}})$
* $h_{xReact} = Average(H_{xReact})$
* $h_{r} = H_{r}[0]$
5. Context Refinement
* $H_{Aff} = Enc_{CTX-Aff}(U_{xReact})$
* $H_{Cog,r} = Enc_{CTX - Cog}(U_{r})$
* Emotion Classification
* $h_{Aff} = H_{Aff}[0]$
* $P_{emo} = Softmax(W_{e}h_{Aff})$
* $L_{emo} = -log(P_{emo}(e^*))$
6. Knowledge Selection
* $H_{Cog}[i] = r\in$ {xWant, xNeed, xIntent, xEffect} $H_{Cog, r}$
* $H_{Refine}[i] = H_{Aff}[i] \oplus H_{Cog}$
* $\tilde{H}_{CTX} = MLP(\sigma(H_{Refine}) \odot H_{Refine})$
8. Response Generation
* $P(y_{t}|y<t, C) = Dec(E_{y<t}, \tilde{H}_{CTX})$
10. Training Objectives
* $L_{nll} = - \displaystyle\sum^T_{t=1} log P(y_{t} | C, y<t)$
* Response Diversity
* $RF_{i} = \frac{freq(c_{i})}{\displaystyle\sum^V_{j=1} freq(c_{i})}$
* $w_{i} = \alpha$ x $RF_{i} + 1$
* $L_{div} = -\displaystyle\sum^T_{t=1}\displaystyle\sum^T_{i=1} w_{i}\delta(c_{i})logP(c_{i}|y<t, C)$
* $L = \gamma_{3} L_{nll} + \gamma_{2} L_{emo} + \gamma_{3} L_{div}$
### Experiments
They selected the following baseline models for comparison:
1. Baselines
* [Transformers](https://arxiv.org/pdf/2109.05739.pdf#page=9&zoom=100,55,471)
* [Multi-Tsk Transformer (Multi_Trs)](https://arxiv.org/pdf/2109.05739.pdf#page=9&zoom=100,55,241)
* [MoEL](https://arxiv.org/pdf/2109.05739.pdf#page=8&zoom=100,409,421)
* [MIME](https://arxiv.org/pdf/2109.05739.pdf#page=8&zoom=100,409,772)
* [EmpDG](https://arxiv.org/pdf/2109.05739.pdf#page=8&zoom=100,409,282)
2. Implementation Details
* They implemented all the models using PyTorch and used 300-dimensional pre-trained GloVE vectors (Pennington, 2014) to initialize the word embeddings, which were shared between the encoders and the decoders.
* The hidden dimension for all corresponding components were set to 300.
* Adam optimizers with $\beta_{1} = 0.9$ and $\beta_{2} = 0.98$ was used for training.
* The initial learning rate was set to 0.0001 and we varied this value during training according to Vaswani et al. (2017)
* All the models were trained on one single TITAn Xp GPU using a batch size of 16 and early stopping.
* They used a batch size of 1 and a maximum of 30 decoding steps during testing and inference.
* They used the same 8:1:1 train/valid/test split as provided by Raskhin et al. (2019)
3. Automatic Evaluation
Table 2. Results of Automatic Evaluation
| Models | PPL | Dist-1 | Dist-2 | Acc(%) |
| ------ | --- | ------ | ------ | ------ |
| Transformer | 37.62 | 0.45 | 2.02 | - |
| Multi-Trs | 37.75 | 0.41 | 1.67 | 33.57 |
| MoEL | 36.93 | 0.44 | 2.10 | 30.62 |
| MIME | 37.09 | 0.47 | 1.90 | 31.36 |
| EmpDG | 37.29 | 0.46 | 2.02 | 30.41 |
| CEM | 36.11 | 0.66 | 2.99 | 39.11 |
| w/o Aff | 36.49 | 0.56 | 2.52 | 33.76 |
| w/o Cog | 36.63 | 0.56 | 2.47 | 36.42 |
| w/o Div | 35.60 | 0.48 | 1.96 | 38.82 |
Table 2 shows the automatic evaluation results. CEM achieves the lowest perplexity, which suggest the overall quality of our generated responses is higher than the baselines. In addition, our model also considerably outperforms the baselines in terms of Dist-n, wich highlights the importance of the diversity loss.
In terms of emotion classification, CEM had a much higher accuracy compared to the baselines, which suggests the addition of commonsense knowledge is also beneficial for detecting the user's emotion.
4. Human Evaluation
This evaluation conducted via two task
* First, crowsourcing workers were asked to assign a score from 1 to 5 to the generated responses based on the aspects of fluency, relevancy, and empathy.
* Second, they were required to choose the better response between two models within the same context.
Tabel 3. Human Evaluation Comparison
| Comparisons | Aspects | Win | Loss | K |
| ----------- | ------- | --- | ---- | ---- |
| CEM vs. MoEL | Coh.</br> Emp.</br> Inf.</br>| 53.6</br>52.0</br>61.0</br> | 37.6</br>38.0</br>30.6 | 0.57</br>0.57</br>0.51 |
| CEM vs. MIME | Coh.</br> Emp.</br> Inf.</br>| 52.0</br>50.3</br>48.6</br> | 42.3</br>41.6</br>45.0 | 0.44</br>0.57</br>0.51 |
| CEM vs. EmpDG | Coh.</br> Emp.</br> Inf.</br>| 46.3</br>54.3</br>47.6</br> | 42.6</br>33.3</br>43.3 | 0.52</br>0.51</br>0.41 |
As shown in Table 3, CEM outperforms the baselines in all of the three aspects. Particularly, with the enhancement of commonsense knowledge, our model was able to produce responses that conveyed more specific and informative content that conveyed more specific and informative content and thus were more empathetic.
We also note several talks like
* CEM did not significantly outperform MIME in informativeness.
* We realized that on average, MIME tends to generate longer responses (12.8 words/ response) compared to CEM (9.6 words/ response).
* It is possibly due to some annotators considering these responses as more informative since they included more words.
* However, as shown by the results of the automatic evaluation (Table 2), we can observe that MIME has the second-lowest Dist-2 score, which suggests that its generated responses may follow similar patterns and have less diversity.
5. Ablation Studies
* w/o Aff: the affective & affection-refined encoders are removed
* w/o Cog: the cognitive and cognition-refined encoders are removed.
* w/o Div: the diversity loss is removed from the training objectives
6. Case Study
* The first case
The baselines fail to realize the meaning behind ready for a puppy, which implies that user wants to buy or adopt a puppy. It can be observed that MoEL dismisses this implication while the other two baselines mistakes the meaning behind the phrase for being ready for and event or exam, which may cause the user to be proud of themselves. By Accessing external knowledge. CEM better acknowledges the user's situations and implied feelings and generates and empathetic response that covers both aspects of empathy. That is by detecting that the user might be excited and may want to get a dog. It responds with both affective (that is great) and cognitive (did you get a good dog?) statements.
* the second case
Unlike the baselines, CEM successfully detects the user is being nostalgic, happy and sad, where the letter two emotions are likely to be implied in the word bittersweet. In addition, CEM realizes that user intent behind looking through photos of their children was to reminisce memories, which suggests that user enjoys having those memories.
* The third case
CEM's ability to express both affective & cognitive empathy in multi-turn dialogue. As shown, all the baselines dismiss the user's statement I was not hit, which implies they are fine and no harm was done. In contrast, CEM correctly recognizes there is no harm done to the user and regardless of detecting the user might have remorse and guilt, it chooses to focus more on the importatnt part of this situation, which is the user's health and safety.
### Conclusion
This is several statements that have been done so far, from that experiments
1. They proposed the Commonsense-aware Empathetic Chatting Machine (CEM) to demonstrate how leveraging commonsense knowledge could benefit the understanding of the user's situation and feelings, which leads to more informative and empathetic feelings.
2. Their empirical automatic and manual evaluation indicated that effectiveness of our approach in empathetic response generation.
3. In the future, their work can inspire other approaches to leverage commonsense knowledge for empathetic response generation and similarly promising tasks (e.g. providing emotion support Liu et al. 2021)
For several weekly reports, you can check it [here](https://docs.google.com/document/d/1k4vFv-BP-KVNSHenKyPCaDA-jreR1GVwPTrOC6Trs_0/edit?usp=share_link).
CoG-BART: Contrast and Generation Make BART a Good Dialogue Emotion Recognizer
---
<a href= "https://arxiv.org/abs/2112.11202">
<img src="http://img.shields.io/badge/Paper-PDF-red.svg" alt="venue"/>
</a>
<a href= "https://github.com/whatissimondoing/CoG-BART">
<img src="http://img.shields.io/badge/Code-Github-orange.svg" alt="venue"/>
</a>
In dialogue systems, utterances with similar semantics may have distinctive emotions under different context.Therefore, modeling long-range contextual emotional relationships with speaker dependency plays a crucial part in dialogue emotion recognition. Meanwhile, distinguishing the different emotion categories is non-trivial since they usually have semantically similar sentiments.
I explain several key point that consists of:
1. They adopt supervised contrastive learning to make different emotions mutually exclusively to identify similar emotions better.
2. They utilize an ausiliary response generation task to enhance the model's ability of handling context information, thereby forcing the model to recognize emotions with similar semantics indiverse contexts.
3. We use the pretrained encoder-decoder model BART as our backbone model since it's very suitable for both understanding and generation tasks to achieve these objectives.
4. The experiments on 4 datasets demonstrate that our proposed model obtains significantly more favorable results than the state-of-the-art model in dialogue emotion recognition.
5. The ablation study futher demostrates the effectiveness of supervised constrastive loss & generative loss.
To summarize, their main contributions can be concluded as follows:
1. To the best of our knowledge, we utilize supervised contrastive learning for the first time in ERC and significantly improve the model's ability to distinguish different sentiments.
2. By incorporating response generation as an auxiliary task, the performance of ERC is improved when involved.
3. Our model is easy-to-implemented since it does not depend on external resources, like graph-based methods.
CQG: A Simple and Effective Controlled Generation Framework Question Generation through step by step rewriting
---
<a href= "https://aclanthology.org/2022.acl-long.475/">
<img src="http://img.shields.io/badge/Paper-PDF-red.svg" alt="venue"/>
</a>
<a href= "https://github.com/sion-zcfei/CQG">
<img src="http://img.shields.io/badge/Code-Github-orange.svg" alt="venue"/>
</a>
Mutli-hop question generation focuses on generating complex that require reasoning over multiple pieces of information of the input passage.
I explain several key point that consists of:
1. To address this challenge, we propose the CQG, which is a simple and effective controlled framework.
2. CQG employs a simple method to generate the multi-hop questions that contain key entities in multi-hop reasoning chains, which ensure the complexity and quality of the question.
3. We introduce a novel controlled transformer-based decoder to guarantee that key entities appear in the question.
4. Experiment results show that our model greatly improves performance, which also outperforms the state-of-the-art model about 25% by 5 BLEU points on HotpotQA
Guiding the Growth: Difficulty-Controllable Question Generation through Step-by-Step Rewriting
---
<a href= "https://aclanthology.org/2021.acl-long.465/">
<img src="http://img.shields.io/badge/Paper-PDF-red.svg" alt="venue"/>
</a>
<a href= "https://tinyurl.com/19esunzz">
<img src="http://img.shields.io/badge/Code-Github-orange.svg" alt="venue"/>
</a>
This paper explore the task of Difficulty Controllable Question Generation (DCQG). which aims at generating questions with required difficult levels.
I explain several key point that consists of:
1. In our work, we redefine question difficulty as the number of inference steps required to answer it and argue that Question Generation (QG) systems should have stronger control over the logic of generated question.
2. To this end, we propose a novel framework that progessively increases question difficulty through step-by-step rewriting under the guidance of an extracted reasoning chain.
3. A dataset is automatically constructed to facilitate the research, on which extensive experiments are conducted to test the performance of our method.
In summary, their contributions are as follows:
1. To the best of our knowledge, this is the first work of difficulty-controllable question generation, with question difficulty defined as the inference steps to answer it.
2. They propose a novel framework that achieves DCQG through step-by-step rewriting under the guidance of an extracted reasoning chain.
3. They build a dataset that can facilitate training of rewriting question into moree complex one, paired with constructed context graphs and the underlying reasoning chain of the question.
### Related Work
1. Deep Question Generation
2. Difficulty-Controllable Question Generation
3. Question Rewriting
### Method
1. Context Graph Construction
2. Reasoning Chain Selection
3. Step-by-step Question Generation
> Algorithm 1: Procedure of Our DCQG Framework.
```gherkin=
# Input and Output
Input: context C, difficulty level d
Output: (Q, A)
# Function
Gcg: BuildCG(C)
No: SampleAnswerNode(Gcg)
Gl: MaxTree(Gcg, No)
Gt: Prune(Gl, d)
# Looping Function
for Ni in PreorderTraversal(Gt) do
if i = 0 then continue
Np(i) = Parent(Ni)
Si = ContextSentences(C, Ni, Np(i))
Ri <- Bridge if Ni= FirstChild(Np(i))
<- Intersection else
Qi <- QGinitial(Ni, Np(i), Si) if i = 1
<- QGRewrite(Qi-1, Ni, Np(i), Si, Ri) else
end for
return (Qd, No)
```
> Algorithm 2: Procedure of Data Construction
```gherkin=
# Input and Output
Input: context C= {P1, p2}, QA pair(Q2, A2), supporting facts F
Output: R1, (Q1, A1), S1, S2, {N0, E1, N1, E2, N2}
# Function
R1: TypeClassify(Q2)
# if else condition
if R1 not in {Bridge, Intersection} then return
subq1, subq2 <- DecompQ(Q2)
suba1, suba2 <- QA(subq1), QA(subq2)
Q1, A1 <- subq2, suba2 if A2 = suba2
<- subq1, suba1 else
Q1, A1 <- F intersection P1, F intersection P2 if Q1 concerns P1
<- F intersection P2, F intersection P1 else
N2 <- FindNode(A2)
No, E1, N1, E2 <- (subq1, subq2)
```
5. Automatic Dataset Construction
### Experiments
1. Experimental Setup
a. Datasets
b. Baselines
c. Implementation Details
2. Evaluation of Question Quality
a. Automatic Evaluation
b. Human Evaluation
3. Controllability Analysis
a. Human Evaluation of Controllability
b. Difficulty Assessment with QA Systems
4. Boosting Multi-hop QA Performance
5. More-hop Question Generation
### Conclusion
1. We explored the task of difficulty-controllable question generation, with question difficutly redefined as the inference steps required to answer it.
2. A step-by-step generation framework was proposed to accomplish this objective, with an input sampler to extract the reasoning chain, a question generator to produce a simple question, and a question rewriter to further adapt it into a more complex one.
3. A dataset was automatically constructed based on HotpotQA to facilitate the research.
4. Extensive evaluations demonstrated that our method can effectively demonstrated that our method can effectively control difficulty of the generated questions, and keep high question quality at the same time.
Knowledge Bridging for Empathetic Dialogue Generation
---
Lack of external knowledge makes empathetic dialogue systems difficult to perceive implicit emotions and learn emotional interactions from limited dialogue history.

Figure 1. Model Architecture KEMP
I explain several key point that consists of:
1. They propose to leverage external knowledge, including commonsense knowledge and emotional lexical knowledge, to explicitly understand and express emotions in empathetic dialogue generation.
2. They first enrich the dialogue history by jointly interacting with external knowledge and construct an emotional context graph.
3. They learn emotional context representation from the knwoledge-enriched emotional context graph and distill emotional signals, which are thre prerequisites to predicate emotions expressed in responses.
4. They proose an emotional cross-attention mechanicsm to learn the emotinal dependencies from the emotional context graph.
5. They find the performance of their method can be further improved by integrating with a pre-trained model that works orthogonally.
In summary, their contribution are as follwos:
1. They propose KEMP which is able to accurately perceive and appropriately express implicit emotions. (By leverage external knowledge to enhance empathetic dialogue generation).
2. They desing an emotional context encoder and an emotion-dependeny decoder to learn the emotional dependencies between the emotion-enchanced representation of the dialogue history and target response.
3. They conduct extensive experiments and analyses to demonstrate the effectiveness of KEMP.
An Investigation of Sustainability of Pre-Trained Language Models for Dialogue Generation - Avoiding Disrepancies
---
[](https://aclanthology.org/2021.findings-acl.393/)
[](https://github.com/atselousov/transformer_chatbot)
[](https://github.com/Marsan-Ma-zz/chat_corpus)
[](https://github.com/nouhadziri/THRED)
[](https://github.com/rkadlec/ubuntu-ranking-dataset-creator)
[](https://github.com/microsoft/unilm/)
[](https://github.com/Maluuba/nlg-eval)
Pre-trained language models have been widely used in response generation for open-domain dialogue. These approaches are built within 4 frameworks: Transformer-ED, Transformer-Dec, Transformer-MLM and Transformer-AR.
I explain several key point that consists of:
1. They experimentally compare them using both large and smale-scale data.
2. This reveals that decoder-onyl architecture is better than stacked encoder-decoder, and both left-to-right and bi-directional attention have their own advantages.
3. They further define two concepts of model disrepancy, which provides a new explanation to the model performance.
4. They propose two solutions to reduce them, which succesfully improve the model performance.
The contribution in this work are as follows:
1. They compare the four commonly used frameworks that utilize pre-trainede language models for open-domain dialogue generation on 3 public datasets each in large and small scale. They analyze each framework based on the experimental results.
2. They introduce the concept of pretrain-finetune discrepancy and finetune-generation discrepancy, and they examine the discrepancies of each framework.
3. They propose two methods to reduce discrepancies, yielding improved performance. It is the first investigation that shows explicitly the phenomenon of model discrepancy and its impact on performance.
Automatically Select Emotion for Response via Personality-Affected Emotion Transition
---
<a href= "https://aclanthology.org/2021.findings-acl.444/">
<img src="http://img.shields.io/badge/Paper-PDF-red.svg" alt="venue"/>
</a>
<a href= "https://github.com/preke/PELD">
<img src="http://img.shields.io/badge/Code-Github-orange.svg" alt="venue"/>
</a>
To provide consistent emotional interaction with users, dialog systems should be capable to automatically select appropriate emotions for responses like human. However, most existing work focus on rendering specified emotions in responses or empathically respond to the emotion of users, yet the individual difference in emotion expression is overlooked.
I explain several key point that consists of:
1. To tackle this issue (inconsistent emotional expressions & disinterest users), They propose to equip the dialog system with personality and enable it to automaticaly select emotions in responses by simulating the emotion transition of human in conversation.
2. To achieve this (preceding dialog context & personality trait), They first model the emotion transition in the dialog system as the variation between the preceding emotion and the response emotion in the Valence-Arousal-Dominance (VAD) emotion space.
3. Then they design neural networks to encode the preceding dialog context & the specified personality traits to compose the variation.
4. We construct a dialog dataset with emotion & personality labels & conduct emotion prediction task for evaluation.
Tweet Classification to Assist Human Moderation for Suicide Prevention
---
<a href= "https://pubmed.ncbi.nlm.nih.gov/35173997">
<img src="http://img.shields.io/badge/Paper-PDF-red.svg" alt="venue"/>
</a>
<a href= "https://github.com/jacobeisenstein/SAGE">
<img src="http://img.shields.io/badge/Code-Github-orange.svg" alt="venue"/>
</a>
Social media platforms are already engaged in leveraging existing online soci-technical systems to employ just-in-time interventions for suicide prevention to the public. Thse effors primarily rely on self-reports. Most recently, platforms have employed automated models to identify self-harm content, but acknowledge that thse automated.
I explain several key point that consists of:
1. They analyze time-aware neural models that build on these language variants and factors in the historical, emotional spectrum of a user's tweeting activity.
2. The strongest model achieves high (statistically significant) performance (macro F1=0.804, recall=0.813) to identify social media indicative of suicidal intent.
3. Using three uses cases of tweets with phrases common to suicidal intent, we qualitatively analyze and interpret how such models decided if suicidal intent was present and discuss how these analyses may be used to alleviate the burden on human moderators within the known constraints of how moderation is performed (e.g. no access to the user's timeline)
4. Finally we discuss the ethical implications of such data-driven models and inferences about suicidal intent from social media.
Overview of Audio Visual Scene-Aware Dialog with Reasoning Track for Natural Language Generation in DSTC10
---
<a href= "https://www.merl.com/publications/docs/TR2022-016.pdf">
<img src="http://img.shields.io/badge/Paper-PDF-red.svg" alt="venue"/>
</a>
<a href= "https://github.com/tylin/coco-caption">
<img src="http://img.shields.io/badge/Code-Github-orange.svg" alt="venue"/>
</a>
In these challenges, the best-performing system relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications.
I explain several key point that consists of:
1. To promote further advancements for real-world applications, a third AVSD challenge is proposed, at DSTC10, with two modifications
* The human-created description is unavailable at inference time
* Systems must demonstrate temporal reasoning by finding evidence from the video to support each answer.
2. This paper introduces the new task that includeds temporal reasoning and the new extension of the AVSD dataset for DSTC10 for which human generated temporal reasoning data were collected.
3. A baseline system was built using an AV-transformer and the new datasets were released for the challenge.
4. Finally this paper reports the challenge results of 12 systems submitted to the AVSD task in DSTC10.
* The two systems using GPT-2 based multimodal transformer have achieved the best performance for human rating. BLEU4 and CIDEr
* The temporal reasoning performed by those systems has outperformed the baseline method with temporal attention.
What Makes a Conversation Satisfying and Enganging? - An Analysis on Reddit Distress Dialogues
---
[](https://www.epfl.ch/labs/gr-pu/wp-content/uploads/2022/07/What-Makes-a-Conversation-Satisfying-and-Engaging-An-Analysis-on-Reddit-Distress-Dialogues.pdf)
AI-driven chatbots have gained interest to help people deal with emotional distress and help them regulate emotion.
I explain sevearal key point that consist of:
1. Once challenge is ensuring that data collected from these platforms contain responses that lead to high engagement and satisfaction and avoid those that lead to high engagement and satisfaction and avoid those that leat to dissastifaction and disengagement.
2. We have developed a novel scoring function that can measure the level of satisfaction and engagement in distress oriented conversation.
3. Using this scoring function, we classified dialogues in Reddit Emotional Distress (RED) dataset as highly satisfying, less satisfying, highly engaing, and less engaging.
Conditional Variational Autoencoders for Emotionally-aware Chatbot Based on Transformer
---
[](https://www.epfl.ch/labs/gr-pu/wp-content/uploads/2022/07/Conditional-Variational-Autoencoders-for-Emotionally-aware-Chatbot-Based-on-Transformer.pdf)
Rising demand for artificial intelligence-powerered chatbots with sentiment analysis is creating new growth opportunities for numerous areas.
Building Empathetic Transformer-based Chatbot: Deepening and Widening the Chatting Topic
---
[](https://www.epfl.ch/labs/gr-pu/wp-content/uploads/2022/07/Building-Empathetic-Transformer-based-Chatbot-Deepening-and-Widening-the-Chatting-Topic.pdf)
Human-machine interaction, particularly via dialogue system, was a popular research area in the past decade.
Exploring Role of Interjections in Human Dialogs
---
[](https://www.epfl.ch/labs/gr-pu/wp-content/uploads/2022/07/Exploring-Role-of-Interjections-in-Human-Dialogs.pdf)
Interjections are words and expressions that people use to communicate sudden reactions, feelings, and emotions.
Question Types and Intents in Human Dialogues
---
[](https://www.epfl.ch/labs/gr-pu/wp-content/uploads/2022/07/Question-Types-and-Intents-in-Human-Dialogues.pdf)
This project proposes to preprocess the existing EmpatheticDialogues dataset, consisting 25K conversations, into a new version that comprises dialogues which contain empathetic question in the final listener's turn.
Emotion Classification on Empathetic Dialogues using BERT-Based Models
---
[](https://raychen0617.github.io/assets/pdf/Project_Report_team_6.pdf)
This project proposes to preprocess the existing EmpatheticDialogues dataset, consisting 25K conversations, into a new version that comprises dialogues which contain empathetic question in the final listener's turn.
A Dialogue Dataset Containing Emotional Support for People in Distress
---
[](https://www.epfl.ch/labs/gr-pu/wp-content/uploads/2022/07/A-Dialogue-Dataset-Containing-Emotional-Support-for-People-in-Distress-3.pdf)
Many people suffer from emotional distress due to many reasons, such as significant life change, financial crisis, or various physical and mental health conditions.
Project Timeline
---
```mermaid
gantt
title Progress Training and Development
section WIDM Timeline
Resources Study :2023-01-12, 7d
Paper Study :7d
```
## Appendix and FAQ
:::info
**Find this document incomplete?** Leave a comment!
:::
###### tags: `WIDM` `Common Sense` `Emotion Recognition`