# Applied Deep Learning - YUN-NUNG (VIVIAN) CHEN (2022 Fall)
###### tags: NTU-2022-Fall
## Class info.
[課程資訊](https://www.csie.ntu.edu.tw/~miulab/f111-adl/)
<style>
.red{
color: red;
}
.bigger{
font-size: 30px;
}
</style>
* Intent Classification NTU ADL HW1 (Fall, 2022)
[Intent classification kaggle source code with LSTM](https://www.kaggle.com/code/hakim29/intent-classification-with-lstm)
* [5 Techniques to Prevent Overfitting in Neural Networks](https://www.kdnuggets.com/2019/12/5-techniques-prevent-overfitting-neural-networks.html)
## Date
### 9/8
:::info
Introduction about ML, DL, and how to apply.
:::
**Learning $\approx$ Looking for a function**
$x$: input, $\hat{y}$: output, $f$: model
Training is to find the best function $f^*$
Testing $y'=f^*(x')$, which $x',y'$ is unknown.
---
End-to-end training: what each function should do is learned automatically
* [Representation learning](https://zhuanlan.zhihu.com/p/136554341)
> **Deep v.s. Shallow**
> Deep Model 相較 Shallow model 容易學習,且 performance 上限高,但所需的資料量更多。
<table><tr>
<td><img src="https://i.imgur.com/FZnevG4.png" alt="drawing" width="400"/></td>
<td><img src="https://i.imgur.com/GEwxFZl.png" alt="drawing" width="400"/></td>
</tr>
</table>
Deep Neural Network: cascading the neurons to form a neural network
A neural network is a complex funtion: $f \ : R^N \rightarrow R^M$
* [Why are neural networks becoming deeper, but not wider ?](https://stats.stackexchange.com/questions/222883/why-are-neural-networks-becoming-deeper-but-not-wider)
---
The learning algorithm $f$ is to map the input doamin $X$ into the output doamin $Y$
$f \ : X \rightarrow Y$
How to Frame the Learning Problem ?
* Input domain
* Output domain
### 9/15
Training Procedure
1. What is the model? (function hypothesis set)
2. What does a “good” function mean? (Loss function design)
3. How do we pick the “best” function? (Optimization)
---
Classification Task: Binary / Multi-class
* $f(x)=y \ \ \ \ f \ : R^N \rightarrow R^M$
|Type|$x$|$y$|$R^N$|$R^M$|
|--|--|--|--|--|
|Handwriting Digit Classification|image|class/label|pixel size|digit recognition|
|Sentiment Analysis|word|class/label|size of vocab|postive/negative/nectural|
* Why Bias ?
<table><tr>
<td><img src="https://i.imgur.com/h54PB7p.png" alt="drawing" width="400"/></td>
<td><img src="https://i.imgur.com/QhNhpqz.png" alt="drawing" width="400"/></td>
</tr>
</table>
* Model Parameter of A Single Neuron
$y=h_{w,b}(x)=\sigma(w^{T}x+b), \ \ \sigma(x)=\frac{1}{1+e^{-z}}$
* [Perceptron](https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC3-2%E8%AC%9B-%E7%B7%9A%E6%80%A7%E5%88%86%E9%A1%9E-%E6%84%9F%E7%9F%A5%E5%99%A8-perceptron-%E4%BB%8B%E7%B4%B9-84d8b809f866)
* [Multi-Layer Perceptron](https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%A5%9E%E7%B6%93%E7%B6%B2%E8%B7%AF-%E5%A4%9A%E5%B1%A4%E6%84%9F%E7%9F%A5%E6%A9%9F-multilayer-perceptron-mlp-%E9%81%8B%E4%BD%9C%E6%96%B9%E5%BC%8F-f0e108e8b9af)
:::info
**Mult-Layer Perceptron v.s. Linear Regression**
* [Geeksforgeeks](https://www.geeksforgeeks.org/difference-between-multilayer-perceptron-and-linear-regression/)
* [Stack Exchange](https://cs.stackexchange.com/questions/28597/difference-between-multilayer-perceptron-and-linear-regression)
**Mult-Layer Perceptron v.s. Logistic Regression**
* [Stack Exchange](https://stats.stackexchange.com/questions/162257/whats-the-difference-between-logistic-regression-and-perceptron)
:::
> Multiple layers enhance the model expression
→ The model can approximate more complex functions
(but it needs more data to build the model)
Deep Neural Networks (DNN): Fully connected [feedforward network](https://ithelp.ithome.com.tw/articles/10194255)
* Notation Definition
<table><tr>
<td><img src="https://i.imgur.com/pgpNayS.png" alt="drawing" width="400"/></td>
<td><img src="https://i.imgur.com/TKhEdwH.png" alt="drawing" width="400"/></td>
</tr>
<tr>
<td><img src="https://i.imgur.com/WmUBZAZ.png" alt="drawing" width="400"/></td>
<td><img src="https://i.imgur.com/YgupRIK.png" alt="drawing" width="400"/></td>
</tr>
</table>
* Notation Summary
|Symbol|Definition|Symbol|Definition|
|--|--|--|--|
|$a_i^{l}$|output of a neuron|$W^l$|a weight matrix|
|$a^l$|output vector of a layer|$b_i^l$|a bias|
|$z_i^l$|input of activation function|$b^l$|a bias vector|
|$w_{i j}^l$|a weight|||
$a^l=\sigma(z^l)=\sigma(W^la^{l-1}+b^l)$
* Activation function
較常用有 Sigmoid、Tanh、ReLU

**Without non-linearity**, deep neural networks work the same as linear transform
$W_1(W_2)\cdot=(W_1W_2)x=Wx$
**With non-linearity**, networks with more layers can approximate more complex functions
---
* Model Parameters
Formal definition: $f(x;\theta) \ \ \ \ \theta=\{W^1,b^1,W^2,b^2...,W^L,b^L\}$
Define a function to measure the quality of a parameter set $\theta$ : Loss/cost/error function $C(\theta)$
- How bad $\theta$ is :
Best model parameter set: $\theta^{*}=\arg \min_{\theta}C(\theta)$
- How good $\theta$ is :
Best model parameter set: $\theta^{*}=\arg \rm{max}_{\theta}C(\theta)$
如果訓練出來的 function 越好,代表 $f(x;\theta) \sim \hat{y} \Longrightarrow \|\hat{y}-f(x;\theta)\|\approx 0$
Define an example loss function: $C(\theta)=\sum_{k}\|\hat{y}_k-f(x_k;\theta)\|$
* Frequnet Loss Function
|Name|Function|When to use ?|type|
|--|--|--|--|
|Square loss|$C(\theta)=(1-\hat{y}f(x;\theta))^2$|[Link](https://ithelp.ithome.com.tw/articles/10218158)|regression problem|
|Hinge loss|$C(\theta)=\max(0,1-\hat{y}f(x;\theta))$|[Link](https://blog.csdn.net/hustqb/article/details/78347713)|binary classification|
|Logistic loss|$C(\theta)=-\hat{y}\log(f(x;\theta))$|[Link](https://linuxhint.com/logistic-regression-pytorch/)|binary classification|
|Cross entropy loss|$C(\theta)=-\sum\hat{y}\log(f(x;\theta))$|[Link](https://ithelp.ithome.com.tw/articles/10218158)|classification|
---
Find a model parameter set that minmizes $C(\theta)$: Find $\theta$ that $\frac{\partial{C(\theta)}}{\partial{\theta}} = 0$
It's called **Gradient Descent** that $\theta^{i+1} \leftarrow \theta^{i} - \eta \nabla_\theta C(\theta)$
Randomly start at $\theta^0$, compute $\frac{d C(\theta^0)}{d \theta}: \ \theta^{1} \leftarrow \theta^0 -\eta \frac{\partial{C(\theta^0)}}{\partial{\theta}} ...$
Cost function can be square error loss.
* Stochastic Gradient Descent (SGD)

**<span class="red">It can make back propagation faster</span>**
* one epoch: see all training samples one
|GD|SGD|Mini-Batch SGD|
|--|--|--|
|Update after seeing all examples|If there are 20 examples, update 20 times in one epoch.|Pick a set of B training samples as a batch *b*|
|$\theta^{i+1}=\theta^{i}-\eta\frac{1}{K}\sum_{k}\nabla C_k(\theta^i)$|$\theta^{i+1}=\theta^{i}-\eta\nabla C_k(\theta^i)$|$\theta^{i+1}=\theta^{i}-\eta\frac{1}{B}\sum_{x_k \in b}\nabla C_k(\theta^i)$|
> Why SGD is slower that mini-batch SGD ?
A: Modern computers run matrix-matrix multiplication faster than matrix-vector multiplication.

* Overfitting
Solution: more training samples, dropout...
---
<span class="bigger">$\Delta w \rightarrow \Delta x \rightarrow \Delta y \rightarrow \Delta z$</span>
<span class="bigger">$\frac{\partial z}{\partial w} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x} \frac{\partial x}{\partial w}$</span>
[NN from scratch](https://github.com/lionelmessi6410/Neural-Networks-from-Scratch)

``` python
def back_propagate(self, y, output):
current_batch_size = y.shape[0]
dZ2 = output - y.T
dW2 = (1./current_batch_size) * np.matmul(dZ2, self.cache["A1"].T)
db2 = (1./current_batch_size) * np.sum(dZ2, axis=1, keepdims=True)
dA1 = np.matmul(self.params["W2"].T, dZ2)
dZ1 = dA1 * self.activation(self.cache["Z1"], derivative=True)
dW1 = (1./current_batch_size) * np.matmul(dZ1, self.cache["X"])
db1 = (1./current_batch_size) * np.sum(dZ1, axis=1, keepdims=True)
self.grads = {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2}
return self.grads
```



## Reference
[YT Playlist](https://www.youtube.com/watch?v=rrw0IIEVEUo&list=PLOAQYZPRn2V5yumEV1Wa4JvRiDluf83vn)