Applied Deep Learning - YUN-NUNG (VIVIAN) CHEN (2022 Fall)

# Applied Deep Learning - YUN-NUNG (VIVIAN) CHEN (2022 Fall) ###### tags: NTU-2022-Fall ## Class info. [課程資訊](https://www.csie.ntu.edu.tw/~miulab/f111-adl/) <style> .red{ color: red; } .bigger{ font-size: 30px; } </style> * Intent Classification NTU ADL HW1 (Fall, 2022) [Intent classification kaggle source code with LSTM](https://www.kaggle.com/code/hakim29/intent-classification-with-lstm) * [5 Techniques to Prevent Overfitting in Neural Networks](https://www.kdnuggets.com/2019/12/5-techniques-prevent-overfitting-neural-networks.html) ## Date ### 9/8 :::info Introduction about ML, DL, and how to apply. ::: **Learning $\approx$ Looking for a function** $x$: input, $\hat{y}$: output, $f$: model Training is to find the best function $f^*$ Testing $y'=f^*(x')$, which $x',y'$ is unknown. --- End-to-end training: what each function should do is learned automatically * [Representation learning](https://zhuanlan.zhihu.com/p/136554341) > **Deep v.s. Shallow** > Deep Model 相較 Shallow model 容易學習，且 performance 上限高，但所需的資料量更多。 <table><tr> <td><img src="https://i.imgur.com/FZnevG4.png" alt="drawing" width="400"/></td> <td><img src="https://i.imgur.com/GEwxFZl.png" alt="drawing" width="400"/></td> </tr> </table> Deep Neural Network: cascading the neurons to form a neural network A neural network is a complex funtion: $f \ : R^N \rightarrow R^M$ * [Why are neural networks becoming deeper, but not wider ?](https://stats.stackexchange.com/questions/222883/why-are-neural-networks-becoming-deeper-but-not-wider) --- The learning algorithm $f$ is to map the input doamin $X$ into the output doamin $Y$ $f \ : X \rightarrow Y$ How to Frame the Learning Problem ? * Input domain * Output domain ### 9/15 Training Procedure 1. What is the model? (function hypothesis set) 2. What does a “good” function mean? (Loss function design) 3. How do we pick the “best” function? (Optimization) --- Classification Task: Binary / Multi-class * $f(x)=y \ \ \ \ f \ : R^N \rightarrow R^M$ |Type|$x$|$y$|$R^N$|$R^M$| |--|--|--|--|--| |Handwriting Digit Classification|image|class/label|pixel size|digit recognition| |Sentiment Analysis|word|class/label|size of vocab|postive/negative/nectural| * Why Bias ? <table><tr> <td><img src="https://i.imgur.com/h54PB7p.png" alt="drawing" width="400"/></td> <td><img src="https://i.imgur.com/QhNhpqz.png" alt="drawing" width="400"/></td> </tr> </table> * Model Parameter of A Single Neuron $y=h_{w,b}(x)=\sigma(w^{T}x+b), \ \ \sigma(x)=\frac{1}{1+e^{-z}}$ * [Perceptron](https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC3-2%E8%AC%9B-%E7%B7%9A%E6%80%A7%E5%88%86%E9%A1%9E-%E6%84%9F%E7%9F%A5%E5%99%A8-perceptron-%E4%BB%8B%E7%B4%B9-84d8b809f866) * [Multi-Layer Perceptron](https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%A5%9E%E7%B6%93%E7%B6%B2%E8%B7%AF-%E5%A4%9A%E5%B1%A4%E6%84%9F%E7%9F%A5%E6%A9%9F-multilayer-perceptron-mlp-%E9%81%8B%E4%BD%9C%E6%96%B9%E5%BC%8F-f0e108e8b9af) :::info **Mult-Layer Perceptron v.s. Linear Regression** * [Geeksforgeeks](https://www.geeksforgeeks.org/difference-between-multilayer-perceptron-and-linear-regression/) * [Stack Exchange](https://cs.stackexchange.com/questions/28597/difference-between-multilayer-perceptron-and-linear-regression) **Mult-Layer Perceptron v.s. Logistic Regression** * [Stack Exchange](https://stats.stackexchange.com/questions/162257/whats-the-difference-between-logistic-regression-and-perceptron) ::: > Multiple layers enhance the model expression → The model can approximate more complex functions (but it needs more data to build the model) Deep Neural Networks (DNN): Fully connected [feedforward network](https://ithelp.ithome.com.tw/articles/10194255) * Notation Definition <table><tr> <td><img src="https://i.imgur.com/pgpNayS.png" alt="drawing" width="400"/></td> <td><img src="https://i.imgur.com/TKhEdwH.png" alt="drawing" width="400"/></td> </tr> <tr> <td><img src="https://i.imgur.com/WmUBZAZ.png" alt="drawing" width="400"/></td> <td><img src="https://i.imgur.com/YgupRIK.png" alt="drawing" width="400"/></td> </tr> </table> * Notation Summary |Symbol|Definition|Symbol|Definition| |--|--|--|--| |$a_i^{l}$|output of a neuron|$W^l$|a weight matrix| |$a^l$|output vector of a layer|$b_i^l$|a bias| |$z_i^l$|input of activation function|$b^l$|a bias vector| |$w_{i j}^l$|a weight||| $a^l=\sigma(z^l)=\sigma(W^la^{l-1}+b^l)$ * Activation function 較常用有 Sigmoid、Tanh、ReLU ![](https://i.imgur.com/rDLugit.png) **Without non-linearity**, deep neural networks work the same as linear transform $W_1(W_2)\cdot=(W_1W_2)x=Wx$ **With non-linearity**, networks with more layers can approximate more complex functions --- * Model Parameters Formal definition: $f(x;\theta) \ \ \ \ \theta=\{W^1,b^1,W^2,b^2...,W^L,b^L\}$ Define a function to measure the quality of a parameter set $\theta$ : Loss/cost/error function $C(\theta)$ - How bad $\theta$ is : Best model parameter set: $\theta^{*}=\arg \min_{\theta}C(\theta)$ - How good $\theta$ is : Best model parameter set: $\theta^{*}=\arg \rm{max}_{\theta}C(\theta)$ 如果訓練出來的 function 越好，代表 $f(x;\theta) \sim \hat{y} \Longrightarrow \|\hat{y}-f(x;\theta)\|\approx 0$ Define an example loss function: $C(\theta)=\sum_{k}\|\hat{y}_k-f(x_k;\theta)\|$ * Frequnet Loss Function |Name|Function|When to use ?|type| |--|--|--|--| |Square loss|$C(\theta)=(1-\hat{y}f(x;\theta))^2$|[Link](https://ithelp.ithome.com.tw/articles/10218158)|regression problem| |Hinge loss|$C(\theta)=\max(0,1-\hat{y}f(x;\theta))$|[Link](https://blog.csdn.net/hustqb/article/details/78347713)|binary classification| |Logistic loss|$C(\theta)=-\hat{y}\log(f(x;\theta))$|[Link](https://linuxhint.com/logistic-regression-pytorch/)|binary classification| |Cross entropy loss|$C(\theta)=-\sum\hat{y}\log(f(x;\theta))$|[Link](https://ithelp.ithome.com.tw/articles/10218158)|classification| --- Find a model parameter set that minmizes $C(\theta)$: Find $\theta$ that $\frac{\partial{C(\theta)}}{\partial{\theta}} = 0$ It's called **Gradient Descent** that $\theta^{i+1} \leftarrow \theta^{i} - \eta \nabla_\theta C(\theta)$ Randomly start at $\theta^0$, compute $\frac{d C(\theta^0)}{d \theta}: \ \theta^{1} \leftarrow \theta^0 -\eta \frac{\partial{C(\theta^0)}}{\partial{\theta}} ...$ Cost function can be square error loss. * Stochastic Gradient Descent (SGD) ![](https://i.imgur.com/vdqGaBU.png) **<span class="red">It can make back propagation faster</span>** * one epoch: see all training samples one |GD|SGD|Mini-Batch SGD| |--|--|--| |Update after seeing all examples|If there are 20 examples, update 20 times in one epoch.|Pick a set of B training samples as a batch *b*| |$\theta^{i+1}=\theta^{i}-\eta\frac{1}{K}\sum_{k}\nabla C_k(\theta^i)$|$\theta^{i+1}=\theta^{i}-\eta\nabla C_k(\theta^i)$|$\theta^{i+1}=\theta^{i}-\eta\frac{1}{B}\sum_{x_k \in b}\nabla C_k(\theta^i)$| > Why SGD is slower that mini-batch SGD ? A: Modern computers run matrix-matrix multiplication faster than matrix-vector multiplication. ![](https://i.imgur.com/zPD0qRA.png) * Overfitting Solution: more training samples, dropout... --- <span class="bigger">$\Delta w \rightarrow \Delta x \rightarrow \Delta y \rightarrow \Delta z$</span> <span class="bigger">$\frac{\partial z}{\partial w} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x} \frac{\partial x}{\partial w}$</span> [NN from scratch](https://github.com/lionelmessi6410/Neural-Networks-from-Scratch) ![](https://i.imgur.com/xb8lP5x.png) ``` python def back_propagate(self, y, output): current_batch_size = y.shape[0] dZ2 = output - y.T dW2 = (1./current_batch_size) * np.matmul(dZ2, self.cache["A1"].T) db2 = (1./current_batch_size) * np.sum(dZ2, axis=1, keepdims=True) dA1 = np.matmul(self.params["W2"].T, dZ2) dZ1 = dA1 * self.activation(self.cache["Z1"], derivative=True) dW1 = (1./current_batch_size) * np.matmul(dZ1, self.cache["X"]) db1 = (1./current_batch_size) * np.sum(dZ1, axis=1, keepdims=True) self.grads = {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2} return self.grads ``` ![](https://i.imgur.com/STy8BSB.png) ![](https://i.imgur.com/S3WwfxV.png) ![](https://i.imgur.com/adNJn7k.png) ## Reference [YT Playlist](https://www.youtube.com/watch?v=rrw0IIEVEUo&list=PLOAQYZPRn2V5yumEV1Wa4JvRiDluf83vn)