【類神經網路架構的運作模式】

# 【類神經網路架構的運作模式】 ## 目錄 - Part1: 類神經運作的流程圖 - Part2: Perceptron 運作公式的推導 - Part3: Backpropagation 公式的推導 <br><br> ## Part1: 類神經運作的流程圖 ### I. 類神經網路的訓練: 挑選數學比賽代表為例 <center> <img src="https://hackmd.io/_uploads/By7j43FjC.png", style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> ### II. 類神經網路的訓練: 實際訓練流程 <center> <img src="https://hackmd.io/_uploads/rJccwL7Nye.png", style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> :::success **重點筆記:** 資料集拆分的範例 - 訓練集 v.s. 測試集: 7:3 or 8:2 - 訓練集 v.s. 測試集 v.s. 驗證集: 7:1.5:1.5 ::: ### III. 手刻類神經的程式 **手刻類神經需要 4 個 Function** - Main Function - Initialize: 產生 Neural Network 初始的權重 - Forward: Neural Network 從輸入算到輸出 - Backward: 類神經內每個參數的 Update <center> <img src="https://hackmd.io/_uploads/r1gTOy5o0.png", style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> #### 倒傳遞的數學原理 - 為最佳化理論一種 - 最佳化概念: 實務上很難找最佳解，由於不同維度會交互作用，很難找解，且容易受 Local Minimum 干擾。 <center> <img src="https://hackmd.io/_uploads/rypNskcjA.png", style=" width: 60%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - 倒傳遞找最佳解的方法: 可透過走斜率切線的反方向找到區域最佳解 <center> <img src="https://hackmd.io/_uploads/BJRkhJqoR.png", style=" width: 60%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - 為什麼我們要一次產生和訓練很多類神經，再從中找最佳者? 因為 Neural Network 的 Performance 深受初始結果影響。不好的初始結果會掉到 Local Minimum。 <center> <img src="https://hackmd.io/_uploads/BJ4oT1cj0.png", style=" width: 60%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> :::success **重點筆記**: - 以單一變數來看，倒傳遞目標是調整一個變數的位置使 loss 最低。 - 多 Random 幾次，可有沒有辦法跑到 Global Minimum 附近。 ::: - 訓練上的選擇: 只能訓練 10 萬次時，哪種方法效果比較好？ 1. 一個 NN 訓練 100,000 次 ( X ) 2. 一萬個 NN 每個訓練 10 次 ( O ) ( 比較有機會找到 Global minimum) - 為了 Deep Learning 不用 train 很多次？因為他用求解空間換來的。 - 補充: 一定可以找到全域最佳解的方法 - 基因演算法 <center> <img src="https://hackmd.io/_uploads/S1R0yx5oR.png", style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - 補充: 應用於深度學習中 <center> <img src="https://hackmd.io/_uploads/r1QI-g5jR.png", style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> ## Part2: Perceptron 運作公式的推倒 **以下的數學公式分成兩個部分來描述：** 1. 程式的實際計算狀況。(以 $\circledast$ 做標記) 2. 正式文件的數學公式。(以 $\circledcirc$ 做標記) <center> <img src="https://hackmd.io/_uploads/S1AuPcPqR.png", style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> \begin{equation} P_{Layer} \end{equation} \begin{align} \circledast & \quad o_{p_{1}} = x_1, \quad o_{p_{2}} = x_2 \\ \\ \circledcirc & \quad \mathbf{o_{p}} = \begin{bmatrix} O_{p_{1}} \\ O_{p_{2}} \end{bmatrix} = \begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix} \end{align} <center> <img src="https://hackmd.io/_uploads/r1jnx9w9A.png", style=" width: 50%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> :::success **重點筆記**: - $P$ 層沒有 $\sum$ 也沒有 $f$，輸入層通常只貫穿帶入變數。 ::: \begin{equation} Q_{Layer} \end{equation} - 先經過 $\sum$ ： \begin{align} \circledast & \quad h_{q_{1}} = w^1_{11} o_{p_{1}} + w^1_{12} o_{p_{2}} + b_1 \\ & \quad h_{q_{2}} = w^1_{21} o_{p_{1}} + w^1_{22} o_{p_{2}} + b_2 \\ & \quad h_{q_{3}} = w^1_{31} o_{p_{1}} + w^1_{32} o_{p_{2}} + b_3 \\ \\ \circledcirc & \quad \mathbf{h_q} = \mathbf{w^1} \mathbf{o_p} + \mathbf{b} \end{align} - 再經過 $f$ ： \begin{align} \circledast & \quad o_{q_{1}} = f(hq_{1}) = \dfrac{e^{h_{q_{1}}} - e^{-h_{q_{1}}}}{e^{h_{q_{1}}} + e^{-h_{q_{1}}}} \\ & \quad o_{q_{2}} = f(hq_{2}) = \dfrac{e^{h_{q_{2}}} - e^{-h_{q_{2}}}}{e^{h_{q_{2}}} + e^{-h_{q_{2}}}} \\ & \quad o_{q_{3}} = f(hq_{3}) = \dfrac{e^{h_{q_{3}}} - e^{-h_{q_{3}}}}{e^{h_{q_{3}}} + e^{-h_{q_{3}}}}\\ \\ \circledcirc & \quad \mathbf{o_{q}} = \dfrac{e^{\mathbf{h}_{q}} - e^{-\mathbf{h}_{q}}}{e^{\mathbf{h}_{q}} + e^{-\mathbf{h}_{q}}} \end{align} :::success **重點筆記**: - $Q$ 層又稱隱藏層，先經過 $\sum$，再經過 $f$ (Tangent Sigmoid)。 ::: <br> \begin{equation} R_{Layer} \end{equation} \begin{align} \circledast & \quad o_r = h_r = w^2_{11} o_{q_{1}} + w^2_{12} o_{q_{2}} + w^2_{13} o_{q_{3}} + b \\ \circledcirc & \quad \mathbf{o_{r}} = \mathbf{w^2} \mathbf{o_{g}} + \mathbf{b}= \begin{bmatrix} w^2_{11} w^2_{12} w^2_{13} \end{bmatrix} \begin{bmatrix} o_{q_{1}} \\ o_{q_{2}} \\ o_{q_{3}} \end{bmatrix} + \begin{bmatrix} b_{1} \\ b_{2} \\ b_{3} \end{bmatrix} \end{align} :::success **重點筆記**: 無限 tangent sigmoid 線性相加，具有 Universal approximation ::: :::info **注意事項**: 參數定義注意事項 - 變數：斜體。 - 矩陣、向量：粗體 - 數字：正體。 ::: <br><br> ## Part3: Backward 公式的推導 #### I. 基本觀念介紹 \begin{equation} v(t+1) = v(t) - \xi \frac{\partial E}{\partial v} \end{equation} <center> <img src="https://hackmd.io/_uploads/H1ipbgqjR.png", style=" width: 50%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> <center> | $v(t)$ | $-$ | $\partial$ | $\xi$ | | :--: | :--: | :--: | :--: | | 起始點 | 往微分反方向走 | 微分 | 步伐，一次要走多遠 | </center> - $\xi$ 的介紹: 1. 靜態的 $\xi$ 對訓練來說不好 (步伐永遠固定) <center> <img src="https://hackmd.io/_uploads/ryThAGciA.png", style=" width: 100%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> 2. $\xi$ 的設定方法: 數值的範圍從 -1 ~ +1 :::success **重點筆記**: 假設有 100 題，訓練 10 個 Epoch $\to$ Update 100 * 10 次，理想 $\xi = \frac{2}{100 \times 10}$ ::: 3. 動態的 $\xi$ 設定方式: moment 的走法 (先大再小) - 目前最佳訓練方法 adam 的核心之一 <center> <img src="https://hackmd.io/_uploads/H1YGbQ5iA.png", style=" width: 50%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> <br> #### II. Loss Function \begin{align} \mathrm{E} &= \frac12 error^2\\ error &= y^d-y^{act} = y^d - o_r \\ & \quad (\mathrm{d:desire, act: actual})\\ \end{align} :::success **重點筆記**: - $error^2$: 處理 $error$ 的正負號。 - $\frac{1}{2}$: 消除偏微分後次方向產生成 $2$。 - $y^{d}-y^{act}$: 理想與實際網路輸出差異。 ::: #### III. Forward 公式 \begin{align} \circledcirc & \quad P : \mathbf{o_{p}} = \begin{bmatrix} O_{p_{1}} \\ O_{p_{2}} \end{bmatrix} = \begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix} \\ \circledcirc & \quad Q : \mathbf{h_q} = \mathbf{w^1} \mathbf{o_p} + \mathbf{b}, \mathbf{O_q} = \mathbf{\frac{e^{hq} - e^{-hq}}{e^{hq} + e^{hq}}} \\ \circledcirc & \quad \mathbf{o_{r}} = \mathbf{w^2} \mathbf{o_{g}} + \mathbf{b}= \begin{bmatrix} w^2_{11} w^2_{12} w^2_{13} \end{bmatrix} \begin{bmatrix} o_{q_{1}} \\ o_{q_{2}} \\ o_{q_{3}} \end{bmatrix} + \begin{bmatrix} b_{1} \\ b_{2} \\ b_{3} \end{bmatrix} \end{align} #### IV. Backward 公式 : 用公式依次推 $w^2, b, w^1$ Perceptron 有 $w$, $b$ 兩個參數要調整，因此要分開求此兩參數 :::info **注意事項**: \begin{equation} v(t+1) = v(t) - \xi \frac{\partial E}{\partial v} \end{equation} \begin{equation} \mathrm{E} = \frac12 error^2 \end{equation} \begin{align} error &= y^d-y^{act} = y^d - o_r \\ &\quad (\mathrm{d:desire, act: actual}) \end{align} ::: #### Update $w^2$ 參數 \begin{align} w^2(t+1) &= w^2(t) + -\xi\frac{\partial E}{\partial w^2}\\ where \; \frac{\partial E}{\partial w^2} &= \frac{\partial E}{\partial error} \cdot \frac{\partial error}{\partial y^{act}} \cdot \frac{\partial y^{act}}{\partial w^2}\\ &=error\cdot -1\cdot O_q \\ \therefore w^2(t+1) &= w^2(t) + \xi \cdot error \cdot O_q \end{align} - $w^2$ 的個別 update 公式(以 $w^2_{11}$ 為例，同樣能應用到 $w^2_{12}$, $w^2_{13}$): \begin{align} w^2_{11}(t+1) &= w^2_{11}(t) + \xi error\cdot O_{q_1}\\ \frac{\partial y^{act}}{\partial w^2_{11}} &= \frac{\partial (w^2_{11}O_{q_1}+w^2_{12}\cdot O_{q_2}+w^2_{13}\cdot O_{q_3})}{\partial w^2_{11}}\\ &=\frac{\partial (w^2_{11}\cdot O_{q_1})}{\partial w^2_{11}}=O_{q_1} \\ \end{align} <center> <img src="https://hackmd.io/_uploads/Hy04jwXEke.png", style=" width: 50%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> :::success **重點筆記**: - 只要注意影響自己的人 ::: \begin{align} w^2_{12}(t+1) &= w^2_{12}(t) + \xi error\cdot O_{q_2} \\ w^2_{13}(t+1) &= w^2_{13}(t) + \xi error\cdot O_{q_3} \end{align} :::success **重點筆記**: - $\xi$ 的誤差: $(-1) \sim (+1)$ - $O_q$ 的誤差: $(-1) \sim (+1)$ ::: <br> #### Update $b$ 參數: \begin{align} b(t+1) &= b(t)+-\xi \frac{\partial \rm E}{\partial b}\\ \frac{\partial \rm{E}}{\partial b} &= \frac{\partial \rm{E}}{\partial error} \cdot \frac{\partial error}{\partial y^{act}} \cdot \frac{\partial y^{act}}{\partial O_q} \cdot \frac{\partial O_q}{\partial h_q} \cdot \frac{\partial h_q}{\partial b}\\ &= error \cdot -1 \cdot w^2 \cdot \frac{4}{(e^{h_q}+e^{-h_q})^2} \cdot 1\\ \therefore b(t+1)&=b(t)+\xi error\cdot w^2\cdot\frac4{(e^{h_q}+e^{-h_q})^2} \end{align} - $b_2$ 的個別 update 公式為例: $$ b_2(t+1)=b_2(t)+\xi error\cdot w^2_{12}\cdot\frac4{(e^{h_{q_2}}+e^{-h_{q_2}})^2} $$ <center> <img src="https://hackmd.io/_uploads/BJiQhPQ41g.png", style=" width: 50%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> #### Update $w^1$ 參數: \begin{align} w^1(t+1) &= w^1(t) + -\xi\frac{\partial \rm{E}}{\partial w^1}\\ \frac{\partial \rm E}{\partial w^1} &= \frac{\partial \rm{E}}{\partial error} \cdot \frac{\partial error}{\partial y^{act}} \cdot \frac{\partial y^{act}}{\partial O_q} \cdot \frac{\partial O_q}{\partial h_q} \cdot \frac{\partial h_q}{\partial w^1}\\ &= error \cdot -1 \cdot w^2 \cdot \frac{4}{(e^{h_q} - e^{-h_q})^2} \cdot O_p \\ \therefore w^1(t+1) &= w^1(t) + \xi error \cdot w^2 \frac{4}{(e^{h_q} - e^{-h_q})^2} \cdot O_{p} \end{align} - $w^1_{32}$ 的個別 update 公式為例: \begin{equation} w^1_{32}(t+1) = w^1_{32}(t) + \xi error \cdot w^2_{13} \frac{4}{(e^{h_{q_3}} - e^{-h_{q_3}})^2} \cdot O_{p_2} \end{equation} <center> <img src="https://hackmd.io/_uploads/HkccpwX4Jg.png", style=" width: 50%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> :::success **重點筆記** : 手刻類神經時，可以把 $\xi \cdot error$ 類似的重複運算，存成新變數 - 細節：Update 前把所有參數先備份，接下來 update 公式中若有用的這些參數，用備份好的參數。 - 原因：$w^2_{12}$ 程式所使用到的是 $w^2_{12}(t+1)$ 的值。 ::: - 先推 $b$ 的更新公式: \begin{align} b(t+1) &= b(t) + - \xi \frac{\partial E}{\partial b} \\ where \; \frac{\partial E}{\partial b} &= \frac{\partial E}{\partial error} \frac{\partial error}{\partial y^{act}} \frac{\partial y^{act}}{\partial b} \\ &= error \cdot -1 \cdot 1 = -error\\ \therefore b(t+1) &= b(t) + \xi \cdot error \end{align} - 再推 $w$ 的更新公式: \begin{align} w(t+1) &= w(t) + - \xi \frac{\partial E}{\partial w} \\ where \; \frac{\partial E}{\partial w} &= \frac{\partial E}{\partial error} \frac{\partial error}{\partial y^{act}} \frac{\partial y^{act}}{\partial w} \\ &= error \cdot -1 \cdot x = -x \cdot error \\ \therefore w(t+1) &= w(t) + \xi \cdot x \cdot error \end{align} :::info **注意事項**: - $E$ 是目標函數，其值越小越好。 - $\frac{1}{2}$ 是為了抵銷偏微分後的結果。 - 偏微分的連鎖率: 探討 $c$ 對 $E$ 的影響 \begin{align} E = y(a), \quad A &= f(b), \quad B = g(c) \\ \\ \frac{\partial E}{\partial c} &= \frac{\partial E}{\partial \alpha} \frac{\partial \alpha}{\partial b}\frac{\partial b}{\partial c} \end{align} ::: <br> ### III. 倒傳遞的討論 #### 討論.1 : Multiple output: 多輸出的倒傳遞一樣把參數會影響者找出來加總即可 <center> <img src="https://hackmd.io/_uploads/rkT3jt7i0.png", style=" width: 50%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> - **Case 1**： $w^2_{24}$ ( 只影響 $y^{act}$, $error_2$ ) \begin{align} w^2(t+1) &= w^2(t) + \xi \cdot error \cdot O_q \\ \rightarrow w^2_{24} (t+1) &= w^2_{24}(t) + \xi \cdot error_2 \cdot O_{q_4} \end{align} - **Case 2**: $w^1_{32}$ ( $error$ 會影響 $error_1$, $error_2$) \begin{align} w^1(t+1) &= w^1(t) + \xi error \cdot w^2 \frac{4}{(e^{h_q} - e^{-h_q})^2} \cdot O_{p} \\ \rightarrow w^1_{32}(t+1) &= w^1_{32}(t) + \xi error_1 \cdot w^2_{13} \frac{4}{(e^{h_{q_3}} - e^{-h_{q_3}})^2} \cdot O_{p_2} \\ &+ \xi error_2 \cdot w^2_{23} \frac{4}{(e^{h_{q_3}} - e^{-h_{q_3}})^2} \cdot O_{p_2} \end{align} :::success **重點筆記**: multiple output 時，參數 update 有可能多個 $error$ ，出現要修正方向出現矛盾 (如一正一負)，造成參數兩邊都無法訓練到最好。 ::: <br> #### 討論.2 : Multiple output - 第二個版本 <center> <img src="https://hackmd.io/_uploads/H1LARD7V1l.png", style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> **Question1: 如果 error 有兩個以上時？** - $\frac{\partial E}{\partial \nu}$ 也就是 $\frac{\partial error_{1}}{\partial \nu} + \frac{\partial error_{2}}{\partial \nu}$，換句話說，更新的公式會兩項相加。 - 其誤差公式如下 $E = \frac{1}{2} error^1_{1} + \frac{1}{2} error^2_{2}$ <center> <img src="https://hackmd.io/_uploads/SJ2uU07N1g.png", style=" width: 70%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> #### 討論.3 : 梯度消失 - 參數在 Update 時，$error$ 傳到越前面影響力越小。(參數不會動) - 驗證手刻類神經有無學錯： <center> <img src="https://hackmd.io/_uploads/SyHqeqQoA.png", style=" width: 90%; height: auto;"> <div style=" border-bottom: 3px solid #d9d9d9; display: inline-block; color: #999; padding: 3px;"> </div> </center> <br><br>