# mid algorithms ``` class: algorithms ``` ------- ### 梯度消失及爆炸 前言 曾在上學期的演算法和資料分析中學習了梯度下降。從而知道梯度消失及爆照。簡述梯度消失根源其為---深度神經網絡和反向傳播。深層網絡由許多非綫性層堆叠,深度網絡可被看作一個複合非綫性多元函數。其數學公式 y=fN​(fN−1​(…f2​(f1​(x))…)) 最終需要得出最優解是g(x),滿足Loss=L(g(x),f(x)),簡單的數學公式: Loss=(g(x) - f(x)^2 2) 損失函數類似於下圖 ![v2-d09ff67c849168b4db6db3201e8edef8_720w](https://hackmd.io/_uploads/ByWPQkRXa.png) ## 梯度消失&爆炸 兩次原因下梯度常出現一個是在深度網絡中,二采用了不合適函數。 ### 從深度網絡角度出發 ![v2-a49d6d008278e9b45a7c9db4c661319f_720w](https://hackmd.io/_uploads/H1Ml4JCQT.png) 圖中爲一個 四層全連接網絡 加黑色每層激活後輸出值為fi(x), i為第i層,x代表第i層輸出,也就是第i-1層的輸出,f是激活函數,那麽,fi+1 = f(ffi*wi+1+bi+1),簡單的列爲 fi+2 = f(fi * wi + 1) BP算法也就是反向傳播算法,基於梯度下降,以目標的負梯度方向進行調整. 梯度下降法则:参数的更新公式为 θ=θ−α⋅∇J(θ)θ=θ−α⋅∇J(θ),其中 θθ 为参数,αα 为学习率,∇J(θ)∇J(θ) 为损失函数 J(θ)J(θ) 关于参数 θθ 的梯度。 对于第二隐藏层的权值更新:设第二隐藏层的权值矩阵为 WW,损失函数为 JJ,第二隐藏层的输出为 a(2)a(2),则权值更新的梯度为 ∂J∂W(2)∂W(2)∂J​,更新公式为 W(2)=W(2)−α⋅∂J∂W(2)W(2)=W(2)−α⋅∂W(2)∂J​。 链式求导法则:如果z=f(g(x))z=f(g(x)),∂z∂x=∂f∂g⋅∂g∂x∂x∂z​=∂g∂f​⋅∂x∂g​。 對於層級增多求出梯度更新以指數形式增加,即發生梯度爆炸。 > 以下為演示代碼 ``` import numpy as np def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_derivative(x): return x * (1 - x) def update_weights(weights, learning_rate, gradient): return weights - learning_rate * gradient def forward_propagation(inputs, weights): return sigmoid(np.dot(inputs, weights)) def backward_propagation(output_error, inputs, weights): gradient = output_error * sigmoid_derivative(output_error) weights_gradient = np.outer(inputs, gradient) return gradient, weights_gradient def train_neural_network(inputs, targets, learning_rate, epochs): input_size = len(inputs[0]) hidden1_size = 4 hidden2_size = 3 output_size = len(targets[0]) # Initialize weights randomly weights_input_hidden1 = np.random.rand(input_size, hidden1_size) weights_hidden1_hidden2 = np.random.rand(hidden1_size, hidden2_size) weights_hidden2_output = np.random.rand(hidden2_size, output_size) for epoch in range(epochs): for i in range(len(inputs)): # Forward propagation hidden1_output = forward_propagation(inputs[i], weights_input_hidden1) hidden2_output = forward_propagation(hidden1_output, weights_hidden1_hidden2) final_output = forward_propagation(hidden2_output, weights_hidden2_output) # Backward propagation output_error = targets[i] - final_output gradient_hidden2_output, weights_gradient_hidden2_output = backward_propagation(output_error, hidden2_output, weights_hidden2_output) gradient_hidden1_hidden2, weights_gradient_hidden1_hidden2 = backward_propagation(gradient_hidden2_output, hidden1_output, weights_hidden1_hidden2) _, weights_gradient_input_hidden1 = backward_propagation(gradient_hidden1_hidden2, inputs[i], weights_input_hidden1) # Update weights weights_input_hidden1 = update_weights(weights_input_hidden1, learning_rate, weights_gradient_input_hidden1) weights_hidden1_hidden2 = update_weights(weights_hidden1_hidden2, learning_rate, weights_gradient_hidden1_hidden2) weights_hidden2_output = update_weights(weights_hidden2_output, learning_rate, weights_gradient_hidden2_output) return weights_input_hidden1, weights_hidden1_hidden2, weights_hidden2_output # Example usage inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) targets = np.array([[0], [1], [1], [0]]) learning_rate = 0.1 epochs = 10000 trained_weights = train_neural_network(inputs, targets, learning_rate, epochs) print("Trained weights:") print("Weights Input-Hidden1:\n", trained_weights[0]) print("Weights Hidden1-Hidden2:\n", trained_weights[1]) print("Weights Hidden2-Output:\n", trained_weights[2]) ``` 如果此部分小於1 那麽根據層數增多 求出梯度更新信息也將以指數形式衰減, 即發生了梯度消失. 以下為演示代碼 ``` import numpy as np import matplotlib.pyplot as plt def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_derivative(x): return x * (1 - x) def forward_propagation(inputs, weights): return sigmoid(np.dot(inputs, weights)) def backward_propagation(output_error, inputs, weights): gradient = output_error * sigmoid_derivative(output_error) weights_gradient = np.outer(inputs, gradient) return gradient, weights_gradient def train_neural_network(inputs, targets, learning_rate, epochs, layer_sizes): input_size = len(inputs[0]) output_size = len(targets[0]) # Initialize weights randomly weights = [np.random.rand(layer_sizes[i], layer_sizes[i+1]) for i in range(len(layer_sizes)-1)] gradients_magnitude = [] for epoch in range(epochs): for i in range(len(inputs)): # Forward propagation layer_outputs = [inputs[i]] for j in range(len(layer_sizes)-1): layer_outputs.append(forward_propagation(layer_outputs[j], weights[j])) # Backward propagation output_error = targets[i] - layer_outputs[-1] gradient = output_error * sigmoid_derivative(layer_outputs[-1]) gradients_magnitude.append(np.linalg.norm(gradient)) for j in range(len(layer_sizes)-2, -1, -1): gradient, weights_gradient = backward_propagation(gradient, layer_outputs[j], weights[j]) weights[j] += learning_rate * weights_gradient return gradients_magnitude # Example usage inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) targets = np.array([[0], [1], [1], [0]]) learning_rate = 0.1 epochs = 10000 layer_sizes = [2, 50, 50, 1] # Number of neurons in each layer gradients = train_neural_network(inputs, targets, learning_rate, epochs, layer_sizes) # Plot gradients over training epochs plt.plot(gradients) plt.title('Gradients Magnitude Over Training Epochs') plt.xlabel('Training Epoch') plt.ylabel('Gradients Magnitude') plt.show() ``` 下圖曲綫表示權重更新速度,對於兩個隱層網絡,可以說隱2權重更新速度比隱1更新速度慢: ![v2-f6b9e851de6b876cb6f2cab65bd60b75_720w](https://hackmd.io/_uploads/Hk-tXeR7T.png) 對於四個隱層更明顯: ![v2-dffdfc852ee891e6f11ae068efa5737f_720w](https://hackmd.io/_uploads/rJUnXlAma.png) ## 梯度消失、爆炸解決方案 #### 預訓練和微調整 此方法來自於 Hinton在2006年發表的論文中。提出用無監督逐層訓練的方法,其基本思想是每次訓練一層隱藏節點,訓練時將上一層隱藏節點的輸出作爲輸入,而本層輸出作爲下一層隱藏節點的輸出,這就是逐層與訓練。 (https://paperswithcode.com/paper/reducing-the-dimensionality-of-data-with) #### batch norm batchnorm 是DL發展提出重要的成果之一。具有加速網絡收斂速度,提高訓練穩定行的效果。本質上是解決反向傳播過程中的梯度問題。 反向傳播有xxx的存在,所以xxx的大小影響了梯度消失和爆炸,通過對每一層的輸出規範為均值和方差一致的方法,消除了帶來的放大隨下 影響。 ##### 殘差結構 ![image](https://hackmd.io/_uploads/BJnOz4Cmp.png) #### LSTM ![image](https://hackmd.io/_uploads/rJBCGV0XT.png) ## 參考資料: 新手村逃脫!初心者的python機器學習攻略 ---書籍 https://paperswithcode.com/paper/reducing-the-dimensionality-of-data-with ---論文 chatgpt --- 程式 編寫 https://zh.wikipedia.org/zh-tw/%E5%8F%8D%E5%90%91%E4%BC%A0%E6%92%AD%E7%AE%97%E6%B3%95 --- BP算法講解 https://www.cupoy.com/qa/club/ai_tw/0000016D6BA22D97000000016375706F795F72656C656173654B5741535354434C5542/0000017BAC14A4DE000000116375706F795F72656C656173655155455354 --- 講解消失和爆炸 https://zh.wikipedia.org/zh-tw/%E9%9D%9E%E7%B7%9A%E6%80%A7%E7%B3%BB%E7%B5%B1 ---非綫性系統講解