# mid algorithms
```
class: algorithms
```
-------
### 梯度消失及爆炸 前言
曾在上學期的演算法和資料分析中學習了梯度下降。從而知道梯度消失及爆照。簡述梯度消失根源其為---深度神經網絡和反向傳播。深層網絡由許多非綫性層堆叠,深度網絡可被看作一個複合非綫性多元函數。其數學公式 y=fN(fN−1(…f2(f1(x))…)) 最終需要得出最優解是g(x),滿足Loss=L(g(x),f(x)),簡單的數學公式: Loss=(g(x) - f(x)^2 2)
損失函數類似於下圖

## 梯度消失&爆炸
兩次原因下梯度常出現一個是在深度網絡中,二采用了不合適函數。
### 從深度網絡角度出發

圖中爲一個 四層全連接網絡 加黑色每層激活後輸出值為fi(x), i為第i層,x代表第i層輸出,也就是第i-1層的輸出,f是激活函數,那麽,fi+1 = f(ffi*wi+1+bi+1),簡單的列爲 fi+2 = f(fi * wi + 1)
BP算法也就是反向傳播算法,基於梯度下降,以目標的負梯度方向進行調整.
梯度下降法则:参数的更新公式为 θ=θ−α⋅∇J(θ)θ=θ−α⋅∇J(θ),其中 θθ 为参数,αα 为学习率,∇J(θ)∇J(θ) 为损失函数 J(θ)J(θ) 关于参数 θθ 的梯度。
对于第二隐藏层的权值更新:设第二隐藏层的权值矩阵为 WW,损失函数为 JJ,第二隐藏层的输出为 a(2)a(2),则权值更新的梯度为 ∂J∂W(2)∂W(2)∂J,更新公式为 W(2)=W(2)−α⋅∂J∂W(2)W(2)=W(2)−α⋅∂W(2)∂J。
链式求导法则:如果z=f(g(x))z=f(g(x)),∂z∂x=∂f∂g⋅∂g∂x∂x∂z=∂g∂f⋅∂x∂g。
對於層級增多求出梯度更新以指數形式增加,即發生梯度爆炸。
> 以下為演示代碼
```
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
def update_weights(weights, learning_rate, gradient):
return weights - learning_rate * gradient
def forward_propagation(inputs, weights):
return sigmoid(np.dot(inputs, weights))
def backward_propagation(output_error, inputs, weights):
gradient = output_error * sigmoid_derivative(output_error)
weights_gradient = np.outer(inputs, gradient)
return gradient, weights_gradient
def train_neural_network(inputs, targets, learning_rate, epochs):
input_size = len(inputs[0])
hidden1_size = 4
hidden2_size = 3
output_size = len(targets[0])
# Initialize weights randomly
weights_input_hidden1 = np.random.rand(input_size, hidden1_size)
weights_hidden1_hidden2 = np.random.rand(hidden1_size, hidden2_size)
weights_hidden2_output = np.random.rand(hidden2_size, output_size)
for epoch in range(epochs):
for i in range(len(inputs)):
# Forward propagation
hidden1_output = forward_propagation(inputs[i], weights_input_hidden1)
hidden2_output = forward_propagation(hidden1_output, weights_hidden1_hidden2)
final_output = forward_propagation(hidden2_output, weights_hidden2_output)
# Backward propagation
output_error = targets[i] - final_output
gradient_hidden2_output, weights_gradient_hidden2_output = backward_propagation(output_error, hidden2_output, weights_hidden2_output)
gradient_hidden1_hidden2, weights_gradient_hidden1_hidden2 = backward_propagation(gradient_hidden2_output, hidden1_output, weights_hidden1_hidden2)
_, weights_gradient_input_hidden1 = backward_propagation(gradient_hidden1_hidden2, inputs[i], weights_input_hidden1)
# Update weights
weights_input_hidden1 = update_weights(weights_input_hidden1, learning_rate, weights_gradient_input_hidden1)
weights_hidden1_hidden2 = update_weights(weights_hidden1_hidden2, learning_rate, weights_gradient_hidden1_hidden2)
weights_hidden2_output = update_weights(weights_hidden2_output, learning_rate, weights_gradient_hidden2_output)
return weights_input_hidden1, weights_hidden1_hidden2, weights_hidden2_output
# Example usage
inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
targets = np.array([[0], [1], [1], [0]])
learning_rate = 0.1
epochs = 10000
trained_weights = train_neural_network(inputs, targets, learning_rate, epochs)
print("Trained weights:")
print("Weights Input-Hidden1:\n", trained_weights[0])
print("Weights Hidden1-Hidden2:\n", trained_weights[1])
print("Weights Hidden2-Output:\n", trained_weights[2])
```
如果此部分小於1 那麽根據層數增多 求出梯度更新信息也將以指數形式衰減, 即發生了梯度消失.
以下為演示代碼
```
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
def forward_propagation(inputs, weights):
return sigmoid(np.dot(inputs, weights))
def backward_propagation(output_error, inputs, weights):
gradient = output_error * sigmoid_derivative(output_error)
weights_gradient = np.outer(inputs, gradient)
return gradient, weights_gradient
def train_neural_network(inputs, targets, learning_rate, epochs, layer_sizes):
input_size = len(inputs[0])
output_size = len(targets[0])
# Initialize weights randomly
weights = [np.random.rand(layer_sizes[i], layer_sizes[i+1]) for i in range(len(layer_sizes)-1)]
gradients_magnitude = []
for epoch in range(epochs):
for i in range(len(inputs)):
# Forward propagation
layer_outputs = [inputs[i]]
for j in range(len(layer_sizes)-1):
layer_outputs.append(forward_propagation(layer_outputs[j], weights[j]))
# Backward propagation
output_error = targets[i] - layer_outputs[-1]
gradient = output_error * sigmoid_derivative(layer_outputs[-1])
gradients_magnitude.append(np.linalg.norm(gradient))
for j in range(len(layer_sizes)-2, -1, -1):
gradient, weights_gradient = backward_propagation(gradient, layer_outputs[j], weights[j])
weights[j] += learning_rate * weights_gradient
return gradients_magnitude
# Example usage
inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
targets = np.array([[0], [1], [1], [0]])
learning_rate = 0.1
epochs = 10000
layer_sizes = [2, 50, 50, 1] # Number of neurons in each layer
gradients = train_neural_network(inputs, targets, learning_rate, epochs, layer_sizes)
# Plot gradients over training epochs
plt.plot(gradients)
plt.title('Gradients Magnitude Over Training Epochs')
plt.xlabel('Training Epoch')
plt.ylabel('Gradients Magnitude')
plt.show()
```
下圖曲綫表示權重更新速度,對於兩個隱層網絡,可以說隱2權重更新速度比隱1更新速度慢:

對於四個隱層更明顯:

## 梯度消失、爆炸解決方案
#### 預訓練和微調整
此方法來自於 Hinton在2006年發表的論文中。提出用無監督逐層訓練的方法,其基本思想是每次訓練一層隱藏節點,訓練時將上一層隱藏節點的輸出作爲輸入,而本層輸出作爲下一層隱藏節點的輸出,這就是逐層與訓練。
(https://paperswithcode.com/paper/reducing-the-dimensionality-of-data-with)
#### batch norm
batchnorm 是DL發展提出重要的成果之一。具有加速網絡收斂速度,提高訓練穩定行的效果。本質上是解決反向傳播過程中的梯度問題。
反向傳播有xxx的存在,所以xxx的大小影響了梯度消失和爆炸,通過對每一層的輸出規範為均值和方差一致的方法,消除了帶來的放大隨下 影響。
##### 殘差結構

#### LSTM

## 參考資料:
新手村逃脫!初心者的python機器學習攻略 ---書籍
https://paperswithcode.com/paper/reducing-the-dimensionality-of-data-with ---論文
chatgpt --- 程式 編寫
https://zh.wikipedia.org/zh-tw/%E5%8F%8D%E5%90%91%E4%BC%A0%E6%92%AD%E7%AE%97%E6%B3%95 --- BP算法講解
https://www.cupoy.com/qa/club/ai_tw/0000016D6BA22D97000000016375706F795F72656C656173654B5741535354434C5542/0000017BAC14A4DE000000116375706F795F72656C656173655155455354 --- 講解消失和爆炸
https://zh.wikipedia.org/zh-tw/%E9%9D%9E%E7%B7%9A%E6%80%A7%E7%B3%BB%E7%B5%B1 ---非綫性系統講解