Introduction to Deep Learning

contributed by <kylekylehaha>

tags:`Data Science`

Neruon: neuron network 的最小單位。

若沒有 acitvation function，model 不會變複雜，仍然是 linear.

w_{0}, . . ., w_{m}

為需要 learned 的參數。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Universality Theorem 告訴我們只要參數越多，就越能模擬到 function，效果也越好，但為何是用 deep not fat netowrk?

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

因為 deep 可以將每層視為一個個 module，可以將各層的結果疊上去，如果只有一層的話，就需要更多參數來達到相同的 performance。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

除此之外，若有相同數量的 data，用 deep 的方式得到的 performance 比較好。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Task

model: 一個 function set。
goal: 利用 training data，從 hypotheis function set 找出最適合 task 的 best function
$f^{*}$

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

分成 3 steps:

What is the model?(function hypothesis set)
What is the "best" function?
How to pick the "best" function?

Task considered today (將問題 model 的方式)
Classification

Binary classification (only two class)
- Spam filtering
- Recommendation system
- Malware detection
- stock prediction
Multi-class classification (morn than two class)
- Handwriting digit classification
- Image recognition

What is the model?

A layer of neuron

Single neuron

Only do binary classification, cannot handle multi-class classification
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

So, we use multiple neuron to do multi-class classification.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Limitation of single layer

Single neuron 可以視為一條直線，因此不論怎麼畫(切)，都無法有效區分 XOR function

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

我們可以利用邏輯閘的概念，透過 AND OR 來達成 XOR

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

也就是說，原本切不開的，我們可以多疊幾個 neuron，來做 transformation，投影到高維向量(這裡一樣是二維，只是有將

x_{1}, x_{2}

轉成

a_{1}, a_{2}

)。轉成

a_{1}, a_{2}

後就能透過一條線切割了。

Neural Network

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Notation

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Relation between Layer Output

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

像 linear regression 這種 model，因為 parameters 較少，故無法做比較難的任務。

What is the "best" function?

找出 best function，猶如找出 best parameters。因此會將

f (x)

寫成

f (x; θ)

，

θ = {W^{1}, b^{1}, W^{2}, b^{2}, . . . W^{L}, b^{L}}

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Cost function

利用 cost function 來當作 "好不好" 的依據。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

$x^{r}$ : 第 r 個 training set
$\hat{y^{r}}$ : 第 r 個 ground truth

x^{r}

通過 f() 後，output 一個 vector。該 vector 表示是 "1" 的機率 ; 是 "2" 的機率 …

How to pick the "best" function?

對 deep learning 來說， function 是固定的，差別在於參數的不同。故我們希望取得一組參數，使得在 training data 上，它的 training loss 最小。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Gradient Descent

若我們知道 function 長得如何，可以直接用微分找出極值點。但一般來說我們不會知道 function 長得如何，因此採用 Gradient Descent

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

計算出該點的 gradient，往 gradient 的反方向走，最後走到 local minimum。

Formal Derivation of Gradient Descent

我們也可以利用 Taylor Series 來證明 gradient descent。

Taylor Series: 在特定點時，可以將點展開，使得值很接近。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
可以看到
$x = \frac{π}{4}$ 時，其值基本上一樣。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

圈圈夠小時，可以將 (a,b) 這點展開。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

為了有效計算 gradient，我們會採用 back propagation

Forward Data & Backward Error

一開始，會利用 forward data 算出最終的 function 後，找到 final error。接著將 error 根據 weight 傳回去，最後再更新 parameter。

Forward Data

Calculate Error

Error Backpropagation

Update weights

Practical Issues for neural network

Parameter Initialization
Learning Rate
Stochastic gradient descent and Mini-batch
Recipe for Learning

Parameter Initialziation

Learning Rate

Stochastic gradient descent and Mini-batch

必須考慮是否每個 data 的 gradient descent 是否一樣，若一樣則成立可以用 stochastic gradient descent。

Gradient descent: 一個 epoch 更新一次 ; Stochastic gradient descent: 一筆資料更新一次，若有 20 筆資料則更新 20 次。故兩者比較可知都更新一個 epoch 時，stochastic 已經跑很遠了。

Stochastic 即為 batch size = 1。

沒有一定說 batch size 是多少就會有比較好的 accuracy or training time，屬於 hyperparameter，要透過實驗去看。

Recipe for Learning

Concluding Remark

Tips for Deep Neural Network

Acitvation Function
Cost Function
Data Preprocessing
Optimization
Generalization

Acitvation Function

現今我們常用 ReLU 來當作 activation function，而非 sigmoid。因為 ReLU 微分後的結果不是 0 就是 1 ，而 sigmoid 的微分值介於 0~1 之間，永遠小於1。這樣在做 back propagation 時，微分值乘上 weight 後會變小，導致越傳 gradient 越小，造成 vanishing gradient problem。