11.2 The Perceptron

# 11.2 The Perceptron ## perceptron (linear discriminant function) Perceptron 是 ++basic processing element++，其中它的 input 可能來自 environment 或是其他 perceptrons 的 output。 ![11.1](https://hackmd.io/_uploads/SkVu1m5x0.png) > $\rightarrow$ 對每個 input $x_j \in R \quad j=1,...,d$，我們都有一個和它對應的 weight $w_j \in R$ 叫做 connection weight 或 synaotic weight。 > > $\rightarrow$ 在這個式子裡， $w_0$ 是 intercept value（截距），用來讓這個 model 更 general；通常我們把 $w_0$ 當作是另一個額外的 bias unit $x_0$ 的 weight。 >> $x_0$ 的值永遠是 $+1$ > > $\rightarrow$ 等號左側的 $y$ 是 output。這樣一來，我們就能把原式改寫成 $y=w^Tx$，其中 $w$ 和 $x$ 是包含 bias weight 和 input 的向量。理由可參考下圖： ![image](https://hackmd.io/_uploads/SyKDZQceA.png) > from「機器學習基石」影片 6 在做 testing 的時候，給定 weights $w$、input $x$，我們會計算 output $y$。為了要去 implement 某個 task，我們必須去 ++learn the weights $w$++，試圖去找出 $x_0,...,x_d$ 要對應到什麼樣的 weight 最合適，這件事情也就是去決定 input $x$ 的每個 element 是如何、以什麼樣的程度去影響 output。 $\rightarrow$ ++learn weight++ 這件事可以被視為是在++尋找 system 的 parameters++。如果我們的 $d$ 是 $1$（也就是我們只有一個 input），上方 $11.1$ 的式子會變成： ![image](https://hackmd.io/_uploads/rkr4F49gA.png) > 上方式子為一條線的 equation，其中 $w$ 是斜率、$w_0$ 是截距。 $\rightarrow$ 這個 perceptron 有一個 input、一個 output，可以拿來 implement ++linear fit++。 - 如果我們有超過一個 input，這條線就變成一個 ++hyperplane++。超過一個 input 的 perceptron 可以用來 implement ++multivariate linear fit++。像是最上方 $11.1$ 式子中的 equation 就能 define 一個 hyperplane，然後把一個 input space 分割成兩部分，一部分是 positive，另一部分是 negative。透過用這個式子去 implement 一個 linear discriminant function，這個 perceptron 可以藉由檢查 output 的 sign 來區分兩個 classes。 > hyperplane 就是比所在空間少一個維度的平面。 >> "a hyperplane of an $n$-dimensional space $V$ is a subspace of dimension $n-1$" ![image](https://hackmd.io/_uploads/B1A845iUR.png) > 從上圖中可以看到， hyperplane 將整個空間裡的點區分成兩個部分，例如以綠色作為 positive，紅色作為 negative。 > > 在我們輸入 input 到 perceptron 後，我們會得到 output $y$。下面會講到，我們透過另一個 function 來檢查 $y$ 的值是正是負，如果 $y > 0$ 就 assign 到一個 class（例如綠色的點都到同個 class），而 $y \le 0$ 就 assign 到另一個 class。 > > $\rightarrow$ 雖然稱一條線為「平面」或許有點奇怪，但二維空間裡一維「平面」（那條直線）也可稱作 hyperplane。 ## threshold function 我們 define ==$s(·)$== 為 threshold function： ![11.3](https://hackmd.io/_uploads/SkEts4clA.png) > 可以將 threshold function $s(·)$ 理解為判斷正負的 function。 > > 因此，當我們把 output $y=w^Tx$ 作為 input 讓 $s(·)$ 去檢驗時，如果 output 是 positive 我們就把它歸到 class $C_1$；如果 output 是 negative 就歸到 $C_2$。必須注意的是： :::warning 我們用 linear discriminant function 代表我們 ++assume 這些 classes 是 ==linear separable== 的++。 $\rightarrow$ 也就是說，我們++找的到一個 hyperplane $w^Tx=0$ 來把 input 分成 $x^t \in C_1$ 和 $x^t \in C_2$++ ::: ![linear separable](https://hackmd.io/_uploads/Sk3q_HqxC.png) ## sigmoid function 在更後面的階段，我們可能會需要求 posterior probability（像是用來計算風險），那我們就會用到 sigmoid function： > Recall： > ![posterior probability](https://hackmd.io/_uploads/BJBHZiiIC.png) >> posterior probability 即圖中的 $P(c|x)$ ![image](https://hackmd.io/_uploads/SJi1RLclA.png) > 如前面的做法，在將我們的 input vector $x$ 乘上 weight vector $w$ 以後，我們得到 output $o$。 > > 接著將 $o$ 作為 sigmoid function 的 input 後，我們就會得到 posterior probability $y$。課本的寫法或許比較混亂，因此我們再整理一次。如果我們用 $\sigma()$ 代表 sigmoid function，則它的定義為： :::info \begin{equation} \sigma(z) = \frac{1}{1+e^{-z}} \end{equation} ::: ![sigmoid function](https://hackmd.io/_uploads/rJPGQijIC.png) > 其實 sigmoid function 泛指任何 S 型的 function，其中一種常用的例子就是以上方式子所定義的 logistic function。 > > $\rightarrow$ 其他例子如把 logistic function 平移再縮放的 hyperbolic tangent，或是常用的 arctangent $f(x) = arctan \ x$，⋯⋯。我們把 $w^Tx$ 代入 $\sigma()$ 會得到： :::success \begin{equation} p(C_k|x) = \sigma(w^Tx) \end{equation} ::: > $p(C_k|x)$ 即 input $x$ 被歸類到 class $C_k$ 的機率。 > > $\rightarrow$ 由上面 sigmoid function 的圖也可以看出 $\sigma(z)$ 的值只會落在 $0$ 和 $1$ 之間，也符合 probabilitiy 的定義。 :::danger Q：為什麼代入 sigmoid function 就可以計算 posterior probability 呢？ ::: 看下方圖中 weighted sum 很大和很小的時候，代入 $\sigma()$ 的值就能理解了： ![image](https://hackmd.io/_uploads/r1Jxv6iUR.png) 舉個簡單的例子： ![image](https://hackmd.io/_uploads/H1-IwaoIA.png) > $\rightarrow$ 關於這部分內容可參考「參考資料」中的影片連結 "The Sigmoid Function Clearly Explained"。整個流程如下圖： ![image](https://hackmd.io/_uploads/Hy-gXioIC.png) 回過頭來看課本給的圖： ![11.2](https://hackmd.io/_uploads/BJB41PcgA.png) > 圖中藍色框裡的每個圈 $x_j$都是一個 input element，構成我們的 input vector $x$，其中如前面說過的，我們會多加一個值永遠是 $1$ 的 bias unit $x_0$。 > > 至於上方綠色框裡的每個圈 $y_i$ 都是一個 output element，是各個 input element $x_0,...,x_d$ 乘上對應的權重後的和。 >> $w_{ij}$： input $x_j$ 對應 output $y_i$ 的 weight > > 在這個圖例中，我們的 output vector $y =[y_1,...,y_K]^T$ 是 $K$-dimensional 的，代表在這個 classification problem 裡面我們有 $K$ 個 class，所以在算好 $y_1,...,y_K$ 後，我們會再做 postprocessing 來選出最大值，以決定 input $x$ 要被歸類到哪個 class。 >> 如果我們是想求分到各個 class 的 posterior probability，則我們就會使用 softmax function。 >> >> $\rightarrow$ 關於 softmax function 的筆記之後會更新。簡單來說，它就是一個會對我們的 output $w^Tx$ 進行一串處理的 function（取 exponential 再 normalize），使得最後的結果會是最大的 $y_i$ 會很接近 $1$，且所有的 output element $y_i$ 的和 $\sum_i y_i=1$，滿足 probability 的定義。最後，就像我們上圖所看到的，如果我們有大於兩個 output，假設說 $K$ 個 output，那我們就會有 $K$ 個 perceptrons，每一個都有一個 weight vector $w_i$（如圖中的 $y_1$ 這個 output 就對應到 $w_1=[w_{10},\ w_{11},\ w_{12},...,\ w_{1d}]$） > 上圖中的粗體 $w_i$ 是 vector，$w_{ij}$ 是它裡面的 element。用數學式表達就是： ![11.5](https://hackmd.io/_uploads/HkE1ZwclC.png) > 其中 $W$ 是一個 $K\times (d+1)$ 的 weight matrix，裡面每一列都是一個 weight vector $w_i \ i=1,...,K$ 直接看我寫的整個過程或許比較清楚： ![image](https://hackmd.io/_uploads/r11aZnsI0.png) > 最後在算好 $Wx$ 時，如果我們只是要找出某個 class 來 assign 我們的 input，那就直接從結果 $y$ 中挑一個最大的 $y_i$ 即可，但如果像我們前面提到的，要求各個 class 的 posterior probability，就需把 $w_i^Tx$ 代入 softmax function。 > > - 求 posterior 的這種方式，變成我們的 neural network 需要兩階段的處理（第一階段算 weighted-sum，第二階段算 softmax 值），但是課本的圖就只用一個 layer 表示。 # 參考資料 - wiki: [Sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) - [Logistic Regression: The good parts](https://sthalles.github.io/logistic-regression/) - [The Sigmoid Function Clearly Explained](https://youtu.be/TPqr8t919YM?si=00k8CLwBwM28UzT9)