Perceptron Learning Algorithm

# PLA / Perceptron Learning Algorithm 經典且簡單的 $\mathcal{H}$。 Perceptron 中文是「感知器」，是演算法的核心。 # 感知器 Perceptron $$ y=h(\mathbf{x})=sign\left(\left(\sum_{i=1}^{d}w_{i}x_{i}\right)-threshold\right) $$ $y$ 別名「標籤 label」，在此處的範圍就只有兩個值： $$ \mathcal{Y}:\{-1,1\} $$ 往後也有可能會以 $\mathcal{Y}:\{\times, \circ\}$ 表示。 ## 符號的化簡把閾值當作是「第 0 項」則可以合併進級數裡面： $$ h(\mathbf{x})=sign\left(\sum_{i=0}^{d}w_{i}x_{i}\right)\\ x_{0}=1 $$ 使用向量來標示上面的公式，可以利用內積表示法來簡化： $$ h(\mathbf{x})=sign\left(\mathbf{w}^{T}\mathbf{x}\right) $$ :::info 雖然不容易出現，但確實會出現 0 的情況；出現的時候可以投個硬幣隨機決定是 1 或 -1，或者固定讓 0 為 1 或 -1。(通常為 -1)。 ::: # $\mathcal{A}$ 持續的檢查每個點。對於當前第 t 個點 $$ if\ \ h(\mathbf{x}_{n(t)}) \ne y_{n(t)}\\ then\ \ \mathbf{w}_{t+1}=\mathbf{w}_{t}+y_{n(t)}\mathbf{x}_{n(t)} $$ 直到全部預測都正確，即可停止；在這裡先假設資料是線性可分的。 ## 解釋在二維的情況會比較好解釋。 $h(\mathbf{x})=sign\left(\mathbf{w}^{T}\mathbf{x}_{n(t)}\right)$ 就是計算 $\mathbf{x}$ 跟直線法向量 $\mathbf{w}$ 的內積。 - 內積小於 0：兩向量夾角大於 90 度。 - 如果 $y_{n(t)}=1$，表示要將 $\mathbf{w}$ 轉到跟 $\mathbf{x}$ 大致同方向 - 內積大於 0：兩向量夾角小於 90 度。 - 如果 $y_{n(t)}=-1$，表示要將 $\mathbf{w}$ 轉到跟 $\mathbf{x}$ 大致反方向這兩種情況剛好都可以表示為修正公式： $$ \mathbf{w}_{t+1}=\mathbf{w}_{t}+y_{n(t)}\mathbf{x}_{n(t)} $$ --- # 停下來的證明如果我有一個完美的線，也就是正確的 $\mathbf{w}_{f}$，代表對於任意的點 n ，該點內積後的正負號跟該點 y 一樣，或者可以表示為： $$ y_{n} = sign(\mathbf{w}_{f}^{T}\mathbf{x}_{n}) $$ 兩個同號數值相乘，結果均為正數。因此可以推得： $$ \min_n \ y_{n}\mathbf{w}_{f}^{T}\mathbf{x}_{n} > 0 $$ 要記住我們那個完美的線是可以完美分隔出不同 y 值的點，也就是說不會有點在線上。對於在 t 次更新時的第 n 個點會符合下面的關係： $$ y_{n(t)}\mathbf{w}_{f}^{T}\mathbf{x}_{n(t)} \ge \min_n \ y_{n}\mathbf{w}_{f}^{T}\mathbf{x}_{n} > 0 $$ 畢竟最小值就是最小的那個。 ## 每次 $\mathbf{w}$ 的更新都會跟正確的 $\mathbf{w}$ 越來越靠近由上面的關係可以推導出下面的關係： $$ \mathbf{w}_{f}^{T}\mathbf{w}_{t+1}=\mathbf{w}_{f}^{T}(\mathbf{w}_{t}+y_{n(t)}\mathbf{x}_{n(t)})\\ \mathbf{w}_{f}^{T}\mathbf{w}_{t+1} \ge \mathbf{w}_{f}^{T}\mathbf{w}_{t}+ \min_n\ y_{n}\mathbf{w}_{f}^{T}\mathbf{x}_{n}\\ \mathbf{w}_{f}^{T}\mathbf{w}_{t+1} > \mathbf{w}_{f}^{T}\mathbf{w}_{t} + 0\\ \mathbf{w}_{f}^{T}\mathbf{w}_{t+1} > \mathbf{w}_{f}^{T}\mathbf{w}_{t} $$ 也就是說每次更新得到的 $\mathbf{w}_{t+1}$ 跟 $\mathbf{w}_{f}$ 的內積會越來越大。 :::warning 但是內積大小會受夾角跟向量長度影響，所以接下來證明另一件事。 ::: ## w 更新長度的成長有上限在第 t 次更新時發生錯誤的第 n 個點，因為異號的緣故可以推得： $$ sign(\mathbf{w}_{t}^{T}x_{n(t)}) \ne y_{n(t)} \Leftrightarrow y_{n(t)}\mathbf{w}_{t}^{T}\mathbf{x}_{n(t)} \le 0 $$ 此時去檢查 $\mathbf{w}_{t}$ 跟 $\mathbf{w}_{t+1}$ 的長度關係： $$ ||\mathbf{w}_{t+1}||^{2} = ||\mathbf{w}_{t} + y_{n(t)}\mathbf{x}_{n(t)}||^{2} =||\mathbf{w}_{t}||^{2} + 2y_{n(t)}\mathbf{w}_{t}^{T}\mathbf{x}_{n(t)} + ||y_{n(t)}\mathbf{x}_{n(t)}||^{2}\\ \Rightarrow ||\mathbf{w}_{t+1}||^{2} \le ||\mathbf{w}_{t}||^{2} + ||y_{n(t)}\mathbf{x}_{n(t)}||^{2}\\ \Rightarrow ||\mathbf{w}_{t+1}||^{2} \le ||\mathbf{w}_{t}||^{2} + \max_{n} ||y_{n}\mathbf{x}_{n}||^{2}\\ $$ 也就是說每次 $\mathbf{w}_{t+1}$ 的長度最多最多就是比 $\mathbf{w}_{t}$ 多 $\displaystyle\max_{n}||y_{n}\mathbf{x}_{n}||^{2}$ ## 綜合來看每次更新的 $\mathbf{w}_{t}$ 都會跟正確的 $\mathbf{w}_{f}$ 卻來越接近，但每次 $\mathbf{w}_{t+1}$ 最多就只會長 $\displaystyle\max_{n} ||y_{n}\mathbf{x}_{n}||^{2}$。下面經過 T 次修正的 $\mathbf{w}$ 以 $\mathbf{w}_{T}$ 表示： $$ \mathbf{w}_{f}^{T}\mathbf{w}_{T} \ge \mathbf{w}_{f}^{T}\mathbf{w}_{T-1}+ \min_{n}\ y_{n}\mathbf{w}_{f}^{T}\mathbf{x}_{n}\\ ||\mathbf{w}_{T}||^{2} \le ||\mathbf{w}_{T-1}||^{2} + \max_{n} ||y_{n}\mathbf{x}_{n}||^{2}\\ $$ 如果把上面兩個式子從 T-1 一直回推不等式到 $\mathbf{w}_{0}$，由於 $\mathbf{w}_{0}=\mathbf{0}$，會得到： $$ \mathbf{w}_{f}^{T}\mathbf{w}_{T} \ge \mathbf{w}_{f}^{T}\mathbf{w}_{0}+ T\min_{n}\ y_{n}\mathbf{w}_{f}^{T}\mathbf{x}_{n}\\ =T\min_{n}\ y_{n}\mathbf{w}_{f}^{T}\mathbf{x}_{n} --(1)\\ ||\mathbf{w}_{T}||^{2} \le ||\mathbf{w}_{0}||^{2} + T\max_{n} ||y_{n}\mathbf{x}_{n}||^{2}\\ =T\max_{n} ||y_{n}\mathbf{x}_{n}||^{2}--(2)\\ $$ 其中： $$ ||\mathbf{w}_{T}||^{2} \le T\max_{n} ||y_{n}\mathbf{x}_{n}||^{2}\\ \Rightarrow ||\mathbf{w}_{T}|| \le \sqrt{T}\max_{n} ||y_{n}\mathbf{x}_{n}||\\ \Rightarrow \frac{1}{||\mathbf{w}_{T}||} \ge \frac{1}{\sqrt{T}\displaystyle\max_{n}||y_{n}\mathbf{x}_{n}||}--(3)\\ $$ 上面不等式會成立的原因是兩者均大於 0。將 3 跟 1 結合後就會發現： $$ \frac{\mathbf{w}_{f}^{T}}{||\mathbf{w}_{f}||}\frac{\mathbf{w}_{T}}{||\mathbf{w}_{T}||} \ge \frac{T\displaystyle\min_{n}\ y_{n}\mathbf{w}_{f}^{T}\mathbf{x}_{n}}{||\mathbf{w}_{f}||\sqrt{T}\displaystyle\max_{n} ||y_{n}\mathbf{x}_{n}||} $$ 因此對於經過 T 次修正的 $\mathbf{w}_{T}$ 可以得到下面的關係： $$ \frac{\mathbf{w}_{f}^{T}}{||\mathbf{w}_{f}||}\frac{\mathbf{w}_{T}}{||\mathbf{w}_{T}||} \ge \sqrt{T}*constant $$ 也就是說正確的 $\mathbf{w}_{f}$ 跟 T 次更新的 $\mathbf{w}_{T}$ 進行正規化(標準化或單位化)，其內積的值會大於等於 $\sqrt{T}$ 乘上一個常數。既然是正規化的值，那麼就可以去除長度的影響；並且因為內積的值越來越大，也就代表**每次更新的 $\mathbf{w}_{T}$ 會越來越靠近正確的 $\mathbf{w}_{f}$。** 而內積的值最大最大就是 1，所以我們知道在某次更新之後，就會停下來。 $$ \sqrt{T}*\frac{\displaystyle\min_{n}\ y_{n}\mathbf{w}_{f}^{T}\mathbf{x}_{n}}{||\mathbf{w}_{f}||\displaystyle\max_{n} ||y_{n}\mathbf{x}_{n}||} \le 1\\ \Rightarrow T*\frac{(\displaystyle\min_{n}\ y_{n}\mathbf{w}_{f}^{T}\mathbf{x}_{n})^{2}}{(||\mathbf{w}_{f}||\displaystyle\max_{n} ||y_{n}\mathbf{x}_{n}||)^{2}} \le 1\\ \Rightarrow T \le \frac{(||\mathbf{w}_{f}||\displaystyle\max_{n} ||y_{n}\mathbf{x}_{n}||)^{2}}{(\displaystyle\min_{n}\ y_{n}\mathbf{w}_{f}^{T}\mathbf{x}_{n})^{2}} $$ 所以如果我們令： $$ R^{2}=(\displaystyle\max_{n} ||y_{n}\mathbf{x}_{n}||)^{2}\\ \rho = \frac{\displaystyle\min_{n}\ y_{n}\mathbf{w}_{f}^{T}\mathbf{x}_{n}}{||\mathbf{w}_{f}||} $$ 上面的 T 範圍可以簡潔的表示為： $$ T \le \frac{R^{2}}{\rho^{2}} $$ --- # 好處程式碼簡單，並且執行很快，且適合用在任意維度。 # 壞處 - 要先假設資料是「線性可分」 - 如果資料有雜訊，也就是本該為 $\circ$ 的變成了 $\times$，這樣資料就會是線性不可分的 - 或者打從一開始資料就不是線性的，可能是二次或三次等等。 - 因此下面會有對 PLA 的修正版本 - 不是很能確定多久會停下來 - 上面有推導出 T 的範圍，但是 T 的範圍是跟 $\mathbf{w}_{f}$ 有關 - 但我們並不知道 $\mathbf{w}_{f}$ 是多少，因此我們其實也不能確定到底 T 多久會停 - 雖然實務上很快就停了 --- ## Noise 本該為 $\circ$ 的變成了 $\times$。但如果 noise 夠小，那 **「通常來說」**： $$ y_{n}=f(\mathbf{x}_{n}) $$ 也就是說如果找到 $g \approx f$，也就代表 **「通常來說」**： $$ y_{n}=g(\mathbf{x}_{n}) $$ 那如何找到一個 w 使得錯誤的次數最少？ $$ \mathbf{w}_{g}\leftarrow \arg\min_{\mathbf{w}}\sum_{n=1}^{N}[[y_{n}\ne sign(\mathbf{w}^{T}\mathbf{x}_{n})]] $$ 很不幸的，這是一個 NP 問題。所以替代方案就是加入「口袋」。 :::info 翻譯： - argmin：數值最小的 argument，換句話說就是取會得到最小值的 $\mathbf{w}$ - [[]]：布林邏輯的式子，也就是說上面可以翻譯為「如果 y 不等於算出的值則等於 1，不然就等於 0」 ::: # Pocket PLA 跟原版 PLA 一樣，但是我們每得到一個新的 $\mathbf{w}_{t}$，會將它跟當前「口袋中的」$\mathbf{w}_{f}$，做比較，看誰在整個資料中的錯誤率最低。如果新的 $\mathbf{w}_{t}$ 比較好，就把他替換成 $\mathbf{w}_{f}$。如果沒有比較好，就繼續以 $\mathbf{w}_{t}$ 進行 PLA。直到 $\mathbf{w}_{f}$ 修正到「足夠次數後」，就會停止。我們相信足夠多次的結果會是蠻好的。 ## 差異由於比原版 PLA 多了檢查的動作，所以如果某個資料是真的線性可分的話，會比原本的 PLA 慢一些。