Kernel Support Vector Machine

# 前情提要上次說到會出問題的 $Q$ 矩陣，來看看他到底哪裡出問題： $$ \color{red}{\mathbf{q}_{n,m}}=y_{n}y_{m}z_{n}z_{m} $$ 原因就出在 $z_{n}z_{m}$ 這個矩陣內積。如果 $\tilde{d}+1$ 維很大，就會面臨之前提到的問題。解決方法就是去想辦法優化「內積」這個動作。 # Kernel Funtion 這裡先來討論多項式轉換 Polynomial transform 要如何優化。假設有一個特徵轉換： $$ \phi(\mathbf{x})=(1,x_{1},x_{2},...,x_{d},x_{1}^{2},x_{1}x_{2},...,x_{1}x_{d},x_{2}x_{1},x_{2}^{2},...,x_{2}x_{d},...,x_{d}x_{1},...,x_{d}^{2}) $$ 這裡同時存在 $x_{i}x_{j}$ 跟 $x_{j}x_{i}$ 是為了方便整理。總之經過上面的轉換，可以發現： $$ \phi(\mathbf{x})\phi(\mathbf{x}^{'})=1+\sum_{i=1}^{d}x_{i}x^{'}_{i}+\sum_{i=1}^{d}\sum_{j=1}^{d}x_{i}x_{j}x^{'}_{i}x^{'}_{j} $$ 根據連加的特性，我們可以化簡為： $$ \phi(\mathbf{x})\phi(\mathbf{x}^{'})=1+\sum_{i=1}^{d}x_{i}x^{'}_{i}+\sum_{i=1}^{d}x_{i}x_{j}\sum_{j=1}^{d}x^{'}_{i}x^{'}_{j}\\ \Rightarrow 1 + \mathbf{x}^{T}\mathbf{x}^{'}+(\mathbf{x}^{T}\mathbf{x}^{'})^{2} $$ 上面的結果告訴我們，我們不用真的把 $\mathbf{x}$ 投射到 $\mathcal{Z}$ 空間去，只要在 $\mathcal{X}$ 空間做好一次內積，就可以快速的算出轉換後的內積結果了。我們把這個方便的~~偷吃步~~叫做 Kernel Funtion。 $$ K_{\phi}(\mathbf{x},\mathbf{x}^{'})\equiv \phi(\mathbf{x})\phi(\mathbf{x}^{'}) $$ # Kernal SVM 既然可以這麼方便算出轉換後內積的值，那我們來看看哪裡還可以利用這個好東西： $$ b = y_{s}-\mathbf{w}^{T}\mathbf{z}_{s}=y_{s}-\left(\sum_{n=1}^{N}\alpha_{n}y_{n}\mathbf{z}_{n}\right)^{T}\mathbf{z}_{s}\\ \Rightarrow y_{s}-\sum_{n=1}^{N}\alpha_{n}y_{n}K(\mathbf{x}_{n},\mathbf{x}_{s}) $$ 甚至連最後的 $g_{SVM}$ 都不用把 $\mathbf{w}$ 給算出來： $$ g_{SVM}(\mathbf{x}) = sign\left(\sum_{\text{SV indices n}}\alpha_{n}y_{n}K(\mathbf{x}_{n},\mathbf{x})+b\right) $$ 這樣一來，我們成功消除對 $\tilde{d}$ 的依賴。如此使用 Kernal Function 來增加效率，並且只需要依靠 $\alpha$ 而不需要真的把 $\mathbf{w}$ 算出來的作法，叫做 Kernal SVM。 # General Polynomial Kernel / Polynomial SVM 如果可以把 Kernal Function 弄成完全平方式就更完美了，所以我們稍微將轉換改成： $$ \phi(\mathbf{x})=(\zeta,\sqrt{2\zeta\gamma}x_{1},...,\sqrt{2\zeta\gamma}x_{d},\gamma x_{1}^{2},...,\gamma x_{d}^{2}) $$ 我們就可以湊成： $$ K_{2}(\mathbf{x},\mathbf{x}^{'})=\left(\zeta+\gamma \mathbf{x}^{T}\mathbf{x}^{'}\right)^{2} $$ 甚至我們可以有了更高項次的轉換： $$ K_{Q}(\mathbf{x},\mathbf{x}^{'})=\left(\zeta+\gamma \mathbf{x}^{T}\mathbf{x}^{'}\right)^{Q} $$ 但是必須要限制： $$ \zeta \le 0\\ \gamma > 0 $$ 這種使用了多項式轉換的又可以叫做 **Polynomial SVM**。 ## 好處由於是胖胖超平面，所以 $d_{VC}$ 是有受到控制的。 ## 壞處如果項次太高，$\zeta+\gamma \mathbf{x}^{T}\mathbf{x}^{'}$ 的值很容易非常大或非常小。並且有三個參數需要作調整，更難去做選擇。 :::info 跟以前提到過的一樣，建議先從簡單的開始做起，也就是 $Q=1$ 的 Linear SVM。而且如果是 Linear SVM，通常會有特別優化的 QP 可以用，會更有效率。 ::: --- # Infinite Dimensional Transform / Gaussian Kernel 透過對指數函數/高斯函數的泰勒展開來達到無限多次項的特徵轉換。下面以one dimension 做舉例： $$ \large \mathbf{x}=x\\ \large K(x,x^{'})=e^{-(x-x^{'})^{2}}\\ \large =e^{-(x)^{2}}e^{-(x^{'})^{2}}e^{2xx^{'}}\\ \large =e^{-(x)^{2}}e^{-(x^{'})^{2}}\sum_{i=0}^{\infty}\frac{(2xx^{'})^{i}}{i!}\\ \Large =\sum_{i=0}^{\infty}\color{blue}{e^{-(x)^{2}}}\color{red}{e^{-(x^{'})^{2}}}\color{blue}{\sqrt{\frac{2^{i}}{i!}}}\color{red}{\sqrt{\frac{2^{i}}{i!}}} \color{blue}{x^{i}}\color{red}{x^{'i}}\\ \large =\color{blue}{\phi(x)^{T}}\color{red}{\phi(x^{i})} $$ 其中： $$ \large \phi(x)=e^{-(x)^{2}}\left(1,\sqrt{\frac{2^{1}}{1!}}x,\sqrt{\frac{2^{2}}{2!}}x^{2}...\right) $$ 這樣就達成了無限多項的轉換。而當 $\mathbf{x}$ 是多維的情形也可以得到一樣的結論，所以我們就有了 Gaussian Kernel 這個酷東西： $$ \Large K(\mathbf{x},\mathbf{x^{'}})=e^{-\gamma \left\|\mathbf{x}-\mathbf{x^{'}}\right\|^{2}} $$ # 不同角度 / Radial Basis Function (RBF) kernel 這個 Gaussian Kernel 在做的事情，直觀上來看就是將 $\mathbf{x}$ 轉換成無限多個維度，然後再從這個無限多個維度中找到那個最好的胖胖超平面來幫我們做預測。但是如果仔細一看： $$ g_{SVM}(\mathbf{x}) = sign\left(\sum_{SV}\alpha_{n}y_{n}K(\mathbf{x}_{n},\mathbf{x})+b\right)\\ =sign\left(\sum_{SV}\alpha_{n}y_{n}e^{-\gamma \left\|\mathbf{x}_{n}-\mathbf{x}\right\|^{2}}+b\right)\\ \Large =sign\left(\sum_{SV}\alpha_{n}y_{n}e^{-\gamma \left\|\mathbf{x}-\mathbf{x}_{n}\right\|^{2}}+b\right) $$ 最後一行告訴我們，這是個以 $\mathbf{x}_{n}$ 為中心，如此多個高斯函數的線性組合；所以他又叫做 Radial Basis Function kernel： - Radial：高斯函數的特性，從某個中心向外擴散 - Basis Function：拿來做線性組合的意思 # Gaussian SVM Gaussian Kernel 的 SVM。找出 $\alpha$，並找到支持項量的高斯函數，進行線性組合。 ## $\gamma$ 的值其實 $\gamma$ 就是 [Variance 的倒數](https://hackmd.io/EnJGsZCxS7OMPeu8arRmHw?view#PDF)，所以如果 $\gamma$ 越大，Variance 就越小，也就代表高斯函數長的尖尖的。因此太高的 $\gamma$ 會容易發生一個一個小島的情況；換句話說，如果 $\gamma \rightarrow \infty$，則 Kernel 就會退化為： $$ K_{lim}(\mathbf{x},\mathbf{x}^{'}) = [[x = x^{'}]] $$ --- # 優缺統整 - Linear Kernel： - 好處 - 簡單優先 - 有特別優化的 QP - 容易解釋 - 壞處 - 存在線性不可分的資料 :::info Linear 是一個重要的基礎工具 ::: - Polynomial Kernel： - 好處 - 可處理線性不可分 - 容易控制對資料的想像(潛在函數是幾次方) - 壞處 - 項次太大的危機 - 有三個參數要做選擇 :::info 小的項次或許會比較好。有時轉換後的維度沒有很高，所以在處理 $Q$ 矩陣的部分，或許此時直接使用 Linear Kernel 搭配優化 QP 會更快。 ::: - Gaussian Kernel： - 好處 - 有無限多個維度，超級 Powerful - 指數的關係使得 Kernel 值被限制 0 到 1 之間，沒有 Poly 遇到的項次問題 - 只有一個參數要選 - 壞處 - 難以解釋，神秘的無限維度 $\mathbf{w}$ - 如果用 Linear 就可以的話，可能會比 Linear 慢一些 - 可能會太有力量 too powerful # Other Valid Kernel 其實 Kernel 就是在算向量的相似性，所以可以找一些其他在算相似性的作為 Kernel function。但是要注意，必須也要是計算向量相似性的才可以當作是一種 Kernel function。 ## Mercer’s condition 數學家已經證明，一個 Kernel function 和以下兩個條件互為充分必要條件：令 $\mathbf{K}$ 矩陣中的 $k_{ij}=K(x_{i},x_{j})$，則 $\mathbf{K}$ 矩陣： - 必須有對稱性 Symmetric - 必須是半正定 Positive semi-definite 這兩個條件叫做 Mercer’s condition。但是通常來說，要自己證明一個 Kernel function 是一件很困難的事。 # Hard-Margin SVM / 線性可分目前提到的 Linear、Poly、Gaussian SVM 都是建立於資料在 $\mathcal{Z}$ 中是線性可分的，這種叫做 Hard-Margin SVM。那如果有 Noise 要怎麼辦，答案就是使用 [Soft-Margin SVM](https://hackmd.io/@ShawnNTU-CS/rJS8Wkchh)。