Support Vector Machine

<center> <img src = "https://i.imgur.com/MjdXyB5.jpg"> </center> ## Prerequisite Support vector machine 是將資料視為 $p$ 維度的向量，並希望透過 $(p-1)$ 的超平面來分開這些資料。而最佳超平面即是以最大間隔來把兩個類分開的超平面。 ### Hyperplane 超平面是 $n$ 維歐氏空間中的 $n-1$ 維的子空間。超平面可由以下方程式定義： $$ a_0x_0 + \dots + a_nx_n = b $$ 其中 $a_0, \dots, a_n$ 是不全為零的常數。 ### Generalization error 是衡量演算法能夠預測以前未見過的數據的結果值的準確性的指標 ### Hinge loss 對於一個預期輸出 $t = \pm 1$ 分類結果 $y$ 的 Hinge loss 定義為 $$ l(y) = \max (0, 1 - t \cdot y) $$ Hinge loss 是一個用於訓練分類器的損失函數。Hinge loss 被用於「最大間格分類」，非常適合用於支持向量機 (SVM)。 ## Goal Support vector machine 在高維或無限維空間中構造超平面或超平面集合，此演算法可以用於分類、回歸或其他任務。 ## Background 以下情境先假設為分類任務 ### Linear Support vector machine 以下考慮 $n$ 個資料，$(\vec{x}_1,y_1), \dots, (\vec{x}_n, y_n)$，其中 $y_i$ 代表的是 $\vec{x}_i$ 所屬的類別。而所要解決的最佳化問題是分隔每個資料點在正確一側，可以寫作 $$ y_i(\vec{w} \cdot \vec{x}_i - b) \ge 1, \ \ \ \text{for all } 1 \le i \le n $$ 也就是在 $y_i(\vec{w} \cdot \vec{x}_i - b) \ge 1$ 下，對於 $i = 1, \cdots, n$ 最小化 $\| \vec{w} \|$ 其中，最大間隔超平面完全是由最靠近它的那些 $\vec{x}_i$ 決定，而 $\vec{x}_i$ 也被稱為 support vector。 --- 若遇到資料線性不可分的情況，則必須使用 Hinge loss $$ \max (0, 1 - y_i (\vec{w} \cdot \vec{x_i} - b)) $$ 而當方條件式滿足時 $$ y_i(\vec{w} \cdot \vec{x}_i - b) \ge 1, \ \ \ \text{for all } 1 \le i \le n $$ 此函數為零。對於間隔的錯誤一側的資料，該函數的值與距間隔的距離成正比。而所要最小化的 loss function 如下 $$ \left[\frac{1}{n} \sum_{i=1}^n \max (0, 1- y_i (\vec{w} \cdot \vec{x}_i) - b)\right] + \lambda \| \vec{w} \|^2 $$ 其中參數 $\lambda$ 是一個超參數，確保 $\vec{x}_i$ 位於間隔的正確一側之間的關係。如此一來，對於足夠小的 $\lambda$，如果輸入資料是可以線性分類的，則軟間隔 SVM 與硬間隔 SVM 將表現相同，但即使不可線性分類，仍能用於分類任務。 ### Non-linear Support vector machine ![](https://hackmd.io/_uploads/Hkhc_yQ0h.png) <center> <a href="https://zh.wikipedia.org/zh-tw/%E6%94%AF%E6%8C%81%E5%90%91%E9%87%8F%E6%9C%BA#/media/File:Kernel_Machine.png">Kernel Function</a> </center><br> 透過 Kernel function 我們可以將低維非線性可分離函數計算成高維線性可分離函數，這樣的話在遇到非線性分類時，Support Vector Machine 還是可以進行分類任務。 ### Calculate Loss Function 我們可以透過 Sub-gradient descent 來計算 Support Vector Machine **Sub-gradient descent** $$ f(\vec{w},b) = \left[\frac{1}{n} \sum_{i=1}^n \max (0, 1- y_i (\vec{w} \cdot \vec{x}_i) - b)\right] + \lambda \| \vec{w} \|^2 $$ $f :\mathbb{R}^n \rightarrow \mathbb{R}$ 定義為 $\mathbb{R}^n$ 上的凸函數，其中使用下述格式更新 $$ x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)} $$ ## Implementation ```python class SVM (object): def __init__(self, kernel='linear', C=1.0, gamma='auto'): self.kernel = kernel self.C = C self.gamma = gamma self.alpha = None self.b = None self.support_vector = None self.support_vector_label = None def _kernel(self, x1, x2): if self.kernel == 'linear': return x1 @ x2.T elif self.kernel == 'rbf': return np.exp(-self.gamma * np.linalg.norm(x1 - x2) ** 2) # gamma = 1 / (2 * sigma^2) elif self.kernel == 'poly': return (1 + x1 @ x2.T) ** self.gamma else: raise ValueError('kernel not supported') def fit(self, x, y): n_samples, n_features = x.shape K = np.zeros((n_samples, n_samples)) for i in range(n_samples): for j in range(n_samples): K[i, j] = self._kernel(x[i], x[j]) ''' define and solve QP problem min 1/2 x^T P x + q^T x s.t. Gx <= h Ax = b x >= 0 ''' P = np.outer(y, y) * K # P = y^T y K q = -1 * np.ones(n_samples) # q = -1 G = np.diag(np.ones(n_samples) * -1) # G = -I h = np.zeros(n_samples) # h = 0 A = y.T # A = y^T b = np.zeros(1) # b = 0 a = np.linalg.solve(np.dot(G, P) + np.eye(n_samples), np.dot(G, q)) # Lagrange multipliers sv = a > 1e-5 # support vectors ind = np.arange(len(a))[sv] # indices of support vectors self.alpha = a[sv] # Lagrange multipliers of support vectors self.support_vector = x[sv] # support vectors self.support_vector_label = y[sv] # labels of support vectors # Intercept # b = 1/n sum(y_n - sum(alpha_i y_i K(x_n, x_i))) self.b = 0 for n in range(len(self.alpha)): self.b += self.support_vector_label[n] self.b -= np.sum(self.alpha * self.support_vector_label * K[ind[n], sv]) self.b /= len(self.alpha) # Weight vector # w = sum(alpha_i y_i x_i) if self.kernel == 'linear': self.w = np.zeros(n_features) for n in range(len(self.alpha)): self.w += self.alpha[n] * self.support_vector_label[n] * self.support_vector[n] else: self.w = None def predict(self, x): if self.w is not None: return np.sign(x @ self.w + self.b) else: y_predict = np.zeros(len(x)) for i in range(len(x)): s = 0 for alpha, sv_y, sv in zip(self.alpha, self.support_vector_label, self.support_vector): s += alpha * sv_y * self._kernel(x[i], sv) y_predict[i] = s return np.sign(y_predict + self.b) def score(self, x, y): y_predict = self.predict(x) return np.mean(y_predict == y) ``` 　 ## From Scratch ## More example [House Prices - Advanced Regression Techniques](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview) ## Reference 歡迎更仔細閱讀以下相關內容以了解本篇知識 - [支持向量機](https://zh.wikipedia.org/zh-tw/%E6%94%AF%E6%8C%81%E5%90%91%E9%87%8F%E6%9C%BA) - [R語言自學日記(20)-機器學習(一)：支持向量迴歸(Support Vector Regression)](https://medium.com/r-%E8%AA%9E%E8%A8%80%E8%87%AA%E5%AD%B8%E7%B3%BB%E5%88%97/r%E8%AA%9E%E8%A8%80%E8%87%AA%E5%AD%B8%E6%97%A5%E8%A8%98-20-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E4%B8%80-%E6%94%AF%E6%8C%81%E5%90%91%E9%87%8F%E8%BF%B4%E6%AD%B8-support-vector-regression-30cd834a918)