論文閱讀: Random Features for Large-Scale Kernel Machines

--- tags: Data science --- # 論文閱讀: Random Features for Large-Scale Kernel Machines [論文](https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf) ## Introduction kernel machine 是使得 SVM 強大的關鍵，其思考點是如果把資料投影到更高維度，在高維的空間中資料就能更好的被[超平面(hyperplain)](https://en.wikipedia.org/wiki/Hyperplane) 區隔開。但是 kernel machine 也伴隨著在巨量的資料下往往需要大量時間來運算的缺點。另一方面，不少的研究中優化了 linear Support Vector Machines 的計算，降低了建立模型的時間複雜度(如 [Training Linear SVMs in Linear Time](https://www.cse.iitb.ac.in/~soumen/readings/papers/Joachims2006linearTime.pdf))。論文中試圖結合這兩種優勢，通過將數據映射到相對低維的隨機特徵空間中，有效地將 kernel machine 的 training 和 evalutation 轉換成 linear machine 的對應操作。對於一組資料，如果透過 function $φ$ 提升資料的維度，就可以在高維空間中更好的透過 linear function 來區分，這個提升維度的作法稱為 Implicit lifting。而 kernel method 則是建立在此想法基礎上，如果存在一個 function $k$ ($k$ 需滿足正定, positive definite)，使得 $k(x,y)= <φ(x),φ(y)>$，這個 $k(x,y)$ 就是一個kernel function。我們可以利用 $k(x,y)$ 來取代 $<φ(x),φ(y)>$ 減少運算的負擔。 :::info $<a,b>$ 表示 a 與 b 的內積(dot) ::: 雖說 $k(x,y)$ 比起 $<φ(x),φ(y)>$ 的運算量有所縮減，但在透過 kernel method 降低實際提升資料維度對複雜度的影響時，也導致了資料數量對複雜度的影響提升，因此當資料過於龐大，會導致運算效率的低下。因此論文試圖找到一個轉換 $z$，在得解可以近似於高維映射的前提下，轉換到低維的空間， ## Random Fourier Features 整個方法的推導得從 [Bochner’s theorem](https://en.wikipedia.org/wiki/Bochner%27s_theorem) 開始說起 > Bochner’s theorem: A continuous kernel $k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y})$ on $\mathbb{R}^D$ is positive definite if and only if $k(\delta)$ is the Fourier transform of a non-negative measure. The Fourier transform 的 non-negative measure 稱作 $p(\omega)$，表示為: $$ k(\delta) =\int p(\omega)exp(i\omega\Delta) \,dw $$ 對於 shift-invariant (滿足 $k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y})$) 且有正定性的函式(gaussian、 radial basis function 皆是)，Bochner’s theorem 保證 shift invariant 的 $k(\delta)$ 的傅立葉轉換 coefficients 都是 non-negative，因此我們可以把轉換後的結果視為是[機率分布](https://en.wikipedia.org/wiki/Probability_distribution)，從中抽樣的 $w$ 則可以用來得到近似解。 :::danger 具體推導吃掉了，實力太差看不懂 :cry: ::: ![](https://i.imgur.com/G15ZOcg.png) 結論的演算法相當簡單: 透過在 kernel function 轉換的機率函數中抽出 $w$，以及在 uniform distribution 中抽出的 $b$ 可以得到 $z(x)$。$z(x)$ 轉換後的資料，會逼近於使用 kernel function 中的 $φ$ 做 implicit lifting 後的結果。透過下面的 python code 可以更好的理解，這個 python 程式可以檢驗 $z(x)$ 是否是 $k(x)$ 的近似解，也就是: ![](https://i.imgur.com/QoOqfi5.png) :::warning :book: 程式參考自 [kernelapprox.py](https://github.com/gwgundersen/random-fourier-features/blob/master/kernelapprox.py)，但稍微修改了一下: * 修改了一下變數名稱，盡量跟論文一致方便理解 * 這裡的 `ZZ` 跟原始碼稍有不同，不知道是我理解有誤還是原始程式寫錯，但我覺得應該不用再除 `R`，如果有清楚的人再麻煩指導一下 QAQ ::: ```python= import matplotlib.pyplot as plt import numpy as np from sklearn.metrics.pairwise import rbf_kernel from sklearn.datasets import make_s_curve fig, axes = plt.subplots(1, 5) fig.set_size_inches(15, 4) font = {'fontname': 'arial', 'fontsize': 18} N = 1000 d = 3 X, t = make_s_curve(N, noise=0.1) X = X[t.argsort()] # The RBF kernel is the Gaussian kernel if we let \gamma = 1 / (2 \sigma^2). K = rbf_kernel(X, gamma=1/2.) axes[0].imshow(K, cmap=plt.cm.Blues) axes[0].set_title('Exact RBF kernel', **font) axes[0].set_xticks([]) axes[0].set_yticks([]) for D, ax in zip([1, 10, 100, 1000], axes[1:]): W = np.random.normal(loc=0, scale=1, size=(D, d)) b = np.random.uniform(0, 2*np.pi, size=D) B = np.repeat(b[:, np.newaxis], N, axis=1) norm = 1./ np.sqrt(D) Z = norm * np.sqrt(2) * np.cos(W @ X.T + B) ZZ = Z.T@Z ax.imshow(ZZ, cmap=plt.cm.Blues) ax.set_title(r'$\mathbf{Z} \mathbf{Z}^{\top}$, $R=%s$' % D, **font) ax.set_xticks([]) ax.set_yticks([]) plt.tight_layout() plt.show() ``` 以 gaussain kernel 為例(當 rbf kernel 的 gamma = $1 / (2 \sigma^2)$ 時，為 gaussian kernel，因此這裡是 $\sigma$ 為 1 的 gaussian kernel)，對於 `make_s_curve` 產生的資料 `X` ，透過 rbf kernel 映射後得到 `K`。 Gaussian kernel 的 $p(w)$ 是 normal distribution，因此在 normal distribution 中抽樣 R 個 ω(scale = 1，因為前面 gaussian_kernel 的 $\sigma$ 也是 1)，並且在 uniform distribution 的 $[0,2\pi]$ 區間抽樣 $b$，然後透過論文所說得到 $z(x)$， $z(x) = \sqrt{2/D} (\sum\limits_{i = 1}^D{cos(w'i+bi))'}$ 去計算 $ZZ^T$ 從結果可見，當抽樣的數量越大，得到的映射確實越接近真實的解。 ![](https://i.imgur.com/T6oAgkY.png) ## Random Binning Feature 這個方法的思維是，將 input space 根據隨機的位移量跟 resolution 分成若干等份的 grid，並把 input 映射成其所落在的 grid 對應的 bit string，如此重複，以兩個數據點 x、y 落在同一個 grid 中的頻率來近似 $k(x,y)$。 :::danger 數學真的太差理解不能，建議大家還是直接參考 reference 或者直接讀論文... ::: 比起 random Fourier features，random binning feature 能應用的 kernel 又更受限。Random binning feature 適用於 non-smooth 的 Laplace kernel，但卻不適合我們更常用的 smooth kernel，例如 Gaussian kernel (雖然後續也有論文嘗試對此做出調整，參考[Scaling up Kernel Ridge Regression via Locality Sensitive Hashing](https://arxiv.org/abs/2003.09756))。 ## Experiment ![](https://i.imgur.com/GtI82od.png) 實驗中基於 5 個不同的大數據集，比較使用 * random feature + [ridge](https://medium.com/chung-yi/ml%E5%85%A5%E9%96%80-%E4%BA%8C%E5%8D%81%E4%BA%8C-ridge-regression-f638e1887a7e) * [CVM](https://www.jmlr.org/papers/v6/tsang05a.html): 根據其他論文表示是當時類似的模型中，速度更快、且表現最好的方法，故論文中特別與之比較 * 使用完整 kernel function 的 SVM: 包含 libsvm、SVMTorch、SVM light 等等知名的套件實驗顯示，論文的方法在使用 random feature，透過簡單的 linear model 訓練後，訓練速度不僅快過 CVM，也可以達到比擬同等 kernel method 的準確度。 Random Fourier feature 在依賴差值的任務中有較佳的表現;而 random binning features 對於 memorization tasks (由SVM所需的 support vectors 數量來衡量)的表現更佳，因為此方法準確的近似原始資料在 input space 中的 locality。在 *Forest* 資料集可以看出端倪：由於此問題的 decision boudary 不平滑而需要大量的 support vectors，導致 random Fourier feature 在此資料上的表現並不好。 ![](https://i.imgur.com/oFKUQp3.png) 實驗中也顯示了足夠的 training set 對於 testing set 的影響(圖左)，如果資料量夠大，就不需要依賴特別強大的模型而能達到好的結果。圖中和圖右則顯示了 random binning features 中 P 的增加對於時間耗費的影響緩慢，然而對於 error rates 的降低效果佳。 ## Reference [video: Random Kitchen Sinks: Replacing Optimization with Randomization in Learning](https://www.youtube.com/watch?v=Nqi2iU7kbD0&feature=youtu.) [機器學習: Kernel 函數 ](https://medium.com/@chih.sheng.huang821/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-kernel-%E5%87%BD%E6%95%B8-47c94095171) [Implicit Lifting and the Kernel Trick](https://gregorygundersen.com/blog/2019/12/10/kernel-trick/) [Random Fourier Features](https://gregorygundersen.com/blog/2019/12/23/random-fourier-features/) [Reflections on Random Kitchen Sinks](http://www.argmin.net/2017/12/05/kitchen-sinks/) [随机装箱算法（Random Binning Features）](https://blog.csdn.net/rovance/article/details/78814219) [Ridge and Lasso Regression: L1 and L2 Regularization](https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b)