Machine learning - Kernel

# Machine learning - Kernel ## Kernel functions A kernel function is a real-valued function of two arguments, $\kappa(\textbf{x},\textbf{x}')\in\mathbb R,$ where $\textbf{x},\textbf{x}'\in\mathcal X.$ A common kernel is linear kernel, $$\kappa(\textbf{x},\textbf{x}')=\textbf{x}^T\textbf{x}'.$$ This is useful if the original data is already high dimensional, and if the original features are individually informative. For example, a bag of words representation, or the expression level of many genes. Another common kernel is **RBF kernel (radial basis function)** : $$\kappa(\textbf{x},\textbf{x}')=\exp(-\frac{||\textbf{x}-\textbf{x}'||^2}{2\sigma^2}),$$ where $\sigma^2$ is known as the **bandwidth**. It origins from the **squared exponential kernel (SE kernel)** or **Gaussian kernel**: $$\kappa(\textbf{x},\textbf{x}')=\exp[-\frac12(\textbf{x}-\textbf{x}')^T{\bf\Sigma}^{-1}(\textbf{x}-\textbf{x}')].$$ If $\bf\Sigma$ is diagonal, it can be written as $$\kappa(\textbf{x},\textbf{x}')=\exp[\ -\frac12\sum^D_{j=1}\frac1{\sigma_j^2}(x_j-x_j')^2\ ].$$ We can interpret the $\sigma_j$ as the **characteristic length scale** of dimension $j.$ If one of the $\sigma_j$ is infinity, the dimension is ignored. And this is known as the **ARD kernel (automatic relevance determination kernel)**. If $\bf\Sigma=\sigma^2\,{\bf I},$ which is a **spherical covariance matrix** (it means there's no directional bias in the Gaussian distribution), that becomes **RBF kernel** we mentioned above. The prior assumptions of RBF kernel are - no covariance, - uniform variance. Therefore, RBF kernel is common used as baseline, same as linear kernel. Kernels can be derived from probabilistic generative models. One approach is to define a kernel as: $$\kappa(\textbf{x}_i,\textbf{x}_j)=\int p(\textbf{x}|\textbf{x}_i)^\rho p(\textbf{x}|\textbf{x}_j)^\rho d\textbf{x},$$ where $\rho>0,$ and $p(\textbf{x}|\textbf{x}_i)$ is often approximated by $p(\textbf{x}|\hat{\pmb\theta}(\textbf{x}_i)).$ Note that ${\pmb\theta}(\textbf{x}_i)$ is a parameter estimate computed using a single data vector. This is called a **probability product kernel**. Another is **Fisher kernel**: $$\kappa(\textbf{x},\textbf{x}')=\textbf{g}_\theta(\textbf{x})^T\textbf{F}_{\theta}^{-1}\textbf{g}_\theta(\textbf{x}'),$$where $$\textbf{g}_\theta(\textbf{x})=\nabla_{\theta}\log p(\textbf{x}|{\theta})$$and is often called **score function**, $\textbf{F}$ is the **Fisher information matrix**: $$\textbf{F}_\theta=\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\textbf{g}_\theta(\textbf{x})\textbf{g}_\theta(\textbf{x})^T].$$ The rationale of Fisher kernel is that, the two vector $\textbf{x},\textbf{x}'$ are similar if their directional gradients are similar w.r.t. the geometry encoded by the curvature of the likelihood function. In practice, we can get Fisher information matrix by observed or emprical. Observed Fisher information matrix is, plug in $\theta=\hat\theta\text{ (MLE)}.$ And solve the Hessian matrix: $$\textbf{F}=\nabla_{\theta}\nabla_{\theta}\log p(\textbf{x}|\hat\theta).$$ Another is emprical Fisher information, just directly calculate the product: $$\textbf{F}=\textbf{g}_{\hat{\theta}}(\textbf{x})\textbf{g}_{\hat\theta}(\textbf{x})^T.$$ The code implementation is shown below: ```python # Emprical Fisher information import numpy as np mu_true = 2.0 sigma_true = 1.5 n_samples = 10000 samples = np.random.normal(loc=mu_true, scale=sigma_true, size=n_samples) score_mu = (samples - mu_true) / sigma_true**2 score_sigma = ((samples - mu_true)**2 - sigma_true**2) / sigma_true**3 score_stack = np.vstack([score_mu, score_sigma]) empirical_fisher = (score_stack @ score_stack.T) / n_samples # Observed Fisher information from autograd import grad, hessian import autograd.numpy as anp def total_log_likelihood(params): mu, sigma = params return anp.sum(-0.5 * anp.log(2 * anp.pi) - anp.log(sigma) - ((samples - mu) ** 2) / (2 * sigma ** 2)) # 計算 Hessian 作為 observed Fisher（取負號） observed_fisher = -hessian(total_log_likelihood)(anp.array([mu_true, sigma_true])) / n_samples ``` There are more explanation about Fisher Information. Let's re-define the Fisher information: $$\textbf{F}_{\theta;\theta'}=\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\textbf{g}_{{\theta'}}(\textbf{x})\textbf{g}_{{\theta'}}(\textbf{x})^T].$$Notice that the parameter is not the same anymore, it's ground truth parameter $\theta$ for $p(\textbf{x}|\theta)$ and a mutable parameter $\theta'.$ If let $\theta'=\theta,$ $$\begin{align}\textbf{F}_{\theta;\theta}&=\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\textbf{g}_{\theta}(\textbf{x})\textbf{g}_{\theta}(\textbf{x})^T]\\&=\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[[\nabla_{\theta}\log p(\textbf{x}|\theta)]\ [\nabla_{\theta}\log p(\textbf{x}|\theta)]^T]\\&=\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[[\nabla_{\theta}\log p(\textbf{x}|\theta)-0][\nabla_{\theta}\log p(\textbf{x}|\theta)-0]^T]\\&=\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[[\nabla_{\theta}\log p(\textbf{x}|\theta)-\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\nabla_{\theta}\log p(\textbf{x}|\theta)]][\nabla_{\theta}\log p(\textbf{x}|\theta)-\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\nabla_{\theta}\log p(\textbf{x}|\theta)]]^T]\\&=\text{cov}(\nabla_{\theta}\log p(\textbf{x}|\theta),\nabla_{\theta}\log p(\textbf{x}|\theta))\\&=\text{cov}[\textbf{g}_{\theta}(\textbf{x})].\end{align}$$Here we used a fact that, $$\begin{align}\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\nabla_{\theta}\log p(\textbf{x}|\theta)]&=\int_{\mathcal X} p(\textbf{x}|\theta)\nabla_{\theta}\log p(\textbf{x}|\theta)\ d\textbf{x}\\&=\int_{\mathcal X}p(\textbf{x}|\theta)\frac{\nabla_{\theta}\, p(\textbf{x}|\theta)}{p(\textbf{x}|\theta)}\ d\textbf{x}\\&=\int_{\mathcal X}\nabla_{\theta}\ p(\textbf{x}|\theta)\ d\textbf{x}\\&=\nabla_{\theta}\int_{\mathcal X}p(\textbf{x}|\theta)\ d\textbf{x}=\nabla_{\theta}\ 1=0\end{align}$$ So when plug-in the ground truth parameter $\theta,$ the Fisher information is **the covariance of score function**. If we take the gradient on the score function again, $$\begin{align}\nabla_{\theta}\ \textbf{g}_{\theta}(\textbf{x})&=\nabla_{\theta}\nabla_{\theta}\log p(\textbf{x}|\theta)\\&=\frac{\nabla_{\theta}\nabla_{\theta}\ p(\textbf{x}|\theta)}{p(\textbf{x}|\theta)}-\frac{[\nabla_{\theta}\ p(\textbf{x}|\theta)]^2}{p(\textbf{x}|\theta)^2}\\&=\frac{\nabla_{\theta}\nabla_{\theta}\ p(\textbf{x}|\theta)}{p(\textbf{x}|\theta)}-(\nabla_{\theta}\log p(\textbf{x}|\theta))^2\\&=\frac{\nabla_{\theta}\nabla_{\theta}\ p(\textbf{x}|\theta)}{p(\textbf{x}|\theta)}-[\nabla_{\theta}\log p(\textbf{x}|\theta)]\,[\nabla_{\theta}\log p(\textbf{x}|\theta)]^T\end{align}$$Take expectation value on both sides, $$\begin{align}&\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\nabla_{\theta}\textbf{g}_{\theta}(\textbf{x})]&=&\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\frac{\nabla_{\theta}\nabla_{\theta}\ p(\textbf{x}|\theta)}{p(\textbf{x}|\theta)}]-\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[[\nabla_{\theta}\log p(\textbf{x}|\theta)]\,[\nabla_{\theta}\log p(\textbf{x}|\theta)]^T]\\&&=&\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\frac{\nabla_{\theta}\nabla_{\theta}\ p(\textbf{x}|\theta)}{p(\textbf{x}|\theta)}]-\textbf{F}_{\theta;\theta}\\&\Rightarrow\\\\&\textbf{F}_{\theta;\theta}&=&\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\frac{\nabla_{\theta}\nabla_{\theta}\ p(\textbf{x}|\theta)}{p(\textbf{x}|\theta)}]-\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\nabla_{\theta}\textbf{g}_{\theta}(\textbf{x})]\\&&=&-\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\nabla_{\theta}\textbf{g}_{\theta}(\textbf{x})]+\int_{\mathcal X}p(\textbf{x}|\theta)\frac{\nabla_{\theta}\nabla_{\theta}\ p(\textbf{x}|\theta)}{p(\textbf{x}|\theta)}d\textbf{x}\\&&=&-\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\nabla_{\theta}\textbf{g}_{\theta}(\textbf{x})]+\nabla_{\theta}\nabla_{\theta}\int_{\mathcal X}p(\textbf{x}|\theta)d\textbf{x}\\&&=&-\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\nabla_{\theta}\nabla_{\theta}\ \log p(\textbf{x}|\theta)]\\&&=&-\mathbb E_{\textbf{x}\sim p(\textbf{x}|\theta)}[\nabla^2_{\theta}\ \log p(\textbf{x}|\theta)].\end{align}$$This is the second meaning of Fisher information, **the negative expectation value of Hessian matrix**. The Hessian matrix can be thought as the curvature of the log-likelihood function, whose input is the controlling parameters. So the more negative Fisher information becoming, the more curvature is there. ## Kernel trick Kernel trick is a useful technique used wildly in machine learning. Given two feature vectors $x$ and $x'$ and a feature mapping function, $\phi(x):\mathcal X\to\mathcal H,$ where $\mathcal X$ is the feature space and $\mathcal H$ is Hilbert space. Normally, lots of ML algorithms need to do inner product: $\langle \phi(x),\phi(x')\rangle.$ Sometimes calculating the feature mapping is expensive or even impossible. Can we have some tricks to skip this? Mercer's theorem shows the hope of that. It says: Assume - the feature space $\mathcal X\subset\mathbb R^n$ is a compact set, - a kernel function $\kappa(x,y):\mathcal X\times\mathcal X\to\mathbb R$ is symmetric ($\kappa(x,y)=\kappa(y,x)$) and continuous, - $\int_{\mathcal X}\int_{\mathcal X}\kappa(x,y)f(x)f(y)\,dxdy\geq0,\forall f\in L^2(\mathcal X),$ where $L^2(\mathcal X)$ is squared integrable functions ($\mathcal X\to\mathbb R$) on $\mathcal X.$ Then there exists a group of orthogonal normalized function basis, $\{\phi_n\}\subset L^2(D),$ and corresponding non-negative eigenvalues $\lambda_n,$ that $$\kappa(x,y)=\sum^{\infty}_{n=1}\lambda_n\phi_n(x)\phi_n(y).$$ And it means, the kernel function $\kappa(x,y)$ has a corresponding mapping $\phi(x)$ from $\mathcal X$ to $L^2(\mathcal X).$ If the **Gram matrix**, $${\bf K}=\begin{bmatrix}\kappa(x_1,x_1)&\ldots&\kappa(x_1,x_n)\\&\vdots\\\kappa(x_n,x_1)&\ldots&\kappa(x_n,x_n)\end{bmatrix}$$is positive definite for any set of inputs $\{x_i\}_{i=1}^n.$ We call such a kernel a **positive definite kernel**. (also called a **Mercer kernel**.) If the kernel satisfies the integral inequality in Mercer's theorem, it must be a positive definite kernel, but the reverse is not true. Mercer's theorem is usually recognized as the rationale of kernel trick. Actually, reproducing kernel Hilbert space (RKHS) provides the rationale of kernel trick, not Mercer's theorem. But RKHS is kinda complex, so skip here. The only thing need to know is: As long as $\kappa(\cdot,\cdot)$ is a **symmetric** and **positive definite kernel**, the RKHS can be constructed from $\kappa(\cdot,\cdot).$ Therefore, **kernel trick is valid**, that is $$\langle\phi(x),\phi(x')\rangle_{\mathcal H}=\kappa(x,x'),$$ where $\kappa(\cdot,\cdot):\mathcal X\times\mathcal X\to\mathcal H,\ x,x'\in\mathcal X,\ \phi(\cdot):\mathcal X\to\mathcal H,$ and $\phi(\cdot)$ mapping the feature space to the RKHS, $\mathcal H,$ corresponding to $\kappa$. So if we want the kernel trick is true, we don't need to meet the condition of Mercer's theorem. For example, the RBF kernel: - [x] RBF kernel is symmetric, - [x] continuous, - [ ] but is defined on $\mathbb R^n,$ not **compact.** Therefore, Mercer's theorem can not be applied here, there's no promise that RBF kernel can be expanded. However, RKHS can be constructed from RBF kernel, so the kernel trick is still valid. And note that if the kernel function meets the condition of Mercer's theorem, then the feature mapping $\phi(\cdot)$ can be explicitly written down. ## Gaussian process ### Noise-free observations Given a set of training data, $X=\{\textbf{x}_1,\ldots,\textbf{x}_n\},\ Y=\{f(\textbf{x}_1),\ldots,f(\textbf{x}_n)\},$ and test data $X^{*}=\{\textbf{x}^*_{1},\ldots,\textbf{x}^*_m\}.$ Note that $f:\mathbb R^a\to\mathbb R.$ We want to predict the corresponding $f(\textbf{x}_i^*),$ and also get the posterior distribution, $p(f(\textbf{x}_i^*)|X,Y,\textbf{x}^*_i).$ Gaussian process has a basic assumption, $p(f(\textbf{x}_1),\ldots,f(\textbf{x}_N))$ is jointly Gaussian with some mean $\mu(\textbf{x})$ and covariance $\Sigma(\textbf{x})$ where $\Sigma(\textbf{x})_{ij}=\kappa(\textbf{x}_i,\textbf{x}_j).$ Note that $\kappa(\cdot,\cdot)$ is a positive definite kernel function. The kernel function is actually the prior for the target function. Based on the assumption, $f(X)$ and $f(X^*)$ are jointly Gaussian. By the assumption, we know $$\begin{bmatrix}Y\\f(X^*)\end{bmatrix}\sim\mathcal N(\begin{bmatrix}\mu(X)\\\mu(X^*)\end{bmatrix},\begin{bmatrix}K(X,X)&K(X,X^*)\\K(X^*,X)&K(X^*,X^*)\end{bmatrix}),$$where $\mu(X)=\begin{bmatrix}-\sum_{j=1}^a\textbf{x}_{1,j}-\\\vdots\\-\sum_{j=1}^a\textbf{x}_{n,j}-\end{bmatrix}.$ But it is generally recommend to use a constant, $\mu(X)=\mu.$ Or just zero. GP model is flexible enough to model the mean arbitrarily well. By conditioning Gaussian, we know $$p(f(X^*)|X^*,X,Y)=\mathcal N(f(X^*)|\mu^*,\Sigma^*),$$where $$\begin{align}\mu^*&=\mu(X^*)+K(X^*,X)[K(X,X)]^{-1}(Y-\mu(X)),\\\Sigma^*&=K(X^*,X^*)-K(X^*,X)K(X,X)^{-1}K(X,X^*).\end{align}$$ The code implementation is shown below (prior kernel is RBF and the mean function is 0): ```python import numpy as np import matplotlib.pyplot as plt def rbf_kernel(X1, X2, length_scale=1.0, sigma_f=1.0): sqdist = np.sum(X1**2, axis=1).reshape(-1, 1) + np.sum(X2**2, axis=1) - 2 * (X1 @ X2.T) return sigma_f**2 * np.exp(-0.5 / length_scale**2 * sqdist) X_train = np.array([-3, -2, -1, 1, 2, 3]).reshape(-1, 1) Y_train = np.sin(X_train) # 生成測試數據 X_test = np.linspace(-5, 5, 100).reshape(-1, 1) # 計算核函數矩陣 sigma_n = 0.1 # 觀測噪聲 K = rbf_kernel(X_train, X_train) # 訓練點的 K(X, X) K_s = rbf_kernel(X_train, X_test) # 訓練點和測試點的 K(X, X*) K_ss = rbf_kernel(X_test, X_test) # 測試點之間的 K(X*, X*) # 計算後驗均值和方差 K_inv = np.linalg.inv(K) # 訓練點核矩陣的逆 mu_s = K_s.T @ K_inv @ Y_train # 預測均值 cov_s = K_ss - K_s.T @ K_inv @ K_s # 預測方差 std_s = np.sqrt(np.diag(cov_s)) # 提取標準差 # 繪製結果 plt.figure(figsize=(8, 5)) plt.plot(X_train, Y_train, 'ro', label="Training Data") # 訓練數據點 plt.plot(X_test, np.sin(X_test), 'g--', label="True Function") # 真實函數 plt.plot(X_test, mu_s, 'b-', label="GP Mean Prediction") # GP 預測均值 plt.fill_between(X_test.ravel(), mu_s.ravel() - 1.96 * std_s, mu_s.ravel() + 1.96 * std_s, alpha=0.2, color='blue', label="95% Confidence Interval") # 置信區間 plt.legend(loc="upper left") plt.title("Gaussian Process Regression (Manual Implementation)") plt.show() ``` ![Gaussian_process_regression](https://hackmd.io/_uploads/ryqwzeymWe.png) ### Noisy observations Consider the case where the observations are noisy, $f(\textbf{x})+\epsilon,\ \epsilon\sim\mathcal N(0,\sigma^2).$ The model is not required to interpolate the data. The joint distribution becomes $$\begin{bmatrix}Y\\f(X^*)\end{bmatrix}\sim\mathcal N(\begin{bmatrix}\mu(X)\\\mu(X^*)\end{bmatrix},\begin{bmatrix}K(X,X)+\sigma^2\,I&K(X,X^*)\\K(X^*,X)&K(X^*,X^*)\end{bmatrix}).$$ That results $$p(f(X^*)|X^*,X,Y)=\mathcal N(f(X^*)|\mu^*,\Sigma^*),$$where $$\begin{align}\mu^*&=\mu(X^*)+K(X^*,X)[K(X,X)+\sigma\,I]^{-1}(Y-\mu(X)),\\\Sigma^*&=K(X^*,X^*)-K(X^*,X)[K(X,X)+\sigma\,I]^{-1}K(X,X^*).\end{align}$$ The code implementation is shown below (prior kernel is RBF and the mean function is 0): ```python import numpy as np import matplotlib.pyplot as plt def rbf_kernel(X1, X2, length_scale=1.0, sigma_f=1.0): """計算 RBF 核函數 K(X1, X2)""" sqdist = np.sum(X1**2, axis=1).reshape(-1, 1) + np.sum(X2**2, axis=1) - 2 * (X1 @ X2.T) return sigma_f**2 * np.exp(-0.5 / length_scale**2 * sqdist) X_train = np.array([-3, -2, -1, 1, 2, 3]).reshape(-1, 1) Y_train = np.sin(X_train) + 0.1 * np.random.randn(*X_train.shape) # 加上噪聲 # 生成測試數據 X_test = np.linspace(-5, 5, 100).reshape(-1, 1) # 計算核函數矩陣 sigma_n = 0.1 # 觀測噪聲 K = my_rbf_kernel(X_train, X_train)+sigma_n**2 * np.eye(len(X_train)) # 訓練點的 K(X, X) K_s = my_rbf_kernel(X_train, X_test) # 訓練點和測試點的 K(X, X*) K_ss = my_rbf_kernel(X_test, X_test) # 測試點之間的 K(X*, X*) # 計算後驗均值和方差 K_inv = np.linalg.inv(K) # 訓練點核矩陣的逆 mu_s = K_s.T @ K_inv @ Y_train # 預測均值 cov_s = K_ss - K_s.T @ K_inv @ K_s # 預測方差 std_s = np.sqrt(np.diag(cov_s)) # 提取標準差 # 繪製結果 plt.figure(figsize=(8, 5)) plt.plot(X_train, Y_train, 'ro', label="Training Data") # 訓練數據點 plt.plot(X_test, np.sin(X_test), 'g--', label="True Function") # 真實函數 plt.plot(X_test, mu_s, 'b-', label="GP Mean Prediction") # GP 預測均值 plt.fill_between(X_test.ravel(), mu_s.ravel() - 1.96 * std_s, mu_s.ravel() + 1.96 * std_s, alpha=0.2, color='blue', label="95% Confidence Interval") # 置信區間 plt.legend(loc="upper left") plt.title("Gaussian Process Regression (Manual Implementation)") plt.show() ``` ![Gaussian_process_regression_noisy](https://hackmd.io/_uploads/rJu_zlkmZl.png) ### Hyper-parameters of Gaussian process The noisy version has an intrinsic hyper-parameter, namely the noise variance, $\sigma.$ The noise-free has none intrinsic hyper-parameter. The other hyper-parameters come from the kernel functions: For RBF kernel, $$\kappa(x,x')=\sigma^2_f\exp[-\frac{||x-x'||^2}{2l^2}],$$where $\sigma_f$ control the output scale (vertical scale), $l$ control the smoothness of the function (horizontal scale). The $\sigma_f$ and $l$ are `sigma_f` and `length_scale` respectively. > $\sigma_f=1,l=3,$ > > ![Gaussian_process_regression_smooth](https://hackmd.io/_uploads/SkVtMxym-g.png) > > $\sigma_f=3,l=1,$ > > ![Gaussian_process_regression_output_scale](https://hackmd.io/_uploads/BkaYzgJ7-l.png) ### Estimating the kernel parameters We can consider an empirical Bayes approach by maximizing the marginal likelihood: $$p(Y\ |\ X)=\int_{\mathbb R^n} p(Y\ |\ \textbf{f},X)\,p(\textbf{f}\ |\ X)\ d\,\textbf{f},$$where $\textbf{f}=[f(\textbf{x}_i)]_{i=1}^n$ and $Y=[y_i]_{i=1}^n=[f(\textbf{x}_i)+\epsilon_i]_{i=1}^n,\ \epsilon_i\sim \mathcal N(0,\sigma^2).$ Since $p(\textbf{f}\ |\ X)=\mathcal N(\textbf{f}\ |\ 0,K),\ p(Y\ |\ \textbf{f},X)=\prod_{i}\,\mathcal N(f(\textbf{x}_i)+\epsilon_i\ |\ f(\textbf{x}_i),\sigma^2),$ the marginal likelihood is given by $$\begin{align}p(Y\ |\ X)&=\int_{\mathbb R^n}[\prod_{i=1}^n\mathcal N(y_i\ |\ f(\textbf{x}_{i}),\sigma^2)]\cdot\mathcal N(\textbf{f}\ |\ 0,K)\,d\textbf{f}\\&=\int_{\mathbb R^n}[\prod_{i=1}^n\frac{1}{(2\pi)^{\frac12}\sigma}\exp[{-\frac{(y_i-f(\textbf{x}_i))^2}{2\sigma^2}}]]\cdot\frac{1}{(2\pi)^{n/2}|K|^{1/2}}\exp[-\frac12\textbf{f}K^{-1}\textbf{f}]\,d\textbf{f}\\&=\frac{1}{(2\pi\sigma)^{n}|K|^{1/2}}\int_{\mathbb R^n}\exp[-\frac1{2\sigma^2}[\sum_{i=1}^n(y_i-f(\textbf{x}_i))^2]-\frac12\textbf{f}K^{-1}\textbf{f}]\,d\textbf{f}\\&=\frac{1}{(2\pi\sigma)^{n}|K|^{1/2}}\int_{\mathbb R^n}\exp[-\frac12[(Y-\textbf{f})^T(\sigma^{-2}I_n)(Y-\textbf{f})+\textbf{f}K^{-1}\textbf{f}\,]\,]\,d\textbf{f}\\&=\frac{1}{(2\pi\sigma)^{n}|K|^{1/2}}\int_{\mathbb R^n}\exp[-\frac12[(\textbf{f}-A^{-1}\textbf{b})^TA(\textbf{f}-A^{-1}\textbf{b})+Y^T(\sigma^{-2}I_n) Y-\textbf{b}^TA^{-1}\textbf{b}]]\,d\textbf{f}\\&=\frac{1}{(2\pi\sigma)^{n}|K|^{1/2}}\exp[-\frac12(Y^T(\sigma^{-2}I_n) Y-\textbf{b}^TA^{-1}\textbf{b})]\int_{\mathbb R^n}e^{-\frac12(\textbf{f}-A^{-1}\textbf{b})^TA(\textbf{f}-A^{-1}\textbf{b})}d\textbf{f}\\&=\frac{1}{(2\pi\sigma)^{n}|K|^{1/2}}\exp[-\frac12(Y^T(\sigma^{-2}I_n) Y-\textbf{b}^TA^{-1}\textbf{b})]\cdot(2\pi)^{n/2}|A^{-1}|^{1/2}\\&=\frac{|A^{-1}|^{1/2}}{(2\pi)^{n/2}\sigma^n|K|^{1/2}}\exp[-\frac12Y^T(\sigma^{-2}I_n-\sigma^{-4}(K^{-1}+\sigma^{-2}I_n)^{-1})Y]\\\\&\Rightarrow{\text{by Woodbury matrix identity}}\\\\&=\frac1{(2\pi)^{n/2}|\sigma^2K|^{1/2}|A|^{1/2}}\exp[-\frac12Y^T\{\sigma^{-2}I_n-\sigma^{-4}[\sigma^2I_n-\sigma^4(K+\sigma^2I_n)^{-1}]\,\}Y]\\&=\frac1{(2\pi)^{n/2}|K+\sigma^2I_n|^{1/2}}\exp[-\frac12Y^T(K+\sigma^2I_n)^{-1}\,Y]\\\\&=\mathcal N(Y\ |\ 0,K+\sigma^2I_n),\end{align}$$where $A=K^{-1}+\sigma^{-2}I_n,\textbf{b}=\sigma^{-2}Y.$ Therefore, the log likelihood is $$\begin{align}\log p(Y\ |\ X)&=-\frac{n}{2}\log(2\pi)-\frac12\log|K+\sigma^2I_n|-\frac12Y^T(K+\sigma^2I_n)^{-1}Y\\&=-\frac12Y^TK_y^{-1}Y-\frac12\log|K_y|-\frac{n}{2}\log(2\pi).\end{align}$$The first term, $-\frac12Y^TK_y^{-1}Y,$ is a data fit term, the second term, $-\frac12\log|K_y|,$ is a model complexity term, and the third term, $-\frac n2\log(2\pi),$ is just constant. The log likelihood can be maximized by gradient-based optimization, where we can show that $$\begin{align}\frac{\partial}{\partial\theta_j}\log p(Y\ |\ X)&=\frac12Y^TK_y^{-1}\frac{\partial K_y}{\partial\theta_j}K_y^{-1}-\frac12\text{tr}(K_y^{-1}\frac{\partial K_y}{\partial\theta_j})\\&=\frac12\text{tr}((\alpha\alpha^T-K_y^{-1})\frac{\partial K_y}{\partial\theta_j}),\end{align}$$where $\alpha=K_y^{-1}Y.$ ### The connection to neural networks Consider a naive fully connected neural network, $$f(x)=\sum_{j=1}^n a_j\sigma(w_j^T x+b_j),$$where $x$ is the input, $w,b$ are weights and bias initialized by some distributions, $a$ are the weights of output initialized by a normal distribution, and $\sigma$ is a non-linear activation function (For example, ReLU and tanh). When $n\to\infty,$ by *central limit theorem*, for any inputs $x,x',$ the output converges to a multivariate Gaussian distribution, and $$\mathbb E[f(x)]=0,\ \text{cov}(f(x),f(x'))=K(x,x'),$$where $K(x,x')$ is the kernel function that depends on $\sigma(\cdot)$ and the distribution of the input $x.$ Note that $$\begin{align}\text{cov}(f(x),f(x'))&=\lim_{n\to\infty}\mathbb E[f(x)f(x')]\\&=\lim_{n\to\infty}\mathbb E[\,\sum_{i=1}^n\sum_{j=1}^na_ia_j\sigma(w_i^Tx+b_i)\sigma(w_j^Tx'+b_j)]\\&=\lim_{n\to\infty}\mathbb \sum_{i\neq j}\mathbb E[a_i]\,\mathbb E[a_j]\,\mathbb E[\sigma(w_i^Tx+b_i)\sigma(w_j^Tx'+b_j)]+\sum_{i=1}^n\mathbb E[a_i^2]\mathbb E[\sigma(w_i^Tx+b_i)\sigma(w_i^Tx'+b_i)]\\&=\lim_{n\to\infty}0+\sum_{i=1}^n1^2\cdot\mathbb E[\sigma(w_i^Tx+b_i)\sigma(w_i^Tx'+b_i)]\\&=\lim_{n\to\infty}n\cdot\mathbb E_{w,b}[\sigma(w^Tx+b)\sigma(w^Tx'+b)],\end{align}$$It's obviously not converge. But if scaling down the neural network by the number of hidden nodes, $$f(x)=\frac1{\sqrt n}\sum_{j=1}^n a_j\sigma(w_j^T x+b_j),$$then, $$\text{cov}(f(x),f(x'))=\mathbb E_{w,b}[\sigma(w_i^Tx+b_i)\sigma(w_i^Tx'+b_i)]=K(x,x'),$$which is a valid kernel. The kernel is dependent on the specific activation function. For example, let $\sigma(z)=\text{ReLU}(z)=\max(0,z),$ and $w\sim\mathcal N(0,I),\,b=0.$ The kernel can be described as below ([Cho and Saul, 2009](https://proceedings.neurips.cc/paper_files/paper/2009/file/5751ec3e9a4feab575962e78e006250d-Paper.pdf)): $$K(x,x')=\frac1\pi||x||\,||x'||\cdot(\sin\theta+(\pi-\theta)\cos\theta).$$ The paper present a new family of kernel function called *arc-cosine kernel*: $$K_n(x,x')=\frac1\pi||x||^n\,||x'||^nJ_n(\theta),$$where $$J_n(\theta)=(-1)^n(\sin\theta)^{2n+1}(\frac1{\sin\theta})[\frac{\partial^n}{\partial^n\theta}(\frac{\pi-\theta}{\sin\theta})],\ \theta=\arccos\frac{x\cdot x'}{||x||\,||x'||}.$$ The arc-cosine kernel is used when the activation function is one of the form: $$\sigma(z)=\Theta(z)z^n=\frac12(1+\text{sign}(z))z^n.$$It can be easily seen, plug in $n=1,$ the function becomes ReLU. The specific kernel can be ploted: ![NNK_example](https://hackmd.io/_uploads/H15nfgJXWx.png) ![NNK_example2](https://hackmd.io/_uploads/r1epflJQWl.png) As shown, this kernel results higher value when the "direction" of the input vectors are close. They found, by using arc-cosine kernel on SVM, the classification performance can be improved: ![Kernel Methods for Deep Learning fig 2](https://hackmd.io/_uploads/ByVAGeJQWg.png) ![Kernel Methods for Deep Learning fig 3](https://hackmd.io/_uploads/rkO0zxyXbg.png) ![Kernel Methods for Deep Learning fig 4](https://hackmd.io/_uploads/rJjCzlkQbg.png) ![Kernel Methods for Deep Learning fig 5](https://hackmd.io/_uploads/H1JAzeJ7Zg.png) ## Neural tangent kernel A NNK is a kernel from an infinite-wide network that is **not trained**. A natural question is: what is the kernel during/after training (an infinite-wide FC neural network)? The training data is $\mathcal X=\{\textbf{x}^{(i)}\}_{i=1}^N,\ \mathcal Y=\{y^{(i)}\}_{i=1}^N.$ Assume the network outputs a scalar value: $f(\textbf{x};\theta)\in\mathbb R.$ The training process of gradient descent can be approximated by a continuous-time gradient descent. The change of parameter at time step $t$ is: $$\frac{d\theta_t}{dt}=-\eta\nabla_{\theta_t}[\frac1{N}\sum_{i=1}^N \mathcal L(f(\textbf{x}^{(i)};\theta_t),\textbf{y}^{(i)})]=-\eta\frac1N\sum^N_{i=1}\nabla_{\theta_t} f(\textbf{x}^{(i)};\theta_t)\nabla_f \mathcal L(f,\textbf{y}^{(i)}).$$By chain rule, the behavior of network on the training data is: $$\begin{align}\frac{d f(\mathcal X;\theta_t)}{dt}&=\begin{bmatrix}\frac{d f(\textbf{x}^{(1)};\theta_t)}{dt}\\\vdots\\\frac{d f(\textbf{x}^{(N)};\theta_t)}{dt}\end{bmatrix}=\begin{bmatrix}\frac{d f(\textbf{x}^{(1)};\theta_t)}{d\theta_t}\frac{d\theta_t}{dt}\\\vdots\\\frac{d f(\textbf{x}^{(N)};\theta_t)}{d\theta_t}\frac{d\theta_t}{dt}\end{bmatrix}\\&=-\eta\frac1N\begin{bmatrix}\sum^N_{i=1}\nabla_{\theta_t}f(\textbf{x}^{(1)};\theta_t)^\top\nabla_{\theta_t} f(\textbf{x}^{(i)};\theta_t)\nabla_f \mathcal L(f,\textbf{y}^{(i)})\\\vdots\\\sum^N_{i=1}\nabla_{\theta_t}f(\textbf{x}^{(N)};\theta_t)^\top\nabla_{\theta_t} f(\textbf{x}^{(i)};\theta_t)\nabla_f \mathcal L(f,\textbf{y}^{(i)})\end{bmatrix}\\&=-\frac{\eta}{N}\begin{bmatrix}\sum^N_{i=1}K(\textbf{x}^{(1)},\textbf{x}^{(i)};\theta_t)\nabla_f \mathcal L(f,\textbf{y}^{(i)})\\\vdots\\\sum^N_{i=1}K(\textbf{x}^{(N)},\textbf{x}^{(i)};\theta_t)\nabla_f \mathcal L(f,\textbf{y}^{(i)})\end{bmatrix}\\&=-\frac{\eta}{N}K(\mathcal X,\mathcal X;\theta_t)\nabla_f\mathcal L(f,\mathcal Y),\end{align}$$where $K(\mathcal X,\mathcal X;\theta_t)_{ij}=K(\textbf{x}^{(i)},\textbf{x}^{(j)};\theta_t)=\nabla_{\theta_t}f(\textbf{x}^{(i)};\theta_t)^\top\nabla_{\theta_t}f(\textbf{x}^{(j)};\theta_t),$ and $\nabla_f\mathcal L(f,\mathcal Y)_i=\nabla_f\mathcal L(f,\textbf{y}^{(i)}).$ Assume the loss function is $\mathcal L(\hat y,y)=\frac12(\hat y-y)^2$ :$$\frac{df(\mathcal X;\theta_t)}{dt}=-\frac{\eta}{N}K(\mathcal X,\mathcal X;\theta_t)(f(\mathcal X;\theta_t)-\mathcal Y),$$where $f(\mathcal X;\theta_t)_i=f(\textbf{x}^{(i)};\theta_t),\ \mathcal Y_i=y^{(i)}.$ We can prove that the NTK converges to be 1. deterministic at initialization, only dependent on the architecture, 2. stays constant during training, when the width of the neural network approaches to infinity. Therefore, we conclude $\lim\text{width}\to\infty,$ the NTK $K(\textbf{x},\textbf{x};\theta_t)\to K_\infty(\textbf{x},\textbf{x}).$ Solve the ODE: $$\begin{align}&&\frac{df(\mathcal X;\theta_t)}{dt}&=&-\frac{\eta}{N}K_\infty(\mathcal X,\mathcal X)(f(\mathcal X;\theta_t)-\mathcal Y)\\&\Rightarrow&\frac1{(f(\mathcal X;\theta_t)-\mathcal Y)}\,df(\mathcal X;\theta_t)&=&-\frac{\eta}{N}K_\infty(\mathcal X,\mathcal X)\,dt\\&\Rightarrow&\int\frac1{(f(\mathcal X;\theta_t)-\mathcal Y)}\,df(\mathcal X;\theta_t)&=&\int-\frac{\eta}{N}K_\infty(\mathcal X,\mathcal X)\,dt\\&\Rightarrow&\log [f(\mathcal X;\theta_t)-\mathcal Y]&=&-\frac{\eta}{N}K_\infty(\mathcal X,\mathcal X)\,t+C\\&\Rightarrow&f(\mathcal X;\theta_t)&=&\mathcal Y+e^{-\frac{\eta}{N}K_\infty(\mathcal X,\mathcal X)t}C\end{align}$$ When $t=0,\ f(\mathcal X;\theta_t)=f(\mathcal X;\theta_0),$ so $C=f(\mathcal X;\theta_0)-\mathcal Y.$ For a test $\textbf{x},$ the evolution of the network respect to time is $$\begin{align}\frac{df(\textbf{x};\theta_t)}{dt}&=\frac{df(\textbf{x};\theta_t)}{d\theta_t}\frac{d\theta_t}{dt}\\&=-\frac\eta N\sum^N_{i=1}\nabla_{\theta_t}f(\textbf{x};\theta_t)^\top\nabla_{\theta_t} f(\textbf{x}^{(i)};\theta_t)\nabla_f \mathcal L(f,\textbf{y}^{(i)})\\&=-\frac\eta N\sum^N_{i=1}K_\infty(\textbf{x},\textbf{x}^{(i)})\nabla_f \mathcal L(f,\textbf{y}^{(i)})\\&=-\frac\eta N K_\infty(\textbf{x},\mathcal X)^\top\nabla_f \mathcal L(f,\mathcal Y),\end{align}$$where $K(\textbf{x},\mathcal X)=\begin{bmatrix}K(\textbf{x},\textbf{x}^{(1)})&\ldots&K(\textbf{x},\textbf{x}^{(N)})\end{bmatrix}.$ Assume the loss function is $\mathcal L(\hat y,y)=\frac12(\hat y-y)^2,$ $$\frac{df(\textbf{x};\theta_t)}{dt}=-\frac{\eta}{N}K_\infty(\textbf{x},\mathcal X)^\top(f(\mathcal X;\theta_t)-\mathcal{Y}).$$It known that, $f(\mathcal X;\theta_t)-\mathcal{Y}=e^{-\frac{\eta}{N}K(\mathcal X,\mathcal X;\theta_t)t}(f(\mathcal X;\theta_0)-\mathcal Y).$ Therefore, $$\frac{df(\textbf{x};\theta_t)}{dt}=-\frac{\eta}{N}K_\infty(\textbf{x},\mathcal X)^\top\left[e^{-\frac{\eta}{N}K_\infty(\mathcal X,\mathcal X)t}(f(\mathcal X;\theta_0)-\mathcal Y)\right].$$ The total change of the output of network can be expressed as $$\begin{align}f(\textbf{x};\theta_t)-f(\textbf{x};\theta_0)&=\int_0^t\frac{df(\textbf{x};\theta_s)}{ds}ds=-\frac{\eta}{N}K_{\infty}(\textbf{x},\mathcal X)^\top\left(\int^t_0 \left[e^{-\frac{\eta}{N}K_{\infty}(\mathcal X,\mathcal X)s}ds\right]\right)(f(\mathcal X;\theta_0)-\mathcal Y)\\&=-\frac{\eta}{N}K_{\infty}(\textbf{x},\mathcal X)^\top\left(-\frac{N}{\eta}[K_{\infty}(\mathcal X,\mathcal X)]^{-1}(e^{-\frac{\eta}{N}K_\infty(\mathcal X,\mathcal X)t}-I_N)\right)(f(\mathcal X;\theta_0)-\mathcal Y)\\&=-K_{\infty}(\textbf{x},\mathcal X)^\top[K_{\infty}(\mathcal X,\mathcal X)]^{-1}(I_N-e^{-\frac{\eta}{N}K_\infty(\mathcal X,\mathcal X)t})(f(\mathcal X;\theta_0)-\mathcal Y).\end{align}$$Therefore, the output of a network at time step $t$ is $$f(\textbf{x};\theta_t)=f(\textbf{x};\theta_0)-K_{\infty}(\textbf{x},\mathcal X)^\top[K_{\infty}(\mathcal X,\mathcal X)]^{-1}(I_N-e^{-\frac{\eta}{N}K_\infty(\mathcal X,\mathcal X)t})(f(\mathcal X;\theta_0)-\mathcal Y).$$When $\lim t\to\infty,$ $$f(\textbf{x};\theta_\infty)=f(\textbf{x};\theta_0)-K_{\infty}(\textbf{x},\mathcal X)^\top[K_{\infty}(\mathcal X,\mathcal X)]^{-1}(f(\mathcal X;\theta_0)-\mathcal Y).$$It is exactly like GP: $$\begin{align}\mu^*&=\mu(X^*)+K(X^*,X)[K(X,X)]^{-1}(Y-\mu(X)),\\\Sigma^*&=K(X^*,X^*)-K(X^*,X)K(X,X)^{-1}K(X,X^*).\end{align}$$ We can do spectral decomposition, $K_\infty=U\Lambda U^{-1}.$ $$\begin{align}f(\textbf{x};\theta_t)&=f(\textbf{x};\theta_0)-K_{\infty}(\textbf{x},\mathcal X)^\top[U\Lambda U^{-1}]^{-1}(I_N-e^{-\frac{\eta}{N}K_\infty(\mathcal X,\mathcal X)t})(f(\mathcal X;\theta_0)-\mathcal Y)\\&=f(\textbf{x};\theta_0)-K_{\infty}(\textbf{x},\mathcal X)^\top[U\Lambda^{-1} U^{-1}](I_N-e^{-\frac{\eta}{N}U\Lambda U^{-1}t})(f(\mathcal X;\theta_0)-\mathcal Y)\\&=f(\textbf{x};\theta_0)-K_{\infty}(\textbf{x},\mathcal X)^\top[U\Lambda^{-1} U^{-1}](I_N-Ue^{-\frac{\eta}{N}\Lambda t}U^{-1})(f(\mathcal X;\theta_0)-\mathcal Y)\\&=f(\textbf{x};\theta_0)-K_{\infty}(\textbf{x},\mathcal X)^\top[U\Lambda^{-1} ](I_N-e^{-\frac{\eta}{N}\Lambda t})U^{-1}(f(\mathcal X;\theta_0)-\mathcal Y)\\&=f(\textbf{x};\theta_0)-\sum^N_{i=1}(\frac{1-e^{-\frac{\eta}{N}\lambda_i t}}{\lambda_i})[u_i^\top K_\infty(x,\mathcal X)][u_i^\top(f(\mathcal X;\theta_0)-\mathcal Y)].\end{align}$$ It can be seen that the convergence rate at direction $u_i$ is controlled by $$1-e^{-\frac{\eta}{N}\lambda_it},$$which means the biggest the eigenvalue is, the faster this network converges (in this direction). We can conclude that, the direction with the smallest eigenvalue is the bottleneck of the network. So the convergence rate of this network is basically decided by it.