GA-NTK Proof Go Through

# GA-NTK Proof Go Through ## *Definition 3.1*: $\beta$-Smoothness A function $f: \mathbb{R}^d \to \mathbb{R}^n$ is $\beta$-smoothness, if $\forall \mathbf{y}, \mathbf{x} \in \mathbb{R}^d, \ \exists \beta \in \mathbb{R}, \ || \nabla_{y} f(\mathbf{y}) - \nabla_{x} f(\mathbf{x}) || \leq \beta || \mathbf{y} - \mathbf{x} ||$. ## *Lemma 3.1*: Convergence of Gradient Descent On $\beta$-Smoothness Function If the function $f: \mathbb{R}^d \to \mathbb{R}$ is $\beta$-smoothness, then the gradient descent on the function $f$ will converge to a critical point. $$ \min_{t \in T} || \nabla f(\mathbf{x}_t) ||_2^2 \leq \sqrt{\frac{2 D \beta M}{T-1}} = O(\frac{1}{\sqrt{T-1}}) $$ Where $T$ is the number of the iteration of the gradient descent, $\beta$ is the coefficient of the $\beta$-smoothness of the function $f$. Since $f$ is $\beta$-smoothness, thus, $\max_{t \in T} || \nabla f(\mathbf{x}_t)||_2^2$ has an upper bound and we denote $\max_{t \in T} || \nabla f(\mathbf{x}_t)||_2^2$ as $M$. Also, the output of the function in the initial point $f(\mathbf{x}_1)$ wouldn't too far away from the output of the function in the final point $f(\mathbf{x}_T)$, which means $\exists D \in \mathbb{R}^+, \ | f(\mathbf{x}_1) - f(\mathbf{x}_T) | \leq D$  By [ECE 901: Large-scale Machine Learning and Optimization Spring 2018 Lecture 9 — 02/22](https://papail.io/teaching/901/scribe_09.pdf) ## *Lemma:* Gradient Approximation Error of $\beta$-Smoothness Functions If a function $f: \mathbb{R}^d \to \mathbb{R}$ is $\beta$-smoothness, then it implies, $\forall \mathbf{x}, \mathbf{y} \in \mathbb{R}^d$ $$ | f(\mathbf{y}) - f(\mathbf{x}) - \nabla f(\mathbf{x})^{\top} (\mathbf{y} - \mathbf{x}) | \leq \frac{\beta}{2} || \mathbf{y} - \mathbf{x} ||_2^2 $$ Here the $\nabla$ take the gradients of the input $\mathbf{x}$. By [Sandwich theorem for β-smooth convex function - Page 3 ~ 8](https://angms.science/doc/CVX/CVX_betasmoothsandwich.pdf) ### Proof of *Lemma:* Gradient Approximation Error of $\beta$-Smoothness Functions  The difference between $f(\mathbf{y})$ and $f(\mathbf{x})$ can be expressed by $$ f(\mathbf{y}) - f(\mathbf{x}) = \int_0^1 \nabla f(\mathbf{x} + t(\mathbf{y} - \mathbf{x}))^{\top} (\mathbf{y} - \mathbf{x}) dt $$ Add a redudant term $\nabla f(\mathbf{x})$ $$ f(\mathbf{y}) - f(\mathbf{x}) = \int_0^1 \left[ \nabla f(\mathbf{x} + t(\mathbf{y} - \mathbf{x})) + \nabla f(\mathbf{x}) - \nabla f(\mathbf{x}) \right]^{\top} (\mathbf{y} - \mathbf{x}) dt $$ Take $\nabla f(x)$ out from the integral and move it to the left hand size: $$ f(\mathbf{y}) - f(\mathbf{x}) - \nabla f(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) = \int_0^1 \left[ \nabla f(\mathbf{x} + t(\mathbf{y} - \mathbf{x})) - \nabla f(\mathbf{x}) \right]^{\top} (\mathbf{y} - \mathbf{x}) dt $$ Take absolute value on both side $$ 0 \leq | f(\mathbf{y}) - f(\mathbf{x}) - \nabla f(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) | = \left| \int_0^1 \left[ \nabla f(\mathbf{x} + t(\mathbf{y} - \mathbf{x})) - \nabla f(\mathbf{x}) \right]^{\top} (\mathbf{y} - \mathbf{x}) dt \right| $$ By $| \int_a^b f(x) dx | \leq \int_a^b | f(x) | dx$ and Cauchy inequality $$ 0 \leq | f(\mathbf{y}) - f(\mathbf{x}) - \nabla f(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) | \leq \int_0^1 || \nabla f(\mathbf{x} + t(\mathbf{y} - \mathbf{x})) - \nabla f(\mathbf{x}) ||_2 \cdot || (\mathbf{y} - \mathbf{x}) ||_2 dt $$ Because $f$ is $\beta$-smoothness, we know $$ || \nabla f(\mathbf{x} + t(\mathbf{y} - \mathbf{x})) - \nabla f(\mathbf{x})||_2 \leq \beta || \mathbf{x} + t(\mathbf{y} - \mathbf{x}) - \mathbf{x} ||_2 = \beta || t(\mathbf{y} - \mathbf{x})||_2 $$ Plug into $\int_0^1 || \nabla f(\mathbf{x} + t(\mathbf{y} - \mathbf{x})) - \nabla f(\mathbf{x}) ||_2 \cdot || (\mathbf{y} - \mathbf{x}) ||_2 dt$ $$ 0 \leq | f(\mathbf{y}) - f(\mathbf{x}) - \nabla f(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) | \leq \beta \int_0^1 || t(\mathbf{y} - \mathbf{x})||_2 \cdot || \mathbf{y} - \mathbf{x}||_2 dt = \beta \int_0^1 t || (\mathbf{y} - \mathbf{x})||_2^2 dt $$ $$ 0 \leq | f(\mathbf{y}) - f(\mathbf{x}) - \nabla f(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) | \leq \frac{\beta}{2} || \mathbf{y} - \mathbf{x}||_2^2 $$ ### Proof of *Theorem:* Convergence of Gradient Descent on $\beta$-Smoothness Functions The gradient descent is $$ \mathbf{x}_{t+1} = \mathbf{x}_{t} - \eta \nabla f(\mathbf{x}_t) $$ With *Lemma:* Gradient Approximation Error of $\beta$-Smoothness Functions $$ | f(\mathbf{x}_{t+1}) - f(\mathbf{x}_{t}) + \nabla f(\mathbf{x_{t}})^{\top}(\mathbf{x}_{t+1} - \mathbf{x}_{t}) | \leq \frac{\beta}{2} || \mathbf{x}_{t+1} - \mathbf{x}_{t}||_2^2 $$ Replace $\mathbf{x}_{t+1} - \mathbf{x}_{t}$ by $\eta \nabla f(\mathbf{x}_t)$ $$ | f(\mathbf{x}_{t+1}) - f(\mathbf{x}_{t}) + \nabla f(\mathbf{x_{t}})^{\top}(\eta \nabla f(\mathbf{x}_t)) | \leq \frac{\beta}{2} || \eta \nabla f(\mathbf{x}_t)||_2^2 $$ $$ | f(\mathbf{x}_{t+1}) - f(\mathbf{x}_{t}) + \eta \nabla f(\mathbf{x_{t}})^{\top} \nabla f(\mathbf{x}_t) | \leq \frac{\beta}{2} \eta^2 || \nabla f(\mathbf{x}_t)||_2^2 $$ Since we only need the upper bound, we ignore the absolute function on the left-hand side. $$ f(\mathbf{x}_{t+1}) - f(\mathbf{x}_{t}) + \eta \nabla f(\mathbf{x_{t}})^{\top} \nabla f(\mathbf{x}_t) \leq \frac{\beta}{2} \eta^2 || \nabla f(\mathbf{x}_t)||_2^2 $$ $$ \eta \nabla f(\mathbf{x_{t}})^{\top} \nabla f(\mathbf{x}_t) \leq f(\mathbf{x}_{t}) - f(\mathbf{x}_{t+1}) + \frac{\beta}{2} \eta^2 || \nabla f(\mathbf{x}_t)||_2^2 $$ $$ \nabla f(\mathbf{x_{t}})^{\top} \nabla f(\mathbf{x}_t) \leq \frac{f(\mathbf{x}_{t}) - f(\mathbf{x}_{t+1})}{\eta} + \frac{\beta}{2} \eta || \nabla f(\mathbf{x}_t)||_2^2 $$ So for each iteration, the best solution of the GD will be bounded as $$ \begin{array}{} \nabla f(\mathbf{x}_1)^{\top} \nabla f(\mathbf{x}_1) \leq \frac{f(\mathbf{x}_{1}) - f(\mathbf{x}_{2})}{\eta} + \frac{\beta}{2} \eta || \nabla f(\mathbf{x}_1)||_2^2 \\ \nabla f(\mathbf{x}_2)^{\top} \nabla f(\mathbf{x}_2) \leq \frac{f(\mathbf{x}_{2}) - f(\mathbf{x}_{3})}{\eta} + \frac{\beta}{2} \eta || \nabla f(\mathbf{x}_2)||_2^2 \\ . \\ . \\ . \\ \nabla f(\mathbf{x}_{T-1})^{\top} \nabla f(\mathbf{x}_{T-1}) \leq \frac{f(\mathbf{x}_{T-1}) - f(\mathbf{x}_{T})}{\eta} + \frac{\beta}{2} \eta || \nabla f(\mathbf{x}_{T-1})||_2^2 \\ \end{array} $$ So, the sum over iterations. $$ \sum_{t=1}^{T-1} \nabla f(\mathbf{x}_t)^{\top} \nabla f(\mathbf{x}_t) \leq \frac{f(\mathbf{x}_{1}) - f(\mathbf{x}_T)}{\eta} + \frac{\eta \beta}{2} \sum_{t=1}^{T-1} || \nabla f(\mathbf{x}_t)||_2^2 $$ Divided by $T-1$ to the both side $$ \forall t \leq T, \ \min_{t \in T} \nabla f(\mathbf{x}_t)^{\top} \nabla f(\mathbf{x}_t) \leq \frac{f(\mathbf{x}_{1}) - f(\mathbf{x}_T)}{\eta (T-1)} + \frac{\eta \beta}{2} \max_{t \in T} || \nabla f(\mathbf{x}_t)||_2^2 $$ Since $f$ is $\beta$-smoothness, thus, $\max_{t \in T} || \nabla f(\mathbf{x}_t)||_2^2$ has an upper bound and we denote $\max_{t \in T} || \nabla f(\mathbf{x}_t)||_2^2$ as $M$. According to the Lemma: Gradient Approximation Error of $\beta$-Smoothness Functions, the error of the Taylor approximation of the $\mathbf{x}_1$ on the $\mathbf{x}_T$ can be bounded by $\frac{\beta}{2} || \mathbf{x}_1 - \mathbf{x}_T ||_2^2$, thus, $\exists D \in \mathbb{R}^+, f(\mathbf{x}_{1}) - f(\mathbf{x}_T) \leq D$ $$ \min_{t \in T} || \nabla f(\mathbf{x}_t) ||_2^2 \leq \frac{D}{\eta (T-1)} + \frac{\eta \beta M}{2} = \frac{2 D + \eta^2 \beta (T-1) M}{2 \eta (T-1)} $$ Let $\eta = \sqrt{\frac{1}{\beta M (T-1)}}$ $$ \min_{t \in T} || \nabla f(\mathbf{x}_t) ||_2^2 \leq \frac{2 D + \eta^2 \beta (T-1) M}{2 \eta (T-1)} = \frac{2 D + \frac{1}{\beta M (T-1)} \beta (T-1) M}{2 (T-1) \sqrt{\frac{1}{\beta M (T-1)}}} = \frac{2 D \sqrt{\beta M (T-1)}}{2 (T-1)} = \sqrt{\frac{D^2 \beta M}{T-1}} $$ Thus, $$ \min_{t \in T} || \nabla f(\mathbf{x}_t) ||_2^2 \leq \sqrt{\frac{D^2 \beta M}{T-1}} = O(\frac{1}{\sqrt{T-1}}) $$ --- ## *Assumption 3.1* (1) Hyperparameter $t$, $\exists C_1 \in \mathbb{R}^{+}, s.t. \leq C_1$ (2) Hyperparameter $\sigma_w$, $\exists C_2 \in \mathbb{R}^{+}, s.t. \sigma_w \leq C_2$ (3) Hyperparameter $\eta$, $\exists C_3 \in \mathbb{R}^{+}, s.t. \eta \leq C_3$ (4) For the all dimensions $a \leq d \in \mathbb{N}$ of the every data points $\mathbf{z}_{i}, \mathbf{x}_{i} \in \mathbb{R}^d, \mathbf{z}_{i} \in \mathbf{Z}^n, \mathbf{x}_{i} \in \mathbf{X}^n, \forall i \leq n$, $\exists C_4 \in \mathbb{R}^{+}, s.t. |\mathbf{x}_{i}^{a}| \leq C_4, |\mathbf{z}_{i}^{a}| \leq C_4$. $\mathbf{x}_{i}^{a}$ is the $a$-th dimension of the $i$-th real image and $\mathbf{z}_{i}^{a}$ is the $a$-th dimension of the $i$-th generated image. (5) For each label $y_i, \forall i \leq n$, $\exists C_5 \in \mathbb{R}^{+}, s.t. |y_i| \leq C_5$ (6) The number of the layers of the neural network $L$ is finite. That means $\exists C_6 \in \mathbb{N}, \ s.t. \ L \leq C_6$. (7) Any two data points $e_i, e_j, i \neq j$ in the training dataset $e_i, e_j \in \{ \mathbf{X}^N, \mathbf{Z}^N\}$ are not identical, which means $\exists C_7 \in \mathbb{R}^+, s.t. \ || e_i - e_j || > C_7$.  ## *Theorem 3.1*: Convergence of GA-NTK If the *assumption 3.1* holds, according to Theorem 3.7 of [Zhou et. 2020](https://arxiv.org/pdf/2005.11879.pdf), the GA-NTK algorithm will almost surely converge to a critical point.  $$ \min_{t \in T} || \nabla_{\mathbf{Z}^n} \mathcal{L}(\mathbf{Z}^n) ||_2^2 \underset{a.s.}{\leq} O(\frac{1}{\sqrt{T-1}}) $$ Where $T$ is the number of iteration of the gradient descent. With lemma: Convergence of Gradient Descent On $\beta$-Smoothness Function, we know that once the function is $\beta$-smoothness $\exists D\in\mathbb{R}^+,\ |f(\mathbf{x}_{1})-f(\mathbf{x}_{T})|\leq D$, then the gradient descent will converge on the function. As a result, once we can show that the loss function of GA-NTK $\mathcal{L}$ is $\beta$-smoothness, then we can show that the gradient descent will converge on the loss function $\mathcal{L}$. ### *Corollary*: Norm of The Gradient If the norm of the gradient of a function $|| \nabla f(\mathbf{x}) ||, \ \mathbf{x} \in \mathbb{R}^d, \ f:\mathbb{R}^d \to \mathbb{R}^n$ has an upper bound, which means $\exists \alpha \in \mathbb{R}^+, \ || \nabla f(\mathbf{x}) || \leq \alpha$ and assumption (7) holds, then the function $f$ is $\beta$-smoothness. ### Proof of *Theorem*: Convergence of GA-NTK With Corollary: Norm of The Gradient, we know that once we can show that the loss function of GA-NTK is $\beta$-smoothness, we can say that the gradient descent will converge on the loss function of GA-NTK. #### The Expansion of The Gradient of The Loss Function First, we need to expand the gradient of the loss function with only 1 generated image $\mathbf{z} \in \mathbf{Z}^1$ and $n$ real images. $$ \nabla_{\mathbf{z}} \mathcal{L}(\mathbf{z}) = \nabla_{\mathbf{z}} ||\mathbf{1}^{n+1} − \mathcal{D}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda)||_2^2 $$ Let $\mathcal{D}_{i}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda)$ be the prediction of the discriminator of the $i$-th data point $\mathbf{x}_i \in \mathbf{X}^n$. Note that the generated image $\mathbf{z} \in \mathbb{R}^d$ is the $n+1$-th data point. $$ = \nabla_{\mathbf{z}} \left( \sum_{i=1}^{n+1} (1 - \mathcal{D}_i (\mathbf{X}^{n}, \mathbf{Z}^1; k, \lambda))^2 \right) $$ $$ = \sum_{i=1}^{n+1} 2 \cdot (\mathcal{D}_i (\mathbf{X}^{n}, \mathbf{Z}^1; k, \lambda) - 1) \cdot \nabla_{\mathbf{z}} \mathcal{D}_i (\mathbf{X}^{n}, \mathbf{Z}^1; k, \lambda) $$ If we can show exist a constant $\mathcal{B}_1 \in \mathbb{R}^+$ s.t. $|| \nabla_{\mathbf{z}} \mathcal{D}_{i}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda) ||_2 \leq \mathcal{B}_1$ and $| \mathcal{D}_{i}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda) | \leq \mathcal{B}_1$, thus, $\mathcal{L}(\mathbf{z})$ is $\beta$-smoothness. #### *Remark*: The Bound of The Output And The Gradient of The Discriminator If exist a constant $\mathcal{B}_1 \in \mathbb{R}^+$ s.t. $|| \nabla_{\mathbf{z}} \mathcal{D}_{i}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda) ||_2 \leq \mathcal{B}_1$ and $| \mathcal{D}_{i}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda) | \leq \mathcal{B}_1$, thus, $\mathcal{L}(\mathbf{z})$ is $\beta$-smoothness. #### *Remark*: The Bound of The Output of The Discriminator If the assumption (1), (3) and the theorem 3.7 proved by the paper [NIPS'20 Spectra of the Conjugate Kernel and Neural Tangent Kernel for Linear-Width Neural Networks](https://arxiv.org/abs/2005.11879) holds, then, $\exists \mathcal{B}_1 \in \mathbb{R}^+, \ s.t. \ | \mathcal{D}_{i}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda) | \underset{a.s.}{\leq} \mathcal{B}_1$ #### *Proof of Remark*: The Bound of The Output of The Discriminator In this section, we want to show that $\exists \mathcal{B}_1 \in \mathbb{R}^+$ s.t. $|\mathcal{D}_{i}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda)| \leq \mathcal{B}_1, \ \forall i \leq n+1$. $$ \mathcal{D}_{i}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda) = \left( \sum_{j=1}^{n} (I_{i,j} - e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{x}_j) t}) y_{j} \right) + (I_{i,n+1} - e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{z}) t}) y_{j} $$ Note that the $n+1$-th data point $\mathbf{x}_{n+1}$ is generated image $\mathbf{z}$ According to the paper in [NIPS'20 Spectra of the Conjugate Kernel and Neural Tangent Kernel for Linear-Width Neural Networks](https://arxiv.org/abs/2005.11879), they prove that the NTK kernel $||K^{NTK}||_2 \leq C$ almost surely for a constant $C > 0$ and large enough training dataset. If we ignore the operation between $\infty$ and $-\infty$, the kernel function of NTK must be bounded, which means $\exists \mathcal{B}_2 \in \mathbb{R}^+ \ s.t. \ \forall i, j \leq n+1 \ \tau^{(l)}(\mathbf{x}_i, \mathbf{x}_j) \underset{a.s.}{\leq} \mathcal{B}_2$. In the mean time, with assumption (1), (3) and the theorem proved by the paper, we can derived that $\exists \mathcal{B}_1 \in \mathbb{R}^+, \ s.t. \ \mathcal{D}_{i}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda) \underset{a.s.}{\leq} \mathcal{B}_1$ #### *Remark*: The Bound of The Gradient of The Discriminator If the assumptions 1, 3, and 5 hold, the theorem 3.7 proved by the paper [NIPS'20 Spectra of the Conjugate Kernel and Neural Tangent Kernel for Linear-Width Neural Networks](https://arxiv.org/abs/2005.11879) holds and $\exists \mathcal{B}_4 \in \mathbb{R}^+, \ s.t. \ \forall j \leq n, \ || \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{x}_j) ||_2 \leq \mathcal{B}_4, \ || \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{z}) ||_2 \leq \mathcal{B}_4$ holds, then, $\exists \mathcal{B}_1 \in \mathbb{R}^+$ s.t. $|| \nabla_{\mathbf{z}} \mathcal{D}_{i}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda) ||_2 \underset{a.s.}{\leq} \mathcal{B}_1$ #### *Proof of Remark*: The Bound of The Gradient of The Discriminator In this section, we aim to show $\exists \mathcal{B}_1 \in \mathbb{R}^+$ s.t. $|| \nabla_{\mathbf{z}} \mathcal{D}_{i}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda) ||_2 \leq \mathcal{B}_1$ First, we expand the gradient of the discriminator $$ \nabla_{\mathbf{z}} \mathcal{D}_{i}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda) = \nabla_{\mathbf{z}} \left( \left( \sum_{j=1}^{n} (I_{i,j} - e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{x}_j) t}) y_{j} \right) + (I_{i,n+1} - e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{z}) t}) y_{n+1} \right) $$ For the prediction of generated image $\mathbf{z} \in \mathbf{Z}^1$ $$ i = n+1, \nabla_{\mathbf{z}} \mathcal{D}_{n+1}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda) = \nabla_{\mathbf{z}} \left( \left( \sum_{j=1}^{n} (I_{i,j} - e^{- \eta \tau^{(L)}(\mathbf{z}, \mathbf{x}_j) t}) y_{j} \right) + (I_{i,n+1} - e^{- \eta \tau^{(L)}(\mathbf{z}, \mathbf{z}) t}) y_{n+1} \right) $$ $$ = \sum_{j=1}^{n} \left(- \left(\nabla_{\mathbf{z}} e^{- \eta \tau^{(L)}(\mathbf{z}, \mathbf{x}_j) t} \right) y_{j} \right) - (\nabla_{\mathbf{z}} e^{- \eta \tau^{(L)}(\mathbf{z}, \mathbf{z}) t}) y_{n+1} $$ $$ = \left( \sum_{j=1}^{n} - y_{j} \left(e^{- \eta \tau^{(L)}(\mathbf{z}, \mathbf{x}_j) t} \right) \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{x}_j) \cdot (- \eta t) \right) - (e^{- \eta \tau^{(L)}(\mathbf{z}, \mathbf{z}) t} \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{z}) (-\eta t)) y_{n+1} $$ $$ = \left( \sum_{j=1}^{n} y_{j} \eta t e^{- \eta \tau^{(L)}(\mathbf{z}, \mathbf{x}_j) t} \color{red}{\nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{x}_j)} \right) + (y_{n+1} \eta t e^{- \eta \tau^{(L)}(\mathbf{z}, \mathbf{z}) t} \color{red}{\nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{z})}) $$ For the prediction of real image $\mathbf{x}_i \in \mathbf{X}^n$ $$ i \leq n, \nabla_{\mathbf{z}} \mathcal{D}_{n+1}(\mathbf{X}^n, \mathbf{Z}^1; k, \lambda) = \nabla_{\mathbf{z}} \left( \left( \sum_{j=1}^{n} (I_{i,j} - e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{x}_j) t}) y_{j} \right) + (I_{i,n+1} - e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{z}) t}) y_{n+1} \right) $$ $$ = \sum_{j=1}^{n} \left(- \left(\nabla_{\mathbf{z}} e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{x}_j) t} \right) y_{j} \right) - (\nabla_{\mathbf{z}} e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{z}) t}) y_{n+1} $$ $$ = \left( \sum_{j=1}^{n} - y_{j} \left(e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{x}_j) t} \right) \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{x}_i, \mathbf{x}_j) \cdot (- \eta t) \right) - (e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{z}) t} \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{x}_i, \mathbf{z}) (-\eta t)) y_{n+1} $$ $$ = \left( \sum_{j=1}^{n} y_{j} \eta t e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{x}_j) t} \color{red}{\nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{x}_i, \mathbf{x}_j)} \right) + (y_{n+1} \eta t e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{z}) t} \color{red}{\nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{x}_i, \mathbf{z})}) $$ According to assumption 1, 3, and 5 and theorem 3.7, $\exists \mathcal{B}_3 \in \mathbb{R}^+$ s.t. $\forall i, j \leq n, \ y_{j} \eta t e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{x}_j) t} \leq \mathcal{B}_3, \ y_{n+1} \eta t e^{- \eta \tau^{(L)}(\mathbf{x}_i, \mathbf{z}) t} \leq \mathcal{B}_3, \ y_{j} \eta t e^{- \eta \tau^{(L)}(\mathbf{z}, \mathbf{x}_j) t} \leq \mathcal{B}_3, \ y_{n+1} \eta t e^{- \eta \tau^{(L)}(\mathbf{z}, \mathbf{z}) t} \leq \mathcal{B}_3$. If we want to bound the norm of $\nabla_{\mathbf{z}} \mathcal{D}_{i}(\mathbf{X}^n, \mathbf{z})$, we need to bound the norm of $\color{red}{\forall j \leq n, \ \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{x}_j)}$ and $\color{red}{\nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{z})}$. #### *Remark*: The Bound of The Gradient of The Neural Tangent Kernel If the assumptions (2) and (4) hold, then, $\exists \mathcal{B}_4 \in \mathbb{R}^+, \ s.t. \ \forall j \leq n, \ || \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{x}_j) ||_2 \leq \mathcal{B}_4, \ || \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{z}) ||_2 \leq \mathcal{B}_4$ #### *Proof of Remark*:The Bound of The Gradient of The Neural Tangent Kernel In this section, our goal is to show that $\exists \mathcal{B}_4 \in \mathbb{R}^+$, s.t. $\forall j \leq n, \ || \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{x}_j) ||_2 \leq \mathcal{B}_4$ and $|| \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{z}) ||_2 \leq \mathcal{B}_4$ Firstly, we show that $\forall j \leq n, \ || \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{x}_j) ||_2 \leq \mathcal{B}_4$. To achieve this goal, we come up with an idea that if each dimension of $\nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{x}_j)$ can be bounded, then we can bound the $||\nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{x}_j)||_2$ The $a$-th dimension of $\nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{x}_j)$ is $\frac{\partial \tau^{(L)}(\mathbf{z}, \mathbf{x}_j)}{\partial \mathbf{z}^{a}}$, where $\mathbf{z}^{a}$ is the $a$-th dimension of the data point $\mathbf{z} \in \mathbb{R}^d$ and $\mathbf{x}^{a}$ is the $a$-th dimension of the data point $\mathbf{x} \in \mathbb{R}^d$ $$ \frac{\partial \tau^{(L)}(\mathbf{z}, \mathbf{x}_j)}{\partial \mathbf{z}^{a}} = \frac{\partial \tau^{(L)}(\mathbf{z}, \mathbf{x}_j)}{\partial \tau^{(L-1)}(\mathbf{z}, \mathbf{x}_j)} ... \frac{\partial \tau^{(1)}(\mathbf{z}, \mathbf{x}_j)}{\partial \mathbf{z}^{a}} $$ As a result, to show that $||\nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{x}_j)||_2$ is bounded, we need to show that 1. Show that $\exists \mathcal{B}_5 \in \mathbb{R}^+$ such that $\left| \frac{\partial \tau^{(L)}(\mathbf{z}, \mathbf{x}_j)}{\partial \tau^{(L-1)}(\mathbf{z}, \mathbf{x}_j)} \right| \leq \mathcal{B}_5$ 2. Show that $\exists \mathcal{B}_6 \in \mathbb{R}^+$ such that $\left| \frac{\partial \tau^{(1)}(\mathbf{z}, \mathbf{x}_j)}{\partial \mathbf{z}^{a}} \right| \leq \mathcal{B}_6$ To prove the first bound: $\exists \mathcal{B}_5 \in \mathbb{R}^+$ such that $\left| \frac{\partial \tau^{(L)}(\mathbf{z}, \mathbf{x}_j)}{\partial \tau^{(L-1)}(\mathbf{z}, \mathbf{x}_j)} \right| \leq \mathcal{B}_5$, we can expand the partial derivative as the following $$ \frac{\partial \tau^{(L)}(\mathbf{z}, \mathbf{x}_j)}{\partial \tau^{(L-1)}(\mathbf{z}, \mathbf{x}_j)} = \frac{\sigma_{\mathbf{w}}^2}{D^{(L−1)}} \sum_{j} w_{ji}^{(L)} w_{ji}^{(L)} \phi'(z_j^{(L−1)}(\mathbf{z})) \phi'(z_j^{(L−1)}(\mathbf{x}_j)) $$ Where $\sigma_w^2$ is the variance of the initialization of the weight, $D^{(L−1)}$ is the width of $L-1$-th layer, $w_{ji}^{(L)}$ is the weight of $i$-th row, $j$-th column of $L$-th layer, $\phi$ is the activation function, and $z_j^{(L−1)}$ is the $j$-th entry output of $L-1$-th layer of pre-activation. As we take the width to infinity and weight to 0 mean as the assumption of NTK, we can reduce the function into an expectation. $$ = \sigma_w^2 \mathbb{E}_{w_{ji}^{(L)}, w_{ji}^{(L)} \sim N(0, \sigma_{w}^2)}[w_{ji}^{(L)} w_{ji}^{(L)}] \mathbb{E}_{(z_j^{(L−1)}(\mathbf{z}), z_j^{(L−1)}(\mathbf{x}_j)) \sim N(\mathbf{0}_2, {\mathbf{K}}_{2,2}^{(l-1)})} \left[ \phi'(z_j^{(L−1)}(\mathbf{z})) \phi'(z_j^{(L−1)}(\mathbf{x}_j)) \right] $$ Since we've known for a random variable $x$, the variance of it is $Var[x] = \mathbb{E}[(x - \mathbb{E}[x])^2] = \mathbb{E}[x^2] - \mathbb{E}[x]^2$. Thus, $$ = \sigma_w^4 \mathbb{E}_{(z_j^{(L−1)}(\mathbf{x}), z_j^{(L−1)}(\mathbf{x}')) \sim N(\mathbf{0}_2, \mathbf{K}_{2,2}^{(l-1)})} \left[ \phi'(z_j^{(L−1)}(\mathbf{x})) \phi'(z_j^{(L−1)}(\mathbf{x}') \right] $$ Where $\mathbf{K}_{2,2}^{(l-1)} = \begin{bmatrix} k^{(l-1)}(\mathbf{z}, \mathbf{z}) & k^{(l-1)}(\mathbf{z}, \mathbf{x}_j) \\ k^{(l-1)}(\mathbf{x}_j, \mathbf{z}) & k^{(l-1)}(\mathbf{x}_j, \mathbf{x}_j) \\ \end{bmatrix}$ is the NNGP kernel matrix of $l-1$ layer and $k^{(l-1)}(\mathbf{z}, \mathbf{x}_j) = Cov[z_{j}^{(l)}(\mathbf{z}) z_{j}^{(l)}(\mathbf{x}_j)]$ is the kernel function of the NNGP kernel function of $l-1$ layer. 1. Bound the map of the derivative of the activation function $\phi'$: If the activation function doesn't have vertical line, it would satisfy this condition. Most of activation functions, including ReLU, sigmoid, tanh, softmax, and erf functions satisfy this condition. 2. According to the assumption 2, $|\sigma_w^4|$ has bound. Thus, we can show that $\exists \mathcal{B}_5 \in \mathbb{R}^+$ such that $\left| \frac{\partial \tau^{(L)}(\mathbf{x}, \mathbf{x}')}{\partial \tau^{(L-1)}(\mathbf{x}, \mathbf{x}')} \right| \leq \mathcal{B}_5$ To prove the second bound: $\exists \mathcal{B}_6 \in \mathbb{R}^+$ such that $\left| \frac{\partial \tau^{(1)}(\mathbf{z}, \mathbf{x}_j)}{\partial \mathbf{z}^{a}} \right| \leq \mathcal{B}_6$, we can expand the partial derivative as the following $$ \tau^{(1)}(\mathbf{z}, \mathbf{x}_j) = \frac{\sigma_w^4}{D^{(0)^2}} \mathbf{z}^{\top} \mathbf{x}_j + 1 = \frac{\sigma_w^4}{D^{(0)^2}} \left( \sum_{p=1}^{D^{(0)}} \mathbf{z}^{p} \mathbf{x}_j^{p} \right) + 1 $$ $$ \frac{\partial \tau^{(1)}(\mathbf{z}, \mathbf{x}_j)}{\partial \mathbf{z}^{a}} = \frac{\sigma_w^4}{D^{(0)^2}} \mathbf{x}_j^{a} $$ According to assumption 2 and 4, we can derive $\exists \mathcal{B}_6 \in \mathbb{R}^+$ such that $\left| \frac{\partial \tau^{(1)}(\mathbf{z}, \mathbf{x}_j)}{\partial \mathbf{z}^{a}} \right| \leq \mathcal{B}_6$. The proof of $\exists \mathcal{B}_4 \in \mathbb{R}^+, \ s.t. \ || \nabla_{\mathbf{z}} \tau^{(L)}(\mathbf{z}, \mathbf{z}) ||_2 \leq \mathcal{B}_4$ is similar. --- At the $l$-th layer, we have NTK kernel function $$ \tau^{(l)}(\mathbf{x}, \mathbf{x}') = \nabla_{\mathbf{\theta}^{(\leq l)}} z_i^{(l)} (\mathbf{x})^{\top} \nabla_{\mathbf{\theta}^{(\leq l)}} z_i^{(l)}(\mathbf{x}') $$ $$ = \nabla_{\mathbf{\theta}^{(l)}} z_i^{(l)} (\mathbf{x})^{\top} \nabla_{\mathbf{\theta}^{(l)}} z_i^{(l)}(\mathbf{x}') + \nabla_{\mathbf{\theta}^{(\leq l-1)}} z_i^{(l)} (\mathbf{x})^{\top} \nabla_{\mathbf{\theta}^{(\leq l-1)}} z_i^{(l)}(\mathbf{x}') $$ $$ \begin{array}{c} = \left( \frac{\sigma_{\mathbf{w}}^2}{D^{(l−1)}} \sum_{j} \phi(z_j^{(l−1)}(\mathbf{x})) \phi(z_j^{(l−1)}(\mathbf{x}')) + \sigma_b^2 \right) + \\ \left( \tau^{(l-1)}(\mathbf{x}, \mathbf{x}') \frac{\sigma_{\mathbf{w}}^2}{D^{(l−1)}} \sum_{j} w_{ji}^{(l)} w_{ji}^{(l)} \phi'(z_j^{(l−1)}(\mathbf{x})) \phi'(z_j^{(l−1)}(\mathbf{x}')) \right) \end{array} $$

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.