Research Notes

# Research Notes --- [Overleaf Link](https://www.overleaf.com/project/62607cc0e382ea40b83f8872) ## Finshing the Paper ### TODOs 1. Introduction 1.1 Related Work: (Flo) - [ ] **2** Intro: limited work on infinite NLM. Why does existing work not generalize, from priors that we don't even get to training: - [ ] **2** Proof BNN -> GP (prior) Neal '96 - [ ] **2** NTK papers - [ ] **2** Yang, why does it not apply 1.2 Rephrasing/Embetterment (Maxime | Flo above) - [ ] **3** Infinite width behavior - [ ] **3** Contribution 2. Proof (Maxime) - [ ] **1** Define the $\Theta$ notation - [ ] **3** Decide on the dependence of the limit L(x) on t for the condition "sum of hi^2 tend to L" => decided, we add a $t$ - [ ] **3** Intuition: make the initialization thing stand out more, maybe with an intro sentence that presents the 3 important messages: (i) the CLT does not apply, (ii) but because everything is Gaussian, we can use another limit, (iii) for which we can prove things at initialization only - [ ] **2** Add an introduction sentence to lemma 3 - [x] **ABANDONED (cannot proove strong lemma 2)** **2** Weak and strong: introduce the notion. Weak = what we have, strong is: there exist $N_0$ such that proba = 1 for all $N \ge N_0$ - [x] **ABANDONED (cannot proove strong lemma 2)** **2** Weak and strong: edit the 3 lemmas, i.e. both presentation and proof - [x] **ABANDONED (cannot proove strong lemma 2)** **2** Weak and strong: edit the proposition - [ ] **1** Write the theorem 3. Kernel (Flo) - [x] **1** Title should be about the KERNEL - [ ] **2** Then there should be a paragraph explaining how we define the kernel for the GP limit in the case of a NLM. "We limit ourselves to the case of non training" - [x] **1** Change notation $\xi$ to $x$ - [x] **1** Change notation $w, v$ to $\omega, \theta$ - [ ] **1** Fix missing citations "(..)" - [ ] **2** Add dependence of $h_i$ on $\omega$ 4. Conclusion (Maxime) - [ ] **1** Conclusion ### Questions from Florian - Eq (6) - "which implies" others: - Proposition 1: delta h is not defined yet - shouldn't it be delta h = O(1/N) aka less than 1/N? I think there was a mistake in Yang's paper - hi(x, w_t) not θ_t - delta h = θ(1/N): cite yang, maybe |delta h_i|? ## Week 14 May 2 - May 9 Check that the NNGP defines a multivariate Gaussian in the case of multiple output Angle of the paper: BNNs, NLMs, Ensemble NLMs (to be defined) and Ensembles are four different techniques for obtaining uncertainty. We prove that all these techniques converge to a GP in the prior, with the NNGP as the kernel for all but the Ensembles. **Important points to write clearly on:** - When we say "initialization" for an NLM, we mean "initialization (sampled) + Bayesian inference" #### Notes $w\sim \mathcal{D}_{0}$ at initialization and $w\sim \mathcal{D}_t$ after training (or $t$ training steps). Expected kernel:$$ E_w[E_θ[f_w(x, θ) f_w(x', θ)]] $$ ### Definition of NLM Ensemble A NLM $f_w$ with fixed NN-weights $w$ and bayesian weights $θ\sim \mathbb{P}_θ$, the output $f(x)$ is a random variable depending on the bayesian weights $θ$: \begin{equation} f_w(x; \theta) = \sum_{i=1}^N θ_i h_i(x,w) \end{equation} The mean prediction of an NLM $f_w$ is $μ_{w}(x) = E_θ[f_w(x;θ)]$ **Definition (NLM Ensemble):** Following Lakshminarayanan et al, we define an ensemble of $k$ NLMs $f_{w_1}, … , f_{w_k}$ the mixture:$$ \mathrm{ensemble}(f_{w_1}, … , f_{w_k})(x) = \frac1k\sum_{j=1}^k f_{w_j}\left(x, θ^{(j)}\right) $$ where $θ^{(j)} \sim \mathbb{P}_{θ^{(j)}}$ with the probability distribution: $$ p(y|x) = \frac1k\sum_{j=1}^kp_j(y | x, θ_j) $$ If all $θ^{(j)}\sim \mathbb{P}_θ$ are iid, then the limit of infinitely many functions converges to $$ f^*(x) = \mathbb{E}_w[f_w(x, θ)] $$ which is still a random variable with respect to $θ^{(j)}\sim \mathbb{P}_θ$. TODO: If we know the distribution of $w\sim \mathbb{P}_w$, then we can define the expected ensemble function: $$ E_w[f^*] = \frac1k E[f] $$ ## Week 14 April 25 - May 2 ### Proof for 1 hidden layer \begin{align} \frac 1 N \sum_{i = 1}^N {{h_i(x, \theta_t)}^2} &= \frac 1 N \sum_{i = 1}^N {{\left( h_i(x, \theta_0) + \Delta h_{i, t}(x) \right)}^2} \\ &= \underbrace{\frac 1 N \sum_{i = 1}^N {{h_i(x, \theta_0)}^2}}_{A_N} + \underbrace{\frac 2 N \sum_{i = 1}^N {h_i(x, \theta_0) \Delta h_{i, t}(x)}}_{B_N} + \underbrace{\frac 1 N \sum_{i = 1}^N {{\Delta h_{i, t}(x)}^2}}_{C_N} \end{align} ### Kernel for 1 hidden layer \begin{align} & K^{t}(x, x') - K^{0}(x, x') = \frac 1 N \sum_{i = 1}^N h_i(x, \theta_t)h_i(x', \theta_t) - \frac 1 N \sum_{i = 1}^N h_i(x, \theta_0)h_i(x', \theta_0)\underset{N→∞}{\longrightarrow} 0 \end{align} #### Comparing Kernels: For fixed NLM weights $w$:$$ K_w(x, x') = \tfrac1Nσ_θ^2 \sum_{i=1}^N h_i(x; w) h_i(x'; w) $$ BNN prior:$$ K_{\mathrm{BNN}}^{prior}(x, x') = σ_θ^2 \mathbb{E}_w\left[h_i(x; w) h_i(x'; w)\right] $$ BNN posterior (function space): $$ E[y|x,X,Y] = K^p(x, X)K^p(X, X)^{-1}Y\\ K_{\mathrm{BNN}}^{posterior}(x, x') = K^p(x, x') - K^p(x, X)K^p(X, X)^{-1}K^p(X, x') $$ ## Week 13 April 18 - April 25 ### Meeting notes: - Compare with a task in mind, e.g. generalization error, expressiveness, - Analyze data-dependence ### Infinite NLM with NTK Let's note $\omega$ the parameters of the last layer (vector of length $N$) and $\theta$ the parameters of all the previous layers (vector of length $P - N$). The [empirical NTK](https://arxiv.org/pdf/2007.05864.pdf) is: $$ \Theta_t(x, x') = {\nabla_{\omega_t, \theta_t} f(x, \omega_t, \theta_t)}^T {\nabla_{\omega_t, \theta_t} f(x', \omega_t, \theta_t)} $$ Splitting the expressions at the last layer, we can write the gradient as: $$ \nabla_{\theta_t} f(x, \theta_t) = \pmatrix{ h_1(x, \theta_t) \\ \vdots \\ h_N(x, \theta_t) \\ \sum_{i = 1}^N \omega_{t, i} \nabla_{\theta_t} h_i(x, \theta_t) } $$ where the summed expression is a vector. This leads to the following, taking $x = x'$: $$ \begin{align} \Theta_t(x, x) &= \sum_{i = 1}^N {h_i(x, \theta_t)}^2 + {\left(\sum_{i = 1}^N \omega_{t, i} \nabla_{\theta_t} h_i(x, \theta_t)\right)}^T {\left(\sum_{j = 1}^N \omega_{t, j} \nabla_{\theta_t} h_j(x, \theta_t)\right)}\\\\ &= \sum_{i = 1}^N {h_i(x, \theta_t)}^2 + \sum_{i = 1}^N \omega_{t, i}^2{\left\| \nabla_{\theta_t} h_i(x, \theta_t) \right\|}^2 + \sum_{i,j = 1 \\ i \neq j}^N \omega_{t, i} \omega_{t, j} {\nabla_{\theta_t} h_i(x, \theta_t)}^T \nabla_{\theta_t} h_j(x, \theta_t) \end{align} $$ The last term of the sum looks like a kernel (dot product of gradients), but it relates to 2 outputs at the same input (maybe try 2 different inputs? Or rephrase into 2 inputs?). If the latter expression converges, we may have a lead to prove that the sum of $h_i^2$ converge in probability or distribution. ### Observations For finite NLMs we have: $$ K_{NLM}(ξ, ξ') = σ_v^2 \frac1N \sum_{i=0}^N h_i(ξ) h_i(ξ') $$ In the limit: $$ \lim_{N→∞} K_{NLM}(ξ, ξ') $$ ### Ideas for Comparisons in Framework 1 How do we compare $$ K_{BNN}(ξ, ξ') = σ_v^2 E\left[h_i(ξ) h_i(ξ')\right] $$ with $$ K_{NLM}(ξ, ξ') = σ_v^2 \frac1N \sum_{i=0}^N h_i(ξ) h_i(ξ') $$ #### Point-wise variance Let's fix a point $ξ_0$ and assume $f(ξ_0) = f_0$. Then, what is the distribution of $f(ξ_1)|f(ξ_0) = f_0$? We know $$(f_0, f_1) \sim \mathcal{N}(0, K_{ξ_0ξ_1})$$ so the conditional distribution [becomes](https://en.wikipedia.org/wiki/Covariance_matrix#Block_matrices): $$ f_1 | f_0 = \mathcal{N}\left(\frac{K(ξ_0, ξ_1)}{K(ξ_0, ξ_0)} f_0,\quad K(ξ_1, ξ_1) - \frac{K(ξ_0, ξ_1)^2}{K(ξ_0, ξ_0)}\right) $$ or $$ f_1 | f_0 = \mathcal{N}\left(\frac{K_{01}}{K_{00}} f_0,\quad K_{11} - \frac{K_{01}^2}{K_{00}}\right) $$ **BNN** $$ \begin{aligned} Var(f_1 | f_0) &= σ_v^2 E[h_i(ξ_1)^2] - σ_v^2\frac{E[h_i(ξ_0)h_i(ξ_1)]^2}{E[h_i(ξ_0)^2]} \end{aligned} $$ **NLM** The variance depends on the sampled $h_i$'s. So we compute the _expected_ variance: $$ E[Var(f_1 | f_0)] = σ_v^2E[h_i(ξ_1)^2] - σ_v^2E\left[\frac{\left(\sum h_i(ξ_0)h_i(ξ_1)\right)^2}{N\cdot\sum h_i(ξ_0)^2}\right] $$ #### Posterior: ## Week 12 April 11 - April 18 #### Main Questions: - Can we stick to taking the limit in 2 steps? (Do we have an alternative?) - What framework for randomness should we use? What can we even expect? Meeting questions: - go from function space $K$ to weight space $ϕ$. #### Frameworks of Randomness In analogy to our two-step training process of an NLM, we look at the following simple model:$$ Z = X + Y,\quad X,Y \sim \mathcal{N}(0, 1) $$ where $X$ sampling corresponds to the first step of initializing and training the neural network and sampling $Y$ corresponds to sampling from the Bayesian linear regression posterior. **Framework 1:** Here we complete the first step, i.e. sample from $X$. Let's say $X=1$. Then we look at the posterior distribution that we get from this sampled "basis": ![](https://i.imgur.com/Lg3WPK2.png) For a different $X$ value, the distribution would look differently. In the figure below we show some more conditional probabilities $Z|X=x$. If we average over all choices of $X$ i.e. "intregrate out" $X$, we receive the the distribution $Z$. ![](https://i.imgur.com/JP9FC1z.png) **Framework 2:** Here we look at the output distribution over both random steps. What we directly get the distribution $Z$. In the NLM setting, framework 2 tells us about the general general probability of outputting a specific value $Z=z$ over both steps. However, it does not show the uncertainty properties of a single NLM we would get in practice. Rather it can be thought of analyzing an ensemble of NLMs. Therefore, if we want to compare NLMs with BNNs, we want to work in framework 1. **What properties can we hope to find in framework 1?** The specific properties of our NLM posterior might depend on the choice of sampeled $h_i$'s (unless the infinity limit removes this dependency). The $h_i$'s will determine the mean and kernel function of our GP. We hope to find a statistic such as "expected variance" of the posterior. Can we define something like an "expected kernel function" to compare it to the BNN kernel? #### What is the GP Kernel of the NLM in Framework 1? Let's say $f_i^{(L)}(ξ) \sim \mathrm{NNGP}$ and $h_i(ξ) = ϕ(f_i^{(L)}(ξ))$. What can we say about $y(ξ) = \sum_{i=1}^N h_i(ξ)v_i$ with $v_i \sim \mathcal{N}(0, σ_v^2/N)$? It is a GP with mean function 0 and kernel: $$ K(ξ, ξ') = E[y(ξ)y(ξ')] = E\left[\sum_{i,j=1}^N h_i(ξ)h_j(ξ')v_iv_j\right] = \sum_{i=0}^N h_i(ξ) h_i(ξ') E\left[v_i^2\right] \\ K(ξ, ξ') = σ_v^2 \frac1N \sum_{i=0}^N h_i(ξ) h_i(ξ') $$ ## Week April 4 - April 11 #### Meeting with Beau Text book [The Principles of Deep Learning Theory](https://arxiv.org/pdf/2106.10165.pdf) Look at [Gaussian Process Behaviour in Wide Deep Neural Networks](https://arxiv.org/pdf/1804.11271.pdf) **Initialization** A deep net with random initialization (or a BNN) converges to an NNGP when the width goes to infinity (see [Gaussian Process Behaviour in Wide Deep Neural Networks](https://arxiv.org/pdf/1804.11271.pdf)). This means if we fix the width of the hidden layer with our $h_i$, and make the width of everything preceding goes to infinity, then our $h_i$ will become multivariate Gaussian variables. If we can prove that they are decorrelated *in the limit* (our intuition is they are decorrelated even in the finite case) using the textbook [The Principles of Deep Learning Theory](https://arxiv.org/pdf/2106.10165.pdf), which gives a closed-form expression $\mathbb{E}(h_{i_1}\cdots h_{i_n})$, then we are done: since decorrelated Gaussians are independent, our $h_i$ are independent Gaussians, which opens the door to studying $\sum_{i = 1}^n h_i^2$. **Posterior** [Exact posterior distributions of wide Bayesian neural networks](https://arxiv.org/pdf/2006.10541.pdf) It is known that the posterior of a BNN converges to the posterior of its NNGP. Can we use that result directly? Otherwise, we need to write down again the posterior derivation (BLR) and take the limit at some convenient point. #### Next steps 1. Review proof random initialization => NNGP (valid for more than 2 layers? Multiple output? No hidden conditions?) 2. Verify decorrelation in finite case 3. Compute GP kernel for (no training, prior) and some fixed activation function 4. What happens during **training**, does the kernel stay the same? 5. What happens if we take the **posterior** of the NLM? At which part do we apply the limit? #### Convergence to NNGP (Matthews et al. '18): Assumptions: - **Input:** (1) countable input set - **Randomness:** (2) Gaussian, (3) centered, and (4) independent initialization/prior - **Nonlinearity:** (5) Continuous activation function, (6) obeying the linear envelope condition (i.e. no heavy tail behavior) - **Limit:** (7) width functions must be increasing, i.e. the width tends to infinity in all layers at the same time (but not necessarily at the same rate) #### NLM case What we know: - $h_i$'s are iid in the infinite width limit - $E[f^{(2)}_i(ξ) f^{(2)}_i(ξ')] = σ_w^2E[ϕ(f^{(1)}_j(ξ))ϕ(f^{(1)}_j(ξ'))]$ with appropriate scaling - $\mathrm{var}(f_i^{(2)}) = σ_w^2E[ϕ(f^{(1)}_j(ξ))^2]$ - For $ϕ = \tanh$, what is $E[\tanh(f_j^{(1)}(ξ))^2]$? - $E[\tanh(x)^2] \le 1$ as $\tanh^2(x) ∈[0, 1]$ - $E[\tanh(x)^2] > 0$, should follow from $\mathrm{var}(x) > 0$ - So our $h_i's$ have bounded non-zero variance → GP. **What is the GP Kernel of the NLM?** Let's say $f_i^{(L)}(ξ) \sim \mathrm{NNGP}$ and $h_i(ξ) = ϕ(f_i^{(L)}(ξ))$. What can we say about $y(ξ) = \sum_{i=1}^N h_i(ξ)v_i$ with $v_i \sim \mathcal{N}(0, σ_v^2/N)$? It is a GP with mean function 0 and kernel: $$ K(ξ, ξ') = E[y(ξ)y(ξ')] = E\left[\sum_{i,j=1}^N h_i(ξ)h_j(ξ')v_iv_j\right] = \sum_{i=0}^N E\left[h_i(ξ) h_i(ξ')\right] E\left[v_i^2\right] \\ K(ξ, ξ') = σ_v^2 E\left[h_i(ξ) h_i(ξ')\right] = K_{NNGP}(ξ, ξ') $$ So the output on the untrained prior in the NLM is exactly the NNGP of the network including the last layer. **Training** ## Week 10 March 28 - April 4 Connecting the dots... #### Revision, NLM: Input $x∈ℝ^{1\times d}$, neural network acts as feature transformation $ϕ_t(x)∈ℝ^{1\times n}$ after training for $t$ timesteps. **Bayesian linear regression**: [[derivations]](https://www.cs.toronto.edu/~rgrosse/courses/csc411_f18/slides/lec19-slides.pdf) Weights $w\sim\mathcal{N}(0, σ_w^2I)$ with $I∈ℝ^{n\times n}$. $$f(x) = ϕ(x)w \ ∈ℝ $$ or for the entire dataset $\mathcal{X}∈ℝ^{m\times d}, \mathcal{Y}∈ℝ^{m}$: $$ f(\mathcal{X}) = ϕ(\mathcal{X})w \ ∈ℝ^{m}$$ and $$\mathcal{Y} = f(\mathcal{X}) + \mathbf{ε}\quad\text{ with }\quad\mathbf{ε} \sim \mathcal{N}(0, σ_ε^2I).$$ The likelihood of the data is $p(\mathcal Y | w) = p(\mathbf{ε} = \mathcal Y-f(\mathcal{X}) | w) = \mathcal{N}(\mathcal Y-f(\mathcal{X}), σ_ε^2I)$ With priors $w\sim \mathcal{N}(0, σ_w^2)$, the posterior over weights is: $$ \begin{align} \log p(w|\mathcal{Y}) &= \log p(\mathcal{Y}|w) + \log p(w) + \text{const}\\ &= -\frac1{2σ_e^2}\|\mathcal Y - ϕ(\mathcal X)w\|^2 - \frac1{2σ_w^2}w^\top w + \text{const}\\ &= -\frac1{2σ_e^2}\left( \mathcal Y^\top \mathcal Y - 2 \mathcal Y^\top ϕ(\mathcal X)w + w^\topϕ(\mathcal X)^\top ϕ(\mathcal X) w \right) - \frac1{2σ_w^2}w^\top w + \text{const}\\ &= -\frac12 (w - μ)^\top Σ^{-1}(w-μ) + \text{const} \end{align} $$ where $$\begin{align} μ &= \tfrac1{σ_ε^2} Σϕ(\mathcal X)^\top \mathcal Y\\ Σ^{-1} &= \tfrac1{σ_ε^2} ϕ(\mathcal X)^\top ϕ(\mathcal X) + \tfrac{1}{σ_w^2}I \end{align} $$ so $$ w|\mathcal Y \sim \mathcal{N}(μ, Σ)$$ and the posterior predictive is: $$\begin{align} p(y | \mathcal{Y}) &= \int p(y, w| \mathcal Y)\text{d}w= \int p(y| w, \mathcal Y)p(w| \mathcal Y)\text{d}w \end{align}$$ The posterior predictive is$$\begin{align} f(x) &\sim \mathcal{N}(μ_{pred}, σ_{pred}^2)\\ μ_{pred} &= μ^\topϕ(x)\\ σ_{pred}^2 &= ϕ(x)^\top Σϕ(x) + σ_ε^2 \end{align}$$ Note that the matrix $Σ$, and the representations $μ, ϕ(x)$ become infinite as $n→∞$. $$ \mathbb{E}[f(x)] = μ^\top ϕ(x) = \tfrac1{σ_ε^2} \mathcal Y^\top ϕ(\mathcal X)Σ^\top ϕ(x) $$ ## Week 9 March 21 - March 28 Meeting notes: - neccessary conditions of convergence (i.e. can we prove it is **not** a GP) - meeting with Beau ### Feature learning in Infinite-Width NNs [Yang, Hu '21](https://arxiv.org/pdf/2011.14522.pdf) #### Parameterizations $abc$-Parameterization: $$ h^{l}(ξ) = \frac1{n^{a_l}}w^lx^{l-1}(ξ),\qquad x^{l}(ξ) = ϕ(h^l(ξ)) $$ $$ w^l_{ij} \sim \mathcal{N}\left(0, \frac1{n^{2b_l}}\right) $$ $$ LR = \fracη{n^c} $$ Examples | Name | $a_1$ | $a_{>1}, (a_{L+1})$ | $b_1$ | $b_{>1}$ | $c$ | |:---------------- |:---------- | ------------------- | --------- |:--------- |:--- | | Standard (SP) | 0 | 0 | 0 | $\frac12$ | 0 | | NTK Param (NTP) | 0 | $\frac12$ | 0 | 0 | 0 | | Mean Field (MFP) | 0 | 1 | 0 | 0 | -1 | | Max. Update (μP) | $-\frac12$ | $0$ ($\frac12$) | $\frac12$ | $\frac12$ | 0 | μP = MFP in the 1 hidden layer case. #### Limits Two distinc limits occur: - Feature learning limit - Kernel Limit ##### Kernel Limit - linearization holds: $f_{t+1}(ξ) = f_t(ξ) - ηK(ξ,ξ_t)χ_t$ - weights and features move little: $Δx_t(ξ) = x_t(ξ) - x_0(ξ) = Θ(1/\sqrt n) → 0$ ##### Feature learning limit - no linearization (even with h.o.t.) can represent $f$. Intuition: state is not sufficiently described just by the function but internal representation also changes. - features change - Only possible if $\lim_{n→∞}f_0(ξ) = 0$ for all $ξ$ almost surely. ## Week 7-8 March 7 - March 21 Todos: - plot $s_n$ (grows linearly?) (done) - single neuron distribution (if useful) - literature search for distribution of $h_i$'s #### $s_n$ grows linearly in practice: $$ λ_ns_n^2 = λ_n\sum_{i=1}^n h_i^2 $$ ![](https://i.imgur.com/YhntKCx.png) ![](https://i.imgur.com/qDBdBUG.png) #### $h_i$'s are not independent Let's consider the simple 2 hidden layer network with width one and two and a one-dimensional input $x$: $$ k = ϕ(wx),\quad h_1 = v_1k,\ h_2 = v_2k$$ with $w,v_1,v_2 \sim \mathcal{N}(0, 1)$ and a fix $x ∈ ℝ$. Let $φ$ denote the standard normal pdf. Then$$ \begin{align} p(h_1 = t, h_2 = t) &= p(v_1k = t, v_2k = t)\\ &= p(v_1wx = t, v_2wx = t)\\ &= \int_{-∞}^∞φ(w) p(v_1 = \tfrac t{wx}, v_2=\tfrac t{wx} | w) \mathrm{d}w\\ &= \int_{-∞}^∞φ(w) p(v_1 = \tfrac t{wx}|w)p(v_2= \tfrac t{wx} | w) \mathrm{d}w\\ &= \int_{-∞}^∞φ(w) φ(\tfrac t{wx})^2\mathrm{d}w\\ &= \mathbb{E}_w\left[φ(\tfrac t{wx})^2\right] \end{align} $$ But individually:$$ \begin{align} p(h_1 = t) &= p(v_1wx = t)\\ &=\int_{-∞}^∞φ(w) p(v_1 = \tfrac t{wx}| w) \mathrm{d}w\\ &= \mathbb{E}_w[φ(\tfrac t{wx})] \end{align} $$ so $$p(h_1 = t)p(h_2 = t) = \mathbb{E}_w[φ(\tfrac t{wx})]^2 \neq \mathbb{E}_w\left[φ(\tfrac t{wx})^2\right] = p(h_1 = t, h_2 = t)$$ ## Week 6 Feb 28 - March 7 ### New Insights/Ideas - $h_i$ are iid at initialization (?) - $h_i$ are iid even after training(?) [(Zavatone-Veth et. al. '22)](https://arxiv.org/pdf/2203.00573.pdf) If yes, then the convergence to a GP is almost trivial. ### Sum of Infinite Gaussians If $X_k \sim \mathcal N(\mu_k, \sigma_k^2)$ are independent Gaussians, then: $$ S_n = \sum_{k = 0}^n X_k \sim \mathcal N \left(\sum_{k = 0}^n \mu_k, \sum_{k = 0}^n \sigma_k^2 \right) $$ If both $\sum_{k = 0}^n \mu_k$ and $\sum_{k = 0}^n \sigma_k^2$ converge, then the characteristic function of $S_n$ converges pointwise to that of a Gaussian distribution, and from Lévy's continuity theorem, this proves that $S_n$ converges in distribution to a Gaussian variable. This fact could be used in our research: With a scaling function $λ_n$, we have $$y = \sum_{i=1}^n \sqrt λ_nh_i v_i$$ After training, the terms in the sum are distributed $u_i \sim \mathcal{N}(0, λ_nh_i^2)$ if we managed to prove that the sum of variances $\sum_{i = 0}^n h_i^2$ or the product of this sum by a sequence $\lambda_n$ converges (which we could use to scale our priors), then we would win. So we want to find a $λ_n$ such that $$λ_ns_n^2 = λ_n\sum_{i=0}^n h_i^2 \overset{n→∞}\longrightarrow c$$ #### Case $λ_n = \frac1{n^{1+ε}}$ For $λ_n = \frac1{n^{1+ε}}$ and bounded $|h_i|\le M$, we know that $$|s_n| \le λ_n (n+1) M^2 = \left(\frac1{n^ε}+\frac1{n^{1+ε}}\right)M^2 \to 0$$ This is not what we want as now the variance of the output becomes 0. #### Case $λ_n = \frac1n$ However, for the intuitive $\lambda_n = \frac 1 n$, it turns out that the product $\lambda_n \sum_{i = 0}^n h_i^2$ may not converge, even when $h_i$ is bounded. As a conterexample, consider: | $i$ | 1 | 2 | 3 | 4 | 5 | $\dots$ | 16 | 17 | $\dots$ | 64 | |-------|---|---|---|-----|---|---|------|----|---|-----------| | $h_i$ | 0 | 1 | 1 | 1 | 0 | $\dots$ | 0 | 1 | $\dots$ | 1 | | $s_i$ | 0 | $\dots$ | $\dots$ | 3 | $\dots$ | $\dots$ | 3 | $\dots$ | $\dots$ | $3+48 = 51$ | | $s_i/i$ | 0 | $\dots$ | $\dots$ | $\frac 3 4$ | $\dots$ | $\dots$ | $\frac 3 {16}$ | $\dots$ | $\dots$ | $\frac {51} {64}$ | where we alternatively repeat zeros and ones for a number of times that triples at each step. More precisely: - we start with 1 zero, - then 3 ones (3 times the number of previous samples, 1), - then 12 zeros (3 times the number of previous samples, 4), - then 48 ones (3 times the number of previous samples, 16. In this manner, the cumulative sum $s_i$ alternates between small and high values, and does not converge. To see this, note that we have a pattern every power of 4: $$ \begin{align} &s_1 = 0\\ &s_{4^{k + 1}} = s_{4^k} + \begin{cases} 0 &\text{if } k \text{ odd},\\ 3 \cdot 4^k &\text{otherwise} \end{cases} \end{align} $$ Noting $\tau_i = s_{4^i}$, it follows that: $$ \begin{align} \tau_{i + 1} &= \tau_{i} + \begin{cases} 0 &\text{if } k \text{ odd},\\ 3 \cdot 4^i &\text{otherwise} \end{cases}\\\\ &=\tau_i + \frac{1 + {(-1)}^i}2 \cdot 3 \cdot 4^i\\\\ \implies \tau_i &= \frac{4^i - 1}2 + \frac 3 {10}(1 - {(-1)}^i)\\\\ \implies \frac{\tau_i}{4^i} &= \frac12 - \underbrace{\frac1{2\cdot4^i} + \frac3{10\cdot 4^i}}_{→ 0} - \frac3{10} (-1)^i \end{align} $$ The expression above, which corresponds to $s_{4^i} / 4^i$, does not converge, which means than $\frac {s_n} n$ does not either. **Conclusion:** it seems like we need more conditions on $h_i$, such as the iid property. ## Week 5 Feb 21 - Feb 28 ![](https://i.imgur.com/Q0FXhtp.png) ![](https://i.imgur.com/VQV6xon.png) **Explanation: Behaviour of $tanh$** ![](https://i.imgur.com/2xCnQsK.png) ![](https://i.imgur.com/YjHWl7X.png) **Convergence of $s_n$** $$ s_n = \frac {\sum_{i = 1}^n {h_i^\nu}} {{\left( \sum_{i = 1}^n {h_i^2} \right)}^{\nu / 2}} $$ with $\nu = 3$: ![](https://i.imgur.com/Vbogc8B.png) ## Week 4 Feb 14 - Feb 21 **Weekly summary:** **Goal for the week:** 1. Emprically simulate distribution of $h_i$ 2. Look at "Tensor programm" paper that shows convergence of different architectures to GPs 3. ### 1. Experiments ### 2. Tensor Programms (generel GP convergence results) High level overview in this [talk](https://www.epfl.ch/labs/tml/wp-content/uploads/2021/05/tensor_programs_slides_Eugene.pdf). **Key results:** Under certain conditions (e.g. $g_α^i$ are _asymptotically_ Gaussian entries): ![](https://i.imgur.com/pVIqmZ4.png) ## Week 3 Feb 7 - Feb 14 **Weekly summary:** - Fixed problems in previous derivations - Showed the prior distribution converges to a GP when the step function is used - Spend a bunch of time on understanding why our counterexample to Slutsky's theorem is not valid ### Lyapunov CLT (2) **Absolute centered moment of a normal distribution:** ([Source](https://arxiv.org/pdf/1209.4340.pdf)) Let $X\sim \mathcal{N}(0, σ^2)$, then: $$ E[|X|^\nu] = (2σ^2)^{\nu/2}\frac{\Gamma(\frac{\nu + 1}2)}{\sqrt{π}}\ $$ For $\nu = 3$, we have $E[|X|^3] = \frac1{\sqrt{π}} (2σ^2)^{3/2}$. The Lyapunov condition becomes as follows: there exist $\nu > 2$ such that: $$ \frac {\sum_{i = 1}^N {h_i^\nu}} {{\left( \sum_{i = 1}^N {h_i^2} \right)}^{\nu / 2}} \ = \ \sum_{i = 1}^N {\left( \frac {h_i} {\sqrt{\sum_{i = 1}^N {h_i^2}}} \right)}^\nu \underset {n \to \infty} \to 0 $$ This fails for $h_i = \frac 1 {2^i}$, and more generally for $h_i = q^i, |q| < 1$. Intuitively, the problem is that only finitely many variables have a significant impact on the result, as variance decreases exponentially. We think the following hypothesis is a sufficient condition for the CTL convergence: $$ ∃ ε > 0: \left|\{X_i | Var(X_i) \ge ε, 1 \le i \le n\}\right|\underset {N → ∞} → ∞ $$ We hope the random initialization (and training) fullfils these conditions. Previous results with a stepwise activation function remain valid: if all our $h_i$ are equal to $\pm 1$, then the Lyapunov condition is verified and our prior predictive tends to a Gaussian variable. ### Counterexample to Slutsky Theorem(3) #### or: does 1 converge to 0? We define $$ g_n^\square = \begin{cases} 1 & -1 \le x \lt - \frac1{n+1}\\ 0 & \text{else} \end{cases} $$ $$ g_n^\triangle = \begin{cases} -n+1 + (n+1)x & \frac{n-1}{n+1} &\le x &< \frac{n}{n+1}\\ n+1 -(n+1)x & \frac{n}{n+1} &\le x &< 1\\ 0 & \text{else} \end{cases} $$ then $$ f_n(x) = g_n^\square(x) + g_n^\triangle(x) $$ or equivalently: $$ f_n(x) = \begin{cases} 0 & x &< -1\\ 1 & -1 &\le x &\lt - \frac1{n+1}\\ 0 & -\frac1{n+1} &\le x &< \frac{n-1}{n+1}\\ -n+1 + (n+1)x & \frac{n-1}{n+1} &\le x &< \frac{n}{n+1}\\ n+1 -(n+1)x & \frac{n}{n+1} &\le x &< 1\\ 0 & \text{else} \end{cases} $$ The sequence of functions $f_n$ converge pointwise to $f = \lim_{n \to \infty} f_n = \mathbf{1}_{[-1,0]}$, which is also a distribution. Choosing $X_n \sim f_n$, we therefore have that $X_n \overset d \to X \sim f$. Let us note $c_n = \frac n {n + 1}$ and $u_n = \frac 1 {c_n} = 1 + \frac 1 n$. Since $u_n \to u =1$, we have by Slutsky's theorem that $u_nX_n \overset d \to uX$. But the PDF of $u_nX_n$ is $x \mapsto f_n(\frac x {u_n}) = f_n(c_nx)$, and at point $x = 1$, it equals $f_n(c_n) = 1$, which does not converge to $f(1)=0$. **Solution:** The result for $f$ might be different, but the result for the CDF $F$ determines the result of $uX$. And when integrating $f$, the problem disappears. ### $h_i$ after random init If we have a $M$ dimensional input, then $$ h_i = \sum_j^M ϕ(w_{ij} x_j) $$ Let's consider the **identity function** for $ϕ$. Then $$ Var(h_i) = σ_w^2\sum_{j=1}^M x_j^2 = σ_w^2s_x^2 $$ where we define $s_x^2 = \sum_j x_j^2$. Then $h_i \sim [0, σ_w^2s_x^2]$ and our condition holds. If $ϕ$ is **not the identity**, but RELu, and $w_{ij}x_j$ is symmetric ($f_{pdf}(z) = f_{pdf}(-z)$) (does it hold(?)), then $$ Var(ReLu(z)) = \frac12 Var(z) $$ and $h_i \sim [0, \frac12 σ_w^2s_x^2]$ ## Week 2 Jan 31 - Feb 7 **Weekly summary:** **Questions:** - How do we get from $$ \frac 1 {\sum_{i = 1}^N h_i^2} \sum_{i = 1}^N h_i v_i, \quad v_i \sim \mathcal{N}(0, 1) $$ to $$ \sum_{i = 1}^N h_i v^*_i, \quad v^*_i \sim \mathcal{N}(0, \psi_N) $$ - (How can you use _dropout_ to model Bayesian inference?) ### Central Limit Theorem to NLMs: One layer NLM: $$ H = ϕ(XW)\\ Y = HV $$ Bayesian Linear Regression: $y = \sum_{i=0}^n h_i v_i$ where $v_i \sim\mathcal{N}(0, 1)$. So $u_i := h_i v_i \sim \mathcal{N}(0, h_i^2)$. **Lyapunov CLT** Let $$ s_n^2 = \sum_{i=1}^n σ_i^2 $$ _Condition_: For one $δ > 0$: $$ \lim_{n→∞} \frac1{s_n^{2+δ}} \sum_{i=1}^n\mathbb{E} \left[|X_i-μ_i|^{2+δ}\right] = 0 $$ In our case, $X_i = h_i v_i$, $σ_i = h_i$, and $μ_i = 0$. We choose $δ=1$ and observe that the third centered moment of a normal distribution is zero. So the condition holds. Then, the Lyapunov CLT states that: $$ \frac1{s_n}\sum_{i=1}^n(X_i - μ_i) \overset{d}→ \mathcal{N}(0, 1) $$ So we know that $\frac1{\sum_i h_i^2}\sum_i h_i v_i → \mathcal{N}(0, 1)$. **Controlling Variance** _Assumption:_ we choose $$ϕ(x)=\begin{cases}-1 & x \le 0\\ 1 & x > 0 \end{cases}$$ Then $h_i = \pm 1$ for all $i$. So, $\sum_{i=1}^N h_i^2 = N$. Then we can choose the priors for $v_i$ in the following way: $$ v^*_i \sim \mathcal{N}(0, \tfrac1{N^2}) $$ This gives: $$ \begin{align} y &= \sum_i h_i v^*_i\\ &= \sum_i h_i \frac1N v_i \\ &=\frac1{\sum_i h_i^2}\sum_i h_i v_i \\ &→ \mathcal{N}(0, 1) \end{align} $$ ### Generalization attempt If we want to extend our result beyond a binary activation, then we must find a way to use the fact that: $$ \frac 1 {\sum_{i = 1}^N h_i^2} \sum_{i = 1}^N h_i v_i \overset{d}\to \mathcal{N}(0, 1) $$ To prove that there exist a sequence $\psi_N$ and a variance $\sigma^2$ such that the output of our BLR: $$ \sum_{i = 1}^N h_i v^*_i, \quad v^*_i \sim \mathcal{N}(0, \psi_N) $$ converges in some way (in distribution, in probability...) to a normal distribution $\mathcal{N}(0, \sigma^2)$ as $N$ tends to infinity. An attempt to prove convergence in distribution resulted in the following problems. #### Problems A key problem is the following fact: For a random variables $X_n$ and real numbers $u_n$. If for $n → ∞$, $X_n \overset{d}→ P$ and $u_n → l$, it does **not** follow that $u_iX_i \overset{d}→ "lP"$ (?). In terms of functions, with $f_k$ a sequence of functions that converge to $f$ and $u_k$ a sequence that converges to $u$, then $f_k(u_k)$ does not necessarily converge to $f(u)$ Counterexample: $$ f_n(x) = \begin{cases} 1 & x = 1/n\\ 0 & \text{else} \end{cases} $$ and $u_n = \frac1n$. Then $f_n(u_n) = 1$ but $\lim_{n→∞} f_n(0) = 0$. ## Week 1 Jan 24 - 31 **Weekly summary:** - Getting familiar with GP, NTK, and representation learning - Understanding the convergence proof NN → GP - Starting with NLM proof: Looking at 1-hidden layer NLM. GP in finite case (depends on learned weights), and when we do not use an activation function. **Questions:** - A Bayesian linear regressian can be described as a GP even in the finite case, is the special thing in the infinite case just that it does no longer depend on the initialization? **Recap BNN → GP:** - last layer: zero-mean iid values with finite variance - Central Limit Theroem: Normal distribution **Bayesian Linear Regression as GP:** $$ K(x_i, x_j) = σ_w^2x_i^\top x_j + σ_b^2 $$ [Derivation](https://www.inf.ed.ac.uk/teaching/courses/mlpr/2016/notes/w7c_gaussian_process_kernels.html) ### NLM as a GP **Finite case:** Kernel for fixed, learned $W$: $$ K(x_i, x_j) = σ_w^2ϕ(x_iW) ϕ(x_jW)^\top + σ_b^2 $$ **Linear case without training** With $ϕ(x)=x$ and $n → ∞$, we get $$ K(x_i, x_j) = σ_w^2x_iWW^\top x_j^\top + σ_b^2 $$ And $WW^\top = σ_w^2I$ as $\mathbb{E}[W_{i,j} W_{j,i}] = \cases{σ_w^2 & if i=j\\0 & else}$. So: $$ K(x_i, x_j) = σ_w^4x_i x_j^\top + σ_b^2 $$ Next week, we try to look at $ϕ=erf$ or $ϕ = tanh$