Epoch-wise bias-variance decomposition

Let's suppose we're training a model parameterized by $\theta$, and let's denote by $\theta_t$ the parameter $\theta$ at step $t$ given by the optimization algorithm of our choice. In machine learning, it is often helpful to be able to decompose the error $E(\theta)$ as $B^2(\theta)+V(\theta)+N(\theta)$, where $B$ represents the bias, $V$ the variance, and $N$ the noise (irreducible error). In most cases, the decomposition is performed on an optimal solution $\theta^*$ (for instance, $\lim_{t \rightarrow \infty} \theta_t$, or its early stopping version), for example, in order to understand how the bias and variance change with the complexity of the function implementing $\theta$, the size of this function, etc. This has helped explain phenomena such as model-wise double descent. On the other hand, it can also be interesting to visualize how $B(\theta_t)$ and $V(\theta_t)$ evolve with $t$ (which can help explain phenomena like epoch-wise double descent): that's what we'll be doing in this blog post. # Notations * $\mathcal{X}$ : domain set (input space) * $\mathcal{Y}$ : label set (output space) * $\mathcal{H}$ : hypothesis class (class of possible models we can learn) # Definititions and Preliminaries **Definitition 1 (Loss Function)** The loss function $\ell(t, y)$ is defined as a function that takes two labels and produces a value between $0$ and some constant $M \in [0, \infty]$, and measures the cost of predicting $y$ when the true value is $t$. \begin{align*} \ell \colon \mathcal{Y} \times \mathcal{Y} &\to [0, M] \\ (t, y) &\mapsto \ell(t, y) \end{align*} Examples include square loss $\ell(t, y) = (t−y)^2$, absolute loss $\ell(t, y) = |t−y|$, and zero-one loss $\ell(t, y) = \mathbb{I}[t\ne y]$ **Definitition 2 (Training set)** Set $S$ of $|S|$ values $z_i = (x_i, t_i) \in \mathcal{X} \times \mathcal{Y}$, where $x_i \in \mathcal{X}$ represents a feature vector and $t_i = t_i(x) \in \mathcal{Y}$ the label of the $i^{th}$ sample. $z_i$ are assumed to be i.i.d. and sampled from an unknown data distribution $\mathcal{D}$. $$S = \{z_1, \cdots , z_n\}$$ Since the training set size is an important parameter of a learning problem, we will assume in the following that all dataset have the same size $n$. **Definitition 3 (Optimal prediction)** In our setup we don't have a joint distribution, but just $x$ being random and $t = t(x)$ being a deterministic or random function of $x$, so that $p(x, t) = p(x)p(t|x)$. The optimal prediction for an example $x$ is given by $$y^* (x) = \arg\min_y \mathbb{E}_{t \sim p(t|x)} [\ell(t, y)]$$ In deterministic case, there exist $t \in \mathcal{Y}$ such that $p(t|x)=1$, so that $y^* (x) = \arg\min_y \ell(t, y) = t$. In the nondeterministic case : * Using square loss, we have : > \begin{split} > y^{*}(x) > &= \arg\min_{y} \mathbb{E}_{t \sim p(t|x)}[(t - y)^2] > \\ &= \arg\min_{y} \mathbb{E}_{t \sim p(t|x)}[t^2 - 2yt + y^2] > \\ &= \arg\min_{y} \ y^2 - 2 \mathbb{E}_{t \sim p(t|x)} [t] y + \mathbb{E}_{t \sim p(t|x)} [t^2] > \\ &= \mathbb{E}_{t \sim p(t|x)} [t] > \end{split} That is the mean of $p(t|x)$. For example, if for some $x$, we have the example $(x, 1)$ with probability $p_x \in [0, 1]$ and the example $(x, 0)$ with probability $1-p_x$, that is : > $$t(x) = \left\{ > \begin{array}{ll} > 1 & \mbox{with probability } p_x \\ > 0 & \mbox{with probability } 1-p_x > \end{array} > \right.$$ Then, using square loss, we have $y^*(x) = p_x$. * Using the absolute loss, we have > \begin{split} > y^*(x) > &= \arg\min_{y} \mathbb{E}_{t \sim p(t|x)}[|y - t|] > \\ &= \arg\min_{y} \ 2 \int_{-\infty}^{y} F_{t|x}(t)dt - y + \mathbb{E}_{t \sim p(t|x)} [t] + 2\lim_{t \rightarrow - \infty} t F_{t|x}(t) > \\ &= F_{t|x}^{-1}(1/2) > \end{split} with $F_{t|x}$ the cumulative distribution function of $p(t|x)$. So $y^*(x)$ is the median of $p(t|x)$. > The last line follows from the fact that $y \mapsto 2 F_{t|x}(y) - 1$ is the derivative of the function $y \mapsto 2 \int_{-\infty}^{y} F_{t|x}(t)dt - y + \mathbb{E}_{t \sim p(t|x)} [t] + 2\lim_{t \rightarrow - \infty} t F_{t|x}(t)$, which thus reaches its minimum when $F_{t|x}(y) = 1/2$, that is when $y = F_{t|x}^{-1}(1/2)$. > The second line follows from > \begin{split} > \mathbb{E}_{t \sim p(t|x)}[|y - t|] > &= \mathbb{E}_{t \sim p(t|x)}\Big[(y - t) \mathbb{I}[y\ge t] - (y - t) \mathbb{I}[y<t] \Big] > \\ &= \mathbb{E}_{t \sim p(t|x)}\Big[(y - t) \mathbb{I}[y\ge t] - (y - t) (1-\mathbb{I}[y\ge t]) \Big] > \\ &= \mathbb{E}_{t \sim p(t|x)}\Big[(y - t) (2\mathbb{I}[y\ge t] - 1) \Big] > \\ &= \mathbb{E}_{t \sim p(t|x)}\Big[2y \mathbb{I}[y\ge t] - y - 2 t \mathbb{I}[y\ge t] + t \Big] > \\ &= 2y\mathbb{E}_{t \sim p(t|x)}\big[\mathbb{I}[y\ge t] \big] - y\mathbb{E}_{t \sim p(t|x)}[1] - 2\mathbb{E}_{t \sim p(t|x)}\big[t\mathbb{I}[y\ge t] \big] + \mathbb{E}_{t \sim p(t|x)}[t] > \\ &= 2yF_{t|x}(y) - y - 2 \int_{[-\infty, y]} t \ dF_{t|x}(t) + \mathbb{E}_{t \sim p(t|x)}[t] > \\ &= 2yF_{t|x}(y) - y - 2 \Big( [t F_{t|x}(t)]_{-\infty}^{y} - \int_{[-\infty, y]} F_{t|x} (t)dt \Big) + \mathbb{E}_{t \sim p(t|x)}[t] > \\ &= 2yF_{t|x}(y) - y - 2 \Big( y F_{t|x}(y) - \lim_{t \rightarrow - \infty} t F_{t|x}(t) - \int_{-\infty}^y F_{t|x}(t)dt \Big) + \mathbb{E}_{t \sim p(t|x)}[t] %\text{ assuming } \lim_{t \rightarrow - \infty} t F_{t|x}(t) = 0 > \\ &= 2yF_{t|x}(y) - y - 2yF_{t|x}(y) + 2\lim_{t \rightarrow - \infty} t F_{t|x}(t) + 2\int_{-\infty}^y F_{t|x}(t)dt + \mathbb{E}_{t \sim p(t|x)}[t] > \\ &= 2\int_{-\infty}^y F_{t|x}(t)dt - y + \mathbb{E}_{t \sim p(t|x)}[t] + 2\lim_{t \rightarrow - \infty} t F_{t|x}(t) > \end{split} With the illustrative example use above of the square loss, we have $F_{t|x}(t) = (1-p_x)\mathbb{I}[0\le t<1] + \mathbb{I}[1\le t]$, so $y^*(x) > = \mathbb{I}[1-p_x < 1/2] + \lambda \mathbb{I}[1-p_x = 1/2] > = \mathbb{I}[p_x > 1/2] + \lambda \mathbb{I}[p_x = 1/2]$ $\forall \lambda \in [0, 1]$. * Using zero-one loss, we have : > \begin{split} > y^*(x) > &= \arg\min_{y} \mathbb{E}_{t \sim p(t|x)}[\mathbb{I}[t\ne y]] > \\ &= \arg\min_{y} \mathbb{E}_{t \sim p(t|x)}[1-\mathbb{I}[t=y]] > \\ &= \arg\min_{y} 1 - \mathbb{E}_{t \sim p(t|x)}[\mathbb{I}[t=y]] > \\ &= \arg\max_{y} \mathbb{E}_{t \sim p(t|x)}[\mathbb{I}[t=y]] > \\ &= \arg\max_{t} p(t|x) = \arg\max_{t} p(t, x) > \end{split} That is the most frequent prediction. With the illustrative example use above for square loss, we have > \begin{equation*} > \begin{split} > y^*(x) &= \arg\min_y \ p_x\mathbb{I}[y\ne 1] + (1-p_x)\mathbb{I}[y\ne 0] > \\ &= \mathbb{I}[p_x\ge 1-p_x] > \\ &= \mathbb{I}[p_x\ge 1/2] > \end{split} > \end{equation*} > **Definitition 4 (Algorithm)** A learning algorithm $\mathcal{A}$ is defined as the following mapping $\mathcal{A} : ( \mathcal{X} \times \mathcal{Y})^n \mapsto \mathcal{H}$. The learning algorithm takes a dataset $S \in ( \mathcal{X} \times \mathcal{Y})^n$ of $n$ samples and return a model $h = \mathcal{A}(S) \in \mathcal{H}$. The optimal model is the model for which $f(x) = y^*(x)$ for every $x$. For example, in the case of zero-one loss, the optimal model is the Bayes classifier, and its loss is called the Bayes rate. In the example use above for zero-one loss, the Bayes classifier is $f(x)=\mathbb{I}[ \mathbb{P}[t=1|x] \ge 1/2]$. **Definitition 5 (True Risk)** : Given $h \in \mathcal{H}$ $$R[h] = \mathbb{E}_{(x,t) \sim \mathcal{D}}[ \ell(t, h(x))]$$ **Definitition 6 (Empirical Risk)** : For $h \in \mathcal{H}$ and a dataset $S = \{(x_1, t_1), \cdots , (x_n, t_n)\}$ $$\hat{R}_S[h] =\frac{1}{n} \sum_{i=1}^n \ell(t_i, h(x_i))$$ The essential task of supervised learning is to maximize the performance on all of the possible data via the adjustment of $h$ on a particularly drawn sample set $S$. We won't be using these risk expressions primarily in this monograph, as is customary in statistical learning theory, but we have recalled them anyway to highlight the dependence of $h = \mathcal{A}(S)$ on $S$. **Definitition 7 (Expected loss for each input feature vector)** Since the the same learner $\mathcal{A}$ produces in general different models $h$ for different training sets $S$, $\ell(t, h(x))$ is a function of the training set $S$ through $h = \mathcal{A}(S)$. This dependency can be removed by averaging over training sets. Let $D_n$ be a set of training sets of size $n$, $\hat{y}_n(x)$ the predictions produced for example $x$ by applying the learner to each training set in $D_n$, and $Y_n(x) = \{ \mathcal{A}(S)(x) \ | \ S \in D_n\}$ be the multiset of these predictions (a specific prediction $\hat{y}_n(x)$ will appear more than once in $Y_n(x)$ if it is produced by more than one training set in $D_n$). The the quantity of interest is the expected loss $$E_n(x) = \mathbb{E}_{D_n, \ t \sim p(t|x)}[\ell(t, \hat{y}_n(x))] = \mathbb{E}_{y \sim Y_n(x), \ t \sim p(t|x)}[\ell(t, y)]$$ Our objetive is to bias-variance decompose the expected loss $E_n(x)$ into three terms: **bias**, **variance** and **noise (irreducible error)**. A standard such decomposition exists for squared loss, and a number of different ones have been proposed for zero-one loss. **Definitition 8 (Main prediction)** : The main prediction for a loss function $\ell$ and a set of training sets $D_n$ is $$y^{\ell, D_n} (x) = \arg\min_{y'} \mathbb{E}_{D_n}[\ell(\hat{y}_n(x), y')] = \arg\min_{y'} \mathbb{E}_{y \sim Y_n(x)}[\ell(y, y')]$$ In words, the main prediction is the value whose average loss relative to all the predictions in $Y_n(x)$ is minimum (i.e., it is the prediction that “differs least” from all the predictions in $Y_n(x)$ according to $\ell$). It is a measure of the “central tendency” of a learner. The main prediction is not necessarily a member of $Y_n(x)$. **Theorem 1** Under squared loss the main prediction is the mean of the predictions in $Y_n(x)$, under absolute loss it is the median of $Y_n(x)$ since, and under zero-one loss it is the mode of $Y_n(x)$ (i.e. the most frequent prediction). *Proof* > * Under squared loss the main prediction is the mean of the predictions in $Y_n(x)$ since > > \begin{equation*} > \begin{split} > y^{\ell, D_n}(x) > &= \arg\min_{y'} \mathbb{E}_{y \sim Y_n(x)}[(y - y')^2] > \\ &= \arg\min_{y'} \ {y'}^2 - 2 \mathbb{E}_{y \sim Y_n(x)} [y] y' + \mathbb{E}_{y \sim Y_n(x)} [y^2] > \\ &= \mathbb{E}_{y \sim Y_n(x)}[y] > \end{split} > \end{equation*} > * Under absolute loss it is the median of $Y_n(x)$ since > \begin{equation*} > \begin{split} > y^{\ell, D_n}(x) > &= \arg\min_{y'} \mathbb{E}_{y \sim Y_n(x)}[|y' - y|] > \\ &= \arg\min_{y'} \ 2 \int_{-\infty}^{y'} F_{Y_n(x)}(y)dy - y' + \mathbb{E}_{y \sim Y_n(x)} [y] + 2\lim_{y \rightarrow - \infty} y F_{Y_n(x)}(y) > \\ &= F_{Y_n(x)}^{-1}(1/2) > \end{split} > \end{equation*} > with $F_{Y_n(x)}$ the cumulative distribution function of $y_n(x) \sim Y_n(x)$. The derivation to get this result is similar to what we did in the **Definitition 3** above for the absolute error. > * Under zero-one loss it is the mode of $Y_n(x)$ (i.e., the most frequent prediction) since > \begin{equation*} > \begin{split} > y^{\ell, D_n}(x) > &= \arg\min_{y'} \mathbb{E}_{y \sim Y_n(x)}[\mathbb{I}[y\ne y']] > \\ &= \arg\min_{y'} \mathbb{E}_{y \sim Y_n(x)}[1-\mathbb{I}[y=y']] > \\ &= \arg\min_{y'} 1 - \mathbb{E}_{y \sim Y_n(x)}[\mathbb{I}[y=y']] > \\ &= \arg\max_{y'} \mathbb{E}_{y \sim Y_n(x)}[\mathbb{I}[y=y']] > \\ &= \arg\max_{y'} f_{Y_n(x)} (y') > \end{split} > \end{equation*} with $f_{Y_n(x)}$ the mass function ($Y_n(x)$ finite) or the density function (is $Y_n(x)$ not finite) of $y_n(x) \sim Y_n(x)$. **Definition 9 (bias, variance, noise)** The bias, variance and noise (irreducible error) of a learner on an example $x$ are respectively > $$B^2(x) = \ell( y^*(x), y^{\ell, D_n}(x)) > \\ V(x) = \mathbb{E}_{D_n}[\ell( y^{\ell, D_n}(x), y_n(x))] > = \mathbb{E}_{y \sim Y_n(x)}[\ell( y^{\ell, D_n}(x), y )] > \\ N(x) = \mathbb{E}_{t \sim p(t|x)}[\ell(t, y^*(x))]$$ In words, the square bias is the loss incurred by the main prediction relative to the optimal prediction, the variance is the average loss incurred by predictions relative to the main prediction, and the noise is the unavoidable component of the loss, that is incurred independently of the learning algorithm. In the deterministic case, $N(x)= \ell(t(x), t(x))$ for all $x$. Bias and variance may be averaged over all examples, in which case we will refer to them as average square bias $\mathbb{E}_{x \sim p(x)}[B^2(x)]$ and average variance $\mathbb{E}_{x \sim p(x)}[V(x)]$. The average noise is $$\mathbb{E}_{x \sim p(x)}[N(x)] = \mathbb{E}_{(x,t) \sim p(x,t)}[\ell(t, y^*(x))]$$ **Theorem 2** For square loss $\ell(t, y) = (t−y)^2$ $$V(x) = \mathbb{E}_{y \sim Y_n(x)}[y^2] - ( y^{\ell, D_n}(x))^2 \text{ and } N(x) = \mathbb{E}_{t \sim p(t|x)}[t^2] - (y^*(x))^2$$ *Proof* > \begin{split} > V(x) > &= \mathbb{E}_{y \sim Y_n(x)}[( y^{\ell, D_n}(x) - y)^2] > \\&= \mathbb{E}_{y \sim Y_n(x)}[( y^{\ell, D_n}(x))^2] - 2 y^{\ell, D_n}(x) \mathbb{E}_{y \sim Y_n(x)}[y] + \mathbb{E}_{y \sim Y_n(x)}[y^2] > \\&= ( y^{\ell, D_n}(x))^2 - 2 ( y^{\ell, D_n}(x))^2 + \mathbb{E}_{y \sim Y_n(x)}[y^2] > \\&= \mathbb{E}_{y \sim Y_n(x)}[y^2] - ( y^{\ell, D_n}(x))^2 > \end{split} > \begin{split} > N(x) > &= \mathbb{E}_{t \sim p(t|x)}[(t - y^*(x))^2] > \\&= \mathbb{E}_{t \sim p(t|x)}[t^2] - 2 y^*(x) \mathbb{E}_{t \sim p(t|x)}[t] + \mathbb{E}_{t \sim p(t|x)}[(y^*(x))^2] > \\&= \mathbb{E}_{t \sim p(t|x)}[t^2] - 2 (y^*(x))^2 + (y^*(x))^2 > \\&= \mathbb{E}_{t \sim p(t|x)}[t^2] - (y^*(x))^2 > \end{split} # Bias-variance decomposition For a given loss functions $\ell$, we are looking for two constants $c_1(x, \ell)$ and $c_2(x, \ell)$ such that $$E_n(x) = \ B^2(x) + c_1(x, \ell) \ V(x) + c_2(x, \ell) \ N(x)$$ **Theorem 2 [1]** For square loss $\ell(t, y) = (t−y)^2$, $c_1(x, \ell)=c_2(x, \ell)=1$ *Proof* > \begin{split} > E_n(x) > & = \mathbb{E}_{y \sim Y_n(x), \ t \sim p(t|x)}[(t−y)^2] > \\ & = \mathbb{E}_{y \sim Y_n(x), \ t \sim p(t|x)}[(t−y^*(x) + y^*(x) - y)^2] > \\ & = \mathbb{E}_{t \sim p(t|x)}[(t−y^*(x))^2] + 2(\mathbb{E}_{t \sim p(t|x)}[t]−y^*(x))(y^*(x) - \mathbb{E}_{y \sim Y_n(x)}[y]) + \mathbb{E}_{y \sim Y_n(x)}[(y^*(x) - y)^2] > \\ & = N(x) + 2\times0\times(y^*(x) - \mathbb{E}_{y \sim Y_n(x)}[y]) + \mathbb{E}_{y \sim Y_n(x)}[(y^*(x) - y^{\ell, D_n}(x) + y^{\ell, D_n}(x) - y)^2] > \\ & = N(x) + (y^*(x) - y^{\ell, D_n}(x))^2 + 2(y^*(x) - y^{\ell, D_n}(x))(y^{\ell, D_n}(x) - \mathbb{E}_{y \sim Y_n(x)}[y]) + \mathbb{E}_{y \sim Y_n(x)}[(y^{\ell, D_n}(x) - y)^2] > \\ & = N(x) + B^2(x) + 2(y^*(x) - y^{\ell, D_n}(x))\times 0 + V(x) > \\ & = B^2(x)+V(x)+N(x) > \end{split} Let $\mathbb{P}_{D_n}(x) = \mathbb{P}[y^*(x) \in Y_n(x)]$ be the probability over training sets in $D_n$ that the learner predicts the optimal class for $x$. **Theorem 3 [1]** For zero-one loss $\ell(t, y) = \mathbb{I}[t\ne y]$ in two class problems, $c_1 (x, \ell) = 2\mathbb{P}_{D_n}(x) − 1$ and $c_2(x, \ell) %= \mathbb{I}[y^{\ell, D_n}(x) = y^*(x)] - \mathbb{I}[y^{\ell, D_n}(x) \ne y^*(x)] = 2 \mathbb{I}[y^{\ell, D_n}(x) = y^*(x)] - 1$ *Proof* > See [1] **Theorem 4 [1]** For zero-one loss $\ell(t, y) = \mathbb{I}[t\ne y]$ in multiclass problems, $c_1(x, \ell) = TODO$ and $c_2(x, \ell) = TODO$ *Proof* > See [1] **Theorem 5** For the absolute loss $\ell(t, y) = |t−y|$, TODO # Application : teacher-student setup Suppose now that we are in a supervised linear regression problem in which the training labels $y$ generated by a noisy linear model, called the *teacher* : $$y = y^* + z, \ y^* = \langle w^*, x \rangle \text{ with } \ x \sim \mathcal{N} \left(0,\, \Sigma = diag(\sigma^{2}_1, \dots, \sigma^{2}_d) \right) \text{ and } \ z \sim \mathcal{N} \left(0, \, \sigma_z^{2} \right)$$ where $x \in \mathbb{R}^d$ is the teacher’s input, $y^*$ and $y$ are the teacher’s noiseless and noisy outputs, respectively. $w^* \in \mathbb{R}^d$ represents the (fixed) weights of the teacher and $z \in \mathbb{R}$ is the label noise. The *student* model is trained on $n$ training pairs $S = \{(x_i, y_i)\}_{i=1}^n$ with the labels $y^i$ being generated by the above teacher network, $\hat{y} = \langle w, x \rangle$. The student network is trained with SGD on the regularized mean-squared loss, evaluated on the $n$ training examples as: $$E(w) := \frac{1}{2n} \sum_{i=1}^{n} (\hat{y}^i - y^i)^2 + \frac{\gamma}{2} \| w \|_2^2$$ where $\gamma \in \mathbb{R}^{+}$ is the weight decay $\ell_2$ regularization coefficient. Optimizing this equation with SGD yields the typical update rule $$w^t = w^{t-1} - \eta \odot \nabla_{w} E \left( w^{t-1} \right)$$ in which $t$ denotes the training step (epoch), $\eta \in \mathbb{R}$ the learning rate. Let $X = \frac{1}{\sqrt{n}} \left[ x_1, \dots, x_n \right]^T \in \mathbb{R}^{n \times d} \text{ with } \ x_i \sim \mathcal{N} \left(0,\, \Sigma \right)$, $Y^* = X w^* \in \mathbb{R}^{n}$, $Z = \frac{1}{\sqrt{n}} \left[ z_1, \dots, z_n \right]^T \in \mathbb{R}^{n} \text{ with } \ z_i \sim \mathcal{N} \left(0,\, \sigma_z^{2} \right)$, $Y = Y^* + Z \in \mathbb{R}^{n}$, $\hat{Y} = X w \in \mathbb{R}^{n}$, $\Sigma_{xx} = X^T X \in \mathbb{R}^{d \times d}$ the input correlation matrix, $\Sigma_{xy^*} = X^T Y^* = \Sigma_{xx} w^* \in \mathbb{R}^{d}$ the input-output noiseless correlation vector, $\Sigma_{x z} = X^T Z \in \mathbb{R}^{d}$ the input-noise correlation vector, $\Sigma_{xy} = X^T Y = \Sigma_{xy^*} + \Sigma_{x z} \in \mathbb{R}^{d}$ the input-output noisy correlation vector. The update obtained is $w^t - w^* = A (w^{t-1} - w^*) + b$ with $A = \mathbb{I}_d - diag(\eta) H$ and $b = diag(\eta) \left( \Sigma_{x z} - \gamma w^* \right)$, where $H = \Sigma_{xx} + \gamma \mathbb{I}_d$ is the hessian of the training error $E(w)$, $\Sigma_{xx} = X^T X$ the input correlation matrix and $\Sigma_{x z} = X^T Z$ the input-noise correlation vector. Using the projection $w^t=Vv^t$, the update became $v^t = \left[ \mathbb{I}_d - \eta \left( \Lambda + \gamma \mathbb{I}_d \right) \right] v^{t-1} + \eta \tilde{s}$, where we used $\Sigma_{xx} = V \Lambda V^T$ and $\Sigma_{xy} = V \tilde{s}$. Whe have $\Lambda = diag \left( \lambda_1, \dots, \lambda_d \right)$ with $\lambda_1 \ge \dots \ge \lambda_{r} > \lambda_{r+1} = \dots = 0$, $r = \min { (n, d) }$ the rank of $\Sigma_{xx}$. We will assume $w_i^* \sim \mathcal{N} \left( \mu_{w^*}, \sigma^2_{w^*} \right)$ and $w_i^0 \sim \mathcal{N} \left(\mu_{w^0}, \sigma^2_{w^0} \right)$ for each $i \in [d]$, and the average iterate $\mathbb{E}_{w^*, w^0}[w^t]$ with respect to the choice of $w^0$ (initialization) and $w^*$ (oracle) will be use in place of $w^t$. ## Optimal prediction $t(x) = \langle w^*, x \rangle + z$ is a nondetermistic function of $x$ through the noise $z \sim \mathcal{N} \left(0, \, \sigma_z^{2} \right)$, so : \begin{split} y^*(x) & = \arg\min_y \mathbb{E}_{z \sim \mathcal{N} \left(0, \, \sigma_z^{2}\right)} [(\langle w^*, x \rangle + z - y)^2] \\ &= \arg\min_y \ (\langle w^*, x \rangle - y)^2 + \sigma_z^2 \\ &= \arg\min_y \ y^2 - 2\langle w^*, x \rangle y + \langle w^*, x \rangle^2 + \sigma_z^2 \\ &= \langle w^*, x \rangle \end{split} ## Main prediction \begin{split} D_n & = \Big\{ \big\{ (x_i, \langle w^*, x_i \rangle + z_i ) \big\}_{i=1}^n \ | \ \ x_i \sim \mathcal{N} \left(0,\, \Sigma \right) \text{ and } z_i \sim \mathcal{N} \left(0, \, \sigma_z^{2} \right) \Big\} \\ &= \Big\{ \big( X, Xw^* + Z \big), \ \sqrt{n}X = \left[ x_1, \dots, x_n \right]^T \in \mathbb{R}^{n \times d} \text{ with } x_i \sim \mathcal{N} \left(0,\, \Sigma \right) \text{ and } \sqrt{n}Z = \left[ z_1, \dots, z_n \right]^T \in \mathbb{R}^{n} \text{ with } \ z_i \sim \mathcal{N} \left(0,\, \sigma_z^{2} \right) \Big\} \end{split} The learning algorithm $\mathcal{A}$ here is SGD. So it has an additonal parameter $t \in [0, \infty]$, the early stopping time. $\mathcal{A}(S, t)$ return the student parameter $\mathbb{E}_{w^*, w^0}[w(S, t)]$ after $t$ steps of SGD on the dataset $S \in D_n$. \begin{split} Y_n^t(x) &= \{ \mathcal{A}(S, t)(x) \ | \ S \in D_n\} \\ &= \{ \langle w(S, t), x \rangle \ | \ S \in D_n\} \end{split} So \begin{split} y_n(x, t) := y^{\ell, D_n}(x) &= \mathbb{E}_{y \sim Y_n^t(x)}[y] \\ &= \mathbb{E}_{S \sim D_n}[ \langle w(S, t), x \rangle] \\ &= \langle \mathbb{E}_{S \sim D_n}[w(S, t)], x \rangle \\ &= \langle \mathbb{E}_{X, Z}[w^t], x \rangle \end{split} Let compute $$K(t) = \mathbb{E}_{X, Z}[w^t]$$. * ### Approach 1 We have \begin{split} & w^t - w^* = A^t (w^0 - w^*) + \left( \sum_{k=0}^{t-1} A^k \right) b = A^t (w^0 - w^*) + (\mathbb{I}_d-A)^{+} (\mathbb{I}_d-A^t) \ b \\ & \Longrightarrow \mathbb{E}_{X, Z}[w^t] = \mathbb{E}_{X}[A^t] (w^0 - w^*) + \mathbb{E}_{X,Z} \left[ \left( \sum_{k=0}^{t-1} A^k \right) b \right] %= (\mu_{w_0} - \mu_{w^*}) \mathbb{E}_{X}[A^t] + \mathbb{E}_{w^*,X,Z} \left[ (\mathbb{I}_d-A)^{+} (\mathbb{I}_d-A^t) \ b \right] \end{split} $A = A(X) = \mathbb{I}_d - diag(\eta) (X^TX + \gamma \mathbb{I}_d)$ and $b = b(X, Z) = diag(\eta) \left( X^TZ - \gamma w^* \right)$ are correlated through $Z$. We need to compute $\mathbb{E}_{X}[A^t]$ and $\mathbb{E}_{X,Z} \left[ \left( \sum_{k=0}^{t-1} A^k \right) b \right]$ to being able to use this approach. We have * $\mathbb{E}_{Z} \left[ b \right] = \mathbb{E}_{X} \left[ b \right] = \mathbb{E}_{X,Z} \left[ b \right] = - \gamma diag(\eta) w^*, \ \text{ie} \ \mathbb{E}_{X,Z} \left[ b_i \right] = - \gamma \eta_i w^*_i$ * $\mathbb{E}_{X} \left[ A \right] = \mathbb{I}_d - diag(\eta) \left( \Sigma + \gamma \mathbb{I}_d \right) = diag \left( 1 - \eta_1 (\sigma_1^2 + \gamma), \dots, 1 - \eta_d (\sigma_d^2 + \gamma) \right)$ For two matrces $B$ and $C$ that commute ($BC=CB$), the binomial theorem applies to $(B+C)^k$: $(B+C)^k = \sum_{p=0}^k {k\choose p} B^{k-p} C^p$. So, for $\eta \in \mathbb{R}$, we have : \begin{split} A^k = (\mathbb{I}_d - \eta (X^TX + \gamma \mathbb{I}_d))^k & = \sum_{p=0}^k {k\choose p} (-\eta)^p (X^TX + \gamma \mathbb{I}_d)^p \\ & = \sum_{p=0}^k {k\choose p} (-\eta)^p \sum_{q=0}^p {p\choose q} (X^TX)^q \gamma^{p-q} \\ & = \sum_{p=0}^k \sum_{q=0}^p {k\choose p} {p\choose q} (-\eta)^p \gamma^{p-q} (X^TX)^q \end{split} This implies, using $m_q(X) = \mathbb{E}_{X} \left[(X^TX)^q\right] = \mathbb{E}_{X} \left[\Sigma_{xx}^q\right]$, \begin{split} \mathbb{E}_{X} \left[A^k\right] = \sum_{p=0}^k \sum_{q=0}^p {k\choose p} {p\choose q} (-\eta)^p \gamma^{p-q} m_q(X) \end{split} and \begin{split} \mathbb{E}_{X,Z} \left[ \left( \sum_{k=0}^{t-1} A^k \right) b \right] &= \eta \sum_{k=0}^{t-1} \sum_{p=0}^k \sum_{q=0}^p {k\choose p} {p\choose q} (-\eta)^p \gamma^{p-q} \left( \mathbb{E}_{X,Z}\left[ (X^TX)^q X^TZ \right] - \gamma \mathbb{E}_{X}\left[ (X^TX)^q w^* \right] \right) \\ &= - \eta \gamma \sum_{k=0}^{t-1} \sum_{p=0}^k \sum_{q=0}^p {k\choose p} {p\choose q} (-\eta)^p \gamma^{p-q} \mathbb{E}_{X}\left[ (X^TX)^q\right] w^* \\ & = - \eta \gamma \left( \sum_{k=0}^{t-1} \mathbb{E}_{X} \left[A^k\right] \right) w^* \end{split} So \begin{split} \mathbb{E}_{X, Z}[w^t] & = \mathbb{E}_{X}[A^t] (w^0 - w^*) - \eta \gamma \left( \sum_{k=0}^{t-1} \mathbb{E}_{X} \left[A^k\right] w^* \right) \\ & = \mathbb{E}_{X}[A^t] (w^0 - (1-\eta \gamma) w^*) - \eta \gamma \left( \sum_{k=0}^{t} \mathbb{E}_{X} \left[A^k\right] w^* \right) \end{split} TODO : $$m_q(X) = \mathbb{E}_{X} \left[(X^TX)^q\right] = TODO$$ $$\Sigma_{{xx}_{ij}} = [X^T X]_{ij} = \frac{1}{n} \sum_{k=1}^n x_{ki} x_{kj}$$ * ### Approach 2 Let $\alpha_j = \alpha_j(X) = 1 - \eta \left( \lambda_j + \gamma \right)$ for each $j \in [d]$. We have : \begin{split} v^t_j = \left\{ \begin{array}{ll} v^{0}_j & \mbox{if } \lambda_j = 0 \land \gamma = 0 \\ (v^{0}_j - v_j^*) \alpha_j^t + \frac{\sqrt{\lambda_j} \tilde{Z}_j -\gamma v^*_j}{\lambda_j + \gamma} \left( 1-\alpha_j^t \right) + v_j^* & \mbox{if } \lambda_j \ne 0 \lor \gamma \ne 0 \\ \end{array} \right. \end{split} \begin{split} \mathbb{E}_{Z} [v_j^t] & = \left\{ \begin{array}{ll} v^{0}_j & \mbox{if } \lambda_i = 0 \land \gamma = 0 \\ (v^{0}_j - v_j^*) \alpha_j^t - \frac{\gamma v_j^* }{\lambda_j + \gamma} \left( 1-\alpha_j^t \right) + v_j^* & \mbox{if } \lambda_j \ne 0 \lor \gamma \ne 0 \\ \end{array} \right. \\ & = v^{0}_j \mathbb{I}[\lambda_j = 0 \land \gamma = 0] + \left[ (v^{0}_j - v_j^*) \alpha_j^t - \frac{\gamma v_j^* }{\lambda_j + \gamma} \left( 1-\alpha_j^t \right) + v_j^* \right] \mathbb{I}[\lambda_j \ne 0 \lor \gamma \ne 0] \\ & = v^{0}_j + \left[ (v^{0}_j - v_j^*) \alpha_j^t - \frac{\gamma v_j^* }{\lambda_j + \gamma} \left( 1-\alpha_j^t \right) - (v^{0}_j - v_j^*) \right] \mathbb{I}[\lambda_j \ne 0 \lor \gamma \ne 0] \\ & = v^{0}_j + \left[ -(v^{0}_j - v_j^*) (1-\alpha_j^t) - \frac{\gamma v_j^* }{\lambda_j + \gamma} \left( 1-\alpha_j^t \right) \right] \mathbb{I}[\lambda_j \ne 0 \lor \gamma \ne 0] \\ & = v^{0}_j - \left( v^{0}_j - v_j^* + \frac{\gamma v_j^* }{\lambda_j + \gamma} \right) (1-\alpha_j^t) \mathbb{I}[\lambda_j \ne 0 \lor \gamma \ne 0] \\ & = v^{0}_j - \left[ v^{0}_j + \left( \frac{\gamma}{\lambda_j + \gamma} - 1\right) v_j^* \right] (1-\alpha_j^t) \mathbb{I}[\lambda_j \ne 0 \lor \gamma \ne 0] \\ & = v^{0}_j - \left( v^{0}_j - \frac{\lambda_j}{\lambda_j + \gamma} v_j^* \right) (1-\alpha_j^t) \mathbb{I}[\lambda_j \ne 0 \lor \gamma \ne 0] \end{split} Using $$w^t = V v^t \Longrightarrow w_i^t = \langle V_i, v^t \rangle = \sum_{j=1}^d V_{ij} v_j^t$$ We have \begin{split} \mathbb{E}_{X,Z}[w_i^t] & = \mathbb{E}_{X}\left[ \sum_{j=1}^d V_{ij} \mathbb{E}_{Z}[v_j^t] \right] \\ & = \mathbb{E}_{X}\left[ \sum_{j=1}^d V_{ij} \left[ v^{0}_j - \left( v^{0}_j - \frac{\lambda_j}{\lambda_j + \gamma} v_j^* \right) (1-\alpha_j^t) \mathbb{I}[\lambda_j \ne 0 \lor \gamma \ne 0] \right] \right] \\ & = \mathbb{E}_{X}\left[ \sum_{j=1}^d V_{ij} v^{0}_j \right] - \mathbb{E}_{X}\left[ \sum_{j=1}^d V_{ij} \left( v^{0}_j - \frac{\lambda_j}{\lambda_j + \gamma} v_j^* \right) (1-\alpha_j^t) \mathbb{I}[\lambda_j \ne 0 \lor \gamma \ne 0] \right] \\ & = w_i^0 - \mathbb{E}_{X}\left[ \sum_{j=1}^d V_{ij} \left( v^{0}_j - \frac{\lambda_j}{\lambda_j + \gamma} v_j^* \right) (1-\alpha_j^t) \mathbb{I}[\lambda_j \ne 0 \lor \gamma \ne 0] \right] \end{split} ## Bias, variance, noise Assume $\mu_{w^0} = \mu_{w^*} = 0$. Then * $\mathbb{E}_{w^*, w^0, X, Z}[w^t] = 0$ * $$y_n(x, t) = \langle \mathbb{E}_{X, Z}[w^t], x \rangle = \langle K(t), x \rangle = \sum_{i=1}^d K_i(t) x_i$$ \begin{split} \mathbb{E}_{w^*, w^0, x} y_n^2(x, t) & = \mathbb{E}_{w^*, w^0, x} \sum_{i=1}^d \sum_{j=1}^d K_i(t) K_j(t) x_i x_j \\ & = \sum_{i=1}^d \sigma_i^2 \mathbb{E}_{w^*, w^0} K_i^2(t) \end{split} ------------ \begin{split} \mathbb{E}_{w^*, w^0, x} \langle w^*, x \rangle y_n(x, t) & = \mathbb{E}_{w^*, w^0, x} \left( \sum_{i=1}^d w^*_i x_i \right) \left( \sum_{i=1}^d K_i(t) x_i \right) \\ & = \mathbb{E}_{w^*, w^0, x} \sum_{i=1}^d \sum_{j=1}^d w^*_i K_j(t) x_i x_j \\ & = \sum_{i=1}^d \sigma_i^2 \mathbb{E}_{w^*, w^0} [w^*_i K_i(t)] \end{split} ----------- * \begin{split} B_n^2(x, t) &= ( y^*(x) - y_n(x, t))^2 = ( \langle w^*, x \rangle - y_n(x, t))^2 \\&= \langle w^*, x \rangle^2 - 2 \langle w^*, x \rangle y_n(x, t) + y_n^2(x, t) %\\& = \langle w^*, x \rangle^2 \text{ for } \mu_{w^0} = \mu_{w^*} = 0 \end{split} \begin{split} \mathbb{E}_{w^*, w^0, x} B_n^2(x, t) = \mathbb{E}_{w^*, x} \langle w^*, x \rangle^2 - 2 \mathbb{E}_{w^*, w^0, x} [\langle w^*, x \rangle y_n(x, t)] + \mathbb{E}_{w^*, w^0, x} y_n^2(x, t) \end{split} \begin{split} \mathbb{E}_{w^*, x} \langle w^*, x \rangle^2 &= \mathbb{E}_{w^*, x} \left( \sum_{i=1}^d w_i^* x_i \right)^2 \\ &= \mathbb{E}_{w^*, x} \sum_{i=1}^d \sum_{j=1}^d w_i^* x_i w_j^* x_j \\ &= \sum_{i=1}^d \sum_{j=1}^d \mathbb{E}_{w^*} [ w_i^* w_j^*] \mathbb{E}_{x} [x_i x_j] \\ &= \sum_{i=1}^d \sum_{j=1}^d \left[\sigma_{w^*}^2 \delta_{ij} + \mu_{w^*}^2 (1-\delta_{ij}) \right] \left[\sigma_i^2 \delta_{ij} + \mu_i \mu_j (1-\delta_{ij}) \right] \\ &= \sum_{i=1}^d \sigma_{w^*}^2 \sigma_i^2 \\ &= \sigma_{w^*}^2 \sum_{i=1}^d \sigma_i^2 = \sigma_{w^*}^2 tr(\Sigma) \end{split} * $$V_n(x, t) = \mathbb{E}_{y \sim Y_n^t(x)}[y^2] - y_n^2(x, t)$$ $$ \mathbb{E}_{w^*, w^0, x} V_n(x, t) = \mathbb{E}_{w^*, w^0, x}\mathbb{E}_{y \sim Y_n^t(x)}[y^2] - \mathbb{E}_{w^*, w^0, x}y_n^2(x, t) $$ \begin{split} \mathbb{E}_{y \sim Y_n^t(x)}[y^2] &= \mathbb{E}_{S \sim D_n} \langle w(S, t), x \rangle^2 \\ &= \mathbb{E}_{S \sim D_n} \left( \sum_{i=1}^d w_i(S, t) x_i \right)^2 \\ &= \mathbb{E}_{S \sim D_n} \sum_{i=1}^d \sum_{j=1}^d w_i(S, t) w_j(S, t) x_i x_j \end{split} \begin{split} \end{split} * $$N(x) = \mathbb{E}_{z \sim \mathcal{N} \left(0, \, \sigma_z^{2}\right)}[(\langle w^*, x \rangle + z - y^*(x))^2] = \mathbb{E}_{z \sim \mathcal{N} \left(0, \, \sigma_z^{2}\right)}[z^2] = \sigma_z^2 $$ # References [1] Pedro Domingos. A unified bias-variance decomposition for zero-one and squared loss. In Proc. of the 17th National Conf. on Artificial Intelligence, pages 564–569, Austin, TX, July 2000. [2] IFT 6085 - Lecture 9, Stability, Generalization and the Applications of Stability