Notes on "SGD Implicitly Regularizes Generalization Error"

These are my notes on the following paper: https://arxiv.org/abs/2104.04874 which analyses the inductive bias of SGD relative to full-batch GD using Taylor expansions. In order to make this paper easier to understand, I present the key ideas using modern machine learning notation.

The expected effect of a GD step on the generalisation gap

The generalisation gap, for the purposes of this paper, is the difference between a model's test loss and training loss. In this work we look at how this generalisation gap changes in a single step of gradient descent on the training data, starting from a fixed parameter

θ

. Since the training and test data are random, the change in change in generalisation gap is also random, we will be interested in the average of this change, when we average over realisations of the training and test data.

Let

D_{t r a i n}

and

D_{t e s t}

be a random training and test set, drawn i.i.d. from the same underlying distribution over some data

x

let

L (θ, D) = \frac{1}{| D |} \sum_{x \in D} ℓ (x, θ)

be the empirical loss evaluated on a set of data

D

Let

g (θ, D) = \nabla_{θ} L (θ, D)

be the gradient evaluated on a set of data

D

Let's say we take a gradient step on

D_{t r a i n}

. We can approximate how the training error changes via Taylor expansion of

L (θ, D_{t r a i n})

\begin{aligned} Δ L_{t r a i n} & = L (θ_{0} - η g (θ_{0}, D_{t r a i n}), D_{t r a i n}) - L (θ_{0}, D_{t r a i n}) \\ = η g (θ_{0}, D_{t r a i n})^{T} g (θ_{0}, D_{t r a i n}) + o (η^{2}) \end{aligned}

Similarly, we can express the change in the test loss as:

\begin{aligned} Δ L_{t e s t} & = L (θ_{0} - η g (θ_{0}, D_{t r a i n}), D_{t e s t}) - L (θ_{0}, D_{t e s t}) \\ = η g (θ_{0}, D_{t e s t})^{T} g (θ_{0}, D_{t r a i n}) + o (η^{2}) \end{aligned}

To move forward, we will want to average the above quantities over random realization of the training and test dataset. Let's consider a dataset of size

N

D = {x_{1}, \dots, x_{N}}

, where each datapoint

x_{i}

is drawn independently from the same distribution

P

. Let's say we have two index sets

I_{1}, I_{2} \subseteq {1 \dots N}

, and corresponding subsets

D_{1} = {x_{i} : i \in I_{1}} \subseteq D

and

D_{2} = {x_{i} : i \in I_{2}} \subseteq D

. We now consider the following expectation:

\begin{aligned} | I_{1} | | I_{2} | E_{D} & [g (θ_{0}, D_{1})^{T} g (θ_{0}, D_{2})] = \sum_{n \in I_{1}} \sum_{m \in I_{2}} E_{D} [g (θ_{0}, x_{n})^{T} g (θ_{0}, x_{m})] \\ = \sum_{n \in I_{1} \cap I_{2}} E_{D} [g (θ_{0}, x_{n})^{T} g (θ_{0}, x_{n})] + \sum_{n \in I_{1}} \sum_{n \in I_{2} ∖ I_{1}} E_{D} [g (θ_{0}, x_{n})^{T} g (θ_{0}, x_{m})] \\ = | I_{1} \cap I_{2} | E_{X \sim P} ‖ g (θ_{0}, X) ‖^{2} + (| I_{1} | | I_{2} | - | I_{1} \cap I_{2} |) {‖ E_{X \sim P} g (θ_{0}, X) ‖}^{2} \\ = | I_{1} \cap I_{2} | tr ({Cov}_{X \sim P} g (θ_{0}, X)) + | I_{1} | | I_{2} | {‖ E_{X \sim P} g (θ_{0}, X) ‖}^{2}, \end{aligned}

Where we used the fact that the expectation of the product of independent random variables is the product of their expetations on the second term in the second line, and the following identity that for

tr Cov [G] = E [‖ G ‖^{2}] - ‖ E [G] ‖^{2}

for vector valued

X

which we prove below using the cyclic property of trace:

\begin{aligned} tr Cov [G] & = c E [G G^{T}] - tr E [G] E [G]^{T} \\ = E tr G^{T} G - tr E [G]^{T} E [G] \\ = E ‖ G ‖^{2} - ‖ E [G] ‖^{2} \end{aligned}

Using this, we can now express the expected change in training and test errors, under the assumption that

I_{t e s t} \cap I_{t} r a i n = \emptyset

\begin{aligned} E Δ L_{t e s t} & = - η {‖ E_{X \sim P} g (θ_{0}, X) ‖}^{2} \\ E Δ L_{t r a i n} & = - η {‖ E_{X \sim P} g (θ_{0}, X) ‖}^{2} - \frac{η}{N_{t r a i n}} tr ({Cov}_{X \sim P} g (θ_{0}, X)) \end{aligned}

Thus, the training error is reduced more than the test error, in expectation. This is what the authors mean when they write that 'GD develops a “bias” toward overfitting', because in expectation, one gradient step increases the gap between training and test losses.

This is the first main finding of the paper, equation (1), just written differently:

E [Δ L_{t e s t} - Δ L_{t r a i n}] = \frac{η}{N_{t r a i n}} tr Σ,

where

Σ

is the covariance matrix of gradients on a single datapoint.

This also provides a nice, intuitive explanation for how the local total variance of gradients

tr Σ

is related to generalisation. Ideally, if the variance minibatch gradients is negligible - or small, it means that running gradient descent is a 'safe way' to learn, because in this case the gradient one would get on the training set is the same as one would get on the test set.

A key limitation of this analysis is that it assummes

θ_{0}

is chosen independently of

D_{t r a i n}

. So long as we never recycle data in subsequent training steps this would be true in each step of gradient descent. In practice, however, we cycle through training data multiple times. If

θ_{0}

in our equations is some intermediate state, and is itself found by running gradient descent on data that overlapped with the training data we currently evaluate, the above analysis is flawed (because it considers

θ_{0}

fixed and not a function of

D_{t r a i n}

). In this situation it can no longer be guaranteed that the distribution of gradients is going to be the same on the training and test set and the whole argument fails.

Characterising the difference between SGD and GD

The second finding of this paper contrasts stochastic and full-batch gradient descent.

Notes on "SGD Implicitly Regularizes Generalization Error"

The expected effect of a GD step on the generalisation gap

Characterising the difference between SGD and GD

Read more

DeepNN Notes on Inductive Biases of Neural Architectures

DeepNN Notes on The Recent History of Deep Learning

abc conjecture with rationals

AI Overview Session Plan