Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

tags: `papers`, `parameters`

Main Ideas:

Mysterious why NN’s can generalize well with so many more parameters than data points
Effective dimensionality (ED) measures the dimensionality of the parameter space determined by the data
- Relates ED to posterior contraction in Bayesian deep learning, model selection, width-depth tradeoffs, double descent, and functional diversity in loss surfaces.
- ED compares favorably to alternative norm- and flatness-based generalization measures.

Effective Dimensionality (ED)

Definition of effective dimensionality of a symmetric matrix:

$N_{e f f} (A, z) = \sum_{i = 1}^{k} \frac{λ_{i}}{λ_{i} + z}$
where
$z$ is a regularization constant.

Relationship to Posterior

In Bayesian models, variance of posterior significantly reduced from variance of prior (posterior contraction)
$⟹$ ED of parameter covariance decreases with more training data

Relationship to Hessian

ED of Hessian increases with more training data
Indicates the number of parameters that have been determined by the data
Inversely correlated with ED of posterior
- Curvature of posterior is related to Hessian ED, and curvature increases with data

Other Notes

Growth in eigenvalues of the hessian of the loss corresponds to increase certainty about parameters (as this happens, eigenvalues of the covariance matrix in our approximaition to the posterior distribution shrink, indicating contraction) -> use effective dimensionality as proxy for the number of parameters that have been determined. Effective dimensionality explains the number of parameters that have been determined by the data. For same parametrization, we expect lower effective dimensionality to generalize better.

degenerate: directions in parameter space that have not been determined - these parameters lead to homogeneous models. This also should connect with some notion of uncertainty.

Can also view effective dimentionality as compression: disregard high dimensional subspaces that contain little information about the model

Computation problems: for deep and wide NN, the Hessian of the loss is large, and thus computing the eigenvalues and eigenvectors is nontrivial.

Questions

Does lower effective dimensionality imply anything about uncertainty?
- Maybe? There are papers that more overparameterization leads to more robustness.
- Hypothesis: lower effective dimensionality means fewer diversity of models and worse uncertainties

Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

tags: papers, parameters

Main Ideas:

Effective Dimensionality (ED)

Relationship to Posterior

Relationship to Hessian

Other Notes

Questions

Read more

Unfiltered Research Notes

Paper Outline

Thoughts from Friday 3/11

Meta Learning & Transfer Learning

tags: `papers`, `parameters`