Try   HackMD

Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

tags: papers, parameters

Main Ideas:

  • Mysterious why NN’s can generalize well with so many more parameters than data points
  • Effective dimensionality (ED) measures the dimensionality of the parameter space determined by the data
    • Relates ED to posterior contraction in Bayesian deep learning, model selection, width-depth tradeoffs, double descent, and functional diversity in loss surfaces.
    • ED compares favorably to alternative norm- and flatness-based generalization measures.

Effective Dimensionality (ED)

  • Definition of effective dimensionality of a symmetric matrix:

    Neff(A,z)=i=1kλiλi+z
    where
    z
    is a regularization constant.

Relationship to Posterior

  • In Bayesian models, variance of posterior significantly reduced from variance of prior (posterior contraction)
  • ED of parameter covariance decreases with more training data

Relationship to Hessian

  • ED of Hessian increases with more training data
  • Indicates the number of parameters that have been determined by the data
  • Inversely correlated with ED of posterior
    • Curvature of posterior is related to Hessian ED, and curvature increases with data

Other Notes

Growth in eigenvalues of the hessian of the loss corresponds to increase certainty about parameters (as this happens, eigenvalues of the covariance matrix in our approximaition to the posterior distribution shrink, indicating contraction) -> use effective dimensionality as proxy for the number of parameters that have been determined. Effective dimensionality explains the number of parameters that have been determined by the data. For same parametrization, we expect lower effective dimensionality to generalize better.

degenerate: directions in parameter space that have not been determined - these parameters lead to homogeneous models. This also should connect with some notion of uncertainty.

Can also view effective dimentionality as compression: disregard high dimensional subspaces that contain little information about the model

Computation problems: for deep and wide NN, the Hessian of the loss is large, and thus computing the eigenvalues and eigenvectors is nontrivial.

Questions

  • Does lower effective dimensionality imply anything about uncertainty?
    • Maybe? There are papers that more overparameterization leads to more robustness.
    • Hypothesis: lower effective dimensionality means fewer diversity of models and worse uncertainties