## Paper TODO
#### Gbetondji
<!-- - cat plots for 2 eigenvalue conditions -->
<!-- - table of 2 eigenvalue conditions for all inputs + all layers -->
<!-- - mimetic initialization -->
<!-- - $|| HFC(X_l) ||_2 / || LFC(X_l) ||_2$ (y-axis) vs. each layer l (x-axis) in each of the models (can plot all models in the same plot in different colors) -->
<!-- - add asymmetry of H and A plots -->
- show ViT- has better performance than ViT, ViT+ as layers are increased
<!-- - flatness measure -->
<!-- - Figure 1 increase norm a bit, right now neighboring images look very similar -->
<!-- - send me all individual photos in Figure 1, will modify the figure to add more detail and other details -->
<!-- - Figure 2 increase font sizes, change legends to remove parentheses and text inside parentheses -->
<!-- - Figure 4 increase font sizes -->
- describing the tables / plots
- describing the experimental setups
#### Matt
<!-- - related work -->
<!-- - limitations -->
<!-- - conclusion -->
MATT TODO
- modify Figure 1
<!-- - show how without residual connection, we get rank collapse, an alternative proof to \cite{dong2021attention}. describe how this is a generalization of work below! (look at work again to make sure it has fixed weights) -->
- talk about pitfalls of prior theoretical analyses (i.e., anti-oversmoothing statements, work below assuming $W_V W_{proj} = I$ (2nd bullet point), when do vits work)
- talk about how other work can be explained by this work (i.e., memetic intialization, small eigenspectrum (third bullet point below, successful pretrained models)
RELATED:
- SIDENOTE: this related work analyzes eigenvalues of self-attention to motivate a contrastive learning solution to oversmoothing, a new layer colled ContraNorm: \url{https://arxiv.org/pdf/2303.06562.pdf} Here they define `effective rank' which is how `vanishing' is the distribution of singular values. This is also vaguely related, focusses on dynamics of contrastive learning: \url{https://arxiv.org/pdf/2303.04435.pdf})
- another related paper which shows that with $W_V W_{proj} = I$ and no residual connections we get rank deficiency: \url{https://arxiv.org/pdf/2306.01610.pdf}
- another fix which divides all weights by their spectral norm: \url{https://proceedings.mlr.press/v202/zhai23a/zhai23a.pdf}. This forces the eigenspectrum to be small
<!-- \end{itemize} -->
- double-check all proofs
- full pass through paper
## new directions
#### Gbetondji
- doubtful about weight-sharing
- success has to do with domain
- might not do well on images e.g., sinkformer image results aren't very convincing (cat vs. dog), best results are on point clouds
- stability
- overlap between stability and normalization
- without normalization, models don't work, even with $(1-\tau)$
- theory from graph NNs
- need new architecture
- prefer to be between theory and practice, not develop new strong theoretical results
- doubly stochastic attention: seems to work okay (in the sinkformer paper)
- HOW DO VISION TRANSFORMERS WORK? (https://arxiv.org/pdf/2202.06709.pdf)
- ViT low-pass, CNNs high-pass
- ViT loss smoother and flatter not convex, CNNs closer to convex not as flat
- flatness: seems related to generalization
- convexity: related to data efficiency
- avg pooling: works as a low-pass filter, balances high-pass filter bias, and flattens landscape
- what's the equivalent for ViT? (i.e., a high-pass filter), this is what the anti-oversmoothing paper does: upweight high-frequencies
- best possible architecture:
- flat and convex
- above paper: non-principled ViT + CNN layers
- can we use gradient flow ideas to make a block that balances high and low pass
- i.e., GNNs as Gradient Flow
- experiments:
- visualize loss landscape for different eigenvalues choices: both flat and convex
- similar to HOW DO VISION TRANSFORMERS WORK?
- convexity: Hessian experiments
- data efficiency:
- small datasets CNNs outperform ViT
## new idea
Consider the doubly-stochastic transformer:
$X^{(t+1)} = X^{(t)} + A X^{(t)} W$
where $A$ is bi-stochastic (i.e., its rows and columns sum to 1) and symmetric, and $W$ is symmetric. We can parameterize both $A$ and $W$ as follows:
$W = L^\top L$
$A = n_r(n_c( \cdots n_r(n_c(\exp(X^\top V^\top V X))) \cdots ))$
where $n_r(\cdot)$ and $n_c(\cdot)$ mean row and column normalization. This is Sinkhorn's algorithm, described in more detail here: https://proceedings.mlr.press/v151/sander22a/sander22a.pdf
This transformer has some very nice properties:
- eigenvalues of $A$ are $\in [-1,1]:
- (upper) https://res.mdpi.com/d_attachment/symmetry/symmetry-12-00369/article_deploy/symmetry-12-00369-v2.pdf
- (both, need better source) https://mathoverflow.net/questions/239951/lower-bound-for-smallest-eigenvalue-of-symmetric-doubly-stochastic-metropolis-ha
- the above property means we can have our own version of Theorem 5.1 from GNNs as Gradient Flows: https://arxiv.org/pdf/2206.10991.pdf
- this means we have conditions on eigenvalues of W to avoid oversmoothing
- and can prove a similar theorem 1 from GRAND: https://arxiv.org/pdf/2106.10934.pdf (just need to require that maximum eigenvalue of W is $1/\lambda_{A,\max}$ and we are done so long as we weight $X^{(t)}$ by $\alpha$ and $A X^{(t)} W$ by $1-\alpha$, for $\alpha \in (0,1)$ which shouldn't change the analysis of theorem 5.1)
- (maybe) can make an extension to sheafs
## meeting notes
### 27/06/23
## derivation
$$ \frac{\partial \mathbf{x}(t)}{\partial t} = \mbox{div}[\frac{1}{g(\mathbf{x}(t),t)}\nabla\mathbf{x}(t)]$$
$$g(\mathbf{x}(t),t) := k(x(t),x(t')) = \exp(\phi(x(t))^\top\phi(x(t')) $$
$$ \frac{\partial \mathbf{x}(t)}{\partial t} = \nabla\mathbf{K}^*\nabla\mathbf{x}(t)$$
$$y_i = \sum_{j \in N_i} g_\theta(x_j - x_i) (x_j-x_i)$$
$$div(g \nabla x)$$
$$TV(X) = \sum_{j\in N_i} |x_i-x_j|$$
$$\dot{x} = div(\frac{\nabla x}{\|\nabla x \|})$$
- vector calculus background: https://en.wikipedia.org/wiki/Vector_calculus_identities
## papers to read
READ!!!!:
### Semantic Diffusion Network for Semantic Segmentation
- https://proceedings.neurips.cc/paper_files/paper/2022/file/396446770f5e8496ca1feb02079d4fb7-Paper-Conference.pdf
### IMAGE SELECTIVE SMOOTHING AND EDGE DETECTION BY NONLINEAR DIFFUSION
- https://epubs.siam.org/doi/pdf/10.1137/0729012?casa_token=27EEtqsq9nAAAAAA:FY_KstyZAMyWeFQr624t3VBG89QU10t0v0KHF7dXt1nc6NzyjXhQ63fJSodwoZzOveG_rBnalF8
### Scale-Space and Edge Detection Using Anisotropic Diffusion
- https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=56205&casa_token=x8kggJ_mwH8AAAAA:q_qPlNaLPj-bArP6_EWQjpNHNevGMscMglQkgr06D5GfEyCJdxrivFARjRw8Tzg5Xd_0IvnaIg&tag=1
### GREAD: Graph Neural Reaction-Diffusion Networks
- https://arxiv.org/pdf/2211.14208.pdf
### STABILIZING TRANSFORMER TRAINING BY PREVENTING ATTENTION ENTROPY COLLAPSE
- https://arxiv.org/pdf/2303.06296.pdf
### Non-local Neural Networks
- https://arxiv.org/pdf/1711.07971.pdf
### A Tour of Modern Image Filtering
- http://i2pc.es/coss/Docencia/SignalProcessingReviews/Milanfar2013.pdf
### ANTI-OVERSMOOTHING IN DEEP VISION TRANSFORMERS VIA THE FOURIER DOMAIN ANALYSIS: FROM THEORY TO PRACTICE
- https://arxiv.org/pdf/2203.05962.pdf
### Anisotropic Diffusion in Image Processing
- https://www.mia.uni-saarland.de/weickert/Papers/book.pdf
## Dynamical Models for NNs
### NEURAL MECHANICS: SYMMETRY AND BROKEN CONSERVATION LAWS IN DEEP LEARNING DYNAMICS
- Daniel Kunin*, Javier Sagastuy-Brena, Surya Ganguli, Daniel L.K. Yamins, Hidenori Tanaka*
- https://arxiv.org/pdf/2012.04728.pdf
- per-channel norm is more stable during training than parameter weights
- differentiable symmetry (for parameters): if loss doesn't change under differentiable transformation of these parameters
- i.e., $f(\theta) = f(\psi(\theta, \alpha))$ where $\theta \rightarrow \psi(\theta, \alpha)$ is the differentiable transformation
- let $\alpha = I \in G$ be the identity element of $G$: for all $\theta$ it is the case that $\psi(\theta,I) = \theta$, then this enforces the following structure on the gradient $\nabla F$, $$\langle \nabla f, \partial_\alpha \psi|_{\alpha=I} \rangle = 0$$ i.e., the gradient $\nabla f$ is perpendicular to the vector field $\partial_\alpha \psi|_{\alpha=I}$ for all $\theta$. (similar result for the Hessian)
- they describe constraints or symmetries on gradients/Hessians for common symmetries (translation, scale, rescale)
- GD update: $\theta = \theta - \eta g(\theta)$ is a forward Euler discretization with step size $\eta$ of the ODE (gradient flow) $\frac{d\theta}{d t} = -g(\theta)$ (as $\eta \rightarrow 0$, GD matches the ODE)
- given symmetries, they derive conservation laws throughout training: how the gradient w.r.t. time is constrained
- e.g., for translation symmetry, the sum of parameters is conserved, effectively constraining dynamics to a hyperplane.
- gradient flow is too simple, need to take into account other things!
- weight decay: $$ \frac{d \theta}{d t} = -g(\theta) - \lambda \theta$$
- momentum: $$ (1 - \beta) \frac{d \theta}{d t} = -g(\theta) $$
- stochasticity: via CLT assume $\hat{g}_{\mathcal{B}}(\theta) - g(\theta)$ is a Gaussian r.v. with mean $\mu = 0$ and covariance $\Sigma(\theta)$. NOTE! Gradient of the loss (whether batched or not) is orthogonal to the generator vector field $\partial_\alpha \psi|_{\alpha=I}$. So the noise must observe the same property. For this to hold for arbitrary noise we must have that $\Sigma(\theta) \partial_\alpha \psi|_{\alpha=I} = 0$. i.e., the symmetry in NN architectures projects the noise from SGD onto low rank subspaces, leaving the gradient flow dynamics in these directions unchanged. TODO: interaction between sthochasticity and discretization is non-trivial though!
- discretization: can use _modified equation analysis_ (Warming and Hyett, 1974)
- _modified loss_: divergence between gradient flow and discretized SGD is given by the gradient correction $-\frac{\eta}{2}Hg$, the gradient of the squared norm $-\frac{\eta}{4}|\nabla \mathcal{L}|^2$. The modified loss is $\tilde{\mathcal{L}} = \mathcal{L} + -\frac{\eta}{4}|\nabla \mathcal{L}|^2$ and the modified gradient flow ODE is $$\frac{d \theta}{d t} = -g(\theta) - \frac{\eta}{2}H(\theta) g(\theta)$$
- _modified flow_: think of a continuous trajectory $\theta(t)$ that goes through the discrete steps taken by GD, then identify the differential equation that generates the trajectory: rearranging the update equation for GD, $\theta_{t+1} = \theta_t - \eta g(\theta_t)$ and assuming $\theta(t) = \theta_t$ and $\theta(t + \eta) = \theta_{t+1}$ gives the equality $-g(\theta_t) = \frac{\theta(t + \eta) - \theta(t)}{\eta}$, Taylor expand the RHS to get the differential equation $-g(\theta_t) = \frac{d \theta}{d t} + \frac{\eta}{2} \frac{d^2 \theta}{dt^2} + O(\eta^2)$. As $\eta \rightarrow 0$ we get gradient flow, but for small $\eta$ we get $$\frac{d \theta}{d t} = -g(\theta) - \frac{\eta}{2} \frac{d^2 \theta}{dt^2}$$. This was done by Kovachki \& Stuart, 2019 to get a more realistic continuous model for momentum.
- they get exact learning dynamics for more realistic versions of gradient flow
### Noether’s Learning Dynamics: Role of Symmetry Breaking in Neural Networks
- Hidenori Tanaka , Daniel Kunin
- https://arxiv.org/pdf/2105.02716.pdf
- gap between tools in physics and learning dynamics in NNs
- root cause: gradient flow $\frac{d q}{dt} = -\nabla f$ is first-order in time whereas classical mechanics, e.g., Newton's second law $m \frac{d^2q}{dt^2} = - \nabla f$ is second-order in time
- bridging the gap: (discretized) GD with finite learning rate introduces implicit acceleration, making it amenable to the Lagrangian formulation: a variational formulation used to study the role of symmetry in dynamical systems
### Graph Neural Networks as Gradient Flows: understanding graph convolutions via energy
- Francesco Di Giovanni, James Rowbottom, Benjamin P. Chamberlain, Thomas Markovich, Michael M. Bronstein
- https://arxiv.org/pdf/2206.10991.pdf
- leverage fundamental physical idea: particles evolve by minimizing an energy. So! we can study the dynamics of a system through the functional expression of the energy.
- class of differential equations that minimize an energy: _gradient flows_
- 2 ways to understand GNN dynamics
1. start from energy functional
- parameterize energy functional
- take GNN equations to follow direction of steepest descent
- <b>physical interpretation</b>: multi-particle dynamics
- e.g., $\mathcal{E}_\theta(\cdot): \mathbb{R}^{n \times d} \rightarrow \mathbb{R}$ is the energy function of: $\dot{\mathbf{F}}(t) := \frac{d \mathbf{F}(t)}{dt} = -\nabla_{\mathbf{F}} \mathcal{E}_\theta(\mathbf{F}(t))$
2. start from evolution equations (i.e., $\frac{d \theta}{d t}$)
- describe a general GNN update rule that has GCN, GraphSage, and GCNII as special cases
- derive an energy which when differentiated yields the general GNN update rule
- they interpret this energy physically, they separate the interaction matrix into positive and negative eigenvalue components then describe this as attraction vs. repulsion terms
- [!] different from Hidenori Tanaka's works, time here refers to GNN layers instead of training epochs
- closed form update for any layer given eigenvectors of $\mathbf{W}$ and $\Delta$ (the graph Laplacian)
- describe oversmoothing as when positive eigenvalues of $\mathbf{W}$ dominate the negative ones
- without residual connections $\mathbf{W}$ is less powerful
- MAIN MESSAGE OF SECTION 4: discrete gradient flow $\mathcal{E}_\theta$ are equivalent to linear graph convolutions with symmetric weights shared across layers
- MAIN MESSAGE OF SECTION 4: graph convolutions with symmetric weights identify curves along with the energy $\mathcal{E}_\theta$ decreases, i.e., acting as 'approximate' gradient flows
### Charting Flat Minima Using the Conserved Quantities of Gradient Flow
- Bo Zhao, Iordan Ganev, Robin Walters, Rose Yu, Nima Dehmamy
- https://openreview.net/pdf?id=JvcLG3eek70
- 2022
- NeurIPS Workshop
- question: _How are parameter space symmetries of NN architectures related to flat minima?_
- discover a set of continuous symmetries which keep nonlinear NNs invariant under certain conditions
- <b>another reference to Noether's theorem!!!</b>: when a continuous symmetry exists, some quantity will remain unchanged during gradient flow dynamics
- given a group $G$, and action $\cdot$ on parameter space $\Theta$ is a symmetry of loss $\mathcal{L}$ if it leaves the loss invariant: $\mathcal{L}(g \cdot \theta) = \mathcal{L}(\theta), \;\;\; \forall \theta \in \Theta, g \in G$
- describe a non-linear symmetry for NNs
- the consequence of this is the existence of extended, flat minima
- define: _conserved quantity_ of gradient flow (GF) is a function $Q(\cdot)$ such that for any two time points $s,t$ along a GF trajectory $Q(\theta(s)) = Q(\theta(t))$
- some $Q$ for GF have been discovered <b>(see paper below!)</b>
- they generalize the above result: for all $M$ in the Lie algebra $\mathfrak{g}$
### Algorithmic Regularization in Learning Deep
Homogeneous Models: Layers are Automatically
Balanced
## Learning Symmetries
### Automatic Symmetry Discovery with Lie Algebra Convolutional Network
- Nima Dehmamy, Robin Walters, Yanchen Liu, Dashun Wang, Rose Yu
- https://arxiv.org/pdf/2109.07103.pdf
- introduce L-conv: infinitesimal version of G-conv
- L-conv leads to Noether's theorem and conservation laws
- can optimize Noether current to discover symmetries
- future work: try to incorporate spatial symmetries (e.g., particle physics)
### A tradeoff between universality of equivariant models and learnability of symmetries
- Vasco Portilheiro
- https://arxiv.org/pdf/2210.09444.pdf
- formalizing symmetry learning and providing impossibility results
### Learning Linear Symmetries in Data Using Moment Matching
- Colin Hagemeyer
- https://arxiv.org/pdf/2204.01213.pdf
- April 4th, 2022
- motivation: symmetries are useful for a lot of physics/ML problems (data augmentation, CNNs, identity-preserving transformations)
- goal: learning <b>discrete</b> symmetries
- makes assumptions to restrict symmetries to have eigenvalues that are either 1 or -1
### AugNet: End-to-End Unsupervised Visual Representation Learning with Image Augmentation
- Mingxiang Chen, Zhanguo Chang, Haonan Lu, Bitao Yang, Zhuang Li, Liufang Guo, Zhecheng Wang
- https://arxiv.org/pdf/2106.06250.pdf
- June 11th, 2021
- learn low-dimensional embeddings to encode semantic similarity
- they use 8 augmentation methods to make the embedding invariant to these augmentations
## Surveys
### Geometric Deep Learning Grids, Groups, Graphs, Geodesics, and Gauges
## Meeting Notes