Gbetondji, Michael: Physics-Inspired Models

## Paper TODO #### Gbetondji      - show ViT- has better performance than ViT, ViT+ as layers are increased      - describing the tables / plots - describing the experimental setups #### Matt    MATT TODO - modify Figure 1  - talk about pitfalls of prior theoretical analyses (i.e., anti-oversmoothing statements, work below assuming $W_V W_{proj} = I$ (2nd bullet point), when do vits work) - talk about how other work can be explained by this work (i.e., memetic intialization, small eigenspectrum (third bullet point below, successful pretrained models) RELATED: - SIDENOTE: this related work analyzes eigenvalues of self-attention to motivate a contrastive learning solution to oversmoothing, a new layer colled ContraNorm: \url{https://arxiv.org/pdf/2303.06562.pdf} Here they define `effective rank' which is how `vanishing' is the distribution of singular values. This is also vaguely related, focusses on dynamics of contrastive learning: \url{https://arxiv.org/pdf/2303.04435.pdf}) - another related paper which shows that with $W_V W_{proj} = I$ and no residual connections we get rank deficiency: \url{https://arxiv.org/pdf/2306.01610.pdf} - another fix which divides all weights by their spectral norm: \url{https://proceedings.mlr.press/v202/zhai23a/zhai23a.pdf}. This forces the eigenspectrum to be small  - double-check all proofs - full pass through paper ## new directions #### Gbetondji - doubtful about weight-sharing - success has to do with domain - might not do well on images e.g., sinkformer image results aren't very convincing (cat vs. dog), best results are on point clouds - stability - overlap between stability and normalization - without normalization, models don't work, even with $(1-\tau)$ - theory from graph NNs - need new architecture - prefer to be between theory and practice, not develop new strong theoretical results - doubly stochastic attention: seems to work okay (in the sinkformer paper) - HOW DO VISION TRANSFORMERS WORK? (https://arxiv.org/pdf/2202.06709.pdf) - ViT low-pass, CNNs high-pass - ViT loss smoother and flatter not convex, CNNs closer to convex not as flat - flatness: seems related to generalization - convexity: related to data efficiency - avg pooling: works as a low-pass filter, balances high-pass filter bias, and flattens landscape - what's the equivalent for ViT? (i.e., a high-pass filter), this is what the anti-oversmoothing paper does: upweight high-frequencies - best possible architecture: - flat and convex - above paper: non-principled ViT + CNN layers - can we use gradient flow ideas to make a block that balances high and low pass - i.e., GNNs as Gradient Flow - experiments: - visualize loss landscape for different eigenvalues choices: both flat and convex - similar to HOW DO VISION TRANSFORMERS WORK? - convexity: Hessian experiments - data efficiency: - small datasets CNNs outperform ViT ## new idea Consider the doubly-stochastic transformer: $X^{(t+1)} = X^{(t)} + A X^{(t)} W$ where $A$ is bi-stochastic (i.e., its rows and columns sum to 1) and symmetric, and $W$ is symmetric. We can parameterize both $A$ and $W$ as follows: $W = L^\top L$ $A = n_r(n_c( \cdots n_r(n_c(\exp(X^\top V^\top V X))) \cdots ))$ where $n_r(\cdot)$ and $n_c(\cdot)$ mean row and column normalization. This is Sinkhorn's algorithm, described in more detail here: https://proceedings.mlr.press/v151/sander22a/sander22a.pdf This transformer has some very nice properties: - eigenvalues of $A$ are $\in [-1,1]: - (upper) https://res.mdpi.com/d_attachment/symmetry/symmetry-12-00369/article_deploy/symmetry-12-00369-v2.pdf - (both, need better source) https://mathoverflow.net/questions/239951/lower-bound-for-smallest-eigenvalue-of-symmetric-doubly-stochastic-metropolis-ha - the above property means we can have our own version of Theorem 5.1 from GNNs as Gradient Flows: https://arxiv.org/pdf/2206.10991.pdf - this means we have conditions on eigenvalues of W to avoid oversmoothing - and can prove a similar theorem 1 from GRAND: https://arxiv.org/pdf/2106.10934.pdf (just need to require that maximum eigenvalue of W is $1/\lambda_{A,\max}$ and we are done so long as we weight $X^{(t)}$ by $\alpha$ and $A X^{(t)} W$ by $1-\alpha$, for $\alpha \in (0,1)$ which shouldn't change the analysis of theorem 5.1) - (maybe) can make an extension to sheafs ## meeting notes ### 27/06/23 ## derivation $$ \frac{\partial \mathbf{x}(t)}{\partial t} = \mbox{div}[\frac{1}{g(\mathbf{x}(t),t)}\nabla\mathbf{x}(t)]$$ $$g(\mathbf{x}(t),t) := k(x(t),x(t')) = \exp(\phi(x(t))^\top\phi(x(t')) $$ $$ \frac{\partial \mathbf{x}(t)}{\partial t} = \nabla\mathbf{K}^*\nabla\mathbf{x}(t)$$ $$y_i = \sum_{j \in N_i} g_\theta(x_j - x_i) (x_j-x_i)$$ $$div(g \nabla x)$$ $$TV(X) = \sum_{j\in N_i} |x_i-x_j|$$ $$\dot{x} = div(\frac{\nabla x}{\|\nabla x \|})$$ - vector calculus background: https://en.wikipedia.org/wiki/Vector_calculus_identities ## papers to read READ!!!!: ### Semantic Diffusion Network for Semantic Segmentation - https://proceedings.neurips.cc/paper_files/paper/2022/file/396446770f5e8496ca1feb02079d4fb7-Paper-Conference.pdf ### IMAGE SELECTIVE SMOOTHING AND EDGE DETECTION BY NONLINEAR DIFFUSION - https://epubs.siam.org/doi/pdf/10.1137/0729012?casa_token=27EEtqsq9nAAAAAA:FY_KstyZAMyWeFQr624t3VBG89QU10t0v0KHF7dXt1nc6NzyjXhQ63fJSodwoZzOveG_rBnalF8 ### Scale-Space and Edge Detection Using Anisotropic Diffusion - https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=56205&casa_token=x8kggJ_mwH8AAAAA:q_qPlNaLPj-bArP6_EWQjpNHNevGMscMglQkgr06D5GfEyCJdxrivFARjRw8Tzg5Xd_0IvnaIg&tag=1 ### GREAD: Graph Neural Reaction-Diffusion Networks - https://arxiv.org/pdf/2211.14208.pdf ### STABILIZING TRANSFORMER TRAINING BY PREVENTING ATTENTION ENTROPY COLLAPSE - https://arxiv.org/pdf/2303.06296.pdf ### Non-local Neural Networks - https://arxiv.org/pdf/1711.07971.pdf ### A Tour of Modern Image Filtering - http://i2pc.es/coss/Docencia/SignalProcessingReviews/Milanfar2013.pdf ### ANTI-OVERSMOOTHING IN DEEP VISION TRANSFORMERS VIA THE FOURIER DOMAIN ANALYSIS: FROM THEORY TO PRACTICE - https://arxiv.org/pdf/2203.05962.pdf ### Anisotropic Diffusion in Image Processing - https://www.mia.uni-saarland.de/weickert/Papers/book.pdf ## Dynamical Models for NNs ### NEURAL MECHANICS: SYMMETRY AND BROKEN CONSERVATION LAWS IN DEEP LEARNING DYNAMICS - Daniel Kunin*, Javier Sagastuy-Brena, Surya Ganguli, Daniel L.K. Yamins, Hidenori Tanaka* - https://arxiv.org/pdf/2012.04728.pdf - per-channel norm is more stable during training than parameter weights - differentiable symmetry (for parameters): if loss doesn't change under differentiable transformation of these parameters - i.e., $f(\theta) = f(\psi(\theta, \alpha))$ where $\theta \rightarrow \psi(\theta, \alpha)$ is the differentiable transformation - let $\alpha = I \in G$ be the identity element of $G$: for all $\theta$ it is the case that $\psi(\theta,I) = \theta$, then this enforces the following structure on the gradient $\nabla F$, $$\langle \nabla f, \partial_\alpha \psi|_{\alpha=I} \rangle = 0$$ i.e., the gradient $\nabla f$ is perpendicular to the vector field $\partial_\alpha \psi|_{\alpha=I}$ for all $\theta$. (similar result for the Hessian) - they describe constraints or symmetries on gradients/Hessians for common symmetries (translation, scale, rescale) - GD update: $\theta = \theta - \eta g(\theta)$ is a forward Euler discretization with step size $\eta$ of the ODE (gradient flow) $\frac{d\theta}{d t} = -g(\theta)$ (as $\eta \rightarrow 0$, GD matches the ODE) - given symmetries, they derive conservation laws throughout training: how the gradient w.r.t. time is constrained - e.g., for translation symmetry, the sum of parameters is conserved, effectively constraining dynamics to a hyperplane. - gradient flow is too simple, need to take into account other things! - weight decay: $$ \frac{d \theta}{d t} = -g(\theta) - \lambda \theta$$ - momentum: $$ (1 - \beta) \frac{d \theta}{d t} = -g(\theta) $$ - stochasticity: via CLT assume $\hat{g}_{\mathcal{B}}(\theta) - g(\theta)$ is a Gaussian r.v. with mean $\mu = 0$ and covariance $\Sigma(\theta)$. NOTE! Gradient of the loss (whether batched or not) is orthogonal to the generator vector field $\partial_\alpha \psi|_{\alpha=I}$. So the noise must observe the same property. For this to hold for arbitrary noise we must have that $\Sigma(\theta) \partial_\alpha \psi|_{\alpha=I} = 0$. i.e., the symmetry in NN architectures projects the noise from SGD onto low rank subspaces, leaving the gradient flow dynamics in these directions unchanged. TODO: interaction between sthochasticity and discretization is non-trivial though! - discretization: can use _modified equation analysis_ (Warming and Hyett, 1974) - _modified loss_: divergence between gradient flow and discretized SGD is given by the gradient correction $-\frac{\eta}{2}Hg$, the gradient of the squared norm $-\frac{\eta}{4}|\nabla \mathcal{L}|^2$. The modified loss is $\tilde{\mathcal{L}} = \mathcal{L} + -\frac{\eta}{4}|\nabla \mathcal{L}|^2$ and the modified gradient flow ODE is $$\frac{d \theta}{d t} = -g(\theta) - \frac{\eta}{2}H(\theta) g(\theta)$$ - _modified flow_: think of a continuous trajectory $\theta(t)$ that goes through the discrete steps taken by GD, then identify the differential equation that generates the trajectory: rearranging the update equation for GD, $\theta_{t+1} = \theta_t - \eta g(\theta_t)$ and assuming $\theta(t) = \theta_t$ and $\theta(t + \eta) = \theta_{t+1}$ gives the equality $-g(\theta_t) = \frac{\theta(t + \eta) - \theta(t)}{\eta}$, Taylor expand the RHS to get the differential equation $-g(\theta_t) = \frac{d \theta}{d t} + \frac{\eta}{2} \frac{d^2 \theta}{dt^2} + O(\eta^2)$. As $\eta \rightarrow 0$ we get gradient flow, but for small $\eta$ we get $$\frac{d \theta}{d t} = -g(\theta) - \frac{\eta}{2} \frac{d^2 \theta}{dt^2}$$. This was done by Kovachki \& Stuart, 2019 to get a more realistic continuous model for momentum. - they get exact learning dynamics for more realistic versions of gradient flow ### Noether’s Learning Dynamics: Role of Symmetry Breaking in Neural Networks - Hidenori Tanaka , Daniel Kunin - https://arxiv.org/pdf/2105.02716.pdf - gap between tools in physics and learning dynamics in NNs - root cause: gradient flow $\frac{d q}{dt} = -\nabla f$ is first-order in time whereas classical mechanics, e.g., Newton's second law $m \frac{d^2q}{dt^2} = - \nabla f$ is second-order in time - bridging the gap: (discretized) GD with finite learning rate introduces implicit acceleration, making it amenable to the Lagrangian formulation: a variational formulation used to study the role of symmetry in dynamical systems ### Graph Neural Networks as Gradient Flows: understanding graph convolutions via energy - Francesco Di Giovanni, James Rowbottom, Benjamin P. Chamberlain, Thomas Markovich, Michael M. Bronstein - https://arxiv.org/pdf/2206.10991.pdf - leverage fundamental physical idea: particles evolve by minimizing an energy. So! we can study the dynamics of a system through the functional expression of the energy. - class of differential equations that minimize an energy: _gradient flows_ - 2 ways to understand GNN dynamics 1. start from energy functional - parameterize energy functional - take GNN equations to follow direction of steepest descent - physical interpretation: multi-particle dynamics - e.g., $\mathcal{E}_\theta(\cdot): \mathbb{R}^{n \times d} \rightarrow \mathbb{R}$ is the energy function of: $\dot{\mathbf{F}}(t) := \frac{d \mathbf{F}(t)}{dt} = -\nabla_{\mathbf{F}} \mathcal{E}_\theta(\mathbf{F}(t))$ 2. start from evolution equations (i.e., $\frac{d \theta}{d t}$) - describe a general GNN update rule that has GCN, GraphSage, and GCNII as special cases - derive an energy which when differentiated yields the general GNN update rule - they interpret this energy physically, they separate the interaction matrix into positive and negative eigenvalue components then describe this as attraction vs. repulsion terms - [!] different from Hidenori Tanaka's works, time here refers to GNN layers instead of training epochs - closed form update for any layer given eigenvectors of $\mathbf{W}$ and $\Delta$ (the graph Laplacian) - describe oversmoothing as when positive eigenvalues of $\mathbf{W}$ dominate the negative ones - without residual connections $\mathbf{W}$ is less powerful - MAIN MESSAGE OF SECTION 4: discrete gradient flow $\mathcal{E}_\theta$ are equivalent to linear graph convolutions with symmetric weights shared across layers - MAIN MESSAGE OF SECTION 4: graph convolutions with symmetric weights identify curves along with the energy $\mathcal{E}_\theta$ decreases, i.e., acting as 'approximate' gradient flows ### Charting Flat Minima Using the Conserved Quantities of Gradient Flow - Bo Zhao, Iordan Ganev, Robin Walters, Rose Yu, Nima Dehmamy - https://openreview.net/pdf?id=JvcLG3eek70 - 2022 - NeurIPS Workshop - question: _How are parameter space symmetries of NN architectures related to flat minima?_ - discover a set of continuous symmetries which keep nonlinear NNs invariant under certain conditions - another reference to Noether's theorem!!!: when a continuous symmetry exists, some quantity will remain unchanged during gradient flow dynamics - given a group $G$, and action $\cdot$ on parameter space $\Theta$ is a symmetry of loss $\mathcal{L}$ if it leaves the loss invariant: $\mathcal{L}(g \cdot \theta) = \mathcal{L}(\theta), \;\;\; \forall \theta \in \Theta, g \in G$ - describe a non-linear symmetry for NNs - the consequence of this is the existence of extended, flat minima - define: _conserved quantity_ of gradient flow (GF) is a function $Q(\cdot)$ such that for any two time points $s,t$ along a GF trajectory $Q(\theta(s)) = Q(\theta(t))$ - some $Q$ for GF have been discovered (see paper below!) - they generalize the above result: for all $M$ in the Lie algebra $\mathfrak{g}$ ### Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced ## Learning Symmetries ### Automatic Symmetry Discovery with Lie Algebra Convolutional Network - Nima Dehmamy, Robin Walters, Yanchen Liu, Dashun Wang, Rose Yu - https://arxiv.org/pdf/2109.07103.pdf - introduce L-conv: infinitesimal version of G-conv - L-conv leads to Noether's theorem and conservation laws - can optimize Noether current to discover symmetries - future work: try to incorporate spatial symmetries (e.g., particle physics) ### A tradeoff between universality of equivariant models and learnability of symmetries - Vasco Portilheiro - https://arxiv.org/pdf/2210.09444.pdf - formalizing symmetry learning and providing impossibility results ### Learning Linear Symmetries in Data Using Moment Matching - Colin Hagemeyer - https://arxiv.org/pdf/2204.01213.pdf - April 4th, 2022 - motivation: symmetries are useful for a lot of physics/ML problems (data augmentation, CNNs, identity-preserving transformations) - goal: learning discrete symmetries - makes assumptions to restrict symmetries to have eigenvalues that are either 1 or -1 ### AugNet: End-to-End Unsupervised Visual Representation Learning with Image Augmentation - Mingxiang Chen, Zhanguo Chang, Haonan Lu, Bitao Yang, Zhuang Li, Liufang Guo, Zhecheng Wang - https://arxiv.org/pdf/2106.06250.pdf - June 11th, 2021 - learn low-dimensional embeddings to encode semantic similarity - they use 8 augmentation methods to make the embedding invariant to these augmentations ## Surveys ### Geometric Deep Learning Grids, Groups, Graphs, Geodesics, and Gauges ## Meeting Notes

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.