NeurIPS 2021 nikos notes

# NeurIPS 2021 nikos notes ## 1. ML for Physics and Physics for ML ### <ins>Shirley overview</ins> | ML for Physics | Physics for ML | | -------- | -------- | | Accelerate simulations for known physics | Information theory, stat. physics as an intuition for models | | Assist scientific discovery (interesting work on representing quantum states with a NN) | Enforce symmetries on models | | Models that are physically interpretable | | ### <ins>Miles inductive bias</ins> * (past) Ising model is a good way to go for more complicated dynamical systems: Hopfield networks, human memory. * (present) 3 types of inductive bias: * Energy preservation (Energy based models, Hamiltonian NNs) * Geometry (CNNs: translationally equivariant, group equivariant CNNs, GNNs: permutation equivariant) * Diffferential equations (Neural ODEs, RNNs can be viewed as euler discretization of an ODE (?)) Soft vs Hard inductive bias Interestingly, data augmentation can be viewed as a soft inductive bias on our ml pipeline. ### <ins>The Problem with Deep Learning for Physics (and how to fix it)</ins> Problem: lack of interpretability Remedy: symbolic regression (when ML is used for physics) *comment: I liked Miles' argument on how mathematical equations are the language for physical laws, and hence it makes sense to use symbolic regression to discover the equations that ML models compute (when applied to physical sciences). Yet, I still think that researching interpretability is like giving up on (fully) mathematically describing and understanding the behavior and properties of DL. + Welling's comment* ## 2. Benign overfitting ## 3. Universal law of Robustness (Mark Selke) Setup: $$y_i = g(x_i) + z_i, \; \; \mathrm{Var}[z_i] = \sigma^2$$ Goal no1: Learn memorizer of labels $y_i$ * Perfect memorizer $f: f(x_i) = y_i$ * Partial memorizer $f: \sum_{i} (f(x_i) - y_i)^2 \leq \frac{1}{2} \sum_i (g(x_i) - y_i)^2 = \frac{1}{2} \sum_i z_i^2$. Intepretation: learn much "better" than the actual signal $g$. Goal no2: Learn a **robust** partial memorizer: memorizer + Lipschitz. Main assertion: to do-so, you need $nd$ parameters. More specifically, if we define a parameterized function class with $p$ parameters, then for all $f \in \mathcal{F}_p$ $$\mathrm{Lip}(f) >> \sigma \sqrt{\frac{nd}{p}}$$ See [here](https://www.youtube.com/watch?v=__Kj9HeU5tE) for a great exposition of the theorem + proof by Ronen Eldan. ## 4. Do we know how to estimate the mean? (Gabor Lugosi) Two mean estimators (in the real line) with good rates (sub-Gaussian bounds): * Median of means: Group points into k blocks, compute the means and then compute the median of them as your estimator. * Trimmed mean! Trim your outliers and estimate the mean of the rest (old idea in robust statistics) Remark: those results were known, the talk discussed recent results for high-dim points. ## 5. When/why Deep learning is easy? (my title) (Shalev-Shwartz) Deep learning is computationally hard, yet it works: 1. Some distributions must be easy to learn. Motivating example: linear classifiers on 3x3 patches of images yield non trivial results (> random guess). * $\log n$ parity problem (bits, learn the product of an unknown set of them). With unbiased distribution superpolynomial time, with biased polynomial. **local correlation** * shallow networks must be able to approximate weakly a function in order to be learnable. * $\log n$ are learnable with neural nets, but cannot be learned by any kernel (Daniely & Malach) (NTK criticism) 2. The problem is actually low dimensional (so the complexity implications do not appear..) * Example with conv nets.