# Introduction
Over the last decade, a surge in the amount of available data combined with increasingly powerful computational resources has resulted in neural networks achieving significant gains across a wide range of tasks. [citations]
Despite their outstanding empirical success, neural networks remain unpredictable in many ways. It is not entirely clear how a neural network reacts to a particular alternation in the training algorithm, its regularization, or its architecture. As an example, the batch normalization method [citations] has been remarkably successful and is widely used in practice. Nevertheless, we still cannot pinpoint the underlying reasons behind its effectiveness. Another example is the existence of adversarial examples [citations]. It is surprising to observe how drastically a neural network's prediction changes when fed with adversarial examples that are imperceptible to humans.
Theoretical understanding of neural networks enables us to combat such unpredictabilities and to guarantee or assess neural networks' reliability. Systematic study of neural network's learning dynamics would pave the path for theoretical guarantees and navigates practitioners towards models with improved reliability and performance.
### History of theoretical studies
Theoretical works on neural networks started more than two decades ago with studies on the capacity and generalization of shallow networks [citations]. After neural networks regained popularity in the 2010s, seminal studies [citations] investigated the learning dynamics of deep linear neural networks, showing that several intuitions of deep linear models carry over to more complex non-linear networks. These works provided explanations for the effectiveness of depth observed in practice. More recently, significant progress has been made in studying non-linear deep networks [citations] as well as incorporating the structure of data in their analysis [citations].
### The puzzling phenomena
Notwithstanding the significant progress made towards establishing theoretical foundations for neural networks, several learning phenomena, well-known in practice, remain opaque when studied in theory.
One such phenomenon is the existence of perplexing non-monotonous patterns in the generalization curves of neural networks. It is well-known in statistical learning theory that generalization error follows a U-shaped curve: as model complexity is increased, the generalization error initially goes down and then increases beyond a certain threshold, i.e., overfitting occurs [citations]. However, the practice of neural networks goes against this classical wisdom. In practice, neural networks follow a double descent curve [citations] as the model size is increased, an initial U-shaped curved followed by a subsequent descent. [citations] show that double descent is not limited to varying the model size but is also observed as the training time proceeds. Once again, the so-called epoch-wise double descent is in apparent contradiction with the classical understanding of over-fitting [citations], where one expects that longer training of a sufficiently large model beyond a certain threshold should result in overfitting.
Another puzzling phenomenon of neural networks' dynamics is their reliance on spurious correlations. It has been reported that, in many cases, state-of-the-art neural networks tend to focus on low-level superficial correlations, rather than more abstract and robustly informative features of interest. As an example, in a recent study conducted by researchers at Cambridge [citations], authors review more than 300 papers on COVID prediction given CT-Scan images. According to the article, none of the papers were able to generalize from one hospital data to another since the models learn to latch on to hospital-specific features. Due to such reliances, neural networks are vulnerable under slight distributional shifts between the training and test sets [citations].
The vulnerability of neural networks is particularly revealed when they are tested on out-of-distribution (OOD) test data, where the IID (identically and independently distributed) assumption breaks. While SOTA neural networks generally achieve excellent IID generalization performance, small discrepancies between training and testing distributions could cause these neural networks to fail in spectacular ways [citations]. Most of the existing learning models rely on the fundamental assumption that training and test data are identically and independently distributed (IID). However, in many practical applications, the IID assumption is violated and distributional shifts are observed between the training and the test sets. Neural networks trained with gradient-based optimization methods, latch onto correlations in the training distribution that might not be of any use at the test time or could even hurt generalization.
### The Guiding Question
On the quest for developing a scientific understanding of neural networks, throughout this dissertation, our guiding question is:
"How do neural networks learn different features and how does that affect their ability to generalize?"
Towards answering this question, the puzzling phenomena and failure modes of neural networks serve as guiding examples where studying them brings us a step closer to finding a better understanding of neural networks. In the remaining of this dissertation, we present three articles aimed at answering our guiding question,
- Chapters 3 and 4 present "Multi-scale Feature Learning Dynamics: Insights for Double Descent" in which we study a linear teacher-student setup exhibiting epoch-wise double descent similar to that in deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error over training. We find that double descent can be attributed to distinct features being learned at different scales: as fast-learning features overfit, slower-learning features start to fit, resulting in a second descent in test error.
- Chapters 5 and 6 present "Gradient Starvation: A Learning Proclivity in Neural Networks" in which we identify and formalize a fundamental gradient descent phenomenon resulting in a learning proclivity in over-parameterized neural networks. Gradient Starvation arises when the cross-entropy loss is minimized by capturing only a subset of features relevant for the task, despite the presence of other predictive features that fail to be discovered. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks. Using tools from Dynamical Systems theory, we identify simple properties of learning dynamics during gradient descent that lead to this imbalance and prove that such a situation can be expected given certain statistical structures in training data. Based on our proposed formalism, we develop guarantees for a novel regularization method aimed at decoupling feature learning dynamics, improving accuracy and robustness in cases hindered by gradient starvation.
- Chapters 7 and 8 present "Simple data balancing achieves competitive worst-group-accuracy" in which we empirically study the problem of generalization under distributional shifts. We look into classifiers that generalize well to specific groups of data (good average performance) but fail to generalize to underspecified groups (minority examples). After observing that common worst-group-accuracy datasets suffer from substantial imbalances, we set out to compare state-of-the-art methods to a simple balancing of classes and groups by either subsampling or reweighting data. Our results show that these data balancing baselines achieve state-of-the-art accuracy, begging closer examination of benchmarks and methods for research in worst-group-accuracy optimization.
Before diving into the main articles, we first review the necessary background in Chapter 2. Chapters 3 through 8 present the three articles. Lastly, Chapter 9 concludes this dissertation.