Thoughts on modularity in LLMs

## Notes Multiple definitions of **sparsity**: - data sparsity: $N$ is too small to obtain good estimate for $\mathbf{w}$ - probability sparsity: probability distribution over events, most of which receive zero probability - sparsity in the dual: associated with SVMs and other kernel-based methods; implies that the predictor can be represented via kernel calculations involving just a few training instances - model sparsity: most dimensions of $\textbf{f}$ are not needed for a good $h_\mathbf{w}$ (where $h_\mathbf{w}(x)=\arg\max_{y\in\mathcal{Y}(x)}\mathbf{w}^T\mathbf{f}(x,y)$ where $\mathbf{f}$ is a vec. function that encodes all the relevant things about $(x,y)$); those dimensions of $\mathbf{w}$ can be zero, leading to a sparse $\mathbf{w}$ Why is it useful to predict when modules form during training? ## Research Questions/Thoughts - hypothesis: optimization -> modularity (w/ redundant modules) -> LM as compression - read superposition paper - how to measure compression? not just complexity, also reconstruction error - overlap in modules may cause failures in composition - b/c successful composition requires orthogonal bases? - failures of the model/compression -> failures of modularity ## Related Work - "Transformers learn through gradual rank increase" - difference between trained and initial weights of Transformers progressively increases in rank - "Parameter-Efficient Sparsity for Large Language Models Fine-Tuning", Li et al., https://arxiv.org/abs/2205.11005 - seems like they use relevant measures of sparsity, like low-rankness and structuredness - "Learning Transformer Programs", Friedman et al., https://arxiv.org/abs/2306.01128 - designed a transformer that can be automatically converted into a discrete, human-readable program - they convert Transformers into Python programs and use off-the-shelf code analysis tools to debug model errors and identify the "circuits" used to solve different sub-problems - "The Quantization Model of Neural Scaling," Michaud et al., https://arxiv.org/abs/2303.13506 - "learned network capabilities are quantized into discrete chunks" - quanta: basic blocks of model performance - proposes QDG: “quanta discovery with gradients” - clusters samples together based on whether gradients on those samples point in similar directions - So for $g_i:=\nabla_\theta L(f_\theta(x_i),y_i)$, normalize gradients to $\hat{g}_i$ so that $\hat{g}_i\cdot \hat{g}_i=1$, then compute affinity matrix $C=AA^T$ where $A$ is the matrix whose rows are the normalized gradients. Then $C_{ij}=\hat{g}_i\cdot\hat{g}_j$, the cosine similarity of the gradients between samples $i$ and $j$. Can also define $\hat{C}_{ij}=1-\mathrm{arccos}(C_{ij})/\pi$. Perform spectral clustering with $\hat{C}$ to cluster samples $(x_i,y_i)$. - components of the "quantization hypothesis": - many natural prediction problems involve quanta, and model prediction is determined by which quanta have been learned - Q sequence: natural ordering of the quanta determined by which abilities are more useful for reducing loss than others - power law relationship between the frequencies at which the quanta are used and prediction drop off - "Break It Down: Evidence for Structural Compositionality in Neural Networks," Lepori et al., https://arxiv.org/abs/2301.10884 - uses continuous sparsification to learn sub-networks for tasks requiring structured compositionality - then evaluates whether the learned sub-networks perform equally on all the tasks - if they are truly particular to a single task (or group of tasks), then there should be a performance drop between the pair of tasks - they also try ablating the sub-networks to see if performance is only harmed on one task and not the other - "Winning the Lottery with Continuous Sparsification," Savarese et al., https://arxiv.org/abs/1912.04427 - proposes **continuous sparsification**, which does lottery ticket search by learning a binary mask over model weights - iteratively learns mask $m$ by training the model with the mask applied, plus a $\ell_1$ regularizer term (which is equivalent to the $\ell_0$ regularizer for binary matrices); learned mask is continuous, but technique applies a sigmoid function element-wise in the regularization term; when outputting the mask, the Heaviside step function is applied - "A Theory for Emergence of Complex Skills in Language Models," Arora and Goyal - Models "skill graphs," which are bipartite graphs with a set $V_1$ of nodes representing "skills" $s\sim \mu_1$ needed to understand pieces of text $t\sim \mu_2$ (the nodes in $V_2$). Edge exists between node $s$ and $t$ if skill $s$ is needed to attain low prediction loss on $t$ - Also defines unknown procedure $\mathrm{Gen}$ for converting randomly sampled tuple of skills into a text-piece exhibiting those skills - $\mathrm{Cloze}$: unknown procedure for creating a Cloze prompt for a given piece of text - Some takeaways: - reduction in excess cross-entropy loss drives skill acquisition, together with assumption that normal language already utilizes multiple skills, mixed up randomly - "On The Specialization of Neural Modules," Jarvis et al. (https://openreview.net/pdf?id=Fh97BDaR6I) - parametrized space of datasets w/ input and output matrices $X=\begin{bmatrix}\Omega_x & \Gamma_x\end{bmatrix}^T$ and $Y=\begin{bmatrix}\Omega_y & \Gamma_y\end{bmatrix}$ where $\Omega_x\in\{-1,1\}^{2^{n_x}\times n_x}$ and $\Omega_y\in\{-1,1\}^{2^{n_x}\times n_y}$ are the compositional input and output feature matrices (where $n_x$ is the no. of bits in the compositional input structure and $n_y$ is the no. of features uniformly sampled from $\Omega_x$); $\Gamma_x,\Gamma_y$ are non-compositional input and output feature matrices consisting of scaled identity matrices - Systematic generalization: the identification and learning of lower-rank sub-structure ($\Omega_x,\Omega_y)$ in the full dataset $(X,Y)$ s.t. the rank of the population covariance on the sub-structure ($E[\Omega_x\Omega_x^T],E[\Omega_y\Omega_x^T]$) is lower than the rank of the population covariance on the dataset as a whole ($E[XX^T],E[YX^T]$) - Showed that linear dense neural networks learn the SVD of the dataset statistics - for all datasets in this space, it's impossible for a dense architecture to learn a systematic mapping between compositional components - this is b/c non-compositional components affect the compositional mappings, so the mapping cannot be systematic (in order for a network to be systematic the compositional output needs to be independent of the non-compositional input) - necessary condition for systematicity: the mapping b/w lower-rank sub-structure must be orthogonal (rely on different modes of variation) from the rest of the network mapping - but in dense networks any correlation between sub-structures in the dataset will couple the mappings - Modularity does impose some form of bias towards systematicity but if any non-compositional structure is considered in the input then a systematic mapping will not be learned - The fully partitioned network achieves systematicity by learning the lower-rank (compositional) sub-structure in isolation of the rest of the mapping. - although theoretical results are only proven for linear networks, empirical results generalize on CMNIST for non-linear CNNs - in dense networks the error of the compositional mapping is tied to the error of the non-compositional mapping (but this is not observed with the split network) - non-compositional components results in lack of generalization for both dense and split networks - compositional mapping of the split architecture is the only mapping that generalizes well (on compositional portion of the dataset) - "Thus, as our theoretical module-centric perspective displays, modules must be allocated perfectly; such that the only consistent correlation presented to a module is the one it must specialise to" - "A mathematical theory of semantic development in deep neural networks," Saxe et al. (https://www.pnas.org/doi/10.1073/pnas.1820226116) - goal is to show that deep learning dynamics can self-organize emergent hidden representations in a manner that recapitulates many empirical phenomena in human semantic development - second-order statistics of inputs and outputs drives synaptic weight changes through coupled nonlinear differential equations w/ up to cubic interactions in the weights - mean change in weights per learning epoch, for deep network $\hat{\mathbf{y}}=\mathbf{W}^2\mathbf{W}^1\mathbf{x}$: \begin{align} \tau\frac{d}{dt}\mathbf{W}^1 &= {\mathbf{W}^2}^\top(\Sigma^{yx}-\mathbf{W}^2\mathbf{W}^1\Sigma^x) \\ \tau\frac{d}{dt}\mathbf{W}^2 &= (\Sigma^{yx}-\mathbf{W}^2\mathbf{W}^1\Sigma^x){\mathbf{W}^1}^\top \end{align} - assumed that influence of "perceptual correlations" is minimal, i.e. $\Sigma^x\approx\mathbf{I}$ - goal then is to understand dynamics of learning as a function of $\Sigma^{yx}$ via SVD and the evolution of the singular values of $\Sigma^{yx}$ - shallow linear NN has different dynamics than deep linear NNs - effective singular values of former have a exponential trajectory but the deep network's effective singular values have a sigmoidal trajectory - "Language modeling is compression" (https://arxiv.org/pdf/2309.10668.pdf) - studies the ability of LLMs to compress info by using the token probabilities emitted by the LLM to implement arithmetic coding (see a good reference on arithmetic coding here: https://www.cs.cmu.edu/~aarti/Class/10704/lec9-arithmeticcoding.pdf) - predicting well can be seen as compressing well (i.e. optimal log-loss is achieved if the expected coding redundancy is minimized) - raw compression rates of transformers are impressive (and mostly better than traditional compressors), but adjusted compression rate (includes model size) are inferior for larger models (e.g. the 1/7/70B chinchilla models); transformer 3.2M does well even on adjusted compression rate at compressing enwik9 - Chinchilla models do well (unadjusted) at compressing general-purpose content, even images and audio (despite being only trained on text) - larger models achieve better adjusted compression rates on larger datasets - the longer the sequence, the more data the model can process in its context, the better the compression - compression rates decrease quickly with increasing sequence length - models learn some data statistics in-context, without gradient-based training - "A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms " (https://arxiv.org/abs/1901.10912) - "Recurrent Independent Mechanisms" (https://arxiv.org/abs/1909.10893), Goyal et al., 2019 - "The intriguing role of module criticality in the generalization of deep networks" (https://arxiv.org/abs/1912.00528), Chatterji et al., 2019 - defined a module as: a node in the computation graph that has incoming edges from other modules and outgoing edges to other nodes and performs a linear transformation on its inputs - also notes that "for layerd model such as VGG, module definition is equivalent to definition of a layer" - nonlinearity is not part of the module - interested in understanding how different modules interact with each other and influence generalization as a whole - past work (Zhang et al., "Are all layers created equal?") indicated that different modules have different robustness to parameter perturbation (like rewinding module back to initialization value while keeping all other modules fixed) - Given convex combination $\theta_i^\alpha=(1-\alpha)\theta_i^0+\alpha\theta_i^F$, $\alpha\in [0,1]$ where $\theta_i^0$ and $\theta_i^F$ are the initial and final weights of module $i$, respectively, and $\alpha+i$ is the minimum value where train error of the network drops by at most a threshold value $\epsilon$ - small $\alpha_i$ implies low module criticality - factors that fail to distinguish critical layers: spectrum of weight matrices, distance to initialization, change in activation patterns - proposed the following measure of module criticality: $$\mu_{i,\epsilon}(f_\Theta)=\min_{0\leq\alpha_i,\sigma_i\leq 1} \left\{\frac{\alpha_i^2\|\theta_i^F-\theta_i^0\|^2_{Fr}}{\sigma_i^2}:\mathbb{E}_{u\sim\mathcal{N}(0,\sigma^2)}[\mathcal{L}_S(f_{\theta_i^\alpha+u,\Theta_{-i}^F})]\leq\epsilon\right\}$$ - and the network criticality is the sum of the module criticality over all modules: $\mu_\epsilon(f_\Theta)=\sum_{i=1}^d\mu_{i,\epsilon}(f_\Theta)$ - proved how $\mu_\epsilon$ can be used in generalization bounds (via PAC-Bayes) - showed that this network criticality measure is better at ranking different models by generalization error than other measures (e.g. product of Frobenius norms, product of spectral norms, distance to initialization, no. of parameters, sum of spectral norms, PAC Bayes (e.g. $\sum_i\|\theta_i^0-\theta_i^F\|_{Fr}^2/\sigma_i^2$)) - ["Can Subnetwork Structure be the Key to Out-of-Distribution Generalization?"](https://arxiv.org/pdf/2106.02890.pdf), Zhang et al., 2021 - past work ([Empirical or invariant risk minimization? a sample complexity perspective](https://openreview.net/forum?id=jrA5GAccy_)) shows that ERM uses *all* the features in a Markov blanket, including those which are not the causes of the label; causes poor OOD generalization - hypothesis: trained models that exploit spurious correlation can contain subnetworks that capture invariant features - want to study whether the choice of structure matters in the training process - functional lottery ticket hypothesis: full network contains subnetwork that can possibly achieve better performance for OOD generalization than full network - used **modular probing** to separately train and obtain subnetworks for digit task and color task in a biased colored MNIST dataset - modular probing: train a binary mask across each layer, use Gumbel-sigmoid trick (Jang et al. 2016) to make the mask differentiable, together with logit regularization term (Csordas et al 2020) to promote subnetwork sparsity - modular probing confirmed the functional lottery ticket hypothesis - "appropriate structure induction can impose a needed inductive bias to prevent the model from fitting the spurious correlation" - Then they propose **modular risk minimization**: training in a way that balances the predictiveness for the invariant feature and sparsity - first pre-train the entire model - then do module structure probing by randomly sampling subnetworks $\mathbf{m}\sim\mathrm{sigmoid}(\pi)$ and updating the module $\pi$ with the following loss: $\mathcal{L}_\text{MOD}=\mathcal{L}_\text{CE}(\mathbf{m}\odot\mathbf{w})+\alpha\sum_{l,j}\pi_{l,j}$ where $\alpha$ is the coefficient of sparsity penalty, $\pi$ is the network logits, $\mathbf{m}$ is a binary mask obtained by hard-thresholding $\mathrm{sigmoid}(\pi)$, and $\mathbf{w}$ are the weights of the network - lastly, retrain the subnetwork - obtain the module $\mathbf{m}$ via hard thresholding, set model parameters back to initialization, and update the network $f$ with $\mathcal{L}_\text{CE}(\mathbf{m}\odot\mathbf{w})$ - ["Automatically Composing Representation Transformations as a Means for Generalization"](https://arxiv.org/abs/1807.04640), Chang et al., 2018 - ["Neural Module Networks"](https://arxiv.org/abs/1511.02799), Andreas et al., 2016 - ["The Evolutionary Origins of Modularity"](https://royalsocietypublishing.org/doi/10.1098/rspb.2012.2863), Clune et al., 2013 - ["PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning"](https://arxiv.org/abs/1711.05769), Mallya et al., 2018 - ["Understanding community structure in layered neural networks"](https://www.sciencedirect.com/science/article/pii/S0925231219311476), Watanabe et al., 2019 - ["Pruned Neural Networks are Surprisingly Modular"](https://arxiv.org/abs/2003.04881), Filan et al., 2020 - "a "module" is a set of neurons with strong internal connectivity but weak external connectivity" - ["Clusterability in Neural Networks"](https://arxiv.org/abs/2103.03386), Filan et al., 2021 - ["Quantifying Local Specialization in Deep Neural Networks"](https://arxiv.org/abs/2110.08058), Hod et al., 2021 - tries to quantify how functionally specialized local clusters of neurons are - importance, which reflects how crucial sets of neurons are to network performance - coherence, which reflects how consistently their neurons associate with features of the inputs - applied graph-based partitioning using spectral clustering of model weights and correlations of activations - ["Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks"](https://arxiv.org/abs/2010.02066), Csordás et al. 2021 - ["Is a Modular Architecture Enough?"](https://arxiv.org/abs/2206.02713) ## Other possible related works: - [Detecting Modularity in Deep Neural Networks](https://openreview.net/forum?id=tFQyjbOz34) - rejected from ICLR though - look through a bit more ## Meeting Notes ### 10/31/2023 - let $E$ be the environment and let $H(x)$ be a module - $y=F(E,x,H(x, E))$ - want to find $\arg\min_H \mathbb{E}_{E, x\sim D}\|\mathrm{Jac}_EH(x)\|$ - and also \max_H\mathbb{E} - look at invariant risk minimization - WILDS - OOD dataset - but find only the ones where domain shift should *not* affect prediction - maybe for distillation? see if we can train a more robust student model using the masked teacher model - ### 10/24/2023 - * lit review on invariant predictors - if compression -> generalization, do we still care about compression? - min. CE works for compression only when distribution doesn't shift - can we use invariance across a family of distributions to define a module? - want to have a definition that is more functional Claim: implicit to our interest in modularity and compression is an interest in OOD generalization. The optimal compressor for addition is the addition algorithm when we consider every possible setting. But in some distributions over addition problems, the optimal compression may be to memorize. Similarly, if a neural network will only be tested on in-distribution data for a single task, the network need not be modular--that is, the "module" can simply be the whole network. ### [Causal inference using invariant prediction](https://arxiv.org/abs/1501.01332) - Observation: when holding both a target's direct causes fixed, changing other, non-direct-cause variables will not change the target. - Exploit this fact to discover the correct causal predictor. The algorithm: 1. Collect a set of causal predictors. 2. Perform interventions. 3. Reject causal predictors that change under the interventions 4. The correct causal predictor must be among the remaining invariant predictors. Example intervention from the paper: Say we have a subset of genes $\mathcal{G}_S \subset \mathcal{G}$ we are interested in. If we perform changes (knockout) on genes in $\mathcal{G}_{\neg S}$, and the relationship between two genes $p(g_1 | g_2)$ is unaffected, then $p(g_1 | g_2)$ is invariant given $\mathcal{G}_{\neg S}$. Notes for our purposes: - This paper suggests taking an intervention-based perspective to modularity. - Our goal is to find the part of the representation that is invariant under shifts. - Intuition: if there is a network trained to recognize faces, then the representation for the same face should be similar even when the background is different. - How this might work for language: Find the subnetwork that computes syntax by permuting semantics. - This can be formalized as a coupling of random variables. - Suppose two distributions $p, q$ are drawn from a family of distributions, and a neural networks is trained on $p$. We can then use $q$ to find the subnetwork in the neural network that is invariant. Questions: 1. How do we induce modularity? Options: 1. Data. We choose a multitask dataset and hope that modules corresponding to tasks arise naturally. We have intuition that this is the case via language modeling--[attention heads specialize](https://aclanthology.org/P19-1580.pdf), [we can label neurons in transformers according to different functions, etc](https://openai.com/blog/our-approach-to-alignment-research). 2. Data and algorithm. We also place additional constraints on the training algorithm to encourage modularity. See [Invariant Risk Minimization](https://arxiv.org/abs/1907.02893), where authors add an additional regularization term to ERM, meant to encourage invariance.