# IKDD Internship <details> <summary>Reading</summary> <details> <summary>MLLMs</summary> #### [A Primer on Pretrained Multilingual Language Models](https://arxiv.org/pdf/2107.00676.pdf) * The idea is to pre-train MLLM with large amounts of unlabeled data from multiple languages with the hope that low resource languages benefit from high resource languages due to shared vocabulary, genetic relatedness, contact relatedness. * No capacity dilution for monolingual LM vs additional pre-training data for MLLM. * While representations learned by MLLMs share commonalities across languages as identified by different correlation analyses, these commonalities are dominant within the same family and only in certain parts of the network. * Vocabulary usually from a concatenation of monolingual data from multiple languages. * Pre-training objectives : MLM, CLM, MRTD. ![](https://i.imgur.com/nJxEJOn.png) * Pre-train MLLM using data from multiple languages -> fine-tune on task specific data on source language -> evaluate on target language. * Performance of zero shot transfer has a strong correlation with the amount of data in the target language used for pre-training. * Knowledge distillation from large models helps. * Performance of MLLMs on zero shot cross lingual transfer is better when : 1. Source and target language share vocabulary 2. There is some similarity between source and target languages 3. Deep architecture 4. Enough pre-training data is available for target language 5. Continual learning 6. Representations are explicitly aligned using parallel text * Given a sentence in a source language and a few candidate sentences in a target language, can we find the correct translation by identifying the nearest neighbor in the representation space? Such translation is sensitively dependent on the layer from which the representation is learnt - peaking in the middle layers of between 5 and 8. * There is evidence that MLLMs learn embeddings which have high overlap across languages, primarily between those of the same family. These common representations seem to be clearest in the middle layers, after which the network specializes for different languages as modeled in the pre-training objectives. #### [Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of Multilingual Language Models](https://blvns.github.io/papers/blevins2022analyzing.pdf) * We investigate when these models acquire their in-language and crosslingual abilities by probing checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks. * Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones. * In contrast, when the model learns to transfer cross-lingually depends on the language pair. * Across many languages and tasks, the final, converged model checkpoint exhibits significant performance degradation and that no one checkpoint performs best on all languages. #### [The Geometry of Multilingual Language Model Representations](https://arxiv.org/pdf/2205.10964.pdf) * We show that languages occupy similar linear subspaces after mean-centering, evaluated based on causal effects on language modeling performance and direct comparisons between subspaces for 88 languages. * To extract contextualized token representations from XLM-R, we inputted text sequences from the OSCAR corpus, concatenating consecutive sentences such that each sequence contained 512 tokens. * We defined an affine subspace for each language A using the language’s mean representation µA ∈ Rd along with k directions of maximal variance in the language, defined by an orthonormal basis VA ∈ Rd×k. To identify this subspace, we applied singular value decomposition (SVD) centered at µA using 262K contextualized token representations from language A (512 sequences in the OSCAR corpus). We selected the subspace dimensionality k such that the subspace accounted for 90% of the total variance in the language. * We computed the ratio of the projected perplexity to the original perplexity in language A. Affine language subspaces encode much of the information relevant to the language modeling task in their corresponding languages. ![](https://i.imgur.com/Z3VXJMo.png) * The model maps text in different languages into distinct subspaces. Mean-shifted subspaces were similar to one another. * Differences in subspace means demonstrate that the language subspaces still differ along particular axes. Intuitively, these axes should encode language-sensitive information, information that has high mutual information with the input language identity. * Shifting by language means induced target language vocabulary. Projecting onto subspaces induced additional target language vocabulary. The combination of meanshifting and subspace projection further increased the proportion of predicted tokens in language B, beyond mean-shifting and projection individually. * We applied linear discriminant analysis (LDA) to identify specific axes that separate language subspaces. Given n sets of representations (in this case, one set of 4K randomly sampled representations for each language), LDA computes n − 1 axes that maximize separation between the sets. * LDA axes encode linguistic typological features and language families. * Language-sensitive axes were stable in middle layers. * We used LDA to identify axes that encode potentially more language-neutral information: token positions and part-of-speech (POS). * Position axes were language-neutral. We performed LDA on sets of representations corresponding to every sixteen token positions, identifying axes that separated the different positions. We used 8K representations sampled uniformly from all languages for each position index. We projected representations from all token positions onto the identified position axes to qualitatively determine whether the axes encode position information language-neutrally. Token position information remains largely language neutral as it passes through the model. * Position information was encoded along nonlinear structures.![](https://i.imgur.com/sbssZOf.png) * Position representations were stable across layers. * POS is not inputted directly into the model; in order to encode POS in a language neutral way, the model must align features (e.g. features of nouns vs. verbs) cross-linguistically without supervision. We mapped language model tokens to the POS tag(s) that they were annotated with anywhere in the UD corpus.9 Using this mapping from tokens to POS tags, we extracted token representations in each language for each POS tag. To identify axes separating specific POS tags using LDA, we used a set of 8K token representations for each POS tag, sampled uniformly from all languages with tokens appearing in the UD corpus. When projecting onto n dimensions, we used LDA over n + 1 POS tags, resulting in n axes that separated representations for the provided POS tags. * POS axes were language-neutral and stable across layers. * Individual languages occupy affine subspaces that are roughly similar to one another after mean-shifting. These language subspaces encode information such as token positions and part-of-speech along shared language-neutral axes. The subspaces differ primarily along language-sensitive axes. </details> <details><summary>Adapters</summary> #### [Parameter Efficient transfer learning for NLP](https://arxiv.org/pdf/1902.00751.pdf) * Tasks arrive in a stream. Build a system which performs well on all of them. Fine-tuning is parameter inefficient ; a new fine-tuned model is needed for each task. Adapters add a small number of trainable parameters per task. Only pre-training required for base model parameters. ![](https://i.imgur.com/R5uVc9B.png) * Adapters are added in each transformer layer after the feedforward sub-layer. It’s composed of a down projection (d to m dimensional features where d > m), a non linearity and an up projection (m to d dimensional features). The bottleneck dimension m, trades off performance with parameter efficiency. Regularization effect similar to autoencoders. Skip connection ensures the module is initialized to an identity function when the parameters are initialized to zero. If initialization deviates too far from the identity function the model may fail to train. * Training adapters with sizes 0.5 to 5% of the original model, performance is within 1% of the published results on BERT large. #### [AdapterFusion: Non-Destructive Task Composition for Transfer Learning](https://arxiv.org/pdf/2005.00247.pdf) * Two stage transfer learning. First, in the knowledge extraction stage learn task specific parameters called adapters, that encapsulate the task-specific information. Then combine the adapters in a separate knowledge composition step. By separating the two stages, i.e., knowledge extraction and knowledge composition, the classifier can effectively exploit the representations learned from multiple tasks in a non-destructive manner. ![](https://i.imgur.com/7xcIPM7.png) * Given the context, AdapterFusion learns a parameterized mixer of the available trained adapters. It learns to identify and activate the most useful adapter for a given input. ![](https://i.imgur.com/o2hFmp1.png) * While Fusion with MT-A does provide gains over simply using MT-A, the effort required to train these in a multi-task setting followed by the Fusion step are not warranted by the limited gains in performance. On the other hand, we find that Fusion with ST-A is an efficient and versatile approach to transfer learning. #### [MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer](https://aclanthology.org/2020.emnlp-main.617.pdf) * The framework comprises three types of adapters: language, task, and invertible adapters. ![](https://i.imgur.com/OA8R1wB.png) ![](https://i.imgur.com/cRS4WTt.png) * Task adapters have the same architecture as language adapters. They are stacked on top of the language adapters and thus receive the output of the language adapter as input, together with the residual of the Transformer’s feed-forward layer. Task adapters are the only parameters that are updated when training on a downstream task. * Invertible adapters are stacked on top of the embedding layer while their respective inverses precede the output embedding layer. <img src="https://i.imgur.com/rLqBifv.jpg" width="500" height="500"> * As input and output embeddings are tied in multilingual pretrained models, invertibility allows us to leverage the same set of parameters for adapting both input and output representations. * Non-linear Independent Component Estimation enables the invertibility of arbitrary nonlinear functions through a set of coupling operations. Split input embeddings into two vectors of equal dimensionality. Ainv() : <img src="https://i.imgur.com/GZMacMn.jpg" width="500" height="100"> A-1inv() : <img src="https://i.imgur.com/K6gFWL2.jpg" width="500" height="100"> <img src="https://i.imgur.com/ejyAyFY.jpg" width="500" height="100"> * The invertible adapter has a similar function to the language adapter, but aims to capture token level language specific transformations. * It is trained together with the language adapters using MLM on unlabelled data of a specific language. During task specific training we use the fixed invertible adapter of the source language, and replace it with the target-language invertible during zero shot transfer. #### [Efficient Test Time Adapter Ensembling for Low-resource Language Varieties](https://arxiv.org/pdf/2109.04877.pdf) <img src="https://i.imgur.com/6JwyVTH.jpg" width="600" height="400"> * Ensembling multiple existing language adapters makes the fine-tuned model significantly more robust to other language varieties not included in these adapters. Entropy Minimized Ensemble of Adapters (EMEA) is a method that optimizes the ensemble weights of the pretrained language adapters for each test sentence by minimizing the entropy of its predictions. * Let R be the set of the source and related language adapters. Lavg(h) is the weighted sum of R language adapters : ![](https://i.imgur.com/jwmjSww.png) ![](https://i.imgur.com/IsU88jK.png) * Zero shot cross lingual transfer with English as the source language. Using multiple language adapters brings significant gains. ![](https://i.imgur.com/UQaZNR9.jpg) #### [ADAMIX: MIXTURE-OF-ADAPTER FOR PARAMETER-EFFICIENT TUNING OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2205.12410.pdf) * A new mechanism to improve adapter capacity without increasing parameters or computational cost by two techniques. * We introduce multiple shared adapter components in each layer of the Transformer architecture. We leverage sparse learning via random routing to update the adapter parameters (encoder is kept frozen) resulting in the same amount of computational cost (FLOPs) as that of training a single adapter. * We propose a simple merging mechanism to average the weights of multiple adapter components to collapse to a single adapter in each Transformer layer, thereby, keeping the overall parameters also the same but with significant performance improvement. * Mixture of experts (MoE) models induce sparsity by activating only a subset of the neural network weights for each incoming example. This is achieved via conditional computation based on routing input examples to a subset of experts introduced in each other layer of the Transformer model. This conditional computation, for instance, selection of top − 1 expert in each other layer, allows the sparse models to be computationally efficient i.e. match the FLOPs of that of a dense model, but also improves its capacity by increasing the number of parameters. * In order to introduce sparsity, we inject multiple feedforward layers (FFN) (corresponding to project-up and project-down) in each Transformer layer. We introduce a simple protocol to stochastically route instances to one of the project-up and then to one of the project-down FFN’s resulting in the same amount of computational cost (FLOPs) as that of using a single adapter but introducing more capacity. ![](https://i.imgur.com/zuz1VnO.png) * At any training step, we randomly select a pair of feedforward up and feedforward down projection matrices in the ith Transformer layer. Given this selection of adapter components Ai and Bi in each Transformer layer in every step, all the inputs in a given batch are processed through the same set of adapters. However, this also creates a challenge on which sets of projection matrices to use during inference due to the random routing protocol during training. We address this challenge with the following two techniques that further allow us to collapse the adapter parameters and obtain the same computational cost (FLOPs) as that of a single adapter design. * The objective of consistency regularization is to enable the adapter components to share information and prevent divergence. To this end, we add the following consistency loss as a regularizer to the task-specific optimization loss: ![](https://i.imgur.com/RWeKen6.png) * It still results in increased serving cost to host all the projection matrices from the different adapter components. * Employ adapter merging only during inference. Simply average the weights of all the project-up or project down matrices in every Transformer layer to collapse to a single adapter component. </details> <details><summary>Prefix Tuning</summary> #### [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190.pdf) ![](https://i.imgur.com/Xlwm30Y.png) * Keeps language model parameters frozen, but optimizes a small continuous task-specific vector called the prefix. * Prefix-tuning prepends a sequence of continuous task-specific vectors to the input. For subsequent tokens, the Transformer can attend to the prefix as if it were a sequence of “virtual tokens”. The prefix consists entirely of free parameters which do not correspond to real tokens. In contrast to fine-tuning, prefix-tuning only optimizes the prefix. * Prefix-tuning requires fewer parameters compared to adapter-tuning. </details> <details><summary>DEMIX</summary> #### [DEMIX Layers: Disentangling Domains for Modular Language Modeling](https://arxiv.org/pdf/2108.05036.pdf) * A DEMIX layer is a collection of expert feedforward networks, each specialized to a domain, that makes the LM modular: experts can be mixed, added or removed after initial training. * We propose a modular LM that has components specialized to distinct domains in the training data, and can be customized at inference-time by mixing, adding, or removing these separated components as needed. * A DEMIX layer is a drop-in substitute for a feedforward layer in a transformer LM (e.g., GPT-3), creating a specialized version of the layer (or expert) per domain. Replace every feedforward layer. ![](https://i.imgur.com/xgj1DVb.jpg) * Naive : We replace every feedforward layer in the transformer with a DEMIX layer. Under this setting, the domain of the test data is known and revealed to the model e.g, the CS expert is used for CS test data. * Dynamic : We introduce a domain variable, Dt, alongside each word. <img src="https://i.imgur.com/02PZQkV.jpg" width="600" height="150"> * The modification is to treat g1,...,gn as a posterior probability over domains, calculated at each timestep, given the history so far. n -> # domains <img src="https://i.imgur.com/AYd5dy6.jpg" width="600" height="200"> * Uniform : Fix the prior to be uniform across the known domains. * Updating : Set the prior at timestep t to be an exponentially-weighted moving average of the posteriors from previous timesteps. <img src="https://i.imgur.com/sYloXD7.jpg" width="600" height="150"> * The decay factor avoids putting too much weight on calculations made early in the dataset. During evaluation, this moving average is calculated over the posterior at the end of each sequence block. * Cached : If, prior to testing, some data from the test distribution is available, we calculate the posterior over domain labels from that data, and fix the prior to that estimate. <img src="https://i.imgur.com/I3YeGy5.jpg" width="500" height="350"> </details> <details><summary>T-Few</summary> #### [Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning](https://arxiv.org/pdf/2205.05638.pdf) * In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. * We introduce a new parameter-efficient fine-tuning method called (IA)3 (cube) that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. * Evaluated on the T0 model. T0 was created by fine-tuning T5 on a multitask mixture of datasets in order to enable zero-shot generalization, i.e. the ability to perform tasks without any additional gradient-based training. * Two additional loss terms to improve the performance of few-shot fine-tuning of language models. * Other PEFT methods did not allow for mixed-task batches. ![](https://i.imgur.com/VxVhahr.png) * (IA)3 makes mixed-task batches possible because each sequence of activations in the batch can be separately and cheaply multiplied by its associated learned task vector. ![](https://i.imgur.com/RecYpaz.png) </details> <details><summary>Analysis of PEFT methods</summary> #### [Towards a unified view of parameter efficient transfer learning](https://arxiv.org/pdf/2110.04366.pdf) ![](https://i.imgur.com/oK98zlt.png) * Adapter vs prefix-tuning : Prefix tuning uses x, the input of the PLM layer, to compute ∆h, while adapters use h, the output of the PLM layer. Thus, prefix tuning can be thought of as a “parallel” computation to the PLM layer, whereas the typical adapter is “sequential” computation. Adapters are more flexible with respect to where they are inserted than prefix tuning: adapters typically modify attention or FFN outputs, while prefix tuning only modifies the attention output of each head. Empirically, this makes a large difference. <img src="https://i.imgur.com/DbMzlor.png" width="1000" height="250"> * Any method with FFN modification outperforms all the methods with attention modification in all cases, often with fewer parameters.The same method applied at FFN always improves over its attention counterpart. For example, LoRA (ffn) improves LoRA (attn) by 1 R-2 points on XSum. * Results suggest that FFN modification can utilize the added parameters more effectively than attention, no matter what the functional form or composition function is. We hypothesize that this is because the FFN learns task-specific textual patterns while attention learns pairwise positional interactions which do not require large capacity for adapting to new tasks. * Modifying head attention shows the best results when the parameter budget is very small, while the FFN can better utilize modifications at larger capacities. This suggests that it may be effective to allocate a larger parameter budget to FFN modification instead of treating attention and FFN equally. #### [Transformer Feed-Forward Layers Are Key-Value Memories](https://arxiv.org/pdf/2012.14913.pdf) * Feed-forward layers in transformer based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. * The learned patterns are human-interpretable, and lower layers tend to capture shallow patterns, while upper layers learn more semantic ones. ![](https://i.imgur.com/ajAQucX.png) * Each key vector ki captures a particular pattern (or set of patterns) in the input sequence, and that its corresponding value vector vi represents the distribution of tokens that follows said pattern. * FFNs emulate neural memory. * The residual connection between layers acts as a refinement mechanism, gently tuning the prediction at each layer while retaining most of the residual’s information. * Every feed-forward layer combines multiple memories to produce a distribution that is qualitatively different from each of its component memories’ value distributions. These layer-wise distributions are then combined via residual connections in a refinement process, where each feed-forward layer updates the residual’s distribution to finally form the model’s output. </details> <details><summary>MuRIL</summary> #### [MuRIL: Multilingual Representations for Indian Languages](https://arxiv.org/pdf/2103.10730.pdf) * SOTA multilingual systems perform poorly on Indian languages. Small representation of Indian languages in vocabulary and training data, MLLMs are weak in low resource environments. MuRIL is a MLLM specifically trained on large amounts of Indian text corpora only. * Two language modeling objectives -> MLM (maksed language modeling), TLM (translation language modeling) ![](https://i.imgur.com/HVloNNE.png) * Monolingual, translated, transliterated data. Transliteration refers to mapping one script to another based on phonetic similarity. Upsampling of data to ensure balance between all languages. * Fertility ratio -> average number of sub-word/word. Higher fertility ratio -> loss in preservation of semantic meaning. HIgher FR for mBERT as compared to MuRIL. Little representation of Indian languages, no transliteration. Vocabulary plays a significant role in the model’s improvement over mBERT. * Results are computed in a zero-shot setting -> fine-tuning models on the labeled training set of one language and evaluating on test sets for all languages. Compared across Indian languages only. </details> <details><summary>Distillation</summary> #### [Adaptive Multi-Teacher Multi-level Knowledge Distillation](https://arxiv.org/pdf/2103.04062.pdf) #### [MergeDistill: Merging Pre-trained Language Models using Distillation](https://arxiv.org/pdf/2106.02834.pdf) </details> <details><summary>Benchmarks</summary> #### [XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization](https://arxiv.org/pdf/2003.11080.pdf) #### [XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation](https://arxiv.org/pdf/2104.07412.pdf) </details> </details> <details> <summary>Ideas/Experiments</summary> * Tasks : * SIQA -> XCOPA * NER * Interleave task training and target language MLM. * MAD-X * FFNs * Combination of FFNs * Constrain on output label distribution for combining multiple FFN / language adapters * [Jamboard](https://jamboard.google.com/d/1qCnHdly_kJ-EIbiypbix-tqL53baplq7UpKNjyYDK5Y/viewer?f=0) </details> <details> <summary>Additional Reading (paper names)</summary> * T5 * Language Model Prior for Low-Resource Neural Machine Translation * Analyzing Mono and Cross Lingual Pretraining * Dynamics of Multilingual Language Models * Domain Adversarial Training of Neural Networks * MAUVE - Measuring the gap between neural text and human text using divergence frontiers * LAReQA - Language Agnostic Answer Retrieval from a Multilingual pool * A simple method to eliminate self language bias in multilingual rtepresentations * Lifting the curse of multilinguality by pre training Modular Transformers * Transformer Feed-Forward layers build predictions by promoting concepts in the vocubalary space * BERT has a Mouth and it Must Speak - BERT as a Markov Random Field language model * Masked Language Model Scoring * Neural unsupervised domain adaptation in NLP - A Survey * Fine-Tuning Pretrained Language Models - Weight Initializations, Data Orders and Early Stopping * Multi Task Learning for Zero Shot Performance Prediction of Multilingual Models * How Multilingual is Multilingual BERT * A Balanced Data Approach for Evaluating Cross-Lingual Transfer - Mapping the Linguistic Blood Bank * A Structutal Probe for finding Syntax in Word Representations </details>