###### tags: `PROGRESS` [Challenges and Applications of Large Language Models](https://arxiv.org/pdf/2307.10169.pdf#page11) ## 8/1 ### Task - To resovle Fine-Tuning Overhead - The additional computational and memory resources required to adapt a pre-trained Large Language Model to perform well on a specific downstream task. - Limited Context Length - The challenge of processing long inputs in natural language processing (NLP) tasks. ### Datasets - [ChatGLM-Efficient-Tuning/data/](https://github.com/hiyouga/ChatGLM-Efficient-Tuning/tree/main/data) - [ChatGLM-6B/ptuning](https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning#%E4%B8%8B%E8%BD%BD%E6%95%B0%E6%8D%AE%E9%9B%86) - [LLaMA_3b_v2_huggingface](https://huggingface.co/openlm-research/open_llama_3b_v2) ### Fine-Tuning Overhead - Fine-tuning an LLM for a specific downstream task. [(Challenges and Applications of Large Language Models)](https://arxiv.org/pdf/2307.10169.pdf#page11) ![](https://hackmd.io/_uploads/BkRIzbGih.png) - Adapter [(Towards a Unified View of Parameter-Efficient Transfer Learning)](https://arxiv.org/pdf/2110.04366.pdf) ![](https://hackmd.io/_uploads/HkzJm-zi3.png) - Recent work: - Liu et al. [331] introduce $(IA)^3$ - Malladi et al. [355] propose a memory-efficient zeroth-order (MeZO) optimizer. - Hu et al. [218] propose LoRA. - Dettmers et al. [118] extend LoRA to quantized LLMs. - Recent issue: - Despite substantial improvements in *memory complexity* needed to fine-tune LLMs for specific tasks, a remaining challenge is the **time complexity**. - Parameter-efficient fine-tuning of LLMs still requires computing full forward/backward passes throughout the whole network. ### Limited Context Length - Having an architecture that can infer long inputs does not guarantee that the LLM will perform as well on those as on shorter inputs. - Limited context lengths are a barrier for handling long inputs well to facilitate applications like novel or textbook writing or summarizing. #### 1. Efficient Attention Mechanisms - **Luna** a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions. - **The dot-product attention** require substantially less memory and compute resources. - **Transient Global**, which is an extension of local attention where each token can attend to nearby tokens and a set of global tokens. - The fundamental building block - self-attention mechanism - The longer the input is, the more important the positional embedding. #### 2. Absolute Positional Embeddings - sinusoidal embeddings - Relative Positional Embeddings - All unseen absolute positions will be converted to previously observed relative offsets between positions, enabling better generalization to long input sequences at inference time. - Rotary Position Embeddings (RoPE) - incorporating absolute positional information in a rotation matrix and modeling the relative positional offset through a rotation. $$ \operatorname{softmax}\left(\frac{1}{\sqrt{d}} \sum_{i, j} \boldsymbol{x}_i^{\top} \boldsymbol{W}_q^{\top} \boldsymbol{R}_{\Theta,(i-j)}^d \boldsymbol{W}_k \boldsymbol{x}_j\right) $$ - Relative Positional Bias $$ \operatorname{softmax}\left(\frac{1}{\sqrt{d}} \sum_{i, j} \boldsymbol{x}_i^{\top} \boldsymbol{W}_q^{\top} \boldsymbol{W}_k \boldsymbol{x}_j+b_{i-j}\right) $$ - ALiBi (Attention with Linear Biases) $$ \operatorname{softmax}\left(\frac{1}{\sqrt{d}} \sum_{i, j} \boldsymbol{x}_i^{\top} \boldsymbol{W}_q^{\top} \boldsymbol{W}_k \boldsymbol{x}_j+m\times{-(i-j)}\right) $$ #### 3. Transformer Alternatives - One line of work tries to replace the attention mechanism using state space models (**SSMs**). - *H3* with a shift matrix to recall previous tokens and multiplicative interactions for token comparisons - **Hyena operator**, a convolution-based sub-quadratic attention, applies an element-wise gating operation based on the operator’s input to mimic the attention contextualization. - **Block-State Transformer**, which builds upon a hybrid layer that combines an SSM for long-range contextualization and a Transformer for short-range interactions between tokens. - **Receptance Weighted Key Value (RWKV)** to combine the parallelization benefits of Transformer-based LLMs during training with the fast inference and low compute requirements of RNNs. ------------------ ## 8/23 ### Issue - Limited context length ### Tasks - Positional Encoding - Scaling Transformers ### Survey [Rethinking Positional Encoding in Language Pre-training](https://openreview.net/pdf?id=09-528y2Fgf) ![](https://hackmd.io/_uploads/rJoDQ2u23.png) * Transformer with Untied Positional Encoding (TUPE) computes **the word contextual correlation** and **positional correlation** separately with different parameterizations and then adds them together. ![](https://hackmd.io/_uploads/BJun3t1Th.png) [LONGNET: Scaling Transformers to 1,000,000,000 Tokens](https://arxiv.org/pdf/2307.02486.pdf) * It has a linear computation complexity and a logarithm dependency between any two tokens in a sequence. * It can be served as a distributed trainer for extremely long sequences. * Its **dilated attention** is a drop-in replacement for standard attention ![](https://hackmd.io/_uploads/r1zcFHkp2.png) * It shows how dilated attention splits the input (Q, K, V) into segments and sparsifies each segment along the sequence dimension by selecting rows with an interval r. * The sparsified segments are then processed in parallel, and the outputs are concatenated to form the final output. \begin{equation} \begin{aligned} &\widetilde{Q}_i=\left[Q_{i w}, Q_{i w+r}, Q_{i w+2 r}, \ldots, Q_{(i+1) w-1}\right]\\ &\widetilde{K}_i=\left[K_{i w}, K_{i w+r}, K_{i w+2 r}, \ldots, K_{(i+1) w-1}\right]\\ &\widetilde{V}_i=\left[V_{i w}, V_{i w+r}, V_{i w+2 r}, \ldots, V_{(i+1) w-1}\right] \end{aligned} \end{equation} ### Thinking - If the input tokens are larger than the maximum size, the input will be deleted. - GPT-3 "davinci":Support at most 2048 tokens. - GPT-3 "curie":Support at most 1024 tokens. - GPT-3 "babbage":Support at most 4096 tokens. - Method: - Scaling the Transformer or Transformer alternatives?. - Find another positional encoding. - Summerize the previous inputs as the next input. ### Feedback - Go HMMs on generation. ------------------ ## 9/13 Constrained text generation ### Problem of Constrained text generation * Sampling from the conditional distribution is intractable. ![Generating Languagewith Tractable Constraints](https://hackmd.io/_uploads/Syg0vV2Rh.png) * We can explore the use of [GeLaTo](https://openreview.net/attachment?id=ET6qkbzeOx&name=pdf) (Generating Language with Tractable Constraints) for other natural language generation tasks, such as dialogue generation or machine translation. ### [Why is constrained neural language generation particularly challenging?](https://arxiv.org/pdf/2206.05395.pdf) - Lack of model expressiveness - Current models are not expressive enough to incorporate arbitrary constraints. - Lack of suitable evaluation metrics - Difficulty in constrained optimization - They are usually non-differentiable, especially at the token level. - Lack of constrained text generation datasets - [CommonGen](https://arxiv.org/pdf/1911.03705.pdf) ### Task * Text Style Transfer ### Thinking * Limiting Generation Scope: * In an **HMM**, the generation scope can be controlled by restricting state transitions and output probabilities. * It can prevent the generation of meaningless or nonsensical text. * Data sparsity problem * Computational complexity problem * The model selection problem * Multi-constraints * Parameter-efficiency * Few-Shot and Zero-Shot Constrained Generation ### Feedback * Search more and deeper ------------------ ## 10/4 Survey of constrained text generation ![](https://hackmd.io/_uploads/r1j3k8qea.png) ## Challenge 1. Diversity and Quality: - Ensuring that generated text remains diverse and of high quality while adhering to constraints. - Strict constraints may limit the diversity of generated outputs, and maintaining high quality becomes challenging when constraints are complex or conflicting. 2. Incorporating Multiple Constraints: - Effectively handling multiple and possibly conflicting constraints. - Combining constraints in a way that produces coherent and meaningful text is challenging, especially when constraints may have varying degrees of importance. ## Approach - Tractable probabilistic models (TPMs) - HMMs - [GeLaTo](https://openreview.net/attachment?id=ET6qkbzeOx&name=pdf) - Probabilistic Circuits - [Probabilistic Generating Circuits](https://arxiv.org/pdf/2102.09768.pdf) - [Probabilistic Circuits: A Unifying Framework for Tractable Probabilistic Models](http://starai.cs.ucla.edu/papers/ProbCirc20.pdf) - [Scaling Up Probabilistic Circuits by Latent Variable Distillation](https://arxiv.org/pdf/2210.04398.pdf) - [Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic Circuits](https://arxiv.org/pdf/2004.06231.pdf) - [LOSSLESS COMPRESSION WITH PROBABILISTIC CIRCUITS](https://arxiv.org/pdf/2111.11632.pdf) - Others - [COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics](https://openreview.net/pdf?id=TiZYrQ-mPup) - [COLLIE: Systematic Construction of Constrained Text Generation Tasks](https://arxiv.org/pdf/2307.08689.pdf) ## Thinking - Define Constraints in Probabilistic Circuit. - Specify the constraints you want to impose on the generated text using a probabilistic circuit. - Optimization with Probabilistic Circuit - Learning the parameters of both the text generation model and the probabilistic circuit in a way that satisfies the defined constraints. - Datasets ## Feedback Go head Catch the key words of the papers Show the relation between different papers(in table) Challenge ## 10/24 Survey ### Probabilistic Circuits | Name | Main different | |:-------- |:--------:| |[Probabilistic Circuits: A Unifying Framework for Tractable Probabilistic Models](http://starai.cs.ucla.edu/papers/ProbCirc20.pdf)| Probabilistic Inference, Many Faces of Probabilistic Circuits and so on.| |[Probabilistic circuits: Representations, inference, learning and applications](https://www.youtube.com/watch?v=2RAG5-L9R70)|Why tractable inference? PCs, Learning circuits and Advanced representations| |[Tractable Regularization of Probabilistic Circuits](https://openreview.net/pdf?id=W9oywyjO8VN)| They **combine** advantages of probabilistic graphical models (PGMs) with those of neural networks (NNs)| |[Probabilistic Generating Circuits](https://arxiv.org/pdf/2102.09768.pdf)|Leaf nodes, which are $z_i$ or constants| |[Sparse Probabilistic Circuits via Pruning and Growing](https://openreview.net/pdf?id=KieCChVB6mN)|Ccombining pruning and growing operations to exploit the sparsity of PC structures| |[Generating Language with Tractable Constraints(GeLaTo)](https://openreview.net/attachment?id=ET6qkbzeOx&name=pdf)| Using distilled **hidden Markov models**, where we can efficiently compute $Pr(text \| α)$| |[Scaling Up Probabilistic Circuits by Latent Variable Distillation](https://arxiv.org/pdf/2210.04398.pdf) |To **solve** the issue phenomenon ,when the number of parameters in PCs increases, their performance immediately **plateaus**. ----------------------- ### Defination - Probabilistic circuits (PCs): - A probabilistic circuit $(P C)$ $\mathcal{C}$ over $RVs$ $\mathbf{X}$, is a pair $(\mathcal{G}, \boldsymbol{\theta})$, where $\mathcal{G}$ is a **computational graph**, also called the circuit structure that is parameterized by $\boldsymbol{\theta}$. - The PC $\mathcal{C}$ computes a function that characterizes a distrbution $p(\mathbf{X})$. ![](https://hackmd.io/_uploads/Hy6YneBM6.png) - Tractable probabilistic inference: - A class of queries $\mathbf{Q}$ is tractable on a family of probabilistic models $\mathcal{M}$ iff any query $q \in \mathbf{Q}$ on a model $m \in \mathcal{M}$ can be computed in time $\mathcal{O}($ poly $(|m|))$. ### Motivation - The first one is to unify the disparate formalisms proposed so far in the literature for tractable models. - The second purpose of the PC framework is to enable reasoning over the tractable bands of a model class in terms of some well-defined structural properties only. ### Challenge - Scaling up such models is a key challenge - Learn tractable models on millions of datapoints and thousands of features in tractable time. ### Feed back find some tasks to solve. ## 11/15 ### Surveys [Parallel Refinements for Lexically Constrained Text Generation with BART](https://arxiv.org/pdf/2109.12487.pdf) - CBART leverages the pre-trained model BART and transfers part of the generation burden from the decoder to the encoder ![image.png](https://hackmd.io/_uploads/r1ReaGqQp.png =80%x) - Guided by the encoder, the decoder refines multiple tokens of the input in one step by inserting tokens before specific positions and re-predicting tokens with low confidence - To further reduce the inference latency, the decoder predicts all tokens in **parallel** [POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training](http://aclanthology.lst.uni-saarland.de/2020.emnlp-main.698.pdf) - The proposed method operates by progressively inserting new tokens between existing tokens in a **parallel** manner - POINTER allows long-term control over generation due to the top-down progressive structure ![](https://hackmd.io/_uploads/SkFeBYDX6.png) ![image](https://hackmd.io/_uploads/rJQxRoxET.png) ### Advantage - Customization and Control - Task-specific Requirements ### Challenge - Diversity - Multi-constraint :::info :bulb: **Idea** 1. Combine the probabilistic circuits (PCs) into constrained text generation (CTG) 2. Leveraging PCs might introduce challenges related to model training and computational complexity ::: ## 12/6 ### Title Domain-adaptation to control the user's input ### The problem to handle :::info **Direction** Different prompt but same output - Understand the user want to ask theoretical method? - prompt - controllable - contraint - parameter-effficient What i want to do the task is handling different domain in same task - domain-adaptation Novelty? Previous weakness? How do I improve? ::: ----------- ## 12/26 Enhancing NLG Consistency ### Title "Enhancing NLG Consistency Across Diverse Inputs Using Data Augmentation and Keyword-Driven Prompts" "CID: **C**onsistent NLG with **I**nput **D**iversity using Data Augmentation and Keyword-Driven Prompts" ### Problem definition ![image](https://hackmd.io/_uploads/HJobChIP6.png =80%x) Data Augmentation ![image](https://hackmd.io/_uploads/ByOW9yODa.png =80%x) Inference Example Input: I'm currently immerse in deep research of nature language generation task. ANS If you have any specific questions or if there's a particular aspect of your research you'd like to discuss, feel free to share. I'm here to assist you in your endeavors related to natural language generation. Input :I concentrating to address the various challenges brings by natural language generation. **The output should be consistency even the input is invarint** #### why this task is an issue **Real-world Application Scenarios:** - NLG systems often encounter diverse inputs from different users or contexts. - Effectively handling this diversity and generating consistent outputs can better meet user requirements, enhancing the practicality of the system. **Robustness and Generalization:** - Considering the diversity of inputs in the real world, making NLG models more robust and capable of generalization is crucial. - Introducing diverse inputs during training and emphasizing consistency can assist the model in adapting better to a variety of situations. **Reduced Bias:** - Denoising can help reduce biases present in the input, promoting fairness and equity in the generated conte ### Previous tasks [Semantic Accuracy in Natural Language Generation: A Thesis Proposal](https://aclanthology.org/2023.acl-srw.48.pdf) - They proposed a unified benchmark for NLG metrics focusing on semantic accuracy Prompt? [AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts](https://arxiv.org/pdf/2010.15980.pdf) ![image](https://hackmd.io/_uploads/HybXCoUw6.png ) [Towards a Better Understanding of Noise in Natural Language Processing](https://aclanthology.org/2021.ranlp-1.7.pdf) ![](https://hackmd.io/_uploads/Hkbp3GAsh.png) Self-supervised-learning - SimCLR Disentangled Representation Learning for texts and emotion or keyword ? - This aim to capture the different dimensions of variation of a text in separate vector embeddings. ### Idea Disentanglement-based models offer two main advantages: 1. Sampling from the latent space of the style embeddings allows for more diverse and controlled stylistic generation. 2. Similarity of documents can now be calculated for each aspect of variation, allowing for finer-grained retrieval. Objective $$p(y|x1)=p(y|x2)$$ Problem $$\prod_0^t p(y_t|y_{<t},x,c)$$ c can be the keyword condition ### Challenge No enough datasets: - Using autoencoder to generate the similar sentences. How to extract the keyword How to know they(inputs) are the same :::danger **feedback**: Title novelty method - can't just combine prompt and extraction previous work fix the equation ::: ----------------------- ## 1/10 Survey previous works ### Title: Enhancing Consistency in Output Despite Poor Input toward ... Approach ### Coherence, Semantic and paraphrasing. "coherent response generation," Learning to Copy Coherent Knowledge for Response Generation [Towards Diverse, Relevant and Coherent Open-Domain Dialogue Generation via Hybrid Latent Variables](https://ojs.aaai.org/index.php/AAAI/article/view/26594) ![image](https://hackmd.io/_uploads/BJvipIBua.png =80%x) "semantic similarity in NLG," "paraphrasing consistency." [Unsupervised Paraphrasing Consistency Training for Low Resource Named Entity Recognition](https://aclanthology.org/2021.emnlp-main.430.pdf) - We convert Conditional Random Field (CRF) into a multi-label classification module and encourage consistency on the entity appearance between the original and paraphrased sequences. ### Others' previous tasks story generation - stories using abstract as outline - [Consistency and Coherency Enhanced Story Generation](https://arxiv.org/pdf/2010.08822.pdf) - ![image](https://hackmd.io/_uploads/SyFaG4-_T.png) summerization ### Idea I want to train a model to generate coherent responses based on input sentences with similar meanings but expressed differently. Objective : $$sim(f(x_1), f(x_2))$$ $L(x_1, x_2) = \max(0, m + \text{Similarity}(f(x_1), f(x_2)) - \text{Similarity}(f(x_1'), f(x_2)))$ $$ L(x_1, x_2) + \alpha \cdot C(x_1, y_2) + \beta \cdot C(x_1, y_2) $$ $C$ is Consistency Metric Because of the lack of correct answers in this task: - Contrastive Learning - Self-Supervised Learning ### Todo Key Information Extraction Context-Aware Processing Consistency Modeling Try to use the datasets from BERTScore.