Challenges and Applications of Large Language Models
8/1
Task
- To resovle Fine-Tuning Overhead
- The additional computational and memory resources required to adapt a pre-trained Large Language Model to perform well on a specific downstream task.
- Limited Context Length
- The challenge of processing long inputs in natural language processing (NLP) tasks.
Datasets
Fine-Tuning Overhead
-
Fine-tuning an LLM for a specific downstream task.
(Challenges and Applications of Large Language Models)
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
Adapter (Towards a Unified View of Parameter-Efficient Transfer Learning)
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
Recent work:
- Liu et al. [331] introduce
- Malladi et al. [355] propose a memory-efficient zeroth-order (MeZO) optimizer.
- Hu et al. [218] propose LoRA.
- Dettmers et al. [118] extend LoRA to quantized LLMs.
-
Recent issue:
- Despite substantial improvements in memory complexity needed to fine-tune LLMs for specific tasks, a remaining challenge is the time complexity.
- Parameter-efficient fine-tuning of LLMs still requires computing full forward/backward passes throughout the whole network.
Limited Context Length
- Having an architecture that can infer long inputs does not guarantee that the LLM will perform as well on those as on shorter inputs.
- Limited context lengths are a barrier for handling long inputs well to facilitate applications like novel or textbook writing or summarizing.
1. Efficient Attention Mechanisms
- Luna a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions.
- The dot-product attention require substantially less memory and compute resources.
- Transient Global, which is an extension of local attention where each token can attend to nearby tokens and a set of global tokens.
- The fundamental building block - self-attention mechanism
- The longer the input is, the more important the positional embedding.
2. Absolute Positional Embeddings
- sinusoidal embeddings
- Relative Positional Embeddings
- All unseen absolute positions will be converted to previously observed relative offsets between positions, enabling better generalization to long input sequences at inference time.
- Rotary Position Embeddings (RoPE)
- incorporating absolute positional information in a rotation matrix and modeling the relative positional offset through a rotation.
- Relative Positional Bias
- ALiBi (Attention with Linear Biases)
- One line of work tries to replace the attention mechanism using state space models (SSMs).
- H3 with a shift matrix to recall previous tokens and multiplicative interactions for token comparisons
- Hyena operator, a convolution-based sub-quadratic attention, applies an element-wise gating operation based on the operator’s input to mimic the attention contextualization.
- Block-State Transformer, which builds upon a hybrid layer that combines an SSM for long-range contextualization and a Transformer for short-range interactions between tokens.
- Receptance Weighted Key Value (RWKV) to combine the parallelization benefits of Transformer-based LLMs during training with the fast inference and low compute requirements of RNNs.
8/23
Issue
Tasks
- Positional Encoding
- Scaling Transformers
Survey
Rethinking Positional Encoding in Language Pre-training
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- Transformer with Untied Positional Encoding (TUPE) computes the word contextual correlation and positional correlation separately with different parameterizations and then adds them together.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
LONGNET: Scaling Transformers to 1,000,000,000 Tokens
- It has a linear computation complexity and a logarithm dependency between any two tokens in a sequence.
- It can be served as a distributed trainer for extremely long sequences.
- Its dilated attention is a drop-in replacement for standard attention
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- It shows how dilated attention splits the input (Q, K, V) into segments and sparsifies each segment along the sequence dimension by selecting rows with an interval r.
- The sparsified segments are then processed in parallel, and the outputs are concatenated to form the final output.
Thinking
- If the input tokens are larger than the maximum size, the input will be deleted.
- GPT-3 "davinci":Support at most 2048 tokens.
- GPT-3 "curie":Support at most 1024 tokens.
- GPT-3 "babbage":Support at most 4096 tokens.
- Method:
- Scaling the Transformer or Transformer alternatives?.
- Find another positional encoding.
- Summerize the previous inputs as the next input.
Feedback
9/13 Constrained text generation
Problem of Constrained text generation
- Sampling from the conditional distribution is intractable.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- We can explore the use of GeLaTo (Generating Language with Tractable Constraints) for other natural language generation tasks, such as dialogue generation or machine translation.
- Lack of model expressiveness
- Current models are not expressive enough to incorporate arbitrary constraints.
- Lack of suitable evaluation metrics
- Difficulty in constrained optimization
- They are usually non-differentiable, especially at the token level.
- Lack of constrained text generation datasets
Task
Thinking
- Limiting Generation Scope:
- In an HMM, the generation scope can be controlled by restricting state transitions and output probabilities.
- It can prevent the generation of meaningless or nonsensical text.
- Data sparsity problem
- Computational complexity problem
- The model selection problem
- Multi-constraints
- Few-Shot and Zero-Shot Constrained Generation
Feedback
10/4 Survey of constrained text generation
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Challenge
-
Diversity and Quality:
- Ensuring that generated text remains diverse and of high quality while adhering to constraints.
- Strict constraints may limit the diversity of generated outputs, and maintaining high quality becomes challenging when constraints are complex or conflicting.
-
Incorporating Multiple Constraints:
- Effectively handling multiple and possibly conflicting constraints.
- Combining constraints in a way that produces coherent and meaningful text is challenging, especially when constraints may have varying degrees of importance.
Approach
- Tractable probabilistic models (TPMs)
- HMMs
- Probabilistic Circuits
- Others
Thinking
- Define Constraints in Probabilistic Circuit.
- Specify the constraints you want to impose on the generated text using a probabilistic circuit.
- Optimization with Probabilistic Circuit
- Learning the parameters of both the text generation model and the probabilistic circuit in a way that satisfies the defined constraints.
- Datasets
Feedback
Go head
Catch the key words of the papers
Show the relation between different papers(in table)
Challenge
10/24 Survey
Probabilistic Circuits
Defination
- Probabilistic circuits (PCs):
- A probabilistic circuit over , is a pair , where is a computational graph, also called the circuit structure that is parameterized by .
- The PC computes a function that characterizes a distrbution .

- Tractable probabilistic inference:
- A class of queries is tractable on a family of probabilistic models iff any query on a model can be computed in time poly .
Motivation
- The first one is to unify the disparate formalisms proposed so far in the literature for tractable models.
- The second purpose of the PC framework is to enable reasoning over the tractable bands of a model class in terms of some well-defined structural properties only.
Challenge
- Scaling up such models is a key challenge
- Learn tractable models on millions of datapoints and thousands of features in tractable time.
Feed back
find some tasks to solve.
11/15
Surveys
Parallel Refinements for Lexically Constrained Text Generation with BART
- CBART leverages the pre-trained model BART and transfers part of the generation burden from the decoder to the encoder

- Guided by the encoder, the decoder refines multiple tokens of the input in one step by inserting tokens before specific positions and re-predicting tokens with low confidence
- To further reduce the inference latency, the decoder predicts all tokens in parallel
POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training
- The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner
- POINTER allows long-term control over generation due to the top-down progressive structure


Advantage
- Customization and Control
- Task-specific Requirements
Challenge
- Diversity
- Multi-constraint
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Idea
- Combine the probabilistic circuits (PCs) into constrained text generation (CTG)
- Leveraging PCs might introduce challenges related to model training and computational complexity
12/6
Title
Domain-adaptation to control the user's input
The problem to handle
Direction
Different prompt but same output
- Understand the user want to ask
theoretical method?
- prompt
- controllable
- contraint
- parameter-effficient
What i want to do the task is handling different domain in same task
Novelty?
Previous weakness?
How do I improve?
12/26 Enhancing NLG Consistency
Title
"Enhancing NLG Consistency Across Diverse Inputs Using Data Augmentation and Keyword-Driven Prompts"
"CID: Consistent NLG with Input Diversity using Data Augmentation and Keyword-Driven Prompts"
Problem definition

Data Augmentation

Inference Example
Input: I'm currently immerse in deep research of nature language generation task.
ANS If you have any specific questions or if there's a particular aspect of your research you'd like to discuss, feel free to share. I'm here to assist you in your endeavors related to natural language generation.
Input :I concentrating to address the various challenges brings by natural language generation.
The output should be consistency even the input is invarint
why this task is an issue
Real-world Application Scenarios:
- NLG systems often encounter diverse inputs from different users or contexts.
- Effectively handling this diversity and generating consistent outputs can better meet user requirements, enhancing the practicality of the system.
Robustness and Generalization:
- Considering the diversity of inputs in the real world, making NLG models more robust and capable of generalization is crucial.
- Introducing diverse inputs during training and emphasizing consistency can assist the model in adapting better to a variety of situations.
Reduced Bias:
- Denoising can help reduce biases present in the input, promoting fairness and equity in the generated conte
Previous tasks
Semantic Accuracy in Natural Language Generation: A Thesis Proposal
- They proposed a unified benchmark for NLG metrics focusing on semantic accuracy
Prompt?
AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts

Towards a Better Understanding of Noise in Natural Language Processing

Self-supervised-learning
Disentangled Representation Learning for texts and emotion or keyword ?
- This aim to capture the different dimensions of variation of a text in separate vector embeddings.
Idea
Disentanglement-based models offer two main advantages:
- Sampling from the latent space of the style embeddings allows for more diverse and controlled stylistic generation.
- Similarity of documents can now be calculated for each aspect of variation, allowing for finer-grained retrieval.
Objective
Problem
c can be the keyword condition
Challenge
No enough datasets:
- Using autoencoder to generate the similar sentences.
How to extract the keyword
How to know they(inputs) are the same
feedback:
Title novelty method
- can't just combine prompt and extraction
previous work
fix the equation
1/10 Survey previous works
Coherence, Semantic and paraphrasing.
"coherent response generation,"
Learning to Copy Coherent Knowledge for Response Generation
Towards Diverse, Relevant and Coherent Open-Domain Dialogue Generation via Hybrid Latent Variables

"semantic similarity in NLG,"
"paraphrasing consistency."
Unsupervised Paraphrasing Consistency Training for Low Resource Named Entity Recognition
- We convert Conditional Random Field (CRF) into a multi-label classification module and encourage consistency on the entity appearance between the original and paraphrased sequences.
Others' previous tasks
story generation
summerization
Idea
I want to train a model to generate coherent responses based on input sentences with similar meanings but expressed differently.
Objective :
is Consistency Metric
Because of the lack of correct answers in this task:
- Contrastive Learning
- Self-Supervised Learning
Todo
Key Information Extraction
Context-Aware Processing Consistency Modeling
Try to use the datasets from BERTScore.