HJ Progress 1 - HackMD

tags: `PROGRESS`

Challenges and Applications of Large Language Models

8/1

Task

To resovle Fine-Tuning Overhead
- The additional computational and memory resources required to adapt a pre-trained Large Language Model to perform well on a specific downstream task.
Limited Context Length
- The challenge of processing long inputs in natural language processing (NLP) tasks.

Datasets

Fine-Tuning Overhead

Fine-tuning an LLM for a specific downstream task.
(Challenges and Applications of Large Language Models)
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Adapter (Towards a Unified View of Parameter-Efficient Transfer Learning)
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Recent work:
- Liu et al. [331] introduce
  $(I A)^{3}$
- Malladi et al. [355] propose a memory-efficient zeroth-order (MeZO) optimizer.
- Hu et al. [218] propose LoRA.
- Dettmers et al. [118] extend LoRA to quantized LLMs.
Recent issue:
- Despite substantial improvements in memory complexity needed to fine-tune LLMs for specific tasks, a remaining challenge is the time complexity.
- Parameter-efficient fine-tuning of LLMs still requires computing full forward/backward passes throughout the whole network.

Limited Context Length

Having an architecture that can infer long inputs does not guarantee that the LLM will perform as well on those as on shorter inputs.
Limited context lengths are a barrier for handling long inputs well to facilitate applications like novel or textbook writing or summarizing.

1. Efficient Attention Mechanisms

Luna a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions.
The dot-product attention require substantially less memory and compute resources.
Transient Global, which is an extension of local attention where each token can attend to nearby tokens and a set of global tokens.
The fundamental building block - self-attention mechanism
The longer the input is, the more important the positional embedding.

2. Absolute Positional Embeddings

sinusoidal embeddings
Relative Positional Embeddings
- All unseen absolute positions will be converted to previously observed relative offsets between positions, enabling better generalization to long input sequences at inference time.
Rotary Position Embeddings (RoPE)
- incorporating absolute positional information in a rotation matrix and modeling the relative positional offset through a rotation.
  
  $softmax (\frac{1}{\sqrt{d}} \sum_{i, j} x_{i}^{⊤} W_{q}^{⊤} R_{Θ, (i - j)}^{d} W_{k} x_{j})$
Relative Positional Bias

$softmax (\frac{1}{\sqrt{d}} \sum_{i, j} x_{i}^{⊤} W_{q}^{⊤} W_{k} x_{j} + b_{i - j})$
ALiBi (Attention with Linear Biases)

$softmax (\frac{1}{\sqrt{d}} \sum_{i, j} x_{i}^{⊤} W_{q}^{⊤} W_{k} x_{j} + m \times - (i - j))$

3. Transformer Alternatives

One line of work tries to replace the attention mechanism using state space models (SSMs).
H3 with a shift matrix to recall previous tokens and multiplicative interactions for token comparisons
Hyena operator, a convolution-based sub-quadratic attention, applies an element-wise gating operation based on the operator’s input to mimic the attention contextualization.
Block-State Transformer, which builds upon a hybrid layer that combines an SSM for long-range contextualization and a Transformer for short-range interactions between tokens.
Receptance Weighted Key Value (RWKV) to combine the parallelization benefits of Transformer-based LLMs during training with the fast inference and low compute requirements of RNNs.

8/23

Issue

Limited context length

Tasks

Positional Encoding
Scaling Transformers

Survey

Rethinking Positional Encoding in Language Pre-training

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Transformer with Untied Positional Encoding (TUPE) computes the word contextual correlation and positional correlation separately with different parameterizations and then adds them together.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

LONGNET: Scaling Transformers to 1,000,000,000 Tokens

It has a linear computation complexity and a logarithm dependency between any two tokens in a sequence.
It can be served as a distributed trainer for extremely long sequences.
Its dilated attention is a drop-in replacement for standard attention

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

It shows how dilated attention splits the input (Q, K, V) into segments and sparsifies each segment along the sequence dimension by selecting rows with an interval r.
The sparsified segments are then processed in parallel, and the outputs are concatenated to form the final output.

\begin{aligned} {\tilde{Q}}_{i} = [Q_{i w}, Q_{i w + r}, Q_{i w + 2 r}, \dots, Q_{(i + 1) w - 1}] \\ {\tilde{K}}_{i} = [K_{i w}, K_{i w + r}, K_{i w + 2 r}, \dots, K_{(i + 1) w - 1}] \\ {\tilde{V}}_{i} = [V_{i w}, V_{i w + r}, V_{i w + 2 r}, \dots, V_{(i + 1) w - 1}] \end{aligned}

Thinking

If the input tokens are larger than the maximum size, the input will be deleted.
- GPT-3 "davinci"：Support at most 2048 tokens.
- GPT-3 "curie"：Support at most 1024 tokens.
- GPT-3 "babbage"：Support at most 4096 tokens.
Method:
- Scaling the Transformer or Transformer alternatives?.
- Find another positional encoding.
- Summerize the previous inputs as the next input.

Feedback

Go HMMs on generation.

9/13 Constrained text generation

Problem of Constrained text generation

Sampling from the conditional distribution is intractable.
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
We can explore the use of GeLaTo (Generating Language with Tractable Constraints) for other natural language generation tasks, such as dialogue generation or machine translation.

Why is constrained neural language generation particularly challenging?

Lack of model expressiveness
- Current models are not expressive enough to incorporate arbitrary constraints.
Lack of suitable evaluation metrics
Difficulty in constrained optimization
- They are usually non-differentiable, especially at the token level.
Lack of constrained text generation datasets
- CommonGen

Task

Text Style Transfer

Thinking

Limiting Generation Scope:
- In an HMM, the generation scope can be controlled by restricting state transitions and output probabilities.
  - It can prevent the generation of meaningless or nonsensical text.
  - Data sparsity problem
  - Computational complexity problem
  - The model selection problem
Multi-constraints
- Parameter-efficiency
Few-Shot and Zero-Shot Constrained Generation

Feedback

Search more and deeper

10/4 Survey of constrained text generation

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Challenge

Diversity and Quality:
- Ensuring that generated text remains diverse and of high quality while adhering to constraints.
- Strict constraints may limit the diversity of generated outputs, and maintaining high quality becomes challenging when constraints are complex or conflicting.
Incorporating Multiple Constraints:
- Effectively handling multiple and possibly conflicting constraints.
- Combining constraints in a way that produces coherent and meaningful text is challenging, especially when constraints may have varying degrees of importance.

Approach

Tractable probabilistic models (TPMs)
- HMMs
  - GeLaTo
- Probabilistic Circuits
Others
- COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics
- COLLIE: Systematic Construction of Constrained Text Generation Tasks

Thinking

Define Constraints in Probabilistic Circuit.
- Specify the constraints you want to impose on the generated text using a probabilistic circuit.
Optimization with Probabilistic Circuit
- Learning the parameters of both the text generation model and the probabilistic circuit in a way that satisfies the defined constraints.
Datasets

Feedback

Go head
Catch the key words of the papers
Show the relation between different papers(in table)
Challenge

10/24 Survey

Probabilistic Circuits

Name	Main different
Probabilistic Circuits: A Unifying Framework for Tractable Probabilistic Models	Probabilistic Inference, Many Faces of Probabilistic Circuits and so on.
Probabilistic circuits: Representations, inference, learning and applications	Why tractable inference? PCs, Learning circuits and Advanced representations
Tractable Regularization of Probabilistic Circuits	They combine advantages of probabilistic graphical models (PGMs) with those of neural networks (NNs)
Probabilistic Generating Circuits	Leaf nodes, which are $z_{i}$ or constants
Sparse Probabilistic Circuits via Pruning and Growing	Ccombining pruning and growing operations to exploit the sparsity of PC structures
Generating Language with Tractable Constraints(GeLaTo)	Using distilled hidden Markov models, where we can efficiently compute $P r (t e x t \| α)$
Scaling Up Probabilistic Circuits by Latent Variable Distillation	To solve the issue phenomenon ,when the number of parameters in PCs increases, their performance immediately plateaus.

Defination

Probabilistic circuits (PCs):
- A probabilistic circuit
  $(P C)$
  $C$ over
  $R V s$
  $X$ , is a pair
  $(G, θ)$ , where
  $G$ is a computational graph, also called the circuit structure that is parameterized by
  $θ$ .
- The PC
  $C$ computes a function that characterizes a distrbution
  $p (X)$ .

Tractable probabilistic inference:
- A class of queries
  $Q$ is tractable on a family of probabilistic models
  $M$ iff any query
  $q \in Q$ on a model
  $m \in M$ can be computed in time
  $O ($ poly
  $(| m |))$ .

Motivation

The first one is to unify the disparate formalisms proposed so far in the literature for tractable models.
The second purpose of the PC framework is to enable reasoning over the tractable bands of a model class in terms of some well-defined structural properties only.

Challenge

Scaling up such models is a key challenge
Learn tractable models on millions of datapoints and thousands of features in tractable time.

Feed back

find some tasks to solve.

11/15

Surveys

Parallel Refinements for Lexically Constrained Text Generation with BART

CBART leverages the pre-trained model BART and transfers part of the generation burden from the decoder to the encoder
Guided by the encoder, the decoder refines multiple tokens of the input in one step by inserting tokens before specific positions and re-predicting tokens with low confidence
To further reduce the inference latency, the decoder predicts all tokens in parallel

POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training

The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner
POINTER allows long-term control over generation due to the top-down progressive structure

Advantage

Customization and Control
Task-specific Requirements

Challenge

Diversity
Multi-constraint

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Idea

Combine the probabilistic circuits (PCs) into constrained text generation (CTG)
Leveraging PCs might introduce challenges related to model training and computational complexity

12/6

Title

Domain-adaptation to control the user's input

The problem to handle

Direction
Different prompt but same output

Understand the user want to ask

theoretical method?

prompt
controllable
contraint
parameter-effficient

What i want to do the task is handling different domain in same task

domain-adaptation

Novelty?
Previous weakness?
How do I improve?

12/26 Enhancing NLG Consistency

Title

"Enhancing NLG Consistency Across Diverse Inputs Using Data Augmentation and Keyword-Driven Prompts"

"CID: Consistent NLG with Input Diversity using Data Augmentation and Keyword-Driven Prompts"

Problem definition

Data Augmentation

Inference Example
Input: I'm currently immerse in deep research of nature language generation task.

ANS If you have any specific questions or if there's a particular aspect of your research you'd like to discuss, feel free to share. I'm here to assist you in your endeavors related to natural language generation.

Input :I concentrating to address the various challenges brings by natural language generation.

The output should be consistency even the input is invarint

why this task is an issue

Real-world Application Scenarios:

NLG systems often encounter diverse inputs from different users or contexts.
Effectively handling this diversity and generating consistent outputs can better meet user requirements, enhancing the practicality of the system.

Robustness and Generalization:

Considering the diversity of inputs in the real world, making NLG models more robust and capable of generalization is crucial.
Introducing diverse inputs during training and emphasizing consistency can assist the model in adapting better to a variety of situations.

Reduced Bias:

Denoising can help reduce biases present in the input, promoting fairness and equity in the generated conte

Previous tasks

Semantic Accuracy in Natural Language Generation: A Thesis Proposal

They proposed a unified benchmark for NLG metrics focusing on semantic accuracy

Prompt?
AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts

Towards a Better Understanding of Noise in Natural Language Processing

Self-supervised-learning

SimCLR

Disentangled Representation Learning for texts and emotion or keyword ?

This aim to capture the different dimensions of variation of a text in separate vector embeddings.

Idea

Disentanglement-based models offer two main advantages:

Sampling from the latent space of the style embeddings allows for more diverse and controlled stylistic generation.
Similarity of documents can now be calculated for each aspect of variation, allowing for finer-grained retrieval.

Objective

p (y | x 1) = p (y | x 2)

Problem

\prod_{0}^{t} p (y_{t} | y_{< t}, x, c)

c can be the keyword condition

Challenge

No enough datasets:

Using autoencoder to generate the similar sentences.

How to extract the keyword

How to know they(inputs) are the same

feedback:
Title novelty method

can't just combine prompt and extraction

previous work
fix the equation

1/10 Survey previous works

Title: Enhancing Consistency in Output Despite Poor Input toward … Approach

Coherence, Semantic and paraphrasing.

"coherent response generation,"
Learning to Copy Coherent Knowledge for Response Generation

Towards Diverse, Relevant and Coherent Open-Domain Dialogue Generation via Hybrid Latent Variables

"semantic similarity in NLG,"

"paraphrasing consistency."
Unsupervised Paraphrasing Consistency Training for Low Resource Named Entity Recognition

We convert Conditional Random Field (CRF) into a multi-label classification module and encourage consistency on the entity appearance between the original and paraphrased sequences.

Others' previous tasks

story generation

stories using abstract as outline
Consistency and Coherency Enhanced Story Generation

summerization

Idea

I want to train a model to generate coherent responses based on input sentences with similar meanings but expressed differently.

Objective :

s i m (f (x_{1}), f (x_{2}))

L (x_{1}, x_{2}) = max (0, m + Similarity (f (x_{1}), f (x_{2})) - Similarity (f (x_{1}^{'}), f (x_{2})))

L (x_{1}, x_{2}) + α \cdot C (x_{1}, y_{2}) + β \cdot C (x_{1}, y_{2})

C

is Consistency Metric

Because of the lack of correct answers in this task:

Contrastive Learning
Self-Supervised Learning

Todo

Key Information Extraction
Context-Aware Processing Consistency Modeling
Try to use the datasets from BERTScore.

tags: PROGRESS

8/1

Task

Datasets

Fine-Tuning Overhead

Limited Context Length

1. Efficient Attention Mechanisms

2. Absolute Positional Embeddings

3. Transformer Alternatives

8/23

Issue

Tasks

Survey

Thinking

Feedback

9/13 Constrained text generation

Problem of Constrained text generation

Why is constrained neural language generation particularly challenging?

Task

Thinking

Feedback

10/4 Survey of constrained text generation

Challenge

Approach

Thinking

Feedback

10/24 Survey

Probabilistic Circuits

Defination

Motivation

Challenge

Feed back

11/15

Surveys

Advantage

Challenge

12/6

Title

The problem to handle

12/26 Enhancing NLG Consistency

Title

Problem definition

why this task is an issue

Previous tasks

Idea

Challenge

1/10 Survey previous works

Title: Enhancing Consistency in Output Despite Poor Input toward … Approach

Coherence, Semantic and paraphrasing.

Others' previous tasks

Idea

Todo

Read more

Contrastive Disentanglement for Coherent Empathetic Dialogue

Towards a Unified Framework of Contrastive Learning for Disentangled Representations, NIPS

How to measure hallucination

CONT: Contrastive Neural Text Generation

tags: `PROGRESS`