tags: PROGRESS

Challenges and Applications of Large Language Models

8/1

Task

  • To resovle Fine-Tuning Overhead
    • The additional computational and memory resources required to adapt a pre-trained Large Language Model to perform well on a specific downstream task.
  • Limited Context Length
    • The challenge of processing long inputs in natural language processing (NLP) tasks.

Datasets

Fine-Tuning Overhead

  • Fine-tuning an LLM for a specific downstream task.
    (Challenges and Applications of Large Language Models)

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

  • Adapter (Towards a Unified View of Parameter-Efficient Transfer Learning)

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

  • Recent work:

    • Liu et al. [331] introduce
      (IA)3
    • Malladi et al. [355] propose a memory-efficient zeroth-order (MeZO) optimizer.
    • Hu et al. [218] propose LoRA.
    • Dettmers et al. [118] extend LoRA to quantized LLMs.
  • Recent issue:

    • Despite substantial improvements in memory complexity needed to fine-tune LLMs for specific tasks, a remaining challenge is the time complexity.
    • Parameter-efficient fine-tuning of LLMs still requires computing full forward/backward passes throughout the whole network.

Limited Context Length

  • Having an architecture that can infer long inputs does not guarantee that the LLM will perform as well on those as on shorter inputs.
  • Limited context lengths are a barrier for handling long inputs well to facilitate applications like novel or textbook writing or summarizing.

1. Efficient Attention Mechanisms

  • Luna a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions.
  • The dot-product attention require substantially less memory and compute resources.
  • Transient Global, which is an extension of local attention where each token can attend to nearby tokens and a set of global tokens.
  • The fundamental building block - self-attention mechanism
  • The longer the input is, the more important the positional embedding.

2. Absolute Positional Embeddings

  • sinusoidal embeddings
  • Relative Positional Embeddings
    • All unseen absolute positions will be converted to previously observed relative offsets between positions, enabling better generalization to long input sequences at inference time.
  • Rotary Position Embeddings (RoPE)
    • incorporating absolute positional information in a rotation matrix and modeling the relative positional offset through a rotation.
      softmax(1di,jxiWqRΘ,(ij)dWkxj)
  • Relative Positional Bias
    softmax(1di,jxiWqWkxj+bij)
  • ALiBi (Attention with Linear Biases)
    softmax(1di,jxiWqWkxj+m×(ij))

3. Transformer Alternatives

  • One line of work tries to replace the attention mechanism using state space models (SSMs).
  • H3 with a shift matrix to recall previous tokens and multiplicative interactions for token comparisons
  • Hyena operator, a convolution-based sub-quadratic attention, applies an element-wise gating operation based on the operator’s input to mimic the attention contextualization.
  • Block-State Transformer, which builds upon a hybrid layer that combines an SSM for long-range contextualization and a Transformer for short-range interactions between tokens.
  • Receptance Weighted Key Value (RWKV) to combine the parallelization benefits of Transformer-based LLMs during training with the fast inference and low compute requirements of RNNs.

8/23

Issue

  • Limited context length

Tasks

  • Positional Encoding
  • Scaling Transformers

Survey

Rethinking Positional Encoding in Language Pre-training

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • Transformer with Untied Positional Encoding (TUPE) computes the word contextual correlation and positional correlation separately with different parameterizations and then adds them together.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

LONGNET: Scaling Transformers to 1,000,000,000 Tokens

  • It has a linear computation complexity and a logarithm dependency between any two tokens in a sequence.
  • It can be served as a distributed trainer for extremely long sequences.
  • Its dilated attention is a drop-in replacement for standard attention

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • It shows how dilated attention splits the input (Q, K, V) into segments and sparsifies each segment along the sequence dimension by selecting rows with an interval r.
  • The sparsified segments are then processed in parallel, and the outputs are concatenated to form the final output.

Q~i=[Qiw,Qiw+r,Qiw+2r,,Q(i+1)w1]K~i=[Kiw,Kiw+r,Kiw+2r,,K(i+1)w1]V~i=[Viw,Viw+r,Viw+2r,,V(i+1)w1]

Thinking

  • If the input tokens are larger than the maximum size, the input will be deleted.
    • GPT-3 "davinci":Support at most 2048 tokens.
    • GPT-3 "curie":Support at most 1024 tokens.
    • GPT-3 "babbage":Support at most 4096 tokens.
  • Method:
    • Scaling the Transformer or Transformer alternatives?.
    • Find another positional encoding.
    • Summerize the previous inputs as the next input.

Feedback

  • Go HMMs on generation.

9/13 Constrained text generation

Problem of Constrained text generation

  • Sampling from the conditional distribution is intractable.
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →
  • We can explore the use of GeLaTo (Generating Language with Tractable Constraints) for other natural language generation tasks, such as dialogue generation or machine translation.

Why is constrained neural language generation particularly challenging?

  • Lack of model expressiveness
    • Current models are not expressive enough to incorporate arbitrary constraints.
  • Lack of suitable evaluation metrics
  • Difficulty in constrained optimization
    • They are usually non-differentiable, especially at the token level.
  • Lack of constrained text generation datasets

Task

  • Text Style Transfer

Thinking

  • Limiting Generation Scope:
    • In an HMM, the generation scope can be controlled by restricting state transitions and output probabilities.
      • It can prevent the generation of meaningless or nonsensical text.
      • Data sparsity problem
      • Computational complexity problem
      • The model selection problem
  • Multi-constraints
    • Parameter-efficiency
  • Few-Shot and Zero-Shot Constrained Generation

Feedback

  • Search more and deeper

10/4 Survey of constrained text generation

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Challenge

  1. Diversity and Quality:

    • Ensuring that generated text remains diverse and of high quality while adhering to constraints.
    • Strict constraints may limit the diversity of generated outputs, and maintaining high quality becomes challenging when constraints are complex or conflicting.
  2. Incorporating Multiple Constraints:

    • Effectively handling multiple and possibly conflicting constraints.
    • Combining constraints in a way that produces coherent and meaningful text is challenging, especially when constraints may have varying degrees of importance.

Approach

Thinking

  • Define Constraints in Probabilistic Circuit.
    • Specify the constraints you want to impose on the generated text using a probabilistic circuit.
  • Optimization with Probabilistic Circuit
    • Learning the parameters of both the text generation model and the probabilistic circuit in a way that satisfies the defined constraints.
  • Datasets

Feedback

Go head
Catch the key words of the papers
Show the relation between different papers(in table)
Challenge

10/24 Survey

Probabilistic Circuits

Name Main different
Probabilistic Circuits: A Unifying Framework for Tractable Probabilistic Models Probabilistic Inference, Many Faces of Probabilistic Circuits and so on.
Probabilistic circuits: Representations, inference, learning and applications Why tractable inference? PCs, Learning circuits and Advanced representations
Tractable Regularization of Probabilistic Circuits They combine advantages of probabilistic graphical models (PGMs) with those of neural networks (NNs)
Probabilistic Generating Circuits Leaf nodes, which are
zi
or constants
Sparse Probabilistic Circuits via Pruning and Growing Ccombining pruning and growing operations to exploit the sparsity of PC structures
Generating Language with Tractable Constraints(GeLaTo) Using distilled hidden Markov models, where we can efficiently compute
Pr(text|α)
Scaling Up Probabilistic Circuits by Latent Variable Distillation To solve the issue phenomenon ,when the number of parameters in PCs increases, their performance immediately plateaus.

Defination

  • Probabilistic circuits (PCs):
    • A probabilistic circuit
      (PC)
      C
      over
      RVs
      X
      , is a pair
      (G,θ)
      , where
      G
      is a computational graph, also called the circuit structure that is parameterized by
      θ
      .
    • The PC
      C
      computes a function that characterizes a distrbution
      p(X)
      .

  • Tractable probabilistic inference:
    • A class of queries
      Q
      is tractable on a family of probabilistic models
      M
      iff any query
      qQ
      on a model
      mM
      can be computed in time
      O(
      poly
      (|m|))
      .

Motivation

  • The first one is to unify the disparate formalisms proposed so far in the literature for tractable models.
  • The second purpose of the PC framework is to enable reasoning over the tractable bands of a model class in terms of some well-defined structural properties only.

Challenge

  • Scaling up such models is a key challenge
  • Learn tractable models on millions of datapoints and thousands of features in tractable time.

Feed back

find some tasks to solve.

11/15

Surveys

Parallel Refinements for Lexically Constrained Text Generation with BART

  • CBART leverages the pre-trained model BART and transfers part of the generation burden from the decoder to the encoder
    image.png
  • Guided by the encoder, the decoder refines multiple tokens of the input in one step by inserting tokens before specific positions and re-predicting tokens with low confidence
  • To further reduce the inference latency, the decoder predicts all tokens in parallel

POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training

  • The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner
  • POINTER allows long-term control over generation due to the top-down progressive structure

image

Advantage

  • Customization and Control
  • Task-specific Requirements

Challenge

  • Diversity
  • Multi-constraint

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Idea

  1. Combine the probabilistic circuits (PCs) into constrained text generation (CTG)
  2. Leveraging PCs might introduce challenges related to model training and computational complexity

12/6

Title

Domain-adaptation to control the user's input

The problem to handle

Direction
Different prompt but same output

  • Understand the user want to ask

theoretical method?

  • prompt
  • controllable
  • contraint
  • parameter-effficient

What i want to do the task is handling different domain in same task

  • domain-adaptation

Novelty?
Previous weakness?
How do I improve?


12/26 Enhancing NLG Consistency

Title

"Enhancing NLG Consistency Across Diverse Inputs Using Data Augmentation and Keyword-Driven Prompts"

"CID: Consistent NLG with Input Diversity using Data Augmentation and Keyword-Driven Prompts"

Problem definition

image

Data Augmentation

image

Inference Example
Input: I'm currently immerse in deep research of nature language generation task.

ANS If you have any specific questions or if there's a particular aspect of your research you'd like to discuss, feel free to share. I'm here to assist you in your endeavors related to natural language generation.

Input :I concentrating to address the various challenges brings by natural language generation.

The output should be consistency even the input is invarint

why this task is an issue

Real-world Application Scenarios:

  • NLG systems often encounter diverse inputs from different users or contexts.
  • Effectively handling this diversity and generating consistent outputs can better meet user requirements, enhancing the practicality of the system.

Robustness and Generalization:

  • Considering the diversity of inputs in the real world, making NLG models more robust and capable of generalization is crucial.
  • Introducing diverse inputs during training and emphasizing consistency can assist the model in adapting better to a variety of situations.

Reduced Bias:

  • Denoising can help reduce biases present in the input, promoting fairness and equity in the generated conte

Previous tasks

Semantic Accuracy in Natural Language Generation: A Thesis Proposal

  • They proposed a unified benchmark for NLG metrics focusing on semantic accuracy

Prompt?
AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts

image

Towards a Better Understanding of Noise in Natural Language Processing

Self-supervised-learning

  • SimCLR

Disentangled Representation Learning for texts and emotion or keyword ?

  • This aim to capture the different dimensions of variation of a text in separate vector embeddings.

Idea

Disentanglement-based models offer two main advantages:

  1. Sampling from the latent space of the style embeddings allows for more diverse and controlled stylistic generation.
  2. Similarity of documents can now be calculated for each aspect of variation, allowing for finer-grained retrieval.

Objective

p(y|x1)=p(y|x2)
Problem
0tp(yt|y<t,x,c)

c can be the keyword condition

Challenge

No enough datasets:

  • Using autoencoder to generate the similar sentences.

How to extract the keyword

How to know they(inputs) are the same

feedback:
Title novelty method

  • can't just combine prompt and extraction

previous work
fix the equation


1/10 Survey previous works

Title: Enhancing Consistency in Output Despite Poor Input toward Approach

Coherence, Semantic and paraphrasing.

"coherent response generation,"
Learning to Copy Coherent Knowledge for Response Generation

Towards Diverse, Relevant and Coherent Open-Domain Dialogue Generation via Hybrid Latent Variables

image

"semantic similarity in NLG,"

"paraphrasing consistency."
Unsupervised Paraphrasing Consistency Training for Low Resource Named Entity Recognition

  • We convert Conditional Random Field (CRF) into a multi-label classification module and encourage consistency on the entity appearance between the original and paraphrased sequences.

Others' previous tasks

story generation

summerization

Idea

I want to train a model to generate coherent responses based on input sentences with similar meanings but expressed differently.

Objective :

sim(f(x1),f(x2))
L(x1,x2)=max(0,m+Similarity(f(x1),f(x2))Similarity(f(x1),f(x2)))

L(x1,x2)+αC(x1,y2)+βC(x1,y2)
C
is Consistency Metric

Because of the lack of correct answers in this task:

  • Contrastive Learning
  • Self-Supervised Learning

Todo

Key Information Extraction
Context-Aware Processing Consistency Modeling
Try to use the datasets from BERTScore.