---
title: Text2Text Diffusion Background
tags: diffusion, research
---
###### tags: `diffusion`, `research`,
# Look at this recent Oct 2025 YT review of text diffusion
https://youtu.be/bmr718eZYGU
# Text2Text Diffusion Background
[repo](https://github.com/JamesL404/T2T-Diffu-Summarization), [project hackmd](https://hackmd.io/@blxCI4wNTdaijtEueJy7vA/SyRoCykVi/edit)
[TOC]
## DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models
Gong et al, 2022, [arXiv](https://arxiv.org/abs/2210.08933), [repo](https://github.com/Shark-NLP/DiffuSeq)
### Background
- Conditional models like Diffusion-LM (Li et al, 2022) do not generalize to conditional language modeling, i.e. to sequences of words. If the condition is a sequence of words, applying DLM is difficult.
- Seq2Seq is essential to NLP: dialogue, paraphrasing, text style transfer, etc.
### DiffuSEq
**Forward process**
- Like DLM, uses $\text { Емв }(\mathbf{w})$ function.
- Embedding of condition and target are learned simultaneously by concatenation:

- Only y is noised (termed partially noising)
**Reverse process**
Similar to DLM, but different to GLIDE
### Results

- Generates highly diverse sequences
### Comments
- Why only noise y?
- Training this way makes the model task specific, does not leverage a general-purpose LLM.
## A Deep Reinforced Model for Abstractive Summarization
Paulus et al, [ICLR 2017](https://arxiv.org/abs/1705.04304)
### Background
Only training with supervised learning exhibits exposure bias
### Intra-Attention
- Attend over specific parts of the encoded input sequence, reduces repetition
- Also attends over previously decoded text
### Generation
- At each timestep, use a switch function that decides to use either a:
- Softmax layer over words
- Or pointer mechanism to copy rare or unseen words from the input sequence
### Hybrid Objective
**Max-likelihood** (teacher forcing) loss:
$$
L_{m l}=-\sum_{t=1}^{n^{\prime}} \log p\left(y_t^* \mid y_1^*, \ldots, y_{t-1}^*, x\right)
$$
**Self-critical poly gradient training**:
$$
L_{r l}=\left(r(\hat{y})-r\left(y^s\right)\right) \sum_{t=1}^{n^{\prime}} \log p\left(y_t^s \mid y_1^s, \ldots, y_{t-1}^s, x\right)
$$
where $y^s$ is produced by sampling the model, and $\hat{y}$ which is obtained by maximizing the output probability distribution at each time step (greedy search), and $r()$ is a reward function of the output sequence. This reward maximizes the likelihood of sampled sequences if they are better to the baseline.
**Mixed objective**:
$$
L_{\text {mixed }}=\gamma L_{r l}+(1-\gamma) L_{m l}
$$
### Results

Results only using RL are surprisingly good.
### Comments
- Can we use this intra-attention over conditional latent states? This is probably taken care of with a Transformer.
- If we use the RL objective, what are the two samples?
## RewardsOfSum: RL Rewards for Summarisation
Parnell et al, [Workshop at ACL-IJCNLP 2021](https://arxiv.org/abs/2106.04080)