Text2Text Diffusion Background

--- title: Text2Text Diffusion Background tags: diffusion, research --- ###### tags: `diffusion`, `research`, # Look at this recent Oct 2025 YT review of text diffusion https://youtu.be/bmr718eZYGU # Text2Text Diffusion Background [repo](https://github.com/JamesL404/T2T-Diffu-Summarization), [project hackmd](https://hackmd.io/@blxCI4wNTdaijtEueJy7vA/SyRoCykVi/edit) [TOC] ## DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models Gong et al, 2022, [arXiv](https://arxiv.org/abs/2210.08933), [repo](https://github.com/Shark-NLP/DiffuSeq) ### Background - Conditional models like Diffusion-LM (Li et al, 2022) do not generalize to conditional language modeling, i.e. to sequences of words. If the condition is a sequence of words, applying DLM is difficult. - Seq2Seq is essential to NLP: dialogue, paraphrasing, text style transfer, etc. ### DiffuSEq **Forward process** - Like DLM, uses $\text { Емв }(\mathbf{w})$ function. - Embedding of condition and target are learned simultaneously by concatenation: ![](https://i.imgur.com/E9nVXmZ.png) - Only y is noised (termed partially noising) **Reverse process** Similar to DLM, but different to GLIDE ### Results ![](https://i.imgur.com/aOxUChD.png) - Generates highly diverse sequences ### Comments - Why only noise y? - Training this way makes the model task specific, does not leverage a general-purpose LLM. ## A Deep Reinforced Model for Abstractive Summarization Paulus et al, [ICLR 2017](https://arxiv.org/abs/1705.04304) ### Background Only training with supervised learning exhibits exposure bias ### Intra-Attention - Attend over specific parts of the encoded input sequence, reduces repetition - Also attends over previously decoded text ### Generation - At each timestep, use a switch function that decides to use either a: - Softmax layer over words - Or pointer mechanism to copy rare or unseen words from the input sequence ### Hybrid Objective **Max-likelihood** (teacher forcing) loss: $$ L_{m l}=-\sum_{t=1}^{n^{\prime}} \log p\left(y_t^* \mid y_1^*, \ldots, y_{t-1}^*, x\right) $$ **Self-critical poly gradient training**: $$ L_{r l}=\left(r(\hat{y})-r\left(y^s\right)\right) \sum_{t=1}^{n^{\prime}} \log p\left(y_t^s \mid y_1^s, \ldots, y_{t-1}^s, x\right) $$ where $y^s$ is produced by sampling the model, and $\hat{y}$ which is obtained by maximizing the output probability distribution at each time step (greedy search), and $r()$ is a reward function of the output sequence. This reward maximizes the likelihood of sampled sequences if they are better to the baseline. **Mixed objective**: $$ L_{\text {mixed }}=\gamma L_{r l}+(1-\gamma) L_{m l} $$ ### Results ![](https://i.imgur.com/Qzf3uAl.png) Results only using RL are surprisingly good. ### Comments - Can we use this intra-attention over conditional latent states? This is probably taken care of with a Transformer. - If we use the RL objective, what are the two samples? ## RewardsOfSum: RL Rewards for Summarisation Parnell et al, [Workshop at ACL-IJCNLP 2021](https://arxiv.org/abs/2106.04080)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.