# [CONT: Contrastive Neural Text Generation](https://arxiv.org/pdf/2205.14690.pdf)
NIPS 2022
--------
# Outline
Intoduction
Method
Experiment
Conclusion
----
# Introduction
**Previous methods** using contrastive learning in neural text generation usually lead to inferior performance
- If we simply use from-batch positive-negative samples following simCLR, and adopt the InfoNCE loss which **ignores the difference between negative samples** (Naive CL).
- Previous work attempts to build better contrastive samples by disturbing the ground truth in the discrete space or the continuous embedding space
**Contrastive Neural Text generation (CONT)** addresses bottlenecks that prevent contrastive learning from being widely adopted in generation tasks from three aspects:
- the construction of contrastive examples. (through beam search algorithm)
- the choice of the contrastive loss. (N-pairs)
- the strategy in decoding. (the learned sequence similarity score)
We validate CONT on five generation tasks with ten benchmarks:
- machine translation
- summarization
- code comment generation
- data-to-text generation
- commonsense generation.

---
# Method
## Architecture

### Contrastive Examples from Predictions
We use the **diverse beam search algorithm** to create contrastive examples from the **top-K** list of the model’s lastest predictions and then append them to the from-batch samples to form the contrastive examples.
A warm-up stage where the model is only supervised by $\mathcal{L}_\text{NLL}$ is recommended as it guarantees the quality of the examples from the model’s prediction.
These self-generated contrastive examples alleviate the model’s exposure bias.
### N-Pairs Contrastive Loss
We first rank all the contrastive examples based on an oracle function $o(·, y)$, which computes a sequence-level score with the ground truth $y$.
$$
\mathcal{L}_{\text {N-Pairs }}=\sum_{\left(y^{+}, \boldsymbol{y}^{-}\right) \in \mathcal{P}} \mathcal{L}\left(\boldsymbol{y}^{+}, \boldsymbol{y}^{-}\right)=\sum_{\left(\boldsymbol{y}^{+}, \boldsymbol{y}^{-}\right) \in \mathcal{P}} \max \left\{0, \cos \left(\mathbf{z}_{\boldsymbol{x}}, \mathbf{z}_{\boldsymbol{y}^{-}}\right)-\cos \left(\mathbf{z}_{\boldsymbol{x}}, \mathbf{z}_{\boldsymbol{y}^{+}}\right)+\xi\right\}
$$
$$
\mathcal{P} \text { contains } C_K^2 \text { pairs constructed from } \mathcal{B} \text {, ground truth } y \text {, and from-batch examples. }
$$
$\xi=\gamma *\left(\operatorname{rank}\left(\boldsymbol{y}^{-}\right)-\operatorname{rank}\left(\boldsymbol{y}^{+}\right)\right)$ reflect the quality difference in these pairs.
### Inference with Learned Similarity Function
$$
\boldsymbol{y}^{*}=\arg \max _{\hat{\boldsymbol{y}}}\left\{\alpha \cdot \cos \left(\mathbf{z}_{\boldsymbol{x}}, \mathbf{z}_{\hat{\boldsymbol{y}}}\right)+(1-\alpha) \prod_{t=0}^{n} p\left(\hat{y}_{t} \mid \boldsymbol{x}, \hat{\boldsymbol{y}}_{<t}\right)\right\}
$$


---
# Experiment



---
# Conclusion
It models an additional contrastive learning objective to provide a sequence-level supervision for auto-regressive neural text generation models.
We explore three shortcomings that limit the development of contrastive learning on text generation tasks.
Speeding up the training stage without losing accuracy is the next important step to improve CONT.
---
# Appendix
## batch samples
The physical meaning of batch samples in this paper is that they represent a group of text sequences or data points that are fed into the neural network simultaneously for processing.
## N-Pairs Contrastive Loss Algorithm

## Background
$$
\begin{gathered}
\mathcal{L}_{\mathrm{NCE}}=-\log \frac{\exp \left(\cos \left(\mathbf{z}_{\boldsymbol{x}}, \mathbf{z}_{\boldsymbol{y}}\right) / \tau\right)}{\sum_{\boldsymbol{y}^{\prime} \in \mathcal{B}} \exp \left(\cos \left(\mathbf{z}_{\boldsymbol{x}}, \mathbf{z}_{\boldsymbol{y}^{\prime}}\right) / \tau\right)} \\
\mathcal{L}_{\mathrm{NLL}}=-\sum_{t=1}^{N} \log p_{\theta}\left(y_{t} \mid \boldsymbol{x}, \boldsymbol{y}_{<t}\right), \\
\end{gathered}
$$