###### tags: `PaperReview`
[Paper Link](https://arxiv.org/abs/1908.05731)
> Kyra Yee, Nathan Ng, Yann N. Dauphin, Michael Auli
>
> Facebook AI Research, Google Brain
>
## Introduction
- Sequence to sequence models directly **estimate the posterior** probability of a **target sequence $y$ given a source sequence $x$** and can be trained with pairs of source and target sequences.
- Direct models **cannot naturally take advantage** of unpaired data.
- The noisy channel approach is an alternative which **entails a channel model probability $p(x|y)$** that operates in the reverse direction as well as a **language model probability $p(y)$**, which can **mitigates explaining away effects** that result in the source being ignored for highly likely output prefixes.
- Previous work on neural noisy channel modeling relied on a complex latent variable model that **incrementally processes source and target prefixes** which performs **less well** than a vanilla sequence to sequence.
- This paper train the model on **full sentences and apply it to score the source** given an **incomplete target sentence**.
## Approach
- The noisy channel approach applies Bayes' rule $p(\boldsymbol{y} \mid \boldsymbol{x})= p(\boldsymbol{x} \mid \boldsymbol{y})p(\boldsymbol{y})/p(\boldsymbol{x})$.
$$
\begin{aligned}
p(x \mid y) & =\sum_{z}^{|x|} \log p(x_j\mid x_0, x_1, \dots, x_{j-1}, y)
\end{aligned}
$$
- A critical choice is to model $p(x \mid y)$ with a standard Transformer architecture.
- Train $p(x \mid y)$ on a complete sentence-pairs and perform inference with **incomplete target prefixes of varying size *k***, i.e., $p(x \mid y_1, \dots, y_k)$. However result shows that **standard sequence to sequence models to be very robust to this mismatch**.
### Decoding
- Compute $argmax_y \log p(x \mid y) + \log p(y)$ to generate $y$ given $x$ with the channel model.
- However doing this way is **computationally expensive** because the channel model $p(x \mid y)$ is conditional on each candidate target prefix, so it requires **separate forward passes** for each vocabulary word.
### Approximation
- Perform a **two-step beam search** where the direct model pre-prunes the vocabulary.
- For beam size $k1$, and for each beam,** collect $k2$ possible next word extensions** according to the **direct model**. Next, **score the resulting $k1 × k2$ partial candidates with the channel model** and then **prune this set to size $k1$**.
- **High computational overhead** in which need to invoke the channel model $k1 × k2$ times compared to just $k1$ times for the direct model.
### Complexity
- Previous method has the complexity of $\mathcal{O}(n + m)$ where $n$ and $m$ are the source and target lengths respectively.
- This method has the complexity of $\mathcal{O}(nm)$, but will be **negligible** because **all source tokens can be scored in parallel** with modern GPU.
- Inference is mostly **slowed down** because of the **autoregression** nature of decoding.
### Model Combination
- Since the direct model needs to be evaluated for pre-pruning, also include these probabilities in making decoding decisions.
$$
\frac{1}{t} \log p(y \mid x)+\frac{\lambda}{s}(\log p(x \mid y)+\log p(y))
$$
- where $t$ is the length of the target prefix $y$, $s$ is the source sentence length and $\lambda$ is a tunable weight.
## Experiments
### Dataset
- Trained on **WMT'17** data, validate on **news2016** and test on **news2017**.
- For reranking, train models with a **40K joint BPE** vocabulary.
- Use the **vocabulary of the language model on the target side** to be able to use the language model during online decoding.
### Language Models
- Train two big Transformer language models with 12 blocks.
### Sequence to Sequence Model Training
- Using big Transformer model.
- For online decoding experiment, encoder and decoder embeddings are not shared because target and source vocabulary are learnt separately.
- Generally, $k_1=5, k_2=10$ is used and tune $\lambda$.
### Simple Channel Model
- Train a transformer models to translate from target to the source and compare two variants:
- A standard seq2seq model trained to predict full source sentences based on full targets (seq2seq).
- A model trained to predict the full source based on a prefix of the target (prefix-model).

- **Prefix-model performs slightly better until 15 tokens.**
### Effect of Scoring the Entire Source Given Partial Target Prefixes

- **Scoring the entire source (src 1) has better or comparable performance than all other source prefix lengths.**
### Online Decoding

### Analysis
- Using the channel model in online decoding **enables searching a much larger space** compared to n-best list re-ranking.
- But **online decoding is challenging** because the channel model **needs to score the entire source sequence** given a partial target which can be hard.

- **Channel model enjoys much larger benefits from more target context** than re-ranking with just the direct model and an LM (DIR+LM) or re-ranking with a direct ensemble (DIR ENS).
### Re-ranking

- RL = right-to-left configuration
## Conclusion
- Propose to **make use of the entire source** for the channel model to make it more beneficial.
- Standard sequence to sequence models are a **simple parameterization for the channel probability** that naturally exploits the entire source.
- This parameterization outperforms strong baselines such as ensembles of direct models and right-to-left models.
- **Channel models are particularly effective with large context sizes** and an interesting future direction is to iteratively refine the output while conditioning on previous contexts.