Simple and Effective Noisy Channel Modeling for Neural Machine Translation

###### tags: `PaperReview` [Paper Link](https://arxiv.org/abs/1908.05731) > Kyra Yee, Nathan Ng, Yann N. Dauphin, Michael Auli > > Facebook AI Research, Google Brain > ## Introduction - Sequence to sequence models directly **estimate the posterior** probability of a **target sequence $y$ given a source sequence $x$** and can be trained with pairs of source and target sequences. - Direct models **cannot naturally take advantage** of unpaired data. - The noisy channel approach is an alternative which **entails a channel model probability $p(x|y)$** that operates in the reverse direction as well as a **language model probability $p(y)$**, which can **mitigates explaining away effects** that result in the source being ignored for highly likely output prefixes. - Previous work on neural noisy channel modeling relied on a complex latent variable model that **incrementally processes source and target prefixes** which performs **less well** than a vanilla sequence to sequence. - This paper train the model on **full sentences and apply it to score the source** given an **incomplete target sentence**. ## Approach - The noisy channel approach applies Bayes' rule $p(\boldsymbol{y} \mid \boldsymbol{x})= p(\boldsymbol{x} \mid \boldsymbol{y})p(\boldsymbol{y})/p(\boldsymbol{x})$. $$ \begin{aligned} p(x \mid y) & =\sum_{z}^{|x|} \log p(x_j\mid x_0, x_1, \dots, x_{j-1}, y) \end{aligned} $$ - A critical choice is to model $p(x \mid y)$ with a standard Transformer architecture. - Train $p(x \mid y)$ on a complete sentence-pairs and perform inference with **incomplete target prefixes of varying size *k***, i.e., $p(x \mid y_1, \dots, y_k)$. However result shows that **standard sequence to sequence models to be very robust to this mismatch**. ### Decoding - Compute $argmax_y \log p(x \mid y) + \log p(y)$ to generate $y$ given $x$ with the channel model. - However doing this way is **computationally expensive** because the channel model $p(x \mid y)$ is conditional on each candidate target prefix, so it requires **separate forward passes** for each vocabulary word. ### Approximation - Perform a **two-step beam search** where the direct model pre-prunes the vocabulary. - For beam size $k1$, and for each beam,** collect $k2$ possible next word extensions** according to the **direct model**. Next, **score the resulting $k1 × k2$ partial candidates with the channel model** and then **prune this set to size $k1$**. - **High computational overhead** in which need to invoke the channel model $k1 × k2$ times compared to just $k1$ times for the direct model. ### Complexity - Previous method has the complexity of $\mathcal{O}(n + m)$ where $n$ and $m$ are the source and target lengths respectively. - This method has the complexity of $\mathcal{O}(nm)$, but will be **negligible** because **all source tokens can be scored in parallel** with modern GPU. - Inference is mostly **slowed down** because of the **autoregression** nature of decoding. ### Model Combination - Since the direct model needs to be evaluated for pre-pruning, also include these probabilities in making decoding decisions. $$ \frac{1}{t} \log p(y \mid x)+\frac{\lambda}{s}(\log p(x \mid y)+\log p(y)) $$ - where $t$ is the length of the target prefix $y$, $s$ is the source sentence length and $\lambda$ is a tunable weight. ## Experiments ### Dataset - Trained on **WMT'17** data, validate on **news2016** and test on **news2017**. - For reranking, train models with a **40K joint BPE** vocabulary. - Use the **vocabulary of the language model on the target side** to be able to use the language model during online decoding. ### Language Models - Train two big Transformer language models with 12 blocks. ### Sequence to Sequence Model Training - Using big Transformer model. - For online decoding experiment, encoder and decoder embeddings are not shared because target and source vocabulary are learnt separately. - Generally, $k_1=5, k_2=10$ is used and tune $\lambda$. ### Simple Channel Model - Train a transformer models to translate from target to the source and compare two variants: - A standard seq2seq model trained to predict full source sentences based on full targets (seq2seq). - A model trained to predict the full source based on a prefix of the target (prefix-model). ![](https://hackmd.io/_uploads/ByAI_tRJp.png) - **Prefix-model performs slightly better until 15 tokens.** ### Effect of Scoring the Entire Source Given Partial Target Prefixes ![](https://hackmd.io/_uploads/BJmmFF0yT.png) - **Scoring the entire source (src 1) has better or comparable performance than all other source prefix lengths.** ### Online Decoding ![](https://hackmd.io/_uploads/HyucRF0yp.png) ### Analysis - Using the channel model in online decoding **enables searching a much larger space** compared to n-best list re-ranking. - But **online decoding is challenging** because the channel model **needs to score the entire source sequence** given a partial target which can be hard. ![](https://hackmd.io/_uploads/HJ_T9FAya.png) - **Channel model enjoys much larger benefits from more target context** than re-ranking with just the direct model and an LM (DIR+LM) or re-ranking with a direct ensemble (DIR ENS). ### Re-ranking ![](https://hackmd.io/_uploads/BJfvRKRkp.png) - RL = right-to-left configuration ## Conclusion - Propose to **make use of the entire source** for the channel model to make it more beneficial. - Standard sequence to sequence models are a **simple parameterization for the channel probability** that naturally exploits the entire source. - This parameterization outperforms strong baselines such as ensembles of direct models and right-to-left models. - **Channel models are particularly effective with large context sizes** and an interesting future direction is to iteratively refine the output while conditioning on previous contexts.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.