Effective Approaches to Attention-based Neural Machine Translation

--- title: Effective Approaches to Attention-based Neural Machine Translation date: 2020-04-22 15:12:00 comments: true author: Darcy categories: - nlp study group tags: - NLP --- ###### tags: `study` `paper` `DSMI lab` paper: [Effective Approaches to Attention-based Neural Machine Translation](https://www.aclweb.org/anthology/D15-1166/) ## Introduction * Neural Machine Translation (NMT) requires minimal domain knowledge and is conceptually simple * NMT generalizes very well to very long word sequences => don't need to store phrase tables * The concept of "attention": learn alignments between different modalities * image caption generation task: visual features of a picture v.s. text description * speech recognition task: speech frames v.s. text * Proposed method: novel types of attention- based models * global approach * local approach  ## Neural Machine Translation * Goal: translate the source sentence $x_1, x_2,...,x_n$ to the target sentence $y_1, y_2,...,y_m$ * A basic form of NMT consists of two components: * Encoder: compute the representation $s$ for each sentence * Decoder: generates one target word at a time $p(y_j|y_{<j},s)=softmax(g(h_j))$, where $g$ is a transformation function that outputs a vocabulary-sized vector, $h_j$ is the RNN hidden unit. * Traning objective: $J_t=\sum_{(x,y)\in D}-log(p(y|x))$, $D$ is the parallel training corpus. ## Attention-based model 1. Global Attention ![](https://i.imgur.com/924nX0n.png) Difference compared with Bahdanau: * Bahdanau uses bidirectional encdoer * Bahdanau uses deep-output and max-out layer * Bahdanau uses a different alignment funciton (but ours are better): $e_{ij}=v^T tanh(W_ah_i+U_a\hat{h_j})$ 2. Local Attention * Global attention is computational costy when the source sentence is long. ![](https://i.imgur.com/t0PuRdN.png) ![](https://i.imgur.com/iCGUXnx.png) 3. Input-feeding approach * make the model fully aware of previous alignment choices * create a very deep network spanning both horizontally and vertically. ![](https://i.imgur.com/J4Wt6n4.png) ## Experiments * Training Data: WMT'14 * 4.5M sentence pairs. * 116M English words, 110M German words * vocabularies: top 50K modst frequent words for both languages. * Model: * stacking LSTM models with 4 layers * each layers with 1000 cells * 1000 dimensional embeddings * Results: * English-German reuslts ![](https://i.imgur.com/1LGjS9Z.png) ![](https://i.imgur.com/dLtcfdW.png) * German-English results ![](https://i.imgur.com/qcFfwKn.png) ## Analysis ![](https://i.imgur.com/pMcXT7s.png) ![](https://i.imgur.com/8XjcNya.png) * Sample Translations baseline model的問題: * 人名翻錯 * 雙重否定翻錯 ![](https://i.imgur.com/osralYQ.png) ## Reference * github in pytorch: [https://github.com/AotY/Pytorch-NMT](https://github.com/AotY/Pytorch-NMT) * slides: [https://slideplayer.com/slide/7710523/](https://slideplayer.com/slide/7710523/)