---
title: Effective Approaches to Attention-based Neural Machine Translation
date: 2020-04-22 15:12:00
comments: true
author: Darcy
categories:
- nlp study group
tags:
- NLP
---
###### tags: `study` `paper` `DSMI lab`
paper: [Effective Approaches to Attention-based Neural Machine Translation](https://www.aclweb.org/anthology/D15-1166/)
## Introduction
* Neural Machine Translation (NMT) requires minimal domain knowledge and is conceptually simple
* NMT generalizes very well to very long word sequences => don't need to store phrase tables
* The concept of "attention": learn alignments between different modalities
* image caption generation task: visual features of a picture v.s. text description
* speech recognition task: speech frames v.s. text
* Proposed method: novel types of attention- based models
* global approach
* local approach
<!-- more -->
## Neural Machine Translation
* Goal: translate the source sentence $x_1, x_2,...,x_n$ to the target sentence $y_1, y_2,...,y_m$
* A basic form of NMT consists of two components:
* Encoder: compute the representation $s$ for each sentence
* Decoder: generates one target word at a time
$p(y_j|y_{<j},s)=softmax(g(h_j))$, where $g$ is a transformation function that outputs a vocabulary-sized vector, $h_j$ is the RNN hidden unit.
* Traning objective: $J_t=\sum_{(x,y)\in D}-log(p(y|x))$, $D$ is the parallel training corpus.
## Attention-based model
1. Global Attention

Difference compared with Bahdanau:
* Bahdanau uses bidirectional encdoer
* Bahdanau uses deep-output and max-out layer
* Bahdanau uses a different alignment funciton (but ours are better): $e_{ij}=v^T tanh(W_ah_i+U_a\hat{h_j})$
2. Local Attention
* Global attention is computational costy when the source sentence is long.


3. Input-feeding approach
* make the model fully aware of previous alignment choices
* create a very deep network spanning both horizontally and vertically.

## Experiments
* Training Data: WMT'14
* 4.5M sentence pairs.
* 116M English words, 110M German words
* vocabularies: top 50K modst frequent words for both languages.
* Model:
* stacking LSTM models with 4 layers
* each layers with 1000 cells
* 1000 dimensional embeddings
* Results:
* English-German reuslts


* German-English results

## Analysis


* Sample Translations
baseline model的問題:
* 人名翻錯
* 雙重否定翻錯

## Reference
* github in pytorch: [https://github.com/AotY/Pytorch-NMT](https://github.com/AotY/Pytorch-NMT)
* slides: [https://slideplayer.com/slide/7710523/](https://slideplayer.com/slide/7710523/)