---
title: Sequence to Sequence Learning with Neural Networks
date: 2020-04-22 15:12:00
comments: true
author: Darcy
categories:
- nlp study group
tags:
- NLP
---
###### tags: `study` `paper` `DSMI lab`
paper: [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215)
# Abstract and Introduction
* DNN cannot be used to map sequences to sequences because they can only work when dimensionalty of input/output is fixed and known
* Task: English to French translation task from the WMT’14 dataset
* Proposed method 的優點:
* Minimal assumptions on the sequence structure: 甚麼樣的sequence 架構都可以處理
* Sensitive to word order
* Does well on long senetances
<!-- more -->
# The model
* Goal: given an input sentance $(x_1,x_2,...,x_T)$ and its corresponding output sentacne $(y_1, y_2, ...,y_{T'})$ (where $T$ need not equal to $T'$), want to estimate $p(y_1, y_2, ...,y_{T'}|x_1,x_2,...,x_T)$
* $p(y_1, y_2, ...,y_{T'}|x_1,x_2,...,x_T)=\Pi_{t=1}^{T'}p(y_{t}|v,y_1,...,y_{t-1})$, where $p(y_{t}|v,y_1,...,y_{t-1})$ is each distribution is represented with a softmax over all the words in the vocabulary
* 把 input sequence 倒過來餵進去效果比較好

# Experiment
* Dataset: WMT’14 English to French dataset
* 12M sentences
* Vocabulary: 160000 most frequently used English words and 80000 most frequently used Frech words
* 沒有在vocabulary出現的字用"UNK"代替
* test set (for evaluation): 1000-best lists generated by SMT system(baseline)
* Objective: maximizing the log probability of a correct translation $T$ given the source sentence $S$: $\hat{T}=\mathop{argmax}\limits_{T}p(T|S)$
* Left-to-Right beam search
* Reverse the source sentence:
* Imporoves the performance, but they don't have a complete explanation XDD
* 當input sentence 被反過來之後,input sentence 的前幾個字和output sentence的前幾個字更近了,有助於output sentence 在一開始就有更精準的生成,後面生成的也會比較準 (類似好的開始就是成功的一半的概念?)
* Trianing details
* 1000 dimensional word embedding (但他沒有說word embedding是怎麼做的)
* LSTM: 4 layers, 1000 cells in each layers
* Parameter initialization from uniform distribution between -0.08 to 0.08
* 平行化計算: 總共使用8個GPU,訓練10天
* Experimental Results
最好的結果是ensemble不同random initialization 的LSTM 所得到的

* Model analysis:
* 能夠分辨使用相同字但不同排序的句子,以及相同意思但使用不同文字表達的句子

* 對於長句的表現仍然良好(左圖: x軸是句子長度)
* 句子中如果有出現很多不常用的字,表現也能維持一定的水準(右圖: x軸是句子裡面出現的字的詞頻在整個vocabulary中的排名的平均)

# 補充
* SMT system: Statistical Machine translation
* 通過對大量的平行語料進行統計分析,構建統計翻譯模型
* 不需要依靠語法規則,所以容易推廣到不同語言的翻譯工作
* Word-based translation: 一個一個字翻
* Phrase-based translation: 視情況將幾個字組起來變成詞彙來翻
* Syntax-based translation: 使用句法分析(例如parsing tree)作為翻譯的依據
* Hierarchical phrase-based translation: a combination of pharse-based and syntax based
* Beam search: 演算法細節再[這裡](https://zhuanlan.zhihu.com/p/36029811?group_id=972420376412762112)