--- title: Sequence to Sequence Learning with Neural Networks date: 2020-04-22 15:12:00 comments: true author: Darcy categories: - nlp study group tags: - NLP --- ###### tags: `study` `paper` `DSMI lab` paper: [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) # Abstract and Introduction * DNN cannot be used to map sequences to sequences because they can only work when dimensionalty of input/output is fixed and known * Task: English to French translation task from the WMT’14 dataset * Proposed method 的優點: * Minimal assumptions on the sequence structure: 甚麼樣的sequence 架構都可以處理 * Sensitive to word order * Does well on long senetances <!-- more --> # The model * Goal: given an input sentance $(x_1,x_2,...,x_T)$ and its corresponding output sentacne $(y_1, y_2, ...,y_{T'})$ (where $T$ need not equal to $T'$), want to estimate $p(y_1, y_2, ...,y_{T'}|x_1,x_2,...,x_T)$ * $p(y_1, y_2, ...,y_{T'}|x_1,x_2,...,x_T)=\Pi_{t=1}^{T'}p(y_{t}|v,y_1,...,y_{t-1})$, where $p(y_{t}|v,y_1,...,y_{t-1})$ is each distribution is represented with a softmax over all the words in the vocabulary * 把 input sequence 倒過來餵進去效果比較好 ![](https://i.imgur.com/kqvV9r8.png) # Experiment * Dataset: WMT’14 English to French dataset * 12M sentences * Vocabulary: 160000 most frequently used English words and 80000 most frequently used Frech words * 沒有在vocabulary出現的字用"UNK"代替 * test set (for evaluation): 1000-best lists generated by SMT system(baseline) * Objective: maximizing the log probability of a correct translation $T$ given the source sentence $S$: $\hat{T}=\mathop{argmax}\limits_{T}p(T|S)$ * Left-to-Right beam search * Reverse the source sentence: * Imporoves the performance, but they don't have a complete explanation XDD * 當input sentence 被反過來之後,input sentence 的前幾個字和output sentence的前幾個字更近了,有助於output sentence 在一開始就有更精準的生成,後面生成的也會比較準 (類似好的開始就是成功的一半的概念?) * Trianing details * 1000 dimensional word embedding (但他沒有說word embedding是怎麼做的) * LSTM: 4 layers, 1000 cells in each layers * Parameter initialization from uniform distribution between -0.08 to 0.08 * 平行化計算: 總共使用8個GPU,訓練10天 * Experimental Results 最好的結果是ensemble不同random initialization 的LSTM 所得到的 ![](https://i.imgur.com/egkt2co.png) * Model analysis: * 能夠分辨使用相同字但不同排序的句子,以及相同意思但使用不同文字表達的句子 ![](https://i.imgur.com/2CgPoaY.png) * 對於長句的表現仍然良好(左圖: x軸是句子長度) * 句子中如果有出現很多不常用的字,表現也能維持一定的水準(右圖: x軸是句子裡面出現的字的詞頻在整個vocabulary中的排名的平均) ![](https://i.imgur.com/lZrQPq4.png) # 補充 * SMT system: Statistical Machine translation * 通過對大量的平行語料進行統計分析,構建統計翻譯模型 * 不需要依靠語法規則,所以容易推廣到不同語言的翻譯工作 * Word-based translation: 一個一個字翻 * Phrase-based translation: 視情況將幾個字組起來變成詞彙來翻 * Syntax-based translation: 使用句法分析(例如parsing tree)作為翻譯的依據 * Hierarchical phrase-based translation: a combination of pharse-based and syntax based * Beam search: 演算法細節再[這裡](https://zhuanlan.zhihu.com/p/36029811?group_id=972420376412762112)