Learning Phrase Representations using RNN Encoder–Decoderfor Statistical Machine Translation

# Learning Phrase Representations using RNN Encoder–Decoderfor Statistical Machine Translation ###### tags: `paper notes` `deep learning` [paper link](https://www.aclweb.org/anthology/D14-1179.pdf) --- ## Intro * RNN Encdoer-Decoder * Encoder 將一串可變長度的sequence map成一個固定長度的vector * Decoder 將Encoder在map回去一個可變長度的Target sequence * 兩個網路是一起訓練, 且目標都是最大化Target sequence的條件機率 (給定input sequence) * 提出一個非常複雜的hidden unit去 improve emory capacity 和 the ease of training. * 這篇的task是把英文翻成法文 * 評估方法是跟現有模型做比較phrase score ## RNN Encoder-Decoder * Review of RNN: $h(t) = f( h(t-1),x)$ * RNN學的是根據已經有的輸入, 能預測出下一個Sequence中的symbol可能有哪些 * Example of 1 of K coding ![](https://i.imgur.com/iMhUmcK.jpg) $p(x)$就是要學出來的條件機率, 他是對每個time step的$x$求積算出來的 ![](https://i.imgur.com/Co6G4my.jpg) * Encoder接收每個symbol of input sequence $x$，直到收到EOS(end of sequence)symbol，最終的輸出就會是整個Input sequence的summary $c$ * Decoder是被訓練用來產生output sequence作為下一個symbol的預測$y_t$ * Decoder的計算跟前面的不一樣 * $h(t) = f(h(t-1),y_{t-1},c)$ * $y$是Decoder的輸出, 會傳到下一個time step * 最後要做inference的時候還要加一個Function $g$ (通常是softmax) * 所以是 $P(y_t|y_{t-1},y_{t-2},...,y_1,c) = g(h(t),y_t-1,c)$ ![](https://i.imgur.com/XWpXlgp.jpg) * 這兩個RNN是被一起訓練，目標是Maximize the conditional log-Likehood ![](https://i.imgur.com/LutFt7l.jpg) * $\theta$是model參數的集合 * $(x_n,y_n)$是(Input sequence,Output sequence) * 模型訓練好之後可以被用在兩個地方: 1. Generate a target sequence given an input sequence 2. Score a given pair of input and output sequences * Score = $p_{\theta}(y|x)$ ## Hidden Unit that Adaptively Remembers and Forgets (GRU) * 這裡是在講GRU * GRU是LSTM的改良版，主要是讓模型的計算和實作更簡單 * LSTM的memory cell (Forget gate & Input gate) -> GRU的Update gate ![](https://i.imgur.com/3SGtxpE.jpg) * **Reset Gate**: $r_j=\sigma[ (W_rX)_j + (U_rh_{t-1})_j ]$ * $\sigma$是sigmoid * $[．]_j$代表的是第j個vector * $x$和$h$是輸入 * $W$和$U$是要學的權重矩陣(weight matrices) * **Update Gate**跟Reset gate長得一樣: $z_j=\sigma[ (W_zX)_j + (U_zh_{t-1})_j ]$ * $h^t_j = z_jh^{t-1}_j + (1-z_j)\hat{h_j^t}$ * Combined: ![](https://i.imgur.com/xo0Q1bo.jpg) * 當Reset gate=0的時候，整個hidden state就被消失重置了, 輸入只剩下$X$ * Update gate $U$是控制要有多少資訊從上一層的hidden state而來，類似LSTM的memory cell * 每一個hidden unit都各自有自己的reset gate和update gate, 他們會在不同的time step各自學習不同的文字相關性 * 學習短期相關性的常常會激發reset gate * 學習長期相關性的常常會激發update gate ## SMT (Statistical Machine Translation) * 通常在STM之中的目標是去找出給定的source sequence $e$所對應的translation $f$ 也就是最大化 $e$和$f$之間的關係 ![](https://i.imgur.com/K15FSIR.jpg) 不過實務上通常會使用$log p(f|e)$和一些額外的features ![](https://i.imgur.com/jxb629a.jpg) * $f_n$和$w_n$是$n-th$feature和weight * $Z(e)$是一個跟weight無關的normalizaed constant (可以想成是bias) * 許多work都有拿NN來Rescore translation hypotheses * 但其實使用representation of the source sentence 作為額外的輸入來score translated sentence也很有趣 ## Scoring Phrase Pairs with RNN Encoder–Decode * 在訓練RNN Encoder-Decoder的時候他們忽略了原始語料庫中每個phrase pair的frequencies * 這樣做的目的是為了減少從語料庫中隨機選擇phrase pair且標準化frequenise的計算成本和確保RNN Encoder-Decoder不會只根據frequenies去學怎麼去把phrase pair做排名 * 還有一個理由是目前translation probability in the phrase table都已經反映出了原始語料庫中phrase pair的frequencies * 他們希望能確保模型是真的有在學習語意上的規則 * 也就是能分辨出合理和不合理的語句 * 或是learning the “manifold” (region of probability concentration) of plausible translations manifold * manifold這段看某 * 一旦模型train好了, 就會開始給每一個在**現有的table**上的phrase pair新的分數 * 只score現有table這件事情也是在減少計算量 * (Schwenk, 2012)提出過有可能到最後把整個現有的phrase table都替代掉, 這種情況發生的時候, RNN Encoder-Decoder會需要產生一串好的target phrase, 而這會需要一個很昂貴的sampling procedure * 但這篇paper只考慮**rescoring** the phrase pairs in the phrase table ### Related Approaches: Neural Networks in Machine Translation * 這段是在summary幾個NMT目前為止重要的Work * (Schwenk, 2012) 用feedforward NN來train, task是去score SMT system的phrase , Input和output都是fixed的 * (Devlin et al., 2014)也是跟Schwenk類似, 使用feedforward NN, 但他一次只predict一個word , 雖然有所進步但還是受限於input phrase的長度得是固定的 * (Chandar et al., 2014)同樣使用feedforward NN, 從一個bag-of-words representation of an input phrase map到output sequence * 這個已經跟他們的很像了, 除了bag-of-words * (Socher et al., 2011)使用過兩個RNN的Encoder-Decoder模型, 但他們受限於monoligual setting (model重造了input sequence) * (Auli et al., 2013), 他們的Decoder是有條件的基於representation of either a source sentence or a source context * 跟他們最像的paper是[Recurrent Continuous Translation Models by Kalchbrenner and Blunsom, 2013](https://www.aclweb.org/anthology/D13-1176.pdf) * 差別在於他們的Encoder是一個 convolutional n-gram model (CGM) * Decoder是inverse CGM和RNN的混合 ## RNN Encoder–Decoder Experiments * 這裡用的模型有1000個hidden units ![](https://i.imgur.com/qYtQMF1.jpg) ### Neural Language Model * 他們也有用這個模型去學習成一個LM(CSLM) ![](https://i.imgur.com/GOniAPw.jpg) ### Qualitative Analysis * 目標是搞清楚performance的improvement到底是從哪裡來的 * 傳統的SMT只有去尋找統計上的pattern, 而他們期望自己的模型可以在大部分情況下都比SMT更好, 只在少數的phrase上比SMT差 * 當然還有之前說的有忽略的frequencies * visualize了很多階段的word, 圖太多就不放了 ### Word and Phrase Representations * 他們用word2vec(Mikolov et al., 2013) ## Conclusion * 一個能把任意長度(有可能來自不同集合)的sequence mapping到另一個sequence的RNN Encoder-Decoder * GRU (reset gate & update gate) * 利用SMT的task來evaluate * 在BLEU score來看, RNN Encoder-Decoder的確有improve整體的preformance