XLNet Generalized Autoregressive Pretraining

--- title: XLNet Generalized Autoregressive Pretraining tags: NLP --- # 6/29 Paper #2 ## XLNet: Generalized Autoregressive Pretraining for Language Understanding ## 預備知識 ### Transformer [Transformer 概述知乎](https://zhuanlan.zhihu.com/p/54356280) ### BERT [BERT 概述知乎](https://zhuanlan.zhihu.com/p/46652512) ### Transformer-XL (optional) [paper link](https://openreview.net/forum?id=HJePno0cYm) ### [xlnet github](https://github.com/zihangdai/xlnet) ### Abstract: 它們在20個nlp任務內有18個屌打BERT(其他也是SoTA)，並且它們的模型可以拿到更多dependency的資訊，是bert無法做到的。BERT是去年10月到現在基本上一直被拿來使用的語言模型，因為做出來的embedding效果在許多nlp任務上完全屌打別人，可以說是nlp界的(RESNET)，如今，它門paper的XLNet屌打bert。在這篇paper內它主要多了一個想法 ->permutaion，並結合transformer-XL內的recurrent segment method讓sequence 可以看到更遠的資訊以及它的relative positional embedding。以下會一一介紹這些名詞。 ### Overview AR (Auto Regressive), AE (Auto Encoding)，兩者是現今有名語言模型使用的方法。 1. AR 有名的模型如下: **ELMo**, **GPT** 假設一個sequence 如下 seq = ( x1, x2, x3, x4, x5) 我的模型在train的時候每一個 time step 是去預測下一個token，舉例子來說，我句子從左邊開始吃，吃了x1 他的output是要能夠maximum下一個x2 token的機率，換句話說，我是在maximum 條件機率，我當下吃了x1，我output x2的probability要能夠最大化。這麼做的好處是，我可以抓到詞和詞之間的dependency，壞處是很難train，耗時久，而這個方法是想要model可以 estimate text corpus 內詞和詞之間的 probability distribution，而且太遠的序列資訊，model很難抓到，這個做法稱做AR類別的 language model 2. AE (Auto Encoding) 有名模型如下: **BERT** 在training的時候會自動把corpus內15%字元[mask]掉，然後希望模型它可以reconstruct回來原本的文章，這個做法稱做AE類別的 language model。這邊還有一個假設是假設被mask掉的字元和內文的其他字元是independent的。(好train而且有效率) 那XLNet？ **AR !!** 但是有包含到 AE模型的優點 1. maximizes the expected log likelihood with all possible **permutations of the factorization order**(後面會講這個order的意思) 2. 因為沒有使用mask的技術，所以他還可以知道那些mask掉的字元他門彼此之間或是和內文的dependency! ### Background: **ELMo, GPT** ![](https://i.imgur.com/pPH1fbA.png) **BERT** ![](https://i.imgur.com/NLoNF9b.png) Pros and Cons: 1. Independence Assumption ->BERT 2. Input Noise ->BERT 3. Context dependency ->ELMo ### Methods: 1. permutations of the factorization order ![](https://i.imgur.com/61wRicu.png) ![](https://i.imgur.com/UW86Ujg.png) 2. Two-Stream Self-Attention for Target-Aware Representation motivation: 無法根據target position 而有不同的output Original ![](https://i.imgur.com/dAGyOsX.png) After Modify ![](https://i.imgur.com/UpFR2YK.png) ------ Partial Prediction ![](https://i.imgur.com/saRD1Wp.png) ------ 有兩種representation ![](https://i.imgur.com/z3GlWy1.png) query representation(b) content representation(a) ![](https://i.imgur.com/7G2hfzV.png) 3. intergral with segment recurrence and relative positional encoding method segment recurrence & relative positional encoding ![](https://i.imgur.com/3khQAgs.png) 4. Modeling Multiple Segments relative segment encoding ### Experiment, Result: 1. comparing with BERT 1. ![](https://i.imgur.com/XSIZyKB.png) 2. Define I ={(x = York, U ={New}),(x = York, U ={city}), (x = York, U ={New City})} ![](https://i.imgur.com/wDIcUd4.png) 2. comparing with ELMo, GPT On ELMo,GPT: (x = York, U ={New}) but not (x= New, U ={York}) But XLNet can capture dependency 3. Pretraining and Implementation 4. ablation study ![](https://i.imgur.com/ggj7Uiu.png)