---
# System prepended metadata

title: XLNet Generalized Autoregressive Pretraining
tags: [NLP]

---

---

title: XLNet Generalized Autoregressive Pretraining
tags: NLP

---
# 6/29 Paper #2
## XLNet: Generalized Autoregressive Pretraining for Language Understanding

## 預備知識
### Transformer [Transformer 概述 知乎](https://zhuanlan.zhihu.com/p/54356280)
### BERT [BERT 概述 知乎](https://zhuanlan.zhihu.com/p/46652512)
### Transformer-XL (optional) [paper link](https://openreview.net/forum?id=HJePno0cYm)
### [xlnet github](https://github.com/zihangdai/xlnet)
### Abstract:
它們在20個nlp任務內有18個屌打BERT(其他也是SoTA)，並且它們的模型可以拿到更多dependency的資訊，是bert無法做到的。BERT是去年10月到現在基本上一直被拿來使用的語言模型，因為做出來的embedding效果在許多nlp任務上完全屌打別人，可以說是nlp界的(RESNET)，如今，它門paper的XLNet屌打bert。在這篇paper內它主要多了一個想法 ->permutaion，並結合transformer-XL內的recurrent segment method讓sequence 可以看到更遠的資訊以及它的relative positional embedding。以下會一一介紹這些名詞。

### Overview

AR (Auto Regressive), AE (Auto Encoding)，兩者是現今有名語言模型使用的方法。
1. AR 有名的模型如下: **ELMo**, **GPT**
假設一個sequence 如下 seq = ( x1, x2, x3, x4, x5)
我的模型在train的時候每一個 time step 是去預測下一個token， 舉例子來說，我句子從左邊開始吃，吃了x1 他的output是要能夠maximum下一個x2 token的機率，換句話說，我是在maximum 條件機率，我當下吃了x1，我output x2的probability要能夠最大化。這麼做的好處是，我可以抓到詞和詞之間的dependency，壞處是很難train，耗時久，而這個方法是想要model可以 estimate text corpus 內詞和詞之間的 probability distribution，而且太遠的序列資訊，model很難抓到，這個做法稱做AR類別的 language model

2. AE (Auto Encoding)  有名模型如下: **BERT**
在training的時候會自動把corpus內15%字元[mask]掉，然後希望模型它可以reconstruct回來原本的文章，這個做法稱做AE類別的 language model。這邊還有一個假設是 假設被mask掉的字元和內文的其他字元是independent的。(好train而且有效率)

那XLNet？
 **AR !!** 但是有包含到 AE模型的優點

1. maximizes the expected log likelihood with all possible **permutations of the factorization order**(後面會講這個order的意思)


2. 因為沒有使用mask的技術，所以他還可以知道那些mask掉的字元他門彼此之間或是和內文的dependency!

### Background: 

**ELMo, GPT**
![](https://i.imgur.com/pPH1fbA.png)


**BERT**
![](https://i.imgur.com/NLoNF9b.png)


Pros and Cons: 
1. Independence Assumption ->BERT
2. Input Noise  ->BERT
3. Context dependency ->ELMo
 
### Methods:

1. permutations of the factorization order

![](https://i.imgur.com/61wRicu.png)

![](https://i.imgur.com/UW86Ujg.png)


2. Two-Stream Self-Attention for Target-Aware Representation 

motivation: 無法根據target position 而有不同的output

Original 
![](https://i.imgur.com/dAGyOsX.png)

After Modify
![](https://i.imgur.com/UpFR2YK.png)

------

Partial Prediction 
![](https://i.imgur.com/saRD1Wp.png)

------

有兩種representation

![](https://i.imgur.com/z3GlWy1.png)

query representation(b)

content representation(a)

![](https://i.imgur.com/7G2hfzV.png)


3. intergral with segment recurrence and relative positional encoding method

segment recurrence  & relative positional encoding 

![](https://i.imgur.com/3khQAgs.png)


4. Modeling Multiple Segments

relative segment encoding


### Experiment, Result:

1. comparing with BERT
    1.
    ![](https://i.imgur.com/XSIZyKB.png)
    
    2. Define I ={(x = York, U ={New}),(x = York, U ={city}), (x = York, U ={New City})}
    
    ![](https://i.imgur.com/wDIcUd4.png)

    
2. comparing with ELMo, GPT

    On ELMo,GPT:
    (x = York, U ={New}) but not (x= New, U ={York})
    
    But XLNet can capture dependency

3. Pretraining and Implementation

4. ablation study

![](https://i.imgur.com/ggj7Uiu.png)