[TOC] # RNN model and NLP summary ## Data processing 1. one-hot encoding ### Processing Text Data 1. Tokenization (text to words) 2. count word frequencies - stored in a hash table: ![image-20210424092307075](https://i.loli.net/2021/04/24/jGxWkTSFAXsOZt4.png) - Sort the table so that the frequency is in the descending order. - Replace "frequency" by "index" (starting from 1.) - The number of unique words is called "vocabulary". ![image-20210424092500378](https://i.loli.net/2021/04/24/hAWLJjTZefmwI8n.png) ## Text Processing and Word Embedding ### Text processing 1. tokenization Considerations in tokenization: - Upper case to lower case. ("Apple" to "apple"?) - Remove stop words, e.g., "the", "a", "of", etc. - Typo correction. ("goood" to "good".) 2. build the dictionary ![image-20210424093157449](https://i.loli.net/2021/04/24/doZwqEPGBbX2LaS.png) 3. align sequences (使每个sequence长度一样) ### Word embedding 1. One-hot encoding - First, represent words using one-hot vectors. - Suppose the dictionary contains $v$ unique words (vocabulary $=v)$. - Then the one-hot vectors $\mathbf{e}_{1}, \mathbf{e}_{2}, \mathbf{e}_{3}, \cdots, \mathbf{e}_{v}$ are $v$ -dimensional. 2. word embedding ![image-20210424102903819](https://i.loli.net/2021/04/24/a3OWxvNzKGYRlPE.png) **`d` 由用户决定** ![image-20210424103504604](https://i.loli.net/2021/04/24/CuXl2YjnEV8JiWZ.png) ![image-20210424103619516](https://i.loli.net/2021/04/24/NbABKXgMHPIprTF.png) ## Recurrent Neural Networks (RNNs) ![image-20210424111328256](https://i.loli.net/2021/04/24/buk5wmHZnArjvIi.png) 1. ​ `ho`包含了 *the* 的信息 ​ `h1`包含了 *the cat* 的信息 ​ `h2`包含了 *the cat sat* 的信息 ​ ... ​ `ht`包含了整句话的信息 2. 整个RNN只有一个参数`A`, 无论这条链有多长 ### Simple RNN ![image-20210424111925427](https://i.loli.net/2021/04/24/JOVp3za2cHUfYmZ.png) ### Simple RNN for IMDB Review ![image-20210424112111109](https://i.loli.net/2021/04/24/QUvgNuAj7e2ID6V.png) ![image-20210424112148940](https://i.loli.net/2021/04/24/H39ZGkPtxqMpNLY.png) ![image-20210424112234197](https://i.loli.net/2021/04/24/trEBK7GgJvV53iq.png) `return_sequences = False`: RNN只返回最后一个state `ht` ![image-20210424112404189](https://i.loli.net/2021/04/24/MZ2bPjWBg4Klzhw.png) ### shortcomings of simple RNN 1. simple RNN is good at short-term dependence 2. simple RNN is bad at long-term dependence ![image-20210424112706164](https://i.loli.net/2021/04/24/H7No1tCnUQPeJlh.png) ### Summary ![image-20210424112756390](https://i.loli.net/2021/04/24/5dhxs4HubAavfpt.png) ![image-20210424112830348](https://i.loli.net/2021/04/24/hOaeQRKDXI1uw7y.png) ## Long Short Term Memory (LSTM) ### LSTM Model ![image-20210424141759826](https://i.loli.net/2021/04/24/PAr5MHWDuISZ2d6.png) ​ ​ ![image-20210424142317536](https://i.loli.net/2021/04/24/tBecGI2Cp4UHuzl.png) - Conveyor belt: the past information directly flows to the future #### Forget gate ![image-20210424142554393](https://i.loli.net/2021/04/24/JKdxOg2LimHhqcD.png) ![image-20210424142621438](https://i.loli.net/2021/04/24/tFic2hu3OAIxwmX.png) ![image-20210424142650070](https://i.loli.net/2021/04/24/dnWZK9ema3VsoU6.png) - Forget gate (`f`): a vector (the same shape as $\mathbf{c}$ and $\mathbf{h}$ ). - A value of **zero** means "let nothing through". - A value of **one** means "let everything through!" - 作用于$C_{t-1}$ - `f`的计算过程: ![image-20210424143033375](https://i.loli.net/2021/04/24/A76kLQuDciHlR1v.png) #### Input gate ![image-20210424150935131](https://i.loli.net/2021/04/24/WAebZ8rG27MYlB1.png) #### New value ![image-20210424151017314](https://i.loli.net/2021/04/24/sGE685CwmKbR1A7.png) #### $C_{t}$ ![image-20210424151141395](https://i.loli.net/2021/04/24/NDWpGqKAJe1l4is.png) #### Output gate ![image-20210424151318393](https://i.loli.net/2021/04/24/izo8ejIrEkwa1NQ.png) ![image-20210424151401890](https://i.loli.net/2021/04/24/SUIfqh6iF9BzlX2.png) #### LSTM: Number of parameters ![image-20210424151512818](https://i.loli.net/2021/04/24/2JWlzqBjLoDcmvM.png) ### LSTM Using Keras ![image-20210424151638923](https://i.loli.net/2021/04/24/tFhrx8CHjo16S73.png) ![image-20210424151651650](https://i.loli.net/2021/04/24/UJfeAuK4nLMvqrN.png) ### Summary ![image-20210424151922930](https://i.loli.net/2021/04/24/Mvly9XCc8qPTINJ.png) ## Making RNNs More Effective ### Stacked RNN ![image-20210424160555123](https://i.loli.net/2021/04/24/AY9xE4RL8STtNBk.png) ### Stacked LSTM ![image-20210424160728788](https://i.loli.net/2021/04/24/WNrx3VEUB2uPGA9.png) ​ 最后一层LSTM的 `return_sequences = True` ### Bidirectional RNN ![image-20210424161040651](https://i.loli.net/2021/04/24/HJYB8VMgQ21vycL.png) If there is no upper RNN layer, then return $\left[\mathbf{h}_{t}, \mathbf{h}_{t}^{\prime}\right]$ ![image-20210424161152476](https://i.loli.net/2021/04/24/ZV4ufUJsbL2C9Gw.png) ### Pretraining ![image-20210424161239868](https://i.loli.net/2021/04/24/r6cWz9yNEGPMgIk.png) **Step 1:** Train a model on large dataset. - Perhaps different problem - Perhaps different model. **Step 2:** keep only the embedding layer **Step 3:** ![image-20210424161503939](https://i.loli.net/2021/04/24/Z8UygmxoD4w9VCr.png) ### Summary - SimpleRNN and LSTM are two kinds of RNNs; always use LSTM instead of SimpleRNN. - Use Bi-RNN instead of RNN whenever possible. - Stacked RNN may be better than a single RNN layer (if $n$ is big). - Pretrain the embedding layer (if $n$ is small). ## Machine Translation and Seq2Seq Model ### Sequence-to-Sequence Model (Seq2Seq) 1. Tokenization \& Build Dictionary - Use 2 different tokenizers for the 2 languages. - Then build 2 different dictionaries. ![image-20210425104618944](https://i.loli.net/2021/04/25/Vez79GnOuMtlJf5.png) 2. One-Hot encoding ![image-20210425104709325](https://i.loli.net/2021/04/25/XRbhYmvwHeAtJSc.png) 3. training the Seq2Seq model ![image-20210425104909292](https://i.loli.net/2021/04/25/i8wmTrfPRzKqSHE.png) ![image-20210425105357101](https://i.loli.net/2021/04/25/wJXt9r5kh1fPYSQ.png) ### Improvements ![image-20210425105528131](https://i.loli.net/2021/04/25/YUF8qKf2vxG1gWo.png) ​ Use Bi-LSTM in the encoder; ​ use unidirectional LSTM in the decoder. ## Attention **Shortcoming** of Seq2Seq model: The final state is incapable of remembering a **long** sequence. ![image-20210425110230206](https://i.loli.net/2021/04/25/kP9OInafDB8pqVm.png) ### Seq2Seq model with attention ![image-20210425110351977](https://i.loli.net/2021/04/25/UQBfwG3RNE6FcPr.png) ![image-20210425110820289](https://i.loli.net/2021/04/25/skAPtlBoZVifpSC.png) ![image-20210425110923416](https://i.loli.net/2021/04/25/1QpWiSw76ckxAE5.png) ![image-20210425111201865](https://i.loli.net/2021/04/25/DU9BSY7fnj6Fq5P.png) ### Summary ![image-20210425111354457](https://i.loli.net/2021/04/25/CrpAn4EUQlBDza9.png) ## Self-Attention ![image-20210425112211760](https://i.loli.net/2021/04/25/MkVeREyTLl1t5gY.png) ![image-20210425114211398](https://i.loli.net/2021/04/25/G6niSpUzkCMOYHw.png) ![image-20210425114315102](https://i.loli.net/2021/04/25/zYcSRF7hCLobgt3.png) # Transformer Model - Transformer is a Seq2Seq model. - Transformer is not RNN. - Purely based attention and dense layers. - Higher accuracy than RNNs on large datasets. ![image-20210425133549242](https://i.loli.net/2021/04/25/VTh1MUw5WrEagQp.png) ## Attention for Seq2Seq Model ![image-20210425133951048](https://i.loli.net/2021/04/25/dlXoZ9aLuSQywm1.png) ![image-20210425134113776](https://i.loli.net/2021/04/25/LhIoOrZYFXEUxH7.png) ![image-20210425134203558](https://i.loli.net/2021/04/25/7MIqF3xBECOknZz.png) ![image-20210425134453128](https://i.loli.net/2021/04/25/k3Qd1R5soH8uSBj.png) ### Attention without RNN ![image-20210425134737743](https://i.loli.net/2021/04/25/bz5noRjELTGJXSk.png) ![image-20210425134800504](https://i.loli.net/2021/04/25/RmcEBhTwdrs3oVW.png) 1. compute weights: ![image-20210425135116933](https://i.loli.net/2021/04/25/9c6zduTPYwKExe2.png) ![image-20210425135152983](https://i.loli.net/2021/04/25/rs7Ni3hIR6dYWgO.png) 2. compute context vector: ![image-20210425135229911](https://i.loli.net/2021/04/25/Jhe7Qpvq8tTRCFM.png) ![image-20210425135252982](https://i.loli.net/2021/04/25/rciny5befF6jdat.png) 3. repeat this process ![image-20210425135333471](https://i.loli.net/2021/04/25/Q3q9GwUT1MHPxks.png) #### Output of attention layer ![image-20210425135424255](https://i.loli.net/2021/04/25/U6SdlyhTmZzct8b.png) ### Attention Layer for Machine Translation ![image-20210425135543534](https://i.loli.net/2021/04/25/3Wy1lGpuncQPTFs.png) RNN for machine translation: 状态 `h` 作为特征向量 Attention layer for ... : context vector `c` 作为特征向量 ### Summary ![image-20210425135908028](https://i.loli.net/2021/04/25/WcyAPeVkN6Toq4f.png) ## Self-Attention without RNN ![image-20210425140006287](https://i.loli.net/2021/04/25/oWSzwdg7ERl4hTV.png) ![image-20210425140112774](https://i.loli.net/2021/04/25/bUVCsdugDL7e9Ny.png) ![image-20210425140129045](https://i.loli.net/2021/04/25/Q1HyC63jTiGmDnP.png) ![image-20210425140151702](https://i.loli.net/2021/04/25/TzZaIgxeH3L7oJ5.png) ## Transformer model ### Word Embedding + Positinal Encoding ![](https://i.imgur.com/5EthvBo.png) - `PE` 向量和 embedding后的词向量维度一样 - `PE` 由该词在句中的位置决定, 从而解释输入序列中的单词顺序 - `PE` 的计算公式: $$ \begin{array}{c} P E(\text { pos }, 2 i)=\sin \left(\text { pos } / 10000^{2 i} / d_{m} \text { odel }\right) \\ P E(\text { pos }, 2 i+1)=\cos \left(\text { pos } / 10000^{2 i} / d_{m} \text { odel }\right) \end{array} $$ - `pos` 为当前词在句中的位置 - `i` 为词向量中entry的index - 偶数位置,使用正弦编码; 奇数位置,使用余弦编码 ### Single-Head Self-Attention ![image-20210425162116842](https://i.loli.net/2021/04/25/G3gqB7dvta9yFkA.png) ### Multi-Head Self-Attention ![image-20210425162235189](https://i.loli.net/2021/04/25/K9hQFGx3eOXCLdv.png) - Using $l$ single-head self-attentions (which do not share parameters.) - A single-head self-attention has 3 parameter matrices: $\mathrm{W}_{Q}, \mathrm{~W}_{K}, \mathrm{~W}_{V}$ - Totally $3 l$ parameters matrices - Concatenating outputs of single-head self-attentions. - Suppose single-head self-attentions' outputs are $d \times m$ matrices. - Multi-head's output shape: $(l d) \times m$. ### Self-Attention Layer + Dense Layer ![image-20210425163322433](https://i.loli.net/2021/04/25/giCMs8FRwoyUOrq.png) ​ 这些全连接层完全一样 (same $W_{U}$) ### Stacked Self-Attention Layers ![image-20210425163614508](https://i.loli.net/2021/04/25/nLygqzd3AcBtGl2.png) ### Transformer's Encoder ![image-20210425165738016](https://i.loli.net/2021/04/25/tXRpCU6M5ksB34T.png) - 输入和输出一样, 所以可以用 resnet的残差结构, 把输入加到输出上 ![image-20210425170000786](https://i.loli.net/2021/04/25/WVB83jSFGeXn7A6.png) ### Transformer's Decoder: One Block ![image-20210425170318480](https://i.loli.net/2021/04/25/RAnFQtJu3NMdlbD.png) ### Transformer ![image-20210425170438697](https://i.loli.net/2021/04/25/MrT6QDy2LSjvIgm.png) ### Summary ![image-20210425170607623](https://i.loli.net/2021/04/25/QcvOBHEXy6MTSU5.png) ![image-20210425170619671](https://i.loli.net/2021/04/25/uvsHm86A54SL2jn.png) ![image-20210425170656647](https://i.loli.net/2021/04/25/mtcPRWsLrI6nTO7.png)