NLP - HackMD

# NLP ## Meaning Representation 機器可以理解得 Word, 這個 Word, phrase 所帶來的意義 (Meaning), wants to express by using words, signs. ### Knowledge-Based Representation #### Hypernyms(包含) relationships of WordNet - 語言家定義的詞彙 - 利用空間的概念，like Dog 是 Animal 子空間 - ![](https://i.imgur.com/vnFg0Nc.png) - 詞彙跟詞彙的關聯來定義 - 可能會包含同義詞, 反義詞等關聯 - ![](https://i.imgur.com/KjXjeqt.png) - 缺點： - newly-invented words 不能被 include - subjective: 會有主觀的意見，所以模糊地帶不清楚 - annotation effor: 人工標記 - difficult to compute word similarity: 不能定義 pair 之間的關聯 ### Corpus-Based Representation - 根據大量的詞料得到相關知識，透過資料上面的分佈來表達真實的情況，並沒有用到人主觀的定義，而是用文章使用的情境，藉由不同的 pattern 的方式去表示 #### Atomic symbols: one-hot representation - ![](https://i.imgur.com/Ks2Am6D.png) - 以出現的詞彙來定義 Dimension - difficult to cimpute the similarity - 因為不同的字會在不同維度之間，但有些會有 relation, 所以不能表示 similarity - 解決方法： neighbors，根據前後文來判斷，來判斷 word meaning #### Neighbor-based representation - Co-occurrentce matrix - definition: full document v.s. windows - **full document**: 同一篇文章 - gerneral topics - Latent Semantic Analysis - 如：體育新聞出現的相關的運動的字彙會比較相近 - **windows**: - capture sysnatic ( POS )and sematic information - 例如：名詞後面會出現動詞的機會比較高 - ex: - window length = 1 - Left or right context - corpus: - I love AI. - I love ddep learning. - I enjoy learning. - ![](https://i.imgur.com/ZZLYIeM.png) - I and love 的次數: 2 - ![](https://i.imgur.com/G1gdVV8.png) - Love and enjoy 看相似性， similarity > 0, 因為 I 都大於 0 - Issues: - matrix size increases with vocabulary - high dimensional - sparsity -> poor robustness - 因為資訊主要只集中在幾個少數的 vector 裡面，但 dimensional 太高，但我們希望每個維度都可以有貢獻，使得訓練會太長 or 不好，最好都是充滿大於 0 的資訊 #### Low-Dimensional Dense Word Vector - dimension reduction on the matrix - SVD of co-occurrence matrix X (降維) - ![](https://i.imgur.com/yv8zcIe.png) - 從 r dim -> k dim, 最後的 X^ 會跟原本的 X 有某種程度的接近 - semantic relations 語意 - dog and cat -> group, China, Russia -> group - ![](https://i.imgur.com/hHgRhLd.png) - syntactic relations 句法 - 詞性 or 分詞, 根據 context 來判斷 - ![](https://i.imgur.com/jyQZN36.png) - Issues: - computationally expensive O(mn^2) - difficult to add new words - 新增需要全部從新從算 - Solve: Directly learn low-dimensional word vectors - Directly learn low-dimensional word vectors - A neural probabilistic language model - Word2vec 2013 - Word embeddings ## Language Modeling - Goal: estimate the probability of a word sequence - ![](https://i.imgur.com/PyRn9Du.png) ### N-Gram - Probability is conditinoed on a window of (n-1) previous words - ![](https://i.imgur.com/BsRFvyU.png) - w1 ~ w_m 的 word sequence 的機率 - 從 i 考慮到 m-1 - ![](https://i.imgur.com/rReaw4I.png) - 考慮 n 個字為一組 - Estimate the probability based on the training data - Issue: - some sequences may not appear in the training data - 沒出現過的機會就會是0 - solve: smoothing - ![](https://i.imgur.com/pdP1UBn.png) - 但是 prob is not accurate ### Feed-Forward Neural Language Model - 估算給定 w(i-n+1) 到 w(i-1) 的機率下，第 i 個字的機率，用 n-gram 是用 count來處理。 - ![](https://i.imgur.com/iRw6QYo.png) - ![](https://i.imgur.com/fbTocrP.png) - output 是下一個字的機率 - 2003 Model - 出現 n-1 的字當作 input, 最後用 softmax - 最後一層可以直接 access 第一層的資訊 - output: 有幾個字就多少 dim - ![](https://i.imgur.com/1FRIc2j.png) - W(3) 是最後一層 access 第一層 - ![](https://i.imgur.com/0GfbTCw.png) - 架構： - ![](https://i.imgur.com/VBUnNWa.png) - 好處： - dataset: cat run, cat jump, 沒有 rabbit run or jump - 但是因為 cat 與 rabbit 的 vector 相近，自然 rabbit predict run or jump 機率也會比較高 - 不會根據實際出現次數計算的機率，而是用 given 現在的 input ，去計算 - 所以不用做 smoothing - Issue: fixed context window for conditioning - 因為考慮字的話，我們還是會考慮整個句子，不只是考慮前後而已 ### RNNLM - 解決 context window ，讓 model 考慮前面所有的資訊 - Idea: condition the neural network on all previous words and tie the weights at each time step - 一般 NN 架構： - predict nice ，只能是看到 a - ![](https://i.imgur.com/luGwYlm.png) - RNN 架構 - predict nice ，不只是看到 a 還可以看到 wreck, start - 所以不會只看當前的字彙，可以考慮 all context - ![](https://i.imgur.com/j53EB8X.png) - RNN Formulation - ![](https://i.imgur.com/JaXdzcg.png) - 下一個字看到 x(t+1) 出現 w(j) 的機率 = y(t,j), given 考慮所有前面的字的機率 - ![](https://i.imgur.com/eykT0Fq.png) - ![](https://i.imgur.com/d9QGdd8.png) - U, V, W 都用同一個參數 - ![](https://i.imgur.com/bV328iD.png) - cost function: C - ![](https://i.imgur.com/pOC85wl.png) - 最後要 update { U, V, W } - 用 gradient descent 去 update - ![](https://i.imgur.com/QoynHTw.png) #### Training via Backpropagation through Time (BPTT) - 一般 Backpropagation - ![](https://i.imgur.com/xxg1g9r.png) - ![](https://i.imgur.com/ibB6cPP.png) - BPTT - ![](https://i.imgur.com/wx6hWIX.png) ![](https://i.imgur.com/T9yaoIx.png) ![](https://i.imgur.com/ZjomNZw.png) ![](https://i.imgur.com/BRPbB5z.png) ![](https://i.imgur.com/MpAwE1A.png) ![](https://i.imgur.com/G8MTgDC.png) 整個流程： ![](https://i.imgur.com/hIud9J0.png) ![](https://i.imgur.com/DXMXRFn.png) - Training Issue - vanishing or exploding gradient - Rough Error Surface - ![](https://i.imgur.com/cawW6iF.png) - Extension: Bidirectional RNN - ![](https://i.imgur.com/jJXp3Uf.png) - ![](https://i.imgur.com/5kZe0y7.png) ## Word represention ### word embedding - 不是用降維的方式，直接學，把字直接嵌入 - word2vec, glove - 比原本的 representation 還要好 - 好處： - 不需要 labeled training corpus - semantic similarity - powerful features ( 同義詞可以非常接近，會自動 predict 很相近結果) - propagate any information into them ( 可以調整適合的 task，達到 end to end 的效果，也可以用成 fine tune, pretrain ) - ![](https://i.imgur.com/ahX6mmV.png) - pretrain: 只動藍色地方 - fine tune: 也可以跟新黃色的地方 ### Word2Vec - Skip Gram Model - goal: 根據 target word, 預測旁邊的 neighbor ( within a window ) - ![](https://i.imgur.com/e5uDnq3.png) - ![](https://i.imgur.com/byyT0JJ.png) - objective function: maximize the probability of any context word given the current center word - ![](https://i.imgur.com/mZ9PYes.png) - 下面取 log ![](https://i.imgur.com/GoBUb8A.png) ## Contextualized Word Embedding - 不同的 token 會有不同的 embedding vector - even 相同 type, 也會有 vector - 所以可能會比較相近 - Each word token has its own embedding. - 用 ELMO (Embeddings from Language Model) - RNN based language models ( trained from lots of sentences) - use bidirection: consider history and future token - which hidden as embedding vector ? - all hidden layer 加起來 (先呈上某個權重 alpha) -