NLP技術之文字表示法 - GloVe 文字向量與語言結構

--- title: 'NLP技術之文字表示法 - GloVe 文字向量與語言結構' tags: NLP --- NLP技術之文字表示法 - GloVe 文字向量與語言結構 === ## Table of Contents [TOC] ## 學習目標 * 第三種文字表示方法：==GloVe== * 史丹佛研發 * 強調==語言中的結構==很重要 --- GloVe --- * [GloVe (Global Vectors for Word Representation)](https://nlp.stanford.edu/projects/glove/) * 非監督式學習法 * 利用word-word co-occurrence統計數據 * [Code download](https://nlp.stanford.edu/software/GloVe-1.2.zip) * Ruby [script](https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb) for preprocessing Twitter data GloVe 基本概念 --- ### 1. ==Nearest Neighbors== * `找最近的鄰居` * 假設字與字之間有==相似程度==，也許可以用 Nearest Neighbors 找出與這個字相似的字。 * Similarity between two word vectors indicate linguistic or semantic similarity * 兩個字分別用向量來表示 (two word vectors)，若其向量愈相像，則代表他們在語言上或語意上愈相像。 * Nearest Neighbors reveal rare but relevant words. * 顯現罕見，但某種程度上相關/相像的詞。 Ex: Lizard(蜥蜴)... ![](https://i.imgur.com/oYGCWtm.png) ### 2. ==Linear Substructures== * The similarity metrics can quantify the relatedness of two words, however: * 相似程度 -> "量化" 字與字之間的關聯性 * (man, woman) can be considered similar to represent human beings -> 相像！ * (man, woman) are considered opposites to highlight a primary a primary axis -> 對立！ * 何時該相像？何時不該相像？ * 考慮字詞之間的Substructures (小結構，有何關聯？) * Linear Substructures * 字詞之間的緊密或疏遠，表示其相近程度。 * Ex: 比較`strong` `stronger`的差別 ![](https://i.imgur.com/XTVJgWQ.png) * Instead of one value, GloVe is designed to use the vector difference(向量差距) between the two word vectors. * ==vector difference(向量差距) -> Ratio的概念== ### 3. ==Model Overview== * Ratio (`條件機率`)： * 字和字之間共同出現的機率 * 故而得知==Linear Substructures== * 字與字之間的相關程度 * Ratios of word-word co-occurrence probabilities have the potential for ==encoding some form of meaning==. * `ice`(冰，固態)： * $P(solid|ice)$、$P(water|ice)$ 高：與`solid` `water`有關。 * `steam`(蒸氣，氣態)： * $P(gas|ice)$、$P(water|water)$ 高：與`gas` `water`有關。 * $P(k|ice)/P(k|steam)$ 所得出的數值 * 遠大於1：代表k與ice愈相像 * 遠小於1：代表k與steam愈相像 * 約等於1，且大於1：代表k與ice和steam皆相關 * 約等於1，且小於1：代表k與ice和steam皆不相關 ![](https://i.imgur.com/CAjaNr3.png) ### 4. ==Encoding Meaning== * Insight: Ratios of co-occurrence probabilities can encode meaning * Q: 如何透過Word Vector space來得到這些(Ratio)表示？ * A: * Log-bilinear model: $W_i\cdot W_j$ = ==$log$== $P(i|j)$ * 因為條件機率通常非常大/小(與指數有關)，若直接相除，則效果不明顯。 * 因此，藉由取$log$，得到線性的概念。 * `IDF`也有用到此概念。 * Vector differences(相除): $W_i \cdot (W_a-W_b)$ = ==$log$== $\dfrac{P(x|a)}{P(x|b)}$ ### 5. ==Goal== * 訓練模型的目的： * 不是猜目標字，也不是猜上下文。 * 讓系統得到`dot product` * The objective of GloVe is: the dot product of word vectors equals the ==logarithm of the words' probability of co-occurrence== ==$J = \displaystyle{}\sum_{i,j=1}^{V}f(X_{ij})(W_i^T \widetilde{W_j} + b_i + \widetilde{b_j} - log X_{ij})^2$== * $X$: co-occurrence matrix (共同出現的矩陣) * $W$: word vectors * 可以藉由GloVe或Word2Vec得到word vectors * $\widetilde{W}$: context word vectors * $b$, $\widetilde{b}$: bias * This objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space * 目標：得到愈小的錯誤值愈好。 GloVe VS. Word2Vec --- * 好像沒有太多的差別？皆使用==Word Vector== * 兩者最大的差別：==Ratio的概念== * 如何知道字與字之間的差距？ 1. Word2Vec： * 直接拿vector，草船"借箭(字的權重)"，"箭"直接做加減。 2. GloVe： * 認為直接相減是錯的！ * 應該要使用"Ratio"來得到這個概念。 * GloVe 優點： * 訓練快速 * 因為所需的文本量少。 * 不需要太大的文本量 * 即使文本量不高，表現仍然很好 * Word analogy (文字==類比==表現非常好) * Ex: `frog` vs. `蟾蜍` -> 相似 --- * ##### tags: `NLP技術之文字表示法 - GloVe 文字向量與語言結構` * [The Stanford Natural Language Processing Group - Teaching](https://nlp.stanford.edu/teaching/)