Natural Language Processing Word Embeddings

# Natural Language Processing Word Embeddings ## word representation ### one hot ![image](https://hackmd.io/_uploads/S1FERjtiT.png) problem: - the dot product of any two different one-hot vectors is zero - no relationship between words ### featured representation ![image](https://hackmd.io/_uploads/HkqBxhFjp.png) - similar vectors => similar words #### visualizing word **embeddings** ![image](https://hackmd.io/_uploads/ry8_-hFsa.png) - algo: t-SNE - high-dimension => 2D we can see the relationship between words (embedding: 一個字會代表多維空間(feature)中的一個點) ## using word embeddings ![image](https://hackmd.io/_uploads/SJoKF3Yip.png) - transfer large A to smaller B - useful: named entity recognition, text summarization, co-reference resolution, parsing - less useful: language modeling, machine translation ## properties of word embeddings ### Analogies using word vectors *example: man is to woman as king is to ?* ![image](https://hackmd.io/_uploads/rkHnhnYsT.png) ![image](https://hackmd.io/_uploads/H1a16hFoT.png) - using the similarity of the vectors (差多少=>減法) ### consine similarity find the similarity between vectors ![image](https://hackmd.io/_uploads/B1bbJTKo6.png) ## embedding matrix ![image](https://hackmd.io/_uploads/BkdWxatia.png) - E(300*10K): embedding matrix (300 feature, 10K words) - O: one-hot vector - e: embedding vector ## Learning Word Embeddings ![image](https://hackmd.io/_uploads/HykuNatip.png) - e: embedding vector - context/target pairs: last 4 words, 4 words on left and right, last 1 word, nearby 1 word etc. ## Word2Vec ### skip-gram ![image](https://hackmd.io/_uploads/BJw0GAYoa.png) randomly pick a word to be the context word, and randomly pick another word within some window #### model ![image](https://hackmd.io/_uploads/ry-x7Cti6.png) idea: - get the embedding vector from embedding matrix and one-hot vector. put the vector in the softmax. ** theta: parameter associate with t #### problem with softmax classfication ![image](https://hackmd.io/_uploads/Hkhk80Ysa.png) - problem: the computational speed - sol: hierarchical softmax classifier(log) e.g. Is it in first 5000 words, first 2500... ## Negative Sampling ![image](https://hackmd.io/_uploads/SkaEN-qo6.png) - negtive example: pick k word randomly from dictionary to create k new examples and label them to 0 ** smaller dataset: k=5-20, larger dataset: k=2-5 ![image](https://hackmd.io/_uploads/BJmz8fqsT.png) - make 10000 softmax turn into 10000 binary classfication problem - update only k+1 binary classification problem ### selecting negtive examples ![image](https://hackmd.io/_uploads/r1rSwGco6.png) f: the observed frequency of word in training set corpus ## GloVe ![image](https://hackmd.io/_uploads/ry_UYG5sT.png) Xij: # of times j appears in context of i (take left and right => symmetric) ![image](https://hackmd.io/_uploads/B1_vFGcoa.png) - f: weight (frequency) f(x)=0 of x=0 (sum is sum only over the pairs of words that have co-occurred at least once in that context-target relationship) - theta, e: symmetric => e(final) = (e+theta)/2 - note: can't guarantee that the axis used to represent the features will be well-aligned with what might be easily humanly interpretable axis(就是vector不一定是前面提到的一個值對應到一個人類可以理解的特徵，但學這個algo的時候還是可以用前面的方法理解) ## sentiment classification - telling if someone likes or not from a piece of text - small dataset ### simple sentiment classification model ![image](https://hackmd.io/_uploads/S1m6y75j6.png) idea: - sum up the embedding vectors - use softmax to generate the result problem: - ignore word order ### RNN for sentment classification ![image](https://hackmd.io/_uploads/SkRfl7cip.png) - can be train in larger dataset (embedding) - even new word (not in label training set) ## Debiasing Word Embeddings problem: - Word embeddings can reflect the gender, ethnicity, age, sexual orientation, and other biases of the text used to train the model ![image](https://hackmd.io/_uploads/SyEzHXqja.png) 中文解釋: 1. 決定要解決的bias方向(就是要解決的偏見是甚麼) 2. 中和非definitional的字(e.g. doctor)，讓他們在對應的偏見軸上沒有差別(definitional: 像是mother, father這種雖然有性別但是是為了定義不帶有偏見的字) 3. 讓成對的字移動到距離non-nonbias axis等距的點，這是為了讓這些字到可能帶有bias的字的距離是一樣的(e.g. grandmother、grandfather到babysitter的距離要是一樣的)