# Natural Language Processing Word Embeddings
## word representation
### one hot

problem:
- the dot product of any two different one-hot vectors is zero
- no relationship between words
### featured representation

- similar vectors => similar words
#### visualizing word **embeddings**

- algo: t-SNE
- high-dimension => 2D
we can see the relationship between words
(embedding: 一個字會代表多維空間(feature)中的一個點)
## using word embeddings

- transfer large A to smaller B
- useful: named entity recognition, text summarization, co-reference resolution, parsing
- less useful: language modeling, machine translation
## properties of word embeddings
### Analogies using word vectors
*example: man is to woman as king is to ?*


- using the similarity of the vectors (差多少=>減法)
### consine similarity
find the similarity between vectors

## embedding matrix

- E(300*10K): embedding matrix (300 feature, 10K words)
- O: one-hot vector
- e: embedding vector
## Learning Word Embeddings

- e: embedding vector
- context/target pairs: last 4 words, 4 words on left and right, last 1 word, nearby 1 word etc.
## Word2Vec
### skip-gram

randomly pick a word to be the context word, and randomly pick another word within some window
#### model

idea:
- get the embedding vector from embedding matrix and one-hot vector. put the vector in the softmax.
** theta: parameter associate with t
#### problem with softmax classfication

- problem: the computational speed
- sol: hierarchical softmax classifier(log)
e.g. Is it in first 5000 words, first 2500...
## Negative Sampling

- negtive example: pick k word randomly from dictionary to create k new examples and label them to 0
** smaller dataset: k=5-20, larger dataset: k=2-5

- make 10000 softmax turn into 10000 binary classfication problem
- update only k+1 binary classification problem
### selecting negtive examples

f: the observed frequency of word in training set corpus
## GloVe

Xij: # of times j appears in context of i (take left and right => symmetric)

- f: weight (frequency)
f(x)=0 of x=0 (sum is sum only over the pairs of words that have co-occurred at least once in that context-target relationship)
- theta, e: symmetric => e(final) = (e+theta)/2
- note: can't guarantee that the axis used to represent the features will be well-aligned with what might be easily humanly interpretable axis(就是vector不一定是前面提到的一個值對應到一個人類可以理解的特徵,但學這個algo的時候還是可以用前面的方法理解)
## sentiment classification
- telling if someone likes or not from a piece of text
- small dataset
### simple sentiment classification model

idea:
- sum up the embedding vectors
- use softmax to generate the result
problem:
- ignore word order
### RNN for sentment classification

- can be train in larger dataset (embedding)
- even new word (not in label training set)
## Debiasing Word Embeddings
problem:
- Word embeddings can reflect the gender, ethnicity, age, sexual orientation, and other biases of the text used to train the model

中文解釋:
1. 決定要解決的bias方向(就是要解決的偏見是甚麼)
2. 中和非definitional的字(e.g. doctor),讓他們在對應的偏見軸上沒有差別(definitional: 像是mother, father這種雖然有性別但是是為了定義不帶有偏見的字)
3. 讓成對的字移動到距離non-nonbias axis等距的點,這是為了讓這些字到可能帶有bias的字的距離是一樣的(e.g. grandmother、grandfather到babysitter的距離要是一樣的)