# NLP Assignment I
## Artemii Bykov, DS-01
###### tags: `NLP`, `ML`, `Word2Vec`, `Embedding`
#### Link to [Colab](https://colab.research.google.com/drive/1Z7xHOk5G4kjMMHfjEWYgFnxqTVjY-ABo)
[ToC]
### Used techologies
* Python3
* Numpy (Matrix operations)
* NLTK (Reuters, Tokenizer)
* SciPy (Softmax)
### Inroduction
I have developed **Word2Vec** using **Skip-Gram** (SG further) approach. Why not **(CBOW) Continuous Bag-Of-Words**? Because **SG** works well with *small amount of the training data*, represents well even rare words or phrases. It is good for me, because I use part of **The Reuters Dataset**. We have **1000 sentences** and **3709 unique word**
### Hyperparameters
| Hyperparameter | Value |
| -------- | -------- |
| Window size | 2 |
| Embedding dimension | 10 |
| Number of epochs | 200 |
| Learninig rate | 0.01 |
```python
@dataclass(frozen=True)
class Hyperparameters:
window_size: int = 2
dimensions: int = 10
epochs: int = 200
learning_rate: float = 0.01
```
### Data preparation/generation
As I said above, I use **The Reuters Dataset**, but only part of it, because in case of using whole dataset, training time will be days. I have **window size** equals to 2, therefore each **data row** is *1 target* word and *4 context* (by 2 from each side), excluding cases when *target word* in the begin or in the end of sentence.
```python
@dataclass
class DataRow:
target: List[int]
context: List[List[int]]
```
#### Subsampling frequent words
I perform subsampling of frequent words. Let’s call $p_i$ the proportion of $word_i$ in the corpus. Then the probability $P(w_i)$ of keeping the word in the corpus is defined as follows:
$$
P(w_i) = \frac{10^{-3}}{p_i}(\sqrt{10^3p_i} + 1)
$$
```python
def subsample_frequent_words(sentences: List[List[str]]) -> List[List[str]]:
subsampled_sentences: List[List[str]] = []
freq_dist = FreqDist([token for sentence in sentences for token in sentence])
total_amount_of_words = sum(list(freq_dist.values()))
freq_dist = {word: freq_dist[word]/float(total_amount_of_words) for word in freq_dist.keys()}
for sentence in sentences:
subsampled_sentences.append([])
for word in sentence:
if random.random() < (1+math.sqrt(freq_dist[word] * 1e3)) * 1e-3 / float(freq_dist[word]):
subsampled_sentences[-1].append(word)
return subsampled_sentences
````
#### Onehot encoding
I have used **onehot encoding** to represent each *word*, therefore *word* is **vector** with **length** equals to **vocabulary size**. It is not so effective, but very straightforward way.
```python
def word2onehot(word: str) -> np.array:
vector = np.zeros(vocabulary_size)
vector[word_to_index[word]] = 1
return vector
```
#### Training data generator
Unfortunately, I have been faced with memory problem. When **vocabulary size** is big enough, the whole **training data** will occupate GBs of memory (Thanks to **onehot encoding**). Obvious solution is not to store whole **training data** in RAM. I have used nice Python concept called **generators** to produce training data in batches.
```python
def training_data(sentences: List[List[str]]) -> Generator[DataRow, None, None]:
for index, sentence in enumerate(sentences):
for index, word in enumerate(sentence):
data_row = DataRow(target=np.array(word2onehot(sentence[index])), context=[])
context: List[List[int]] = []
for jndex in range(
index - Hyperparameters.window_size,
index + Hyperparameters.window_size + 1,
):
if jndex != index and len(sentence) - 1 > jndex > 0:
context.append(word2onehot(sentence[jndex]))
data_row.context = np.array(context)
yield data_row
```
### Training process
#### Bird overview:

#### Forward propagation
Simple matrix multiplication and using **softmax** to force each element to the range of 0 and 1:
```python
def forward_pass(embeddings, context, vector) -> Tuple[np.array]:
hidden_layer = np.dot(embeddings.T, vector)
output_layer = np.dot(context.T, hidden_layer)
prob_calculated = softmax(output_layer)
return prob_calculated, hidden_layer, output_layer
```
#### Error calculation
Error calculation done by summing up the difference between calculated probabilities and each of the context words:
```python
error = np.sum(
[
np.subtract(prob_calculated, word)
for word in data_row.context
],
axis=0
)
```
#### Backpropagation
We use simple **gradient descent** to calculate the amount of adjustment we need in our matrices:
```python
def backpropagation(embeddings, context, hidden_layer, error, vector):
delta_embeddings = np.outer(vector, np.dot(context, error.T))
delta_context = np.outer(hidden_layer, error)
embeddings = embeddings - (Hyperparameters.learning_rate * delta_embeddings)
context = context - (Hyperparameters.learning_rate * delta_context)
return embeddings, context
```
#### Loss function
I use **loss function** only for determining whether algorithm is covereging or not. Loss function is took from [here](https://arxiv.org/pdf/1411.2738.pdf).
#### Train loop
1. Perform forward propagation
2. Calculate Error
3. Update weights using Backpropagation
4. Calculate Loss
```python
def train():
embeddings = np.random.uniform(-1, 1, (vocabulary_size, Hyperparameters.dimensions))
context = np.random.uniform(-1, 1, (Hyperparameters.dimensions, vocabulary_size))
for epoch in range(1, Hyperparameters.epochs + 1):
try:
loss = 0
for data_row in training_data():
if len(data_row.context) == 0:
continue
prob_calculated, hidden_layer, output_layer = forward_pass(
embeddings, context, data_row.target,
)
error = np.sum([np.subtract(prob_calculated, word) for word in data_row.context], axis=0)
embeddings, context = backpropagation(embeddings, context, hidden_layer, error, data_row.target)
output_sum = -np.sum(
[output_layer[np.where(context_word == 1)[0][0]] for context_word in data_row.context]
)
loss += output_sum + len(data_row.context) * np.log(np.sum(np.exp(output_layer)))
print('Epoch:', epoch, "Loss:", loss)
yield embeddings, context
except KeyboardInterrupt:
yield -1, -1
```
### Anwer on the questions
1. Yes, I have enough data, but using whole dataset will lead to very long training process. Possible improvments:
* Using **negative sampling** for significantly reducing amount of weights adjustment
2. If word is not vocabulary, just return None.
3. It is base and straightforward implementation of Word2Vec trained on not big dataset, that is why only simple lexical semantic relations were learned, for example (using **cosine similarity**):
* tomato -> purees, peanuts
* debit -> avg, bonus, revs
* democratic -> liberal
* advantage -> bonus
4. BATS testing - only **exact synonyms** and **verb tenses** tests gives non-zero accuracy (1-3%). Why? Not enough training data was used.