# NLP Assignment I ## Artemii Bykov, DS-01 ###### tags: `NLP`, `ML`, `Word2Vec`, `Embedding` #### Link to [Colab](https://colab.research.google.com/drive/1Z7xHOk5G4kjMMHfjEWYgFnxqTVjY-ABo) [ToC] ### Used techologies * Python3 * Numpy (Matrix operations) * NLTK (Reuters, Tokenizer) * SciPy (Softmax) ### Inroduction I have developed **Word2Vec** using **Skip-Gram** (SG further) approach. Why not **(CBOW) Continuous Bag-Of-Words**? Because **SG** works well with *small amount of the training data*, represents well even rare words or phrases. It is good for me, because I use part of **The Reuters Dataset**. We have **1000 sentences** and **3709 unique word** ### Hyperparameters | Hyperparameter | Value | | -------- | -------- | | Window size | 2 | | Embedding dimension | 10 | | Number of epochs | 200 | | Learninig rate | 0.01 | ```python @dataclass(frozen=True) class Hyperparameters: window_size: int = 2 dimensions: int = 10 epochs: int = 200 learning_rate: float = 0.01 ``` ### Data preparation/generation As I said above, I use **The Reuters Dataset**, but only part of it, because in case of using whole dataset, training time will be days. I have **window size** equals to 2, therefore each **data row** is *1 target* word and *4 context* (by 2 from each side), excluding cases when *target word* in the begin or in the end of sentence. ```python @dataclass class DataRow: target: List[int] context: List[List[int]] ``` #### Subsampling frequent words I perform subsampling of frequent words. Let’s call $p_i$ the proportion of $word_i$ in the corpus. Then the probability $P(w_i)$ of keeping the word in the corpus is defined as follows: $$ P(w_i) = \frac{10^{-3}}{p_i}(\sqrt{10^3p_i} + 1) $$ ```python def subsample_frequent_words(sentences: List[List[str]]) -> List[List[str]]: subsampled_sentences: List[List[str]] = [] freq_dist = FreqDist([token for sentence in sentences for token in sentence]) total_amount_of_words = sum(list(freq_dist.values())) freq_dist = {word: freq_dist[word]/float(total_amount_of_words) for word in freq_dist.keys()} for sentence in sentences: subsampled_sentences.append([]) for word in sentence: if random.random() < (1+math.sqrt(freq_dist[word] * 1e3)) * 1e-3 / float(freq_dist[word]): subsampled_sentences[-1].append(word) return subsampled_sentences ```` #### Onehot encoding I have used **onehot encoding** to represent each *word*, therefore *word* is **vector** with **length** equals to **vocabulary size**. It is not so effective, but very straightforward way. ```python def word2onehot(word: str) -> np.array: vector = np.zeros(vocabulary_size) vector[word_to_index[word]] = 1 return vector ``` #### Training data generator Unfortunately, I have been faced with memory problem. When **vocabulary size** is big enough, the whole **training data** will occupate GBs of memory (Thanks to **onehot encoding**). Obvious solution is not to store whole **training data** in RAM. I have used nice Python concept called **generators** to produce training data in batches. ```python def training_data(sentences: List[List[str]]) -> Generator[DataRow, None, None]: for index, sentence in enumerate(sentences): for index, word in enumerate(sentence): data_row = DataRow(target=np.array(word2onehot(sentence[index])), context=[]) context: List[List[int]] = [] for jndex in range( index - Hyperparameters.window_size, index + Hyperparameters.window_size + 1, ): if jndex != index and len(sentence) - 1 > jndex > 0: context.append(word2onehot(sentence[jndex])) data_row.context = np.array(context) yield data_row ``` ### Training process #### Bird overview: ![](https://i.imgur.com/pi1GCxw.png) #### Forward propagation Simple matrix multiplication and using **softmax** to force each element to the range of 0 and 1: ```python def forward_pass(embeddings, context, vector) -> Tuple[np.array]: hidden_layer = np.dot(embeddings.T, vector) output_layer = np.dot(context.T, hidden_layer) prob_calculated = softmax(output_layer) return prob_calculated, hidden_layer, output_layer ``` #### Error calculation Error calculation done by summing up the difference between calculated probabilities and each of the context words: ```python error = np.sum( [ np.subtract(prob_calculated, word) for word in data_row.context ], axis=0 ) ``` #### Backpropagation We use simple **gradient descent** to calculate the amount of adjustment we need in our matrices: ```python def backpropagation(embeddings, context, hidden_layer, error, vector): delta_embeddings = np.outer(vector, np.dot(context, error.T)) delta_context = np.outer(hidden_layer, error) embeddings = embeddings - (Hyperparameters.learning_rate * delta_embeddings) context = context - (Hyperparameters.learning_rate * delta_context) return embeddings, context ``` #### Loss function I use **loss function** only for determining whether algorithm is covereging or not. Loss function is took from [here](https://arxiv.org/pdf/1411.2738.pdf). #### Train loop 1. Perform forward propagation 2. Calculate Error 3. Update weights using Backpropagation 4. Calculate Loss ```python def train(): embeddings = np.random.uniform(-1, 1, (vocabulary_size, Hyperparameters.dimensions)) context = np.random.uniform(-1, 1, (Hyperparameters.dimensions, vocabulary_size)) for epoch in range(1, Hyperparameters.epochs + 1): try: loss = 0 for data_row in training_data(): if len(data_row.context) == 0: continue prob_calculated, hidden_layer, output_layer = forward_pass( embeddings, context, data_row.target, ) error = np.sum([np.subtract(prob_calculated, word) for word in data_row.context], axis=0) embeddings, context = backpropagation(embeddings, context, hidden_layer, error, data_row.target) output_sum = -np.sum( [output_layer[np.where(context_word == 1)[0][0]] for context_word in data_row.context] ) loss += output_sum + len(data_row.context) * np.log(np.sum(np.exp(output_layer))) print('Epoch:', epoch, "Loss:", loss) yield embeddings, context except KeyboardInterrupt: yield -1, -1 ``` ### Anwer on the questions 1. Yes, I have enough data, but using whole dataset will lead to very long training process. Possible improvments: * Using **negative sampling** for significantly reducing amount of weights adjustment 2. If word is not vocabulary, just return None. 3. It is base and straightforward implementation of Word2Vec trained on not big dataset, that is why only simple lexical semantic relations were learned, for example (using **cosine similarity**): * tomato -> purees, peanuts * debit -> avg, bonus, revs * democratic -> liberal * advantage -> bonus 4. BATS testing - only **exact synonyms** and **verb tenses** tests gives non-zero accuracy (1-3%). Why? Not enough training data was used.