NLP Assignment II

# NLP Assignment II ## Artemii Bykov, DS-01 ###### tags: `NLP`, `ML`, `Supervised learning`, `RNN`, `LSTM` #### Link to [Colab](https://colab.research.google.com/drive/1-RTpFsVrqv-8zF71Z-T5QJp3xqIeVnvM) [ToC] ### Used techologies * Python3 * Conllu package for parsing **.conllu** * PyTorch (layers, optimizer, loss function, LogSofmax) ### Introduction I have developed **Part-of-Speech Tagger** using **LSTM RNN**. Why we do not use simple **Feed-Forward NN**? In PoS Tagging problem, context (especially previous states) is the core, because of the word itself can be different parts of speech - **context is matter**. As we studied and can see on the picture below, using the concept of **forget gates** we can easily use previous states to predict tags. ![](https://i.imgur.com/bV1k3cK.png) ### Hyperparameters | Hyperparameter | Value | | -------- | -------- | | Embedding dimenstion size | 32 | | Hidden dimension size | 32 | | Number of epochs | 300 | | Learninig rate | 0.1 | ```python @dataclass(frozen=True) class Hyperparameter: embedding_dim: int = 32 hidden_dim: int = 32 epochs: int = 300 learning_rate: float = 0.1 ``` ### Data preparation I use the dataset of universal dependencies version of syntax annotations from the **GUM (Georgetown University Multilayer) corpus**. | Statistic | Value | | -------- | -------- | | Number of different tags | 17 | | Number of unique words | 9498 | | Number of sentences in the train part | 3753 | | Number of words in the train part | 69760 | | Number of sentences in the test part | 890 | | Number of words in the test part | 15924 | #### Distribution of tags in dataset: ![](https://i.imgur.com/T5e4dra.png) #### Tags/Words mapping The method builds mapping from human-readable **tag name** and **integer number**. The same for words: ```python def create_tags_mapping(sentences: List) -> Dict[str, int]: tags: Set[str] = set() for sentence in sentences: for token in sentence: tags.add(token['upostag']) return {tag: index for index, tag in enumerate(tags)} def create_words_mapping( train_sentences: List, test_sentences: List, ) -> Dict[str, int]: unique_words: Set[str] = set() for sentence in train_sentences: for token in sentence: unique_words.add(token['lemma']) for sentence in test_sentences: for token in sentence: unique_words.add(token['lemma']) return {word: index for index, word in enumerate(unique_words)} ``` The method performs encoding of words/tags using built mappings and vice versa: ```python def encode_data( data: List[str], mapping: Dict[str, int], ) -> torch.tensor: return torch.tensor( [mapping[token] for token in data], dtype=torch.long, ) def decode_data(encoded_data: List[int], mapping: Dict[int, str]) -> List[str]: return [mapping[encoded] for encoded in encoded_data] ``` The method builds dataset using parsed **.conllu** files. The dataset has the following format: `List[Tuple[List[str], List[str]]]`, i.e where **1st element** in Tuple is **List of words**, the **2nd element** is the **List of corresponding tags** ```python def create_dataset(sentences: List) -> List[Tuple[List[str], List[str]]]: result: List[Tuple[List[str], List[str]]] = [] for sentence in sentences: tokens: List[str] = [] tags: List[str] = [] for token in sentence: tokens.append(token['lemma']) tags.append(token['upostag']) result.append((tokens, tags)) return result ``` ### Training process #### Bird overview: ![](https://i.imgur.com/VCt4Z9A.png) ```python class PartOfSpeechTagger(nn.Module): def __init__( self, embedding_dim: int, hidden_dim: int, number_of_unique_words: int, number_of_unique_tags: int, ) -> None: super().__init__() self.hidden_dim = hidden_dim self.word_embeddings = nn.Embedding( number_of_unique_words, embedding_dim, ) self.lstm = nn.LSTM(embedding_dim, hidden_dim) self.linear = nn.Linear(hidden_dim, number_of_unique_tags) def forward(self, sentence: torch.tensor) -> None: embeddings = self.word_embeddings(sentence) lstm_output, _ = self.lstm(embeddings.view(len(sentence), 1, -1)) tags = self.linear(lstm_output.view(len(sentence), -1)) tags_probabilities = F.log_softmax(tags, dim=1) return tags_probabilities ``` #### Layers (Forward propagation): * **Embedding** - in fact it is simple look-up table for retrieving word embedding by index. **List of indices -> Corresponding word embeddings** * **LSTM** - **Word embeddings -> LSTM output (Hidden layer/state)** * **Linear** - just a linear transformation to the incoming data. We use this linear to map from hidden state space to tag space. **Hidden state -> Tag space** * **LogSotfMax** - **Tag space -> Tag scores** #### Others: * **Loss function** - the negative log likelihood loss (NLLLoss), good choice train a classification problem * **Optimizer** - stochastic gradient descent (SGD) #### Training loop: 1. Model initialization 2. Loss function initialization 3. Optimizer initialization 4. For $epoch_i,\ sentence_j,\ tags_j$: * Clear accumulated gradients * Transform sentence in tensor of word indices * Forward propagation * Compute loss function and gradient * Update network parameters * Repeat step 4 ```python model = PartOfSpeechTagger( hyperparameters.embedding_dim, hyperparameters.hidden_dim, len(words_mapping), len(tag_to_index_mapping), ) loss_function = nn.NLLLoss() optimizer = optim.SGD(model.parameters(), lr=hyperparameters.learning_rate) for epoch in range(hyperparameters.epochs): for sentence, tags in train_dataset: model.zero_grad() encoded_sentence = encode_data(sentence, words_mapping) encoded_tags = encode_data(tags, tag_to_index_mapping) tags_probabilities = model(encoded_sentence) loss = loss_function(tags_probabilities, encoded_tags) loss.backward() optimizer.step() torch.save(model, MODEL_DUMP) print(f'Epoch: {epoch}. Loss: {loss}') ``` ### Analysis #### Accuracy: | Target | Accuracy | | -------- | -------- | | Train words | 92% | | Train sentences | 41% | | Test words | 82% | | Test sentences | 15% | #### Loss function: ![](https://i.imgur.com/1RXnYqp.png) #### Examples: * Fully correctly labelled sentence: * **Sentence**: how many sample do you anticipate be find during the course of the project ? * **Real tags**: SCONJ ADJ NOUN AUX PRON VERB AUX VERB ADP DET NOUN ADP DET NOUN PUNCT * **Predicted tags**: SCONJ ADJ NOUN AUX PRON VERB AUX VERB ADP DET NOUN ADP DET NOUN PUNCT * Partially correctly labelled sentence: * **Sentence**: there be electronic unit sell that emit a ultrasonic beeping sound that rodent hate . * **Real tags**: PRON VERB ADJ NOUN VERB PRON VERB DET ADJ NOUN NOUN SCONJ NOUN VERB PUNCT * **Predicted tags**: PRON VERB VERB VERB VERB SCONJ VERB DET ADJ ADJ VERB SCONJ VERB NOUN PUNCT #### A few words: I think the task of Part-of-Speech tagging was done pretty good. First of all thanks to the dataset with good distribution between classes. Moreover, this distribution is almost the same for train and test parts that is one of reason for good accuracy on test data. As I said in the introduction part PoS Tagging problem is context problem, that\`s why we use LSTM RNN to handle the previous state.