NLP FINAL PROJECT REPORT

# NLP FINAL PROJECT REPORT - Tan Hui Ru Eda 1003098 - Latasha Lenus 1003106 - Ivan Christian 1003056 # How to run the code ## Part 2 The code is contained in Part2i and Part2ii files. If you are using the command line, run the command: ```py Part2ii``` to get the scores. The output file is `data/partial/dev.p2.out`. ## Part 3 The code is contained in Part3i and Part3ii files. If you are using the command line, run the commands: ```py Part3ii``` to get the scores. ## Part 4 The code is contained in the Part4 file. If you are using the command line, run the command: ```py Part4``` to get the result for Part4i and the scores for Part4ii. The output file is `data/partial/dev.p4.out`. ## Part 5 The code is contained in Part5i, Part5ii and Part5iii files. If you are using the command line, run the commands: ```py Part5i``` to get the scores. The output file for Part5i is `data/full/dev.p5.CRF.f3.out`. For Part5ii, run the commands: ```py Part5ii``` to get the scores. The output file for Part5i is `data/full/dev.p5.CRF.f4.out`. For Part5iii, run the commands: ```py Part5iii``` to get the scores. The output file for Part5i is `data/full/dev.p5.CRF.SP.out`. ## Part 6 The code is contained in Part6i and Part6ii files. If you are using the command line, for part 6i, run the commands: ```py Part6i```. The output file is `data/full/test.p6.CRF.out`. For part 6ii, the code and the results are contained in `6ii` folder. Please navigate inside and once inside the folder, run the commands: ```py main``` to start the training. The output file is `test/partial/test.p6.model.out` that is in the `Part6ii` folder. Our models are stored in `checkpoint/`. We will only be storing our last model for submission. To generate the dev or the test set predictions, please run `py generate`. When prompted, choose 'D' to generate the dev set or choose 'T' to generate the test set. The test set will be saved in `test/partial/test.p6.model.out` while the dev set will be saved in To evaluate the dev set, please run `py evaluate`. This is already defaulted to the `dev.p6.model.out` and the comparison to `dev.out` in partial. # Explanation of code ## Part 1 The goal for this section is to calculate the emission probabilities and transition probabilities based on the training set. We created a function called ```q1``` that takes in the input file path for the training set and outputs a feature-weight dictionary. The input file is parsed to obtain the sentences and individual word-label pairs. ```python with open(inputpath, encoding="utf-8") as train_file: train_data = train_file.read() sentences = train_data.split("\n\n") # sentences data = train_data.split("\n") # word-label pairs ``` From the training set data, each word-label pair (e.g. Word: transportation, Label: O) is counted and e(x|y) is evaluated. ```python # Count word-label pair & labels label_count = {} pair_count = {} for pair in data: if pair: parts = pair.split(" ") word = parts[0] # label label = parts[len(parts)-1] # word-label pair xy = str(label) + "+" + str(word) # count labels if (label in label_count.keys()): label_count[label] += 1 else: label_count[label] = 1 # count word-label pair if (xy in pair_count.keys()): pair_count[xy] += 1 else: pair_count[xy] = 1 ``` Emission probability is calculated using the following function: $$e(x|y)= count(y → x)/count(y)$$ ```python # Evaluate e(x|y) for pair in pair_count.keys(): emi = "emission:" + pair label = pair.split("+")[0] emi_prob = pair_count[pair]/label_count[label] f[emi] = math.log(emi_prob) ``` In the same function ```q1```, for each sentence in the training set, each yi, yi-1 pair is counted and q(yi|yi−1) is calculated. Note that the special labels "START" and "STOP" are added to the start and end of the sentence before the transition probability is calculated. ```python # Count transition pairs count_tran_pairs = {} label_count["START"] = len(sentences) # for each sentence for sent in sentences: sentence = sent.split("\n") if sentence: sentence_labels = [] # append START to start of sentence_labels sentence_labels.append("START") for word_label in sentence: if word_label: word_label_parts = word_label.split(" ") # append label in sentence to sentence_labels sentence_labels.append(word_label_parts[len(word_label_parts)-1]) # append STOP to end of sentence_labels sentence_labels.append("STOP") # count transition pair for i in range(len(sentence_labels)-1): pair = str(sentence_labels[i]) + "+" + str(sentence_labels[i+1]) if (pair in count_tran_pairs.keys()): count_tran_pairs[pair] += 1 else: count_tran_pairs[pair] = 1 ``` The transition probability is calculated using the following function: $$q(y_i|y_{i−1})= count(y_{i−1},y_i) / count(y_{i−1})$$ ```python # Evaluate q(yi|yi−1) for pair in count_tran_pairs.keys(): trans = "transition:" + pair label_0 = pair.split("+")[0] trans_prob = count_tran_pairs[pair]/label_count[label_0] f[trans] = math.log(trans_prob) ``` ## Part 2i The goal for this section is calculate the score for a given pair of input and output sequence (x, y), using the function below: $$w∙f(x,y)= ∑_jw_j f_j (x,y)$$ Where $f_j (x,y)$ is the function that returns the number of times the j-th feature in the dictionary appears in (x, y) and $w_j$ is the weight of the j-th feature. We created a function called ```q2i``` that takes in the two lists of strings (input sentence x and the output tag sequence y) and the dictionary of feature-weight mappings. The code signature is ```def q2i(x, y, feature_dict)```. We then calculate the counts of the emission and transition features by iterating through the entire sentence and storing the counts of all the emission and transition features in a dictionary. ```python # emission features for i in range(len(x)): word = x[i] emissionf = "emission:{}+{}".format(y[i], word) count_dict[emissionf] += 1 # transition features newy = ["START"] + y + ["STOP"] for i in range(1, len(x) + 2): transitionf = "transition:{}+{}".format(newy[i - 1], newy[i]) count_dict[transitionf] += 1 ``` We calculate the score by iterating through the count dictionary and multiplying the count with its weight in the feature dictionary. ```python wfproduct = 0 for key, value in count_dict.items(): wfproduct += feature_dict[key] * value ``` ## Part 2ii Before we start the Viterbi algorithm, we need to retrieve the tags that will be used in Viterbi. Thus, we have a function called ```get_tags``` that will take in an input file path, read through the lines and return a list of unique tags. Since we are training on the partial test set, the format is <Englis Word> <output tag> (The O), so we take the second element after splitting the line (splitlines[1]) ```python def get_tags(inputpath): ''' Inputs: inputpath: (str) Path of file Outputs: tags: (List[str]) List of unique tags ''' tags = set() with open(inputpath) as f: lines = f.readlines() for line in lines: formatted_lines = line.strip() if formatted_lines != '': splitlines = formatted_lines.split(" ") y = splitlines[1] tags.add(y) return list(tags) ``` This set of tags will then be used by the ```viterbi_decode``` function, which also takes in the input sequence x as a list of strings, and the feature-weight dictionary. The function will first initialise a score array of size (number of words, number of tags) where the values will be negative infinity. It will also initialise a backpointer array of size (number of words, number of tags) where the values are zero. ```python scores = [[-np.inf for j in range(numtags)] for i in range(numwords)] back = [[0 for j in range(numtags)] for i in range(numwords)] ``` As we loop through the sentence length and the tags, we calculate the score. The score is the sum of the emission score, the transition score and the score stored in the previous tag. It is a sum because the scores are log probability, not probability. If the new score is higher than the previous score at the same node, we store that new score and add the backpointer as the index of the previous tag. ```python for i in range(1, numwords): for prev in range(numtags): # v for now in range(numtags): # u prevtag = nodes[prev] currenttag = nodes[now] transitionf = 'transition:{}+{}'.format(prevtag, currenttag) emissionf = 'emission:{}+{}'.format(currenttag, x[i]) prevscore = scores[i - 1][prev] totalscore = featuredict.get(emissionf, -10 ** 8) + featuredict.get(transitionf, -10 ** 8) + prevscore if totalscore > scores[i][now]: scores[i][now] = totalscore back[i][now] = prev ``` After we have reached the STOP tag, we work backwards using the backpointers to find the best tag sequence and return it. ```python path = [nodes[stopbp]] prevbp = stopbp for i in range(numwords - 1, 0, -1): prevbp = back[i][prevbp] output = nodes[prevbp] path = [output] + path ``` To predict a tag sequence for a file, we have an input file path for the viterbi_inference function. This function will then read the file, split the sentences and feed each sentence into the viterbi algorithm to be predicted. ```python def viterbi_inference(inputpath, outputpath, nodes, featuredict): ''' Function creates a viterbi inference Inputs: inputpath: (str) Path of input file outputpath: (str) Path of output file nodes: (List(str)) List of tags featuredict: (Dict(str) -> float) Dictionary containing feature-weight mappings Outputs: None ''' sentences = [] with open(inputpath) as f: lines = f.readlines() sentence = [] for line in lines: strippedline = line.strip() if strippedline == '': sentences.append(sentence) sentence = [] else: sentence.append(strippedline) with open(outputpath, "w") as f: for index, sentence in enumerate(sentences): pred = viterbi_decode(sentence, nodes, featuredict) for i in range(len(sentence)): f.write(sentence[i] + " " + pred[i] + "\n") f.write("\n") ``` ### Results The accuracy, precision, recall and F1 scores as calculated by the ```conlleval.evaluate``` function is as shown below: ![](https://i.imgur.com/pvwhKr8.png) ## Part 3i ### Computing forward score The goal of this section is to create a function that can calculate the forward algorithm and the loss that it is based on. To achieve this, we implemented a `compute_forward` function to calculate the forward score. To calculate the forward score, we would need the sentence that we want to score, the feature dictionary that we have generated previously and the list of possible tags. We first initialise a matrix to store the scores to 0, as seen in the following code snippet. ```python ... forward_scores = np.zeros((sent_len, tags_n)) # Initialise the scores to zeroes forward_prob = 0 ... ``` We then start the score calculation by firstly considering the `START` tag and current tag for the transition score, which we defaulted at -10^2, which at this point we consider as a very large number, if we were not able to find it in the feature dictionary that we have have generated previously. Similarly, we obtained the emission score from the feature dictionary and default it to -10^2 if we were not able to find it in the feature dictionary. At this point we did it only to the first word as we needed to initialise the score for the word at the beginning of the sentence. The sum of the scores would then be saved in the forward score array as seen in the following code snippet. ```python for idx, current_y in enumerate(tags): transition_key = f'transition:START+{current_y}' emission_key = f'emission:{current_y}+{sent[0]}' transition_score = feature_dict.get(transition_key, -10**2) emission_score = feature_dict.get(emission_key,-10**2) forward_scores[(0, idx)] = transition_score + emission_score ``` We then proceeded to calculate the forward score by iterating through the sentence and obtaining the transition and emission score. For each word we iterate through the possible scores that we can obtain from the feature dictionary, defaulting the scores that we do not find to -10^2. We then sum each of the possible iteration, found from checking current (current_y) and previous (prev_y) tags and save it to a temporary score which updates the forward score array as seen in the following code snippet. ```python ''' Get the best score based on the transmission and emission score ''' for i in range(1,sent_len): for j, current_y in enumerate(tags): temp_score = 0 #set score to 0 for each iteration for k, prev_y in enumerate(tags): transition_key = f'transition:{prev_y}+{current_y}' emission_key = f'emission:{current_y}+{sent[i]}' transition_score = feature_dict.get(transition_key, -10**2) emission_score = feature_dict.get(emission_key,-10**2) temp_score += np.exp(min(emission_score + transition_score + forward_scores[i-1, k], 700)) forward_scores[i, j] = np.log(temp_score) if temp_score else -10**2 ``` Once we were done with the iteration through sentence, we need to calculate the the score for the STOP and return both the overall total forward score (forward prob) and the log forward score (alpha) for use in the CRF loss computation. If the score doesnt exist we default to a high number of 700. ```python ''' Calculate STOP ''' for j, prev_y in enumerate(tags): transition_key = f'transition:{prev_y}+STOP' transition_score = feature_dict.get(transition_key, -10**2) # Sum exponentials overall_score = np.exp(min(transition_score + forward_scores[sent_len-1, j], 700)) forward_prob += overall_score log_forward_score = np.log(forward_prob) if forward_prob else -700 return forward_scores, log_forward_score ``` ### Computing CRF loss Computing CRF loss will be based on the following equation ![](https://i.imgur.com/ZIWn5Gk.png) To compute the CRF loss, we would need the input sentence, the tags corresponding to said sentence, the feature dictionary that we have generated previously, and the list of unique tags available in the set. We first initialise `loss` to 0 start iterating through the sentence to compute the loss. the total loss is stored in the `loss` variable. For each word in the sentence, we compute the loss function at each step by calling the `compute_crf_loss_step` function. This can be seen in the following code snippet. ```python ... for i in tqdm(range(len(input_sequences))): step_loss = compute_crf_loss_step(input_sequences[i], input_labels[i], feature_dict, tags) loss += step_loss ... ``` The `compute_crf_loss_step` computes the loss for each individual sentence by comparing it with the "ground truth" score that the can be obtained from `q2i`. We take the difference between `log_forward_score` and the "ground truth" score to obtain the loss value for 1 step which will be summed over the overall loss. The `crf_loss_step` function essentially computes this part of the equation : ![](https://i.imgur.com/sw8Hv8p.png) The full implementation can be seen from the following function snippet: ```python def compute_crf_loss_step(sent, groundtruth, feature_dict, tags): ''' Function to calculate the crf loss based on the sentence and the ground truth for 1 instance Args: - sent : list of sequence of str - groundtruth : list of true sequence of str ( compared with sent to get the score ) - feature_dict : dict that maps the feature -> score (w) - tags : list of unique tags returns: - loss : loss value for this particular instance ''' crf_score = q2i(sent, groundtruth, feature_dict) _, log_forward_score = compute_forward(sent, feature_dict, tags) loss = -(crf_score - log_forward_score) # easier to sum for later return loss ``` We have also added a regularisation constant should we need to regularise the value of the crf loss. This however results in a very high value for the crf loss. ```python ... if regularisation: reg_loss = 0 for feature_key in feature_dict: reg_loss += feature_dict[feature_key]**2 reg_loss = nabla*reg_loss loss += reg_loss return loss ... ``` ### Results Using the train set: ``` Computed CRF loss value : 465.4578456572406 Regularised Computed CRF loss value : 3308.939500862148 ``` ## Part 3ii This section is the continuation of the previous section and will focus on the backward algorithm and calculate the gradient for Part 3. ### Backward Algorithm We implemented the backward algorithm by first taking in the input sentences, feature dictionary (generated previously) and the unique tags set (generated previously). To start the backward algorithm calculations we initialise all the scores in the array to 0 as seen in the following snippet. ```python sent_len = len(sent) tags_n = len(tags) backward_scores = np.zeros((sent_len, tags_n)) ``` We then computed the score from the end of the sequence, meaning that we go backwards from the end of sentence (that has a corresponding transition: "current tag":STOP value) and calculate from there. These scores are then stored in `backward_scores`. ```python for i in range(sent_len-1, 0, -1): for j, current_y in enumerate(tags): transition_key = f"transition:{current_y}+STOP" transition_score = feature_dict.get(transition_key, -10**2) # Sum exponentials backward_scores[sent_len-1, j] = transition_score ``` We implemented a similar method to the forward algorithm to calculate the score for the transition and emission values and store them in the `backward_scores` array. The following is the code snippet for the implementation. ```python for i in range(sent_len-1, 0, -1): for k, current_y in enumerate(tags): temp_score = 0 for j, next_y in enumerate(tags): transition_key = f"transition:{current_y}+{next_y}" emission_key = f"emission:{next_y}+{sent[i]}" transition_score = feature_dict.get(transition_key, -10**2) emission_score = feature_dict.get(emission_key, -10**2) # Sum exponentials temp_score += np.exp(min(emission_score + transition_score + backward_scores[i, j], 700)) # Add to backward scores array backward_scores[i-1, k] = np.log(temp_score) if temp_score else -10**2 ``` Similar to the forward algorithm, we have also implemented calculating the backward score for the end of the sequence. Unlike the forward algorithm, the sequence would end at `START` which would give us the full scores in the array available to be processed for the gradient calculation. The end of the `compute_backward` function would return the `backward_score` and `log_backward_score` value. ```python ''' Calculate START ''' backward_prob = 0 for j, next_y in enumerate(tags): transition_key = f"transition:START+{next_y}" emission_key = f"emission:{next_y}+{sent[0]}" # Emission of last word transition_score = feature_dict.get(transition_key, -10**2) emission_score = feature_dict.get(emission_key, -10**2) overall_score = np.exp(min(emission_score + transition_score + backward_scores[0, j], 700)) backward_prob += overall_score log_backward_score = np.log(backward_prob) if backward_prob else -700 return backward_scores, log_backward_score ``` ### Compute Forward Backward This is an intermediary step to achieve the gradients. This step is done to calculate the `expected_feature_counts` from forward and backward algorithm. `expected_feature_counts` refers to the expected value from the forward backward algorithm. We initialise the `expected_feature_counts` as a dictionary with values of 0 and calculated the `forward_score`, `alpha` from the `compute_forward` function while `backwards_score` is computed using the `compute_backward` function. This can be seen in the following snippet: ```python expected_feature_counts = defaultdict(float) forward_scores, alpha = compute_forward(sent, feature_dict, tags) backward_scores, _ = compute_backward(sent, feature_dict, tags) forward_prob = np.exp(min(alpha, 700)) ``` We then get the features (emission and transition) and add it on to the expected feature count dictionary for it to be used later for the gradient calculation. This is the implementation for emission, where you calculate it from forward and backward score and subtract alpha from it. This is to get a normalised individual score f for the emission. The code snippet can be seen in the following: ```python ''' Get emission for the expected feature counts for each word ''' for idx in range(sent_len): for tag_idx, curr_y in enumerate(tags): emission_key = f"emission:{curr_y}+{sent[idx]}" expected_feature_counts[emission_key] += np.exp(min(forward_scores[idx, tag_idx] + backward_scores[idx, tag_idx] - alpha, 700)) ``` Next is the implementation for the transition score. We will calculate from START all the way to the end of the sequence to find the score and save it to the dictionary. The following is the implementation of the transition method: ```python ''' get transition for START ''' for tag_idx, next_y in enumerate(tags): start_transition_key = f"transition:START+{next_y}" expected_feature_counts[start_transition_key] += np.exp(min(forward_scores[0, tag_idx] + backward_scores[0, tag_idx] - alpha, 700)) stop_transition_key = f"transition:{next_y}+STOP" expected_feature_counts[stop_transition_key] += np.exp(min(forward_scores[sent_len-1, tag_idx] + backward_scores[sent_len-1, tag_idx] - alpha, 700)) ''' Get transition for the rest of the sentence ''' for tag_idx, curr_y in enumerate(tags): for next_tag_idx, next_y in enumerate(tags): transition_key = f"transition:{curr_y}+{next_y}" transition_score = feature_dict.get(transition_key, -10**2) total = 0 for idx in range(sent_len-1): emission_key = f"emission:{next_y}+{sent[idx+1]}" emission_score = feature_dict.get(emission_key, -10**2) total += np.exp(min(forward_scores[idx, tag_idx] + backward_scores[idx+1, next_tag_idx] + transition_score + emission_score - alpha, 700)) expected_feature_counts[transition_key] = total ``` Once we finish both the transmission and emission calculation, we return `expected_feature_count` that has been updated with the values. This will be used to compute the gradients. ### Compute Gradients To compute the gradient, we implemented a function that uses the results from forward_backward, which is the `expected_feature_counts`. To compute the feature gradients, we would need the list of training sentences, training tags, feature dictionary, and the list of available tags. We then initialise the feature gradient dictionary (as specified in the problem set), to 0. The implmentation can be seen in the following: ```python feature_gradients = defaultdict(float) #set the default values to 0.0 ``` We then iterate through the training set to get both the expected feature counts, which we would have from the `forward_backward` function, and the actual count of the feautures. To get the actual counts of the features, we have implemented a function called `count_true_features`. `count_true_features` would count the number of times that each transition and emission feature appears in the sentence. To calculate the number of emission features, every time the `emission_key`, which is the formatted `emission:current_tag+word`, appears, we update the count of the key in `feature_gradients`. We implemented the same concept for the transition count after we have updated the input sentence with 'START' and 'STOP'. Once all the features are counted, we return the `feature_count` dictionary. The implementation can be seen in the following snippet: ```python for i in range(n): formatted_word = x[i] emission_key = f"emission:{y[i]}+{formatted_word}" feature_count[emission_key] += 1 # count the emission updated_y = ["START"] + y + ["STOP"] #update for transition for i in range(1, n+2): prev_y = updated_y[i-1] y_i = updated_y[i] transition_key = f"transition:{prev_y}+{y_i}" feature_count[transition_key] += 1 ``` At this point we now have the `feature_expected_counts` as well as the `actual_counts`. This will then be used as a method to update the `feature_gradient` dictionary. The update method is the implementation of the following formula: ![](https://i.imgur.com/pzERxx4.png) Since the `feature_gradient` dictionary is set to 0, we can just take the difference between the the expected count and actual count. We can do this by adding the value of expected count first then subtracting the value of the actual counts. The following is the snippet for said implementation: ```python for k, v in feature_expected_counts.items(): feature_gradients[k] += v for k, v in actual_counts.items(): feature_gradients[k] -= v ``` As part of the requirement of Part 4 we have implemented a regularisation method. The implementation can be seen in the following: ```python if regularisation == True: for k, v in f.items(): feature_gradients[k] += 2*nabla*f[k] return feature_gradients ``` This will then return the feature gradients for use in Part 4. ### Results The following are the results to be tested against a numerical gradient that was calculated by finding the difference and dividing it over a predefined constant delta. We checked this for the following emission and transmission tags: `['emission:O+the', 'transition:START+O', 'transition:O+O', 'transition:O+STOP']` This is the result: ![](https://i.imgur.com/lrNsfGw.png) ## Part 4i Now, the loss function will have the L2 regularisation term to prevent overfitting. The new loss function with L2 regularisation is: $$loss = -∑_ilog p(y_i|x_i) + \eta∑_jw_j^2$$ where $w_j$ is the weight of the j-th feature, $\eta$ is the coefficient of the L2 regularisation term. The gradient of the new loss function will thus be: $$gradient = -∑_ilog p(y_i|x_i) + 2\eta∑_jw_j$$ where $w_j$ is the weight of the j-th feature, $\eta$ is the coefficient of the L2 regularisation term. We will focus on getting new weights using the L-BFGS algorithm, via the ```fmin_l_bfgs_b``` function. There are two functions, ```callbackF``` and ```get_loss_grad```, that will be used in that function. The callbackF function will be as follows: ```python loss = get_loss_grad(w)[0] print(’Loss:{0:.4f}’.format(loss)) ``` In the ```get_loss_grad``` function, we compute the loss and the gradient of the loss function using the ```compute_gradients``` and ```compute_crf_loss``` from Question 3. The code is as shown below. We need the train_inputs, train_labels, the feature-weight dictionary and all the possible tags from the training dataset. We also change the weights $w$ from a numpy array to a dictionary, ```newf```. ```python trainpath = "data/partial/train" train_inputs, train_labels = get_dataset(trainpath) f = q1(trainpath) states = get_tags(trainpath) newf = {} for i, key in enumerate(f.keys()): newf[key] = w[i] ``` We subsequently use the two functions from Q3 to get the loss and gradient of the loss functions. ```python loss = compute_crf_loss(train_inputs, train_labels, newf, states, 0.1, regularization=True) grads = compute_gradients(train_inputs, train_labels, newf, states, 0.1, regularisation=True) ``` We have to prepare the grads and change it from a dictionary to a numpy array before returning it. ```python np_grads = np.zeros(len(newf)) for i, k in enumerate(newf.keys()): np_grads[i] = grads[k] # return loss and grad return loss, np_grads ``` ### Results The intermediary loss is 468.2452 and the final loss is 331.8489. ## Part 4ii In the ```part4iiviterbi``` function, we take the weights stored in the results array calculated from the ```fmin_l_bfgs_b``` function. These weights are then used in the Viterbi algorithm from Part 2i, which will generate new predictions. We first need to prepare the weights and change it from a numpy array to a dictionary, for the feature-weight dictionary to be passed into the Viterbi algorithm. ```python newfeaturedict = {} for i, key in enumerate(featuredict.keys()): newfeaturedict[key] = newweights[i] viterbi_inference(inputpath, outputpath, tags, newfeaturedict) ``` ### Results The accuracy, precision, recall and F1 scores as calculated by the ```conlleval.evaluate``` function is as shown below: ![](https://i.imgur.com/9BxxDur.png) ## Part 5i In Question 5i, we now add part-of-speech (POS) tags into the emission features for the CRF, such as "emission:B-geo+NNS" where NNS is the POS tag. We created a function called ```part5i_get_tags_weights``` that takes in the file path for the training set. From the training set data, each POS-label pair (e.g. POS: NNS, Label: B-geo) is counted and e(xPOS|y) is evaluated. Emission probability calculated using the following function: $$e(x|y)= count(y → x)/count(y)$$ In the same function ```part5i_get_tags_weights```, similar to the function in Part 1: 1. Each word-label pair (e.g. Word: transportation, Label: O) is counted and e(x|y) is evaluated. 2. For each sentence in the training set, each yi, yi-1 pair is counted and q(yi|yi−1) is calculated. Note that the special labels "START" and "STOP" are added to the start and end of the sentence before the transition probability is calculated. We reapply the Viterbi algorithm on the dataset ```full/dev.in```, which is of the format <English Word> <part of speech tag like NNS> <output tag>. Since the format has changed, we need to change the code for get_tags that we used in Part 2ii. Instead of adding the second element from splitlines (splitlines[1]), we take splitlines[2], which is the output tag. The ```part5i_get_tags``` function is as follows: ```python tags = set() with open(inputpath) as f: lines = f.readlines() for line in lines: formatted_lines = line.strip() if formatted_lines != '': splitlines = formatted_lines.split(" ") y = splitlines[2] tags.add(y) return list(tags) ``` For the decoding algorithm, we now need to account for the new type of tag. In addition to the transition features, the emission features for the word (e.g. "emission:O+START") and the previous tag's score, we have to also add the emission POS feature into the score. Thus, we change the code from Part 2ii to include the emission POS feature in the score in the ```part5i_viterbi_decode``` function: ```python transitionf = 'transition:{}+{}'.format(prevtag, currenttag) emissionwordf = 'emission:{}+{}'.format(currenttag, x[i][0]) emissionposf = 'emission:{}+{}'.format(currenttag, x[i][1]) prevscore = scores[i - 1][prev] totalscore = featuredict.get(emissionwordf, -10 ** 8) + featuredict.get(emissionposf, -10 ** 8) + featuredict.get(transitionf, -10 ** 8) + prevscore ``` When writing the prediction to ```full/dev.p5.CRF.f3.out```, we have to write it in the same format <English Word> <part of speech tag like NNS> <output tag>. Thus, the ```viterbi_inference``` code from Part 2ii needs to be changed to: ```python with open(outputpath, "w") as f: for sentence in sentences: pred = part5i_viterbi_decode(sentence, nodes, featuredict) for i in range(len(sentence)): f.write(sentence[i][0] + " " + sentence[i][1] + pred[i] + "\n") f.write("\n") ``` ### Results The accuracy, precision, recall and F1 scores as calculated by the ```conlleval.evaluate``` function is as shown below: ![](https://i.imgur.com/J7YJ4fb.png) ## Part 5ii We have a new feature "$combine:y_{i-1}+y_i+x_i$", which is combined from emission feature "$emission:y_i+x_i$" and "$transition:y_{i-1}+y_i$" feature. We need to add this combined feature into the CRF model and train the new model. We created a function called ```part5ii_get_tags_weights``` that takes in the file path for the training set. From the training set data, 1. Each word-label pair (e.g. Word: transportation, Label: O) is counted and e(x|y) is evaluated. 2. Each POS-label pair (e.g. POS: NNS, Label: B-geo) is counted and e(xPOS|y) is evaluated. 3. For sentence in the training set, each yi, yi-1 pair is counted and q(yi|yi−1) is calculated. The emission probability calculated using the following function: $$e(x|y)= count(y → x)/count(y)$$ The transition probability is calculated using the following function: $$q(y_i|y_{i−1}= count(y_{i−1},y_i) / count(y_{i−1})$$ For the evaluation of the combined probability, all possible permutations of transition_pairs (yi-1,yi) and words (xi) in the data set. To get (yi-1, yi, xi) ```python for pair in pair_count.keys(): # emission counts: pair_count = {word+label: counts} word = pair.split("+")[1] # word from emission for tpair in count_tran_pairs.keys(): # transition counts: count_tran_pairs = {(yi-1)+(yi): counts } yi = tpair.split("+")[1] # yi from transition # combine: yi-1 (from trans) yi (from trans) xi (from emi) combination = "combine:" + tpair + "+" + word ``` The transition probability of (yi-1,yi)and the emission probability of (yi,xi) is taken from the existing feature-weight dictionary f and summed to get the "combine" probability. If either the transition probability or the emission probability does not exist in f, they are set to a small value -10e8, and then summed. ```python for pair in pair_count.keys(): # emission counts: pair_count = {word+label: counts} word = pair.split("+")[1] # word from emission for tpair in count_tran_pairs.keys(): # transition counts: count_tran_pairs = {(yi-1)+(yi): counts } yi = tpair.split("+")[1] # yi from transition combination = "combine:" + tpair + "+" + word # combine: yi-1 (from trans) yi (from trans) xi (from emi) combi_trans_prob = -10e8 combi_emi_prob = -10e8 # if value exists input it if (str("transition:" + tpair) in f.keys()): combi_trans_prob = f["transition:" + tpair] if (str("emission:" + yi + "+" + word) in f.keys()): combi_emi_prob = f["emission:" + yi + "+" + word] combi_prob = combi_trans_prob + combi_emi_prob f[combination] = combi_prob ``` Getting tags and the viterbi inference functions will be the same as in Part 5i, but the viterbi decode function will change to include the new feature, along with the emission POS tag. When iterating through the layers to find the output tag, we also consider the combine feature for English words and the combine feature for POS tags, because $x_i$ can represent both English words and POS tags. The scores for these two features must also be considered in the total score. Thus the viterbi decode code can be changed to: ```python transitionf = 'transition:{}+{}'.format(prevtag, currenttag) emissionwordf = 'emission:{}+{}'.format(currenttag, x[i][0]) emissionposf = 'emission:{}+{}'.format(currenttag, x[i][1]) combinewordf = 'combine:{}+{}+{}'.format(prevtag, currenttag, x[i][0]) combineposf = 'combine:{}+{}+{}'.format(prevtag, currenttag, x[i][1]) prevscore = scores[i - 1][prev] totalscore = featuredict.get(emissionwordf, -10 ** 8) + featuredict.get(emissionposf, -10 ** 8) + featuredict.get(transitionf, -10 ** 8) + featuredict.get(combinewordf, -10 ** 8) + featuredict.get(combineposf, -10 ** 8) + prevscore ``` ### Results The accuracy, precision, recall and F1 scores as calculated by the ```conlleval.evaluate``` function is as shown below: ![](https://i.imgur.com/PrtHN0d.png) ## Part 5iii ### Structured Perceptron To implement the structured perceptron method, we would need the tags, feature dictionary, and train inputs, which were generated in Part 5i and Part 5ii. We are doing it based on the pseudo code found in the lecture notes as well as the following: ![](https://i.imgur.com/zxcT3rE.jpg) To implement the structured perceptron, we first initialised a `weights` dictionary for the size of the feature dictionary. We did this by copying the whole feature dictionary and defaulting the values to 0. The implementation can be seen in the following: ```python weights = copy.deepcopy(featuredict) #initialise weights to 0 weights = dict.fromkeys(weights, 0) # create maping for weights ``` Once we have the weights initialised, we would need to obtain the `g_truth` (ground truth) list of tags from the training inputs as well as the predicted path and tags. We can get the predicted path and tags using `part5ii_viterbi_decode` function defined earlier in Part 5ii. We would then be able to proceed to update the weights. The following is the implementation. We ran the structured perceptron learning method for 25 epochs: ```python for epoch in tqdm(range(epochs)): for sentence in train_inputs: g_truth = [] for i in sentence: g_truth.append(i[2]) pred = part5ii_viterbi_decode(sentence, nodes, weights) #predicted path weights = updated_get_update_weights(weights, featuredict,sentence, g_truth, pred) ``` To update the weights, we implemented a function that will replace the initial weights with the new weights. Based on the pseudo code and the lecture notes, we implemented a method such that the weights are updated only with the difference between the feature scores of the true and predicted values. Each feature, transition, emission to words, and emission to POS, and the corresponding weights in the weights dictionary are updated with the difference of said feature in the original feature dictionary, which in this case acts as phi from the pseudo code. The following is the code implementation for the weights update: ```python for i, (t, p, words) in enumerate(zip(target, pred, sentence)): if t!=p: if i == 0: prev_tag = 'START' t_key = f'transition:{prev_tag}+{t}' p_key = f'transition:{prev_tag}+{p}' w[t_key] += featuredict.get(t_key, 0) - featuredict.get(p_key,0) elif i == len(sentence) - 1: STOP = 'STOP' t_key = f'transition:{sentence[i-1][2]}+{t}' t_stop_key = f'transition:{t}+{STOP}' p_key = f'transition:{sentence[i-1][2]}+{p}' p_stop_key = f'transition:{p}+{STOP}' w[t_key] += featuredict.get(t_key, 0) - featuredict.get(p_key,0) w[t_stop_key] += featuredict.get(t_key, 0) - featuredict.get(p_key,0) else: t_key = f'transition:{sentence[i-1][2]}+{t}' p_key = f'transition:{sentence[i-1][2]}+{p}' w[t_key] += featuredict.get(t_key, 0) - featuredict.get(p_key,0) t_emission = f'emission:{t}+{words[0]}' p_emission = f'emission:{p}+{words[0]}' t_pos = f'emission:{t}+{words[1]}' p_pos = f'emission:{p}+{words[1]}' w[t_emission] += featuredict.get(t_emission, 0) - featuredict.get(p_emission,0) w[t_pos] += featuredict.get(t_pos, 0) - featuredict.get(p_pos,0) ... ``` Once the function is done updating the weights, it will run again based on the number of epochs left during the training. Once the epochs finishes, the learning is considered done and the weights are returned for the evaluation. Once the weight learning is done, we wrote the results in `dev.P5.SP.out`, validated it against `dev.out` and evaluated it using `conlleval.evaluate`. ### Results The accuracy, precision, recall and F1 scores as calculated by the ```conlleval.evaluate``` function is as shown below: The accuracy, precision, recall and F1 scores as calculated by the ```conlleval.evaluate``` function is as shown below: ![](https://i.imgur.com/ekcwNAt.png) ## Part 6i For this section, our main idea was to design an additional transition trigram feature. We created a function called ```part6i_get_tags_weights``` that takes in the file path for the training set. From the training set data, for each sentence in the training set, each yi, yi-1 pair is counted and q(yi|yi−1) is calculated. The transition probability is calculated using the following function: $$q(y_i|y_{i−1})= count(y_{i−1},y_i) / count(y_{i−1}))$$ For the evaluation of the combined probabilty, all possible permutations of labels (yi) as well as ["START", "STOP"]. To get (yi-1, yi, yi+1) ```python # label_counts = {label: count} # dictionary of all labels in data # ["START"] and ["STOP"] are added to account for special trigrams e.g. (START, label, STOP) or (label, STOP, STOP) for L1 in list (label_count.keys()) + ["START"]: for L2 in list (label_count.keys())+ ["STOP"]: for L3 in list (label_count.keys()) + ["STOP"]: trans_tri = "transition_trigram:" + L1 + "+" + L2 + "+" + L3 f[trans_tri] = getTriProb(f, L1, L2, L3) ``` The transition probability of (yi-1, yi)and (yi, yi+1) is taken from the existing feature-weight dictionary f and summed to get the "transition_trigram" probability. If either the transition probability does not exist in f, they are set to a small value -10e8, and then summed. ```python def getTriProb(f, L1, L2, L3): # input: permutations of 3 labels trans1_prob = -10e8 trans2_prob = -10e8 # if value exists input it if (str("transition:" + L1 + "+" + L2) in f.keys()): trans1_prob = f["transition:" + L1 + "+" + L2] if (str("transition:" + L2 + "+" + L3) in f.keys()): trans2_prob = f["transition:" + L2 + "+" + L3] trans_triple_prob = trans1_prob + trans2_prob return trans_triple_prob ``` At this point we have 4 different features: 1. Emission Feature ($emission:y_{i}+x_{i}$) 2. Bigram Transition Feature ($transition:y_{i-1}+y_i$) 3. Combine Feature ($combine:y_{i-1}+y_i+x_i$) 4. Trigram Transition Feature ($transition\_trigram:y_{i-1}+y_i+y_{i+1}$) We designed three different sets of features (made of different combinations of the 4 features) and ran them on the dev set and test set to determine which set of features perform the best. We replaced the bigram transition feature with the trigram transition feature to give the The three different sets of features are: #### Set 1 * Transition trigram feature ($transition\_trigram:y_{i-1}+y_i+y_{i+1}$) where $y_{i-1}$, $y_{i}$ and $y_{i+1}$ are the tags such as O * Emission word and POS features for the current tag layer ($emission:y_{i}+x_{i}$) The code is as follows: For the first layer, n = 0, after START: ```python for i in range(numtags): tag = nodes[i] emissionwordf = 'emission:{}+{}'.format(tag, x[0][0]) emissionposf = 'emission:{}+{}'.format(tag, x[0][1]) emissionscore = featuredict.get(emissionwordf, -10 ** 8) + featuredict.get(emissionposf, -10 ** 8) for j in range(numtags): nexttag = nodes[j] transitionf = 'transition_trigram:START+{}+{}'.format(tag, nexttag) scores[0][i] = emissionscore + featuredict.get(transitionf, -10 ** 8) ``` For the layers n = 1 to n = number of words - 2: ```python # loop through layer = 1 to layer = n - 2 for i in range(1, numwords - 1): for prev in range(numtags): # v for now in range(numtags): # u prevtag = nodes[prev] currenttag = nodes[now] emissionwordf = 'emission:{}+{}'.format(currenttag, x[i][0]) emissionposf = 'emission:{}+{}'.format(currenttag, x[i][1]) emissionscore = featuredict.get(emissionwordf, -10 ** 8) + featuredict.get(emissionposf, -10 ** 8) prevscore = scores[i - 1][prev] for nexti in range(numtags): nexttag = nodes[nexti] transitionf = 'transition_trigram:{}+{}+{}'.format(prevtag, currenttag, nexttag) totalscore = emissionscore + featuredict.get(transitionf, -10 ** 8) + prevscore if totalscore > scores[i][now]: scores[i][now] = totalscore back[i][now] = prev ``` For the layer before STOP, we need to ensure that the transition trigram feature is $transition\_trigram:y_{i-1}+STOP+STOP$: ```python # Score for last layer before STOP's nodes for prev in range(numtags): prevtag = nodes[prev] for now in range(numtags): currenttag = nodes[now] transitionf = 'transition_trigram:{}+{}+STOP'.format(prevtag, currenttag) emissionwordf = 'emission:{}+{}'.format(currenttag, x[numwords - 1][0]) emissionposf = 'emission:{}+{}'.format(currenttag, x[numwords - 1][1]) emissionscore = featuredict.get(emissionwordf, -10 ** 8) + featuredict.get(emissionposf, -10 ** 8) prevscore = scores[numwords - 2][prev] totalscore = featuredict.get(transitionf, -10 ** 8) + emissionscore + prevscore if totalscore > scores[numwords - 1][now]: scores[numwords - 1][now] = totalscore back[numwords - 1][now] = prev ``` The results for the dev set are: ![](https://i.imgur.com/1eieMWa.png) #### Set 2 * Transition trigram feature ($transition\_trigram:y_{i-1}+y_i+y_{i+1}$) where $y_{i-1}$, $y_{i}$ and $y_{i+1}$ are the tags such as O * Emission features for the previous word and POS, the current word and POS and the next word and POS ($emission:y_{i-1}+x_{i-1}$, $emission:y_{i}+x_{i}$ and $emission:y_{i+1}+x_{i+1}$"). The code is as follows: For the first layer, n = 0, after START: ```python for i in range(numtags): tag = nodes[i] emissionwordf = 'emission:{}+{}'.format(tag, x[0][0]) emissionposf = 'emission:{}+{}'.format(tag, x[0][1]) emissionscore = featuredict.get(emissionwordf, -10 ** 8) + featuredict.get(emissionposf, -10 ** 8) for j in range(numtags): nexttag = nodes[j] transitionf = 'transition_trigram:START+{}+{}'.format(tag, nexttag) scores[0][i] = emissionscore + featuredict.get(transitionf, -10 ** 8) ``` For the layers n = 1 to n = number of words - 2: ```python # loop through layer = 1 to layer = n - 2 for i in range(1, numwords - 1): for prev in range(numtags): # v for now in range(numtags): # u prevtag = nodes[prev] currenttag = nodes[now] emissionwordf = 'emission:{}+{}'.format(currenttag, x[i][0]) emissionposf = 'emission:{}+{}'.format(currenttag, x[i][1]) emissionscore = featuredict.get(emissionwordf, -10 ** 8) + featuredict.get(emissionposf, -10 ** 8) emissionprevwordf = 'emission:{}+{}'.format(prevtag, x[i-1][0]) emissionprevposf = 'emission:{}+{}'.format(prevtag, x[i-1][1]) emissionprevscore = featuredict.get(emissionprevwordf, -10 ** 8) + featuredict.get(emissionprevposf, -10 ** 8) prevscore = scores[i - 1][prev] for nexti in range(numtags): nexttag = nodes[nexti] transitionf = 'transition_trigram:{}+{}+{}'.format(prevtag, currenttag, nexttag) emissionnextwordf = 'emission:{}+{}'.format(nexttag, x[i+1][0]) emissionnextposf = 'emission:{}+{}'.format(nexttag, x[i+1][1]) emissionnextscore = featuredict.get(emissionwordf, -10 ** 8) + featuredict.get(emissionposf, -10 ** 8) totalscore = emissionscore + featuredict.get(transitionf, -10 ** 8) + prevscore + emissionprevscore + emissionnextscore if totalscore > scores[i][now]: scores[i][now] = totalscore back[i][now] = prev ``` For the layer before STOP, we need to ensure that the transition trigram feature is $transition\_trigram:y_{i-1}+STOP+STOP$: ```python # Score for last layer before STOP's nodes for prev in range(numtags): prevtag = nodes[prev] for now in range(numtags): currenttag = nodes[now] transitionf = 'transition_trigram:{}+{}+STOP'.format(prevtag, currenttag) emissionwordf = 'emission:{}+{}'.format(currenttag, x[numwords - 1][0]) emissionposf = 'emission:{}+{}'.format(currenttag, x[numwords - 1][1]) emissionscore = featuredict.get(emissionwordf, -10 ** 8) + featuredict.get(emissionposf, -10 ** 8) emissionprevwordf = 'emission:{}+{}'.format(prevtag, x[i-1][0]) emissionprevposf = 'emission:{}+{}'.format(prevtag, x[i-1][1]) emissionprevscore = featuredict.get(emissionprevwordf, -10 ** 8) + featuredict.get(emissionprevposf, -10 ** 8) prevscore = scores[numwords - 2][prev] totalscore = featuredict.get(transitionf, -10 ** 8) + emissionscore + prevscore + emissionprevscore if totalscore > scores[numwords - 1][now]: scores[numwords - 1][now] = totalscore back[numwords - 1][now] = prev ``` The results for the dev set are: ![](https://i.imgur.com/gWrnEV3.png) #### Set 3 * Transition trigram feature ($transition\_trigram:y_{i-1}+y_i+y_{i+1}$) where $y_{i-1}$, $y_{i}$ and $y_{i+1}$ are the tags such as O * Combine features ($combine:y_{i-1}+y_i+x_i$) * Emission feature for the current POS and word ($emission:y_{i}+x_{i}$) The code for calculating the score is as follows: For the first layer, n = 0, after START: ```python for i in range(numtags): tag = nodes[i] emissionwordf = 'emission:{}+{}'.format(tag, x[0][0]) emissionposf = 'emission:{}+{}'.format(tag, x[0][1]) emissionscore = featuredict.get(emissionwordf, -10 ** 8) + featuredict.get(emissionposf, -10 ** 8) combinewordf = 'combine:START+{}+{}'.format(tag, x[0][0]) combineposf = 'combine:START+{}+{}'.format(tag, x[0][1]) for j in range(numtags): nexttag = nodes[j] transitionf = 'transition_trigram:START+{}+{}'.format(tag, nexttag) scores[0][i] = emissionscore + featuredict.get(transitionf, -10 ** 8) + featuredict.get(combinewordf, -10 ** 8) + featuredict.get(combineposf, -10 ** 8) ``` For the layers n = 1 to n = number of words - 2: ```python # loop through layer = 1 to layer = n - 2 for i in range(1, numwords - 1): for prev in range(numtags): # v for now in range(numtags): # u prevtag = nodes[prev] currenttag = nodes[now] emissionwordf = 'emission:{}+{}'.format(currenttag, x[i][0]) emissionposf = 'emission:{}+{}'.format(currenttag, x[i][1]) emissionscore = featuredict.get(emissionwordf, -10 ** 8) + featuredict.get(emissionposf, -10 ** 8) combinewordf = 'combine:{}+{}+{}'.format(prevtag, currenttag, x[i][0]) combineposf = 'combine:{}+{}+{}'.format(prevtag, currenttag, x[i][1]) prevscore = scores[i - 1][prev] for nexti in range(numtags): nexttag = nodes[nexti] transitionf = 'transition_trigram:{}+{}+{}'.format(prevtag, currenttag, nexttag) totalscore = emissionscore + featuredict.get(transitionf, -10 ** 8) + prevscore + featuredict.get(combinewordf, -10 ** 8) + featuredict.get(combineposf, -10 ** 8) if totalscore > scores[i][now]: scores[i][now] = totalscore back[i][now] = prev ``` For the layer before STOP, we need to ensure that the transition trigram feature is $transition\_trigram:y_{i-1}+STOP+STOP$: ```python # Score for last layer before STOP's nodes for prev in range(numtags): prevtag = nodes[prev] for now in range(numtags): currenttag = nodes[now] transitionf = 'transition_trigram:{}+{}+STOP'.format(prevtag, currenttag) emissionwordf = 'emission:{}+{}'.format(currenttag, x[numwords - 1][0]) emissionposf = 'emission:{}+{}'.format(currenttag, x[numwords - 1][1]) emissionscore = featuredict.get(emissionwordf, -10 ** 8) + featuredict.get(emissionposf, -10 ** 8) combinewordf = 'combine:{}+{}+{}'.format(prevtag, currenttag, x[numwords - 1][0]) combineposf = 'combine:{}+{}+{}'.format(prevtag, currenttag, x[numwords - 1][1]) prevscore = scores[numwords - 2][prev] totalscore = featuredict.get(transitionf, -10 ** 8) + emissionscore + prevscore + featuredict.get(combinewordf, -10 ** 8) + featuredict.get(combineposf, -10 ** 8) if totalscore > scores[numwords - 1][now]: scores[numwords - 1][now] = totalscore back[numwords - 1][now] = prev ``` The results for the dev set are: ![](https://i.imgur.com/LaNLoye.png) **Set 2** performed the worst. We hypothesise that it is due to the additional emission features for the previous and the next word. These two features caused the model to overfit to the dev data. If the emission scores for the previous and next word were considered in the total score, this would cause the total score to be dependent on the previous and next words (contextual words). The contextual words will differ from sentence to sentence and dataset to dataset. Thus, the model would not perform well on new, unseen test data. **Set 1** performed second best, and its F1 score was very close to Set 3's F1 score, presumably because it did not have the combine features that we used in Set 3. Since the combine score is an addition of emission and transition, it helps to increase/emphasise the importance (or lack of) of the tag. Thus, having combine helps to boost the F1 score. Since the F1 score for Set 3 was the best, we used that set of features for the test set. ## Part 6ii For Part 6ii, we decided to design an Attention Seq2Seq model in order to try to compare it with the rest of the project. Our main idea is to try to capture the more important features using the attention model. What we learned in class was that the attention weights would help in learning the more important features as it would have a "multiplier" weight for the features that appear more often. This, in theory, should improve the recall and precision values when compared to the non-neural network approach. The following screen shot of the slide explains the concept about Seq2SEq with attention. ![](https://i.imgur.com/2Icb2Rp.png) We used the `pytorch` library to create this model by inheriting the `nn.Module` class. We initialise the necessary layers, mainly the encoder and decoder, for the seq2seq architecture first, which is in the form of the bidirectional LSTM layer. We also initialise the attention layer to make sure that the model can learn the important weights for the attention mechanism. For our approach, we decided to train it on the partial dataset because we felt that it would simpler to implement and somewhat yield a similar result to the non-neural network approach trained on the full data set as we realise that the difference between the `full` and `partial` dataset was in the POS tagging. An imporovement for future development would be using the full dataset and its POS content to make a more accurate model. AttentionSeq2Seq model architecture can be seen in the following: ```python class AttentionSeq2Seq(nn.Module): ''' Sequence to Sequence Model with some attention to capture context of sentence before assigning tags, uses LSTM Args: - vocab_size : int (size of the vocabulary set) - embedding_dim : int (embedding size) - hidden_dim : int (hidden dimension size) - n_layers : int (number of layers) - tagset_size : int (number of tags) ''' def __init__(self, vocab_size, embedding_dim, hidden_dim, n_layers, tagset_size): super(AttentionSeq2Seq, self).__init__() self.encoder_embed = nn.Embedding(vocab_size, embedding_dim) self.encoder = nn.LSTM(embedding_dim, hidden_dim, n_layers, bidirectional=True) self.decoder_embed = nn.Embedding(vocab_size, embedding_dim) self.attention = nn.Linear(embedding_dim+hidden_dim*2, embedding_dim) self.decoder = nn.LSTM(embedding_dim, hidden_dim, n_layers, bidirectional=True) self.hiddentotag = nn.Linear(hidden_dim*2, tagset_size) def forward(self, x): embeds = self.encoder_embed(x) encoder_out, (hn, cn) = self.encoder(embeds) decode_emb = self.decoder_embed(x) attn = self.attention(torch.cat([encoder_out, decode_emb], dim=2)) attn = F.softmax(attn, dim=2) decoder_out, _ = self.decoder(decode_emb, (hn, cn)) tag_space = self.hiddentotag(decoder_out) tag_scores = F.softmax(tag_space, dim=2) return tag_scores ``` The attention model architecture would take in the following arguments to create the model and train it: ```python - vocab_size : int (size of the vocabulary set) - embedding_dim : int (embedding size) - hidden_dim : int (hidden dimension size) - n_layers : int (number of layers) - tagset_size : int (number of tags) ``` We train the model based on the following hyper-parameters: ```python 'EPOCH' : 50 'EMBEDDING_SIZE' : 512, 'HIDDEN_DIM' : 512, 'N_LAYERS' : 3, 'LEARNING_RATE' : 1e-4, 'MOMENTUM' : 0.9, 'WEIGHT_DECAY' : 1e-5, 'OPTIMISER' : Adam, 'LOSS FUNCTION' : Cross Entropy Loss ``` ### Results from Training: #### Precision ![](https://i.imgur.com/79RIJmK.png) #### Recall ![](https://i.imgur.com/EcjZH8q.png) #### Train Losses ![](https://i.imgur.com/p0GxvyM.png) #### Validation Loss ![](https://i.imgur.com/04JgB8o.png) #### Validation Accuracy ![](https://i.imgur.com/fAyF8IU.png) #### Training/Validation process ```python Epoch : 1, Train Loss : 2.5631164767912455, Validation Accuracy : 0.8071428571428572, validation Loss : 2.465279563835689, Precision : 0.734375, Recall : 0.49333333333333335 Epoch : 2, Train Loss : 2.34437217520816, Validation Accuracy : 0.8214285714285714, validation Loss : 2.3961641132831573, Precision : 0.76875, Recall : 0.5666666666666667 Epoch : 3, Train Loss : 2.281953721174172, Validation Accuracy : 0.8428571428571429, validation Loss : 2.378270112616675, Precision : 0.7170634920634921, Recall : 0.6055555555555555 Epoch : 4, Train Loss : 2.2407905993717057, Validation Accuracy : 0.8142857142857143, validation Loss : 2.3696454414299555, Precision : 0.5749158249158249, Recall : 0.5877777777777777 Epoch : 5, Train Loss : 2.2054331332445143, Validation Accuracy : 0.8142857142857143, validation Loss : 2.356927933011736, Precision : 0.6027777777777777, Recall : 0.5655555555555556 Epoch : 6, Train Loss : 2.160745951746191, Validation Accuracy : 0.8214285714285714, validation Loss : 2.344608277933938, Precision : 0.625, Recall : 0.5877777777777777 Epoch : 7, Train Loss : 2.148998691354479, Validation Accuracy : 0.8214285714285714, validation Loss : 2.3437087884971075, Precision : 0.6361111111111111, Recall : 0.5877777777777777 Epoch : 8, Train Loss : 2.1380474009684156, Validation Accuracy : 0.8214285714285714, validation Loss : 2.3379396608897616, Precision : 0.6361111111111111, Recall : 0.5877777777777777 Epoch : 9, Train Loss : 2.144378139930112, Validation Accuracy : 0.8142857142857143, validation Loss : 2.3376920342445375, Precision : 0.5833333333333334, Recall : 0.5655555555555556 Epoch : 10, Train Loss : 2.1301851836698398, Validation Accuracy : 0.8142857142857143, validation Loss : 2.3377557243619647, Precision : 0.5023809523809524, Recall : 0.5655555555555556 Epoch : 11, Train Loss : 2.1221915464316097, Validation Accuracy : 0.8071428571428572, validation Loss : 2.33673711674554, Precision : 0.5283882783882784, Recall : 0.5655555555555556 Epoch : 12, Train Loss : 2.1106248474546843, Validation Accuracy : 0.8285714285714286, validation Loss : 2.3321053717817577, Precision : 0.5357142857142857, Recall : 0.5655555555555556 Epoch : 13, Train Loss : 2.1064239738242967, Validation Accuracy : 0.8285714285714286, validation Loss : 2.3325614316122874, Precision : 0.5357142857142857, Recall : 0.5655555555555556 Epoch : 14, Train Loss : 2.1024797022342683, Validation Accuracy : 0.8285714285714286, validation Loss : 2.330783372265952, Precision : 0.5357142857142857, Recall : 0.5655555555555556 Epoch : 15, Train Loss : 2.0983524652464047, Validation Accuracy : 0.8285714285714286, validation Loss : 2.329663963828768, Precision : 0.5357142857142857, Recall : 0.5655555555555556 Epoch : 16, Train Loss : 2.094123219379357, Validation Accuracy : 0.8285714285714286, validation Loss : 2.3280977036271775, Precision : 0.5283882783882784, Recall : 0.5655555555555556 Epoch : 17, Train Loss : 2.0890290632843973, Validation Accuracy : 0.8285714285714286, validation Loss : 2.3266073618616376, Precision : 0.5283882783882784, Recall : 0.5655555555555556 Epoch : 18, Train Loss : 2.090371375637395, Validation Accuracy : 0.8285714285714286, validation Loss : 2.32764288016728, Precision : 0.5357142857142857, Recall : 0.5655555555555556 Epoch : 19, Train Loss : 2.095219942288739, Validation Accuracy : 0.8285714285714286, validation Loss : 2.3254343450069426, Precision : 0.511904761904762, Recall : 0.5655555555555556 Epoch : 20, Train Loss : 2.078696032507079, Validation Accuracy : 0.8285714285714286, validation Loss : 2.3245972079890116, Precision : 0.5357142857142857, Recall : 0.5655555555555556 Epoch : 21, Train Loss : 2.08551876864263, Validation Accuracy : 0.8285714285714286, validation Loss : 2.3229639640876223, Precision : 0.49489795918367346, Recall : 0.5655555555555556 Epoch : 22, Train Loss : 2.07079145865781, Validation Accuracy : 0.8285714285714286, validation Loss : 2.322780510357448, Precision : 0.6, Recall : 0.5655555555555556 Epoch : 23, Train Loss : 2.073653793334961, Validation Accuracy : 0.8142857142857143, validation Loss : 2.3228782185486385, Precision : 0.511904761904762, Recall : 0.5655555555555556 Epoch : 24, Train Loss : 2.0815608524850435, Validation Accuracy : 0.8214285714285714, validation Loss : 2.322839352914265, Precision : 0.49489795918367346, Recall : 0.5655555555555556 Epoch : 25, Train Loss : 2.072092562488147, Validation Accuracy : 0.8357142857142857, validation Loss : 2.3215960468564716, Precision : 0.5, Recall : 0.5655555555555556 Epoch : 26, Train Loss : 2.074716635474137, Validation Accuracy : 0.8285714285714286, validation Loss : 2.3244994734014783, Precision : 0.45, Recall : 0.5655555555555556 Epoch : 27, Train Loss : 2.080420016178063, Validation Accuracy : 0.8285714285714286, validation Loss : 2.322144877910614, Precision : 0.4761904761904762, Recall : 0.5655555555555556 Epoch : 28, Train Loss : 2.0739777152027403, Validation Accuracy : 0.8214285714285714, validation Loss : 2.321476606811796, Precision : 0.5722222222222223, Recall : 0.5655555555555556 Epoch : 29, Train Loss : 2.0728431410023145, Validation Accuracy : 0.8285714285714286, validation Loss : 2.3193008601665497, Precision : 0.4904761904761905, Recall : 0.5655555555555556 Epoch : 30, Train Loss : 2.0750978135636875, Validation Accuracy : 0.8357142857142857, validation Loss : 2.319464339528765, Precision : 0.5095238095238095, Recall : 0.6044444444444445 Epoch : 31, Train Loss : 2.0633754361953054, Validation Accuracy : 0.8357142857142857, validation Loss : 2.314763639654432, Precision : 0.5722222222222223, Recall : 0.5822222222222223 Epoch : 32, Train Loss : 2.0682562383157865, Validation Accuracy : 0.8357142857142857, validation Loss : 2.3142775901726313, Precision : 0.5714285714285714, Recall : 0.5822222222222223 Epoch : 33, Train Loss : 2.0651103651949336, Validation Accuracy : 0.8357142857142857, validation Loss : 2.314954174416406, Precision : 0.5, Recall : 0.5822222222222223 Epoch : 34, Train Loss : 2.0706394421202794, Validation Accuracy : 0.8357142857142857, validation Loss : 2.313420624392373, Precision : 0.5523809523809524, Recall : 0.5822222222222223 Epoch : 35, Train Loss : 2.0726764434150287, Validation Accuracy : 0.8357142857142857, validation Loss : 2.313703598294939, Precision : 0.5636752136752137, Recall : 0.5822222222222223 Epoch : 36, Train Loss : 2.062797150228705, Validation Accuracy : 0.8357142857142857, validation Loss : 2.3143895370619636, Precision : 0.4904761904761905, Recall : 0.5822222222222223 Epoch : 37, Train Loss : 2.065275397683893, Validation Accuracy : 0.8357142857142857, validation Loss : 2.317206164768764, Precision : 0.5555555555555556, Recall : 0.5822222222222223 Epoch : 38, Train Loss : 2.064026991171496, Validation Accuracy : 0.8357142857142857, validation Loss : 2.318472003085273, Precision : 0.5833333333333334, Recall : 0.5822222222222223 Epoch : 39, Train Loss : 2.072461142071656, Validation Accuracy : 0.8357142857142857, validation Loss : 2.3155653962067198, Precision : 0.5747863247863249, Recall : 0.5822222222222223 Epoch : 40, Train Loss : 2.0605525123221535, Validation Accuracy : 0.8357142857142857, validation Loss : 2.3137236527034215, Precision : 0.5747863247863249, Recall : 0.5822222222222223 Epoch : 41, Train Loss : 2.0626814878412656, Validation Accuracy : 0.8357142857142857, validation Loss : 2.3140461002077375, Precision : 0.5747863247863249, Recall : 0.5822222222222223 Epoch : 42, Train Loss : 2.0600655572755, Validation Accuracy : 0.8357142857142857, validation Loss : 2.3147521955626353, Precision : 0.5555555555555556, Recall : 0.5822222222222223 Epoch : 43, Train Loss : 2.0527449526957104, Validation Accuracy : 0.8357142857142857, validation Loss : 2.314120681796755, Precision : 0.5714285714285714, Recall : 0.5822222222222223 Epoch : 44, Train Loss : 2.06457995729787, Validation Accuracy : 0.8357142857142857, validation Loss : 2.312924523012979, Precision : 0.5636752136752137, Recall : 0.5822222222222223 Epoch : 45, Train Loss : 2.0538459283964974, Validation Accuracy : 0.8357142857142857, validation Loss : 2.3117253720760345, Precision : 0.5142857142857143, Recall : 0.5822222222222223 Epoch : 46, Train Loss : 2.0531645672661916, Validation Accuracy : 0.8214285714285714, validation Loss : 2.312817904778889, Precision : 0.48831168831168836, Recall : 0.5822222222222223 Epoch : 47, Train Loss : 2.049910864021097, Validation Accuracy : 0.8357142857142857, validation Loss : 2.306066176721028, Precision : 0.5142857142857143, Recall : 0.5822222222222223 Epoch : 48, Train Loss : 2.057153474858829, Validation Accuracy : 0.8357142857142857, validation Loss : 2.3062964379787445, Precision : 0.5142857142857143, Recall : 0.5822222222222223 Epoch : 49, Train Loss : 2.039928914819445, Validation Accuracy : 0.8357142857142857, validation Loss : 2.3062496185302734, Precision : 0.4904761904761905, Recall : 0.5822222222222223 Epoch : 50, Train Loss : 2.03866273441485, Validation Accuracy : 0.8357142857142857, validation Loss : 2.30652220419475, Precision : 0.4904761904761905, Recall : 0.5822222222222223 ``` #### Results ![](https://i.imgur.com/k8OXc7Y.png) Based on the results, the Attention Seq2Seq model seems to perform worse than the other models. This could be because the Attention Seq2Seq model lacks a variety of features to learn from as it is trained only on `partial` dataset. To improve this model, the training on `full` dataset may create a better model as it may learn the association of POS tags to the NER tag. Another reason why this model performs worse than other models is because the model has not learned enough given the number of epochs trained. This may mean that the model may still learn new features if the model is given enough training epochs. Additionally, the model may be able to learn more and predict better if it is given a larger and better dataset to train with. This means that the quality and quantity of the dataset matters to the model training in order to get a better model.