50.007 Machine Learning - Sentiment Analysis Project Report

# 50.007 Machine Learning - Sentiment Analysis Project Report ## Part 2 ### Emission Function Implement the emission function that estimates the emission parameters from the training set using MLE (maximum likelihood estimation). ```python= ''' Inputs: x (str): word to be queried y (str): label to be queried tag_word_table (dictionary): data in table form k (float): number of occurences #UNK# is found Output: probability (float): probability of generating x from y based on tag_word_table Function: Calculates emission probability ''' def emission(x, y, tag_word_table, k = 0.5): word_list = tag_word_table[y] if x == "#UNK#": emission_count = k else: emission_count = word_list.count(x) ycount = len(word_list) return emission_count / (ycount + k) ``` With the calculated emission parameters, generate predictions on the output labels and evaluate results with evaluation script provided. #### EN: > #Entity in gold data: 13179 > #Entity in prediction: 18650 > > #Correct Entity : 9542 > Entity precision: 0.5116 > Entity recall: 0.7240 > Entity F: 0.5996 > > #Correct Sentiment : 8456 > Sentiment precision: 0.4534 > Sentiment recall: 0.6416 > Sentiment F: 0.5313 #### CN: > #Entity in gold data: 700 > #Entity in prediction: 4248 > > #Correct Entity : 345 > Entity precision: 0.0812 > Entity recall: 0.4929 > Entity F: 0.1395 > > #Correct Sentiment : 167 > Sentiment precision: 0.0393 > Sentiment recall: 0.2386 > Sentiment F: 0.0675 #### SG: > #Entity in gold data: 4301 > #Entity in prediction: 12237 > > #Correct Entity : 2386 > Entity precision: 0.1950 > Entity recall: 0.5548 > Entity F: 0.2885 > > #Correct Sentiment : 1531 > Sentiment precision: 0.1251 > Sentiment recall: 0.3560 > Sentiment F: 0.1851 ## Part 3 ### Transition Function Implement the transition function that estimates the transition parameters from the training set using MLE (maximum likelihood estimation). ```python ''' Inputs: u (str)： initial state/tag v (str)： final state/tag words (list)：list of words Output: transition_probability (float): transition probability from state u to state v Function: Takes initial state u and final state v, calculates the transition probability from state u to state v ''' def transition(u,v,words): count_u = 0 count_u_v = 0 for sentence in words: if u == "START": count_u = len(words) if sentence[0][1] == v: count_u_v +=1 elif v == "STOP": for i in range(len(sentence)): if sentence[i][1] == u: count_u += 1 if i+1 == len(sentence): count_u_v += 1 else: for i in range(len(sentence)): if sentence[i][1] == u: count_u += 1 if i+1!=len(sentence) and sentence[i+1][1] == v: count_u_v += 1 return count_u_v/count_u ``` ### Viterbi Algorithm Implement the Viterbi Algorithm that predicts the output labels for a given sentence. ```python ''' Inputs: n (list)： list of input words to be predicted l_labels (str)： list of all possible states tr_arr (2d numpy array): transition table em_arr (2d numpy array): emission table word_list (list): list of words Output: args (list): list of predicted states Function: Takes input sentence of n words, outputs list of n predicted tags/states corresponding to input words ''' def viterbi(n, l_labels, tr_arr, em_arr, word_list): # Initializoing pi array pi_arr = np.zeros([len(l_labels[1:-1]),len(n)+2]) ### Forwarding Algo # Iterating through the columns of the pi array for j in range(len(n)+2): # column 0: START nodes, assign value 1 to all nodes in column 0 if j == 0: for i in range(len(pi_arr)): pi_arr[i][j] = 1 # column 1: nodes right after START, pi value = 1(pi value of START) * transition_prob * emission_ prob elif j == 1: if n[j-1] not in word_list: n[j-1] = "#UNK#" for u_idx, u in enumerate((l_labels[1:-1])): if tr_arr[0][u_idx] == 0 or em_arr[word_list.index(n[j-1])][u_idx] == 0: pi_arr[u_idx][j] = float('-inf') else: pi_arr[u_idx][j] = np.log(tr_arr[0][u_idx]) + np.log(em_arr[word_list.index(n[j-1])][u_idx]) # columns 2 to n : pi value = max(pi value of previous nodes * transition_prob * emission_ prob) elif j > 1 and j < len(n)+1: # n[j-1]: current word in sentence, if not found in word_list, replace with "#UNK#" if n[j-1] not in word_list: n[j-1] = "#UNK#" # Iterating through column v: current column for u_idx, u in enumerate((l_labels[1:-1])): # v # array to store temp scores pi_temp = [] # Iterating through column u: previous column for u_idx_temp, u_temp in enumerate((l_labels[1:-1])): # u if tr_arr[u_idx_temp+1][u_idx] == 0 or em_arr[word_list.index(n[j-1])][u_idx] == 0: pi_temp.append(float('-inf')) else: # append pi_value_v (current) = pi_value_u (previous) * transition_prob(u,v) * emission_prob(v,word) pi_temp.append(pi_arr[u_idx_temp][j-1] + np.log(tr_arr[u_idx_temp+1][u_idx]) + np.log(em_arr[word_list.index(n[j-1])][u_idx])) #pi_value_v = max(_temp) pi_arr[u_idx][j] = max(pi_temp) # column n+1 : STOP node: pi value = max(pi value of previous nodes * transition_prob) elif j == len(n)+1: pi_temp = [] for u_idx, u in enumerate((l_labels[1:-1])): if tr_arr[u_idx+1][len(l_labels[1:-1])] == 0: pi_temp.append(float("-inf")) else: pi_temp.append(np.log(tr_arr[u_idx+1][len(l_labels[1:-1])]) + pi_arr[u_idx][j-1]) for u_idx_temp, u_temp in enumerate((l_labels[1:-1])): pi_arr[u_idx_temp][j] = max(pi_temp) ### Backtracking Algo # list to store predicted outputs args = [] # To store the index current node with the highes score last_idx = len(l_labels[1:-1]) # Iterating from n to 1: n, n-1, n-2...1 for i in range(len(n),0,-1): # array to store all temp scores calculated temp = [] # Iterating through the rows for u_idx, u in enumerate((l_labels[1:-1])): if tr_arr[u_idx+1][last_idx] == 0: temp.append(float("-inf")) else: # append the score = transition_prob * pi value to temp temp.append(np.log(tr_arr[u_idx+1][last_idx]) + pi_arr[u_idx][i]) # update last_idx with the index of the node that had the highest score if np.max(temp) == float("-inf"): last_idx = 7 else: last_idx = np.argmax(temp) # append tag/label corresponding to node with highest score to output args.append(l_labels[last_idx+1]) return list(reversed(args)) ``` To account for the numerical underflow issue that occurs when negative probabilities are constantly multiplied with each other during the forwarding portion of the algorithm, we have modified the forward propagating pi score algorithm to: > `pi_current = pi_prev + log(transition(prev,current)) + log(emission(current, observation))` instead of the previous: > `pi_current = pi_prev * transition(prev,current) * emission(current, observation)` Subsequently, all nodes with `pi = 0` had their scores replaced with `float("-inf")` Furthermore, to account for the impossible path issue that occurs during the backtracking portion of the algo, such that when all the pi values for a given observation are `float("-inf")`, we will force an output label of "O" to increase the accuracy for EN dataset. Finally, With the calculated emission and transition parameters, generate predictions on the output labels using the Viterbi Algorithm and evaluate results with evaluation script provided. #### EN: > #Entity in gold data: 13179 > #Entity in prediction: 12520 > > #Correct Entity : 10139 > Entity precision: 0.8098 > Entity recall: 0.7693 > Entity F: 0.7891 > > #Correct Sentiment : 9656 > Sentiment precision: 0.7712 > Sentiment recall: 0.7327 > Sentiment F: 0.7515 #### SG: > #Entity in gold data: 4301 > #Entity in prediction: 4221 > > #Correct Entity : 2193 > Entity precision: 0.5195 > Entity recall: 0.5099 > Entity F: 0.5147 > > #Correct Sentiment : 1805 > Sentiment precision: 0.4276 > Sentiment recall: 0.4197 > Sentiment F: 0.4236 #### CN: > #Entity in gold data: 700 > #Entity in prediction: 815 > > #Correct Entity : 204 > Entity precision: 0.2503 > Entity recall: 0.2914 > Entity F: 0.2693 > > #Correct Sentiment : 121 > Sentiment precision: 0.1485 > Sentiment recall: 0.1729 > Sentiment F: 0.1597 ## Part 4 ### Finding top 3 paths using Viterbi Algortihm Modified the viterbi algorithm to store the top 3 pi values from the previous layer for each node. This guarantees us to find the top 3 paths when backtracking from the 'STOP' node. ``` Python ''' Input: n (list of str): sentence split into list by spaces l_labels (list of str): list of tags tr_arr (2d numpy arr): transmission table tr_arr[initial state, final state] em_arr (2d numpy arr): emission table em_arr[word, state] word_list (list of str): list of all words in the training set Output: pred_tag (list of string): list of predicted tags Function: Does modified viterbi algorithm to find the top 3 best path ''' def viterbi_top3(n, l_labels, tr_arr, em_arr, word_list): args = [] all_states = l_labels[1:-1] len_states = len(all_states) # Set up pi array with each state containing top 3 pi states and within each state contains the [pi_prob, parent_idx] pi_arr = np.zeros([len_states,len(n),3,2]) # going through the table for j in range(len(n)): curr_word = n[j] if curr_word not in word_list: curr_word = "#UNK#" word_idx = word_list.index(curr_word) # First column if j == 0: for curr_state in range(len_states): if tr_arr[0][curr_state] == 0 or em_arr[word_idx][curr_state] == 0: pi_val = float('-inf') else: pi_val = np.log(tr_arr[0][curr_state]) + np.log(em_arr[word_idx][curr_state]) pi_arr[curr_state,j] = np.array([[pi_val, 0], [pi_val, 0], [pi_val, 0]]) # Second column elif j == 1: for curr_state in range(len_states): temp_vals = [] prev_pi_vals = pi_arr[:,j-1,:,:] for u_idx, state in enumerate(prev_pi_vals): pi_val, parent_idx = state[0,:] prev_state = u_idx if tr_arr[prev_state+1][curr_state] == 0 or em_arr[word_idx][curr_state] == 0: new_pi_val = float('-inf') else: new_pi_val = pi_val + np.log(tr_arr[prev_state+1][curr_state]) + np.log(em_arr[word_idx][curr_state]) temp_vals.append([new_pi_val, prev_state]) # Sort the values in descending order by using the first value of the list (pi_val) sorted_vals = sorted(temp_vals, key = lambda x: x[0], reverse=True) pi_arr[curr_state,j] = np.array(sorted_vals[:3]) # All other columns elif j > 1: for curr_state in range(len_states): prev_pi_vals = pi_arr[:,j-1,:,:] top3 = find_top3(prev_pi_vals, len_states, tr_arr, em_arr, curr_state, word_idx) pi_arr[curr_state,j] = top3 temp_vals = [] last_pi_vals = pi_arr[:,len(n)-1,:,:] for u_idx, state in enumerate(last_pi_vals): for pi_val, parent_idx in state: prev_state = u_idx tr_val = tr_arr[prev_state+1][-1] if tr_val == 0: new_pi_val = float('-inf') else: new_pi_val = pi_val + np.log(tr_val) temp_vals.append([new_pi_val, prev_state]) # Sort the values in descending order by using the first value of the list (pi_val) sorted_vals = sorted(temp_vals, key = lambda x: x[0], reverse=True) top3 = np.array(sorted_vals[:3]) # paths_table contains all the paths that visited that node for each node in the graph # the path at index j and with state_ind state is paths_table[j][state] paths_table = [] path = [] for j in range(len(n)): temp_paths = [] for state in range(len_states): temp_paths.append([]) paths_table.append(temp_paths) pred_vals_best, paths_table = traceback(int(top3[0,1]), n, pi_arr, all_states, paths_table) pred_vals_2ndbest, paths_table = traceback(int(top3[1,1]), n, pi_arr, all_states, paths_table) pred_vals_3rdbest, paths_table = traceback(int(top3[2,1]), n, pi_arr, all_states, paths_table) return pred_vals_best ``` In the code above, the pi array setup, has the different tags (not including START and STOP) as the rows and the words of the input sentence as the columns. Each node has a 3x2 numpy array which stores the top 3 [pi_val, parent_idx]. Storing the parent index helps with the backtracking later on. For the first column, all 3 rows are the same, as it could only come from the 'START' state. In the 2nd column, we compare all the nodes from the 1sr column and find the top 3. In the 3rd column onwards, we compare all the top 3 from the previous layers, giving us 3x the number of pi values to compare as compared to the original viterbi algorithm. We then store the top 3 pi_val and parent_idx pair in each node. This is computed using the find_top3 function. ```python ''' Input: prev_pi_vals (len_statesx3x2 numpy array): previous pi values len_states (int): num of possible states tr_arr (2d numpy arr): transition table em_arr (2d numpy arr): emission table curr_state (int): current state index in the table word_idx (int): index of current word in word_list Output: top3 (3x2 numpy array): array of the top 3 values with the 1st value being the highest Function: Finds the top 3 probabilities and their parents ''' def find_top3(prev_pi_vals, len_states, tr_arr, em_arr, curr_state, word_idx): # Initialize temp_vals = [] for u_idx, state in enumerate(prev_pi_vals): for pi_val, parent_idx in state: prev_state = u_idx tr_val = tr_arr[prev_state+1][curr_state] em_val = em_arr[word_idx][curr_state] if tr_val == 0 or em_val == 0: new_pi_val = float('-inf') else: new_pi_val = pi_val + np.log(tr_val) + np.log(em_val) temp_vals.append([new_pi_val, prev_state]) # Sort the values in descending order by using the first value of the list (pi_val) sorted_vals = sorted(temp_vals, key = lambda x: x[0], reverse=True) top3 = np.array(sorted_vals[:3]) return top3 ``` When we reach the end of the pi array, the last column of (pi values*transition value from that node to the STOP state) are compared to find the top 3. This top 3 values are used in the backtracking algorithm. ```python ''' Input: parent_idx (int): index of parent of STOP n (list): input sentence pi_arr (2d numpy arr): table of pi values all_states (list): list of all the tags not including START and STOP paths_table (nested list): table containing all the paths of previous iterations for each node Output: pred_vals (list): predicted values paths_table (list of list): updated path table for this iteration Function: Traceback using the parent values ''' def traceback(parent_idx, n, pi_arr, all_states, paths_table): pred_vals = [] path = ['STOP'] for curr_idx in range(len(n)-1,0,-1): best_idx = paths_table[curr_idx][parent_idx].count(path) paths_table[curr_idx][parent_idx].append(path) next_path = path + [all_states[parent_idx]] path = next_path pred_vals.append(all_states[parent_idx]) _, parent_idx = pi_arr[parent_idx, curr_idx, best_idx, :] parent_idx = int(parent_idx) pred_vals.append(all_states[parent_idx]) pred_vals.reverse() return pred_vals, paths_table ``` In the backtracking algorithm, we use the parent value calculated before to start the backtracking. This parent value leads us to the previous node where we check the number of time this same path has been taken before. This number of times tells us which of the top 3 values we should use when backtracking. E.g. If backtracking visits node 1 from ['STOP', 'O'], it stores ['STOP', 'O'] in the list for node 1. If ['STOP','O'] is taken again in the next time this backtracking algorithm is taken, the 2nd best value should be used for this node. This is then repeated until the last node and the entire predicted list is returned. #### EN: > #Entity in gold data: 13179 > #Entity in prediction: 14371 > > #Correct Entity : 10611 > Entity precision: 0.7384 > Entity recall: 0.8051 > Entity F: 0.7703 > > #Correct Sentiment : 9736 > Sentiment precision: 0.6775 > Sentiment recall: 0.7388 > Sentiment F: 0.7068 ## Part 5 ### Structural Perceptron For part 5, we attempted using Structural Perceptron to make the predictions. We followed the algorithm listed here http://www.phontron.com/slides/nlp-programming-en-12-struct.pdf but made a few modifications to it to suit our use-case better. #### Features Making use of what we already have, we decided to define our features using the transition and emission probability that we have created and use for the normal HMM Viterbi algorithm in Part 3. We define a function `get_features(key)` which takes in a key and returns the respective feature value. They key follows the following format: `((prev_state, current_state), (current_state, word))` ``` python ''' Inputs: key (2d tuple): Takes in a 2d tuple that contains 2 tuples of transition state and emission state. I.e: ((prev_state, current_state), (current_state, word)) Output: A float value that is the emission probability * transition probability Function: Takes in a key and finds the respective emission and transition probability. Multiply the two and return the value ''' def get_features(self,key): #Extract the states and word from the key transition = key[0] emission = key[1] prev_state = transition[0] current_state = transition[1] word = emission[1] #Retrieve the index of the states and word prev_state_index = self.labels_list.index(prev_state) #labels_list contains start, but current_state columns in both tr_table and em_table does not have start, so we have to minus 1 to offset the index to match that of the tables current_state_index = self.labels_list.index(current_state) - 1 word_index = self.word_list.index(word) #Get the transition feature value trans_val = self.transition_table[prev_state_index][current_state_index] #Get the emission feature value if current_state == "STOP": # if the state is STOP, there is no emission value em_val = 0 else: em_val = self.emission_table[word_index][current_state_index] # Returns the transmission probability and the emission probability return trans_val * em_val ``` #### Weights As we are trying to improve on the Part 3 Viterbi Algorithm, we decided to initialise the weights to all contain a value of 1. This way, the predicted output of our algorithm before any update of weights, will be the exact same output as our Part 3 Viterbi Algorithm. We plan to spot the mistakes made by the Part 3 Viterbi Algorithm, and update the weights according to better handle those mistakes. Whenever we spot a mistake in the prediction, we update the predicted weights by subtracting the difference between the actual features and the predicted features, such as to make it smaller, and we update the actual weights by adding the difference so as to make it larger. ```python ''' Inputs: training_sentences (list)： list of sentences from the training set training_labels (list)： list of correct labels from the training set epochs (int): Specify the number of epochs to run the training algo Output: None Function: Takes in a list of sentences and their respective labels from the training set and update train the weights using perceptron update algorithm. ''' # The following code is truncated for the sake of being succint def perceptron_training(self,training_sentences, training_labels, epochs): if current_state != actual_current_state: pred_features = self.get_features(key) actual_features = self.get_features(correct_key) if (actual_features - pred_features) > 0: #if actual features is higher, we want to reduce the predicted weights and boost the actual weights #Reduce predicted weights weights -= (actual_features - pred_features) self.update_weights(weights, key) #Boost the actual weights correct_weights += (actual_features - pred_features) self.update_weights(correct_weights,correct_key) elif (actual_features - pred_features) < 0: #if actual features is lower, we want to reduce the predicted weights and boost the actual weights #Reduce predicted weights (actual-pred) is negative weights -= (-(actual_features - pred_features)) self.update_weights(weights, key) #Boost the actual weights (atual-pred) is positive correct_weights += (-(actual_features - pred_features)) self.update_weights(correct_weights,correct_key) ``` #### Regularization In order to prevent a scenario where some particular weights gets boosted to a very high/low value as a result of the algorithm making lots of mistakes on that particular instance, we decided to add in a learning rate and regularizer to control how much each weight gets updated. When the weights are updated, we take the reciprocal of the regularisation parameter and multiply it to the update value. We added two hyperparameters that can be defined when calling the `perceptron_training` function. User can specify an integer to `irp` to define the initial regularizer parameter, and a value to `learning_rate` to define how much to add to the regularizer after an update to the weight is made. The `perceptron_training` function will now be as follows: ```python ''' Inputs: training_sentences (list)： list of sentences from the training set training_labels (list)： list of correct labels from the training set epochs (int): Specify the number of epochs to run the training algo irp (int/float): Specify the initial regularisation parameter learning_rate (int/float): Specify how much to add to the regularisation parameter after each update Output: None Function: Takes in a list of sentences and their respective labels from the training set and update train the weights using perceptron update algorithm. ''' # The following code is truncated for the sake of being succint def perceptron_training(self,training_sentences, training_labels, epochs, irp = 5, learning_rate = 0.1): if current_state != actual_current_state: pred_features = self.get_features(key) actual_features = self.get_features(correct_key) if key in regularisation_dict.keys(): regularisation_param = regularisation_dict[key] else: regularisation_param = irp if correct_key in regularisation_dict.keys(): correct_regularisation_param = regularisation_dict[correct_key] else: correct_regularisation_param = irp if (actual_features - pred_features) > 0: #if actual features is higher, we want to reduce the predicted weights and boost the actual weights #Reduce predicted weights weights -= ((1/regularisation_param) * (actual_features - pred_features)) self.update_weights(weights, key) regularisation_dict[key] = regularisation_param + learning_rate #Boost the actual weights correct_weights += ((1/correct_regularisation_param) * (actual_features - pred_features)) self.update_weights(correct_weights,correct_key) regularisation_dict[correct_key] = correct_regularisation_param + learning_rate elif (actual_features - pred_features) < 0: #if actual features is lower, we want to reduce the predicted weights and boost the actual weights #Reduce predicted weights (actual-pred) is negative weights -= ((1/regularisation_param) * (-(actual_features - pred_features))) self.update_weights(weights, key) regularisation_dict[key] = regularisation_param + learning_rate #Boost the actual weights (atual-pred) is positive correct_weights += ((1/correct_regularisation_param) * (-(actual_features - pred_features))) self.update_weights(correct_weights,correct_key) regularisation_dict[correct_key] = correct_regularisation_param + learning_rate ``` #### Modified Viterbi In order to make use of the weights when running the Viterbi Algorithm, we had to update our Viterbi Algorithm. The changes made are just simply to use the weights and multiply it with the emission and transition probability. E.g: `get_weights((u,v),(v,o)) * tr_array(u,v) * em_array(v,o) * pi(u,j-1)` ``` python ''' Inputs: n (list)： list of input words to be predicted l_labels (str)： list of all possible states tr_arr (2d numpy array): transition table em_arr (2d numpy array): emission table word_list (list): list of words weights (dict): A dictionary of wrods Output: args (list): list of predicted states Function: Takes input sentence of n words, outputs list of n predicted tags/states corresponding to input words ''' def viterbi_sp(n, l_labels, tr_arr, em_arr, word_list, weights): # Initializoing pi array pi_arr = np.zeros([len(l_labels[1:-1]),len(n)+2]) parent = np.zeros([len(l_labels[1:-1]),len(n)+2]) pred = [] ### Forwarding Algo # Iterating through the columns of the pi array for j in range(len(n)+2): # column 0: START nodes, assign value 1 to all nodes in column 0 if j == 0: for i in range(len(pi_arr)): pi_arr[i][j] = 0 # column 1: nodes right after START, pi value = 1(pi value of START) * transition_prob * emission_ prob elif j == 1: if n[j-1] not in word_list: n[j-1] = "#UNK#" for u_idx, u in enumerate((l_labels[1:-1])): if tr_arr[0][u_idx] == 0 or em_arr[word_list.index(n[j-1])][u_idx] == 0: pi_arr[u_idx][j] = float('-inf') else: pi_arr[u_idx][j] = weights[('START', l_labels[u_idx+1]),(l_labels[u_idx+1], n[j-1])]*((np.log(tr_arr[0][u_idx])+1) + (np.log(em_arr[word_list.index(n[j-1])][u_idx])+1)) parent[u_idx][j] = 0 # columns 2 to n : pi value = max(pi value of previous nodes * transition_prob * emission_ prob) elif j > 1 and j < len(n)+1: # n[j-1]: current word in sentence, if not found in word_list, replace with "#UNK#" if n[j-1] not in word_list: n[j-1] = "#UNK#" # Iterating through column v: current column for u_idx, u in enumerate((l_labels[1:-1])): # v # array to store temp scores pi_temp = [] # Iterating through column u: previous column for u_idx_temp, u_temp in enumerate((l_labels[1:-1])): # u if tr_arr[u_idx_temp+1][u_idx] == 0 or em_arr[word_list.index(n[j-1])][u_idx] == 0: pi_temp.append(float('-inf')) else: # append pi_value_v (current) = pi_value_u (previous) * transition_prob(u,v) * emission_prob(v,word) pi_temp.append(pi_arr[u_idx_temp][j-1] + weights[(l_labels[u_idx_temp+1], l_labels[u_idx+1]),(l_labels[u_idx+1], n[j-1])]* ((np.log(tr_arr[u_idx_temp+1][u_idx])+1) + (np.log(em_arr[word_list.index(n[j-1])][u_idx])+1))) #pi_value_v = max(_temp) pi_arr[u_idx][j] = max(pi_temp) parent[u_idx][j] = np.argmax(pi_temp)+1 # column n+1 : STOP node: pi value = max(pi value of previous nodes * transition_prob) elif j == len(n)+1: pi_temp = [] for u_idx, u in enumerate((l_labels[1:-1])): if tr_arr[u_idx+1][len(l_labels[1:-1])] == 0: pi_temp.append(float("-inf")) else: pi_temp.append(weights[(l_labels[u_idx+1], 'STOP'),('STOP', 'none')]* (np.log(tr_arr[u_idx+1][len(l_labels[1:-1])])+1) + pi_arr[u_idx][j-1]) for u_idx_temp, u_temp in enumerate((l_labels[1:-1])): pi_arr[u_idx_temp][j] = max(pi_temp) parent[u_idx_temp][j] = np.argmax(pi_temp)+1 ### Backtracking Algo # list to store predicted outputs args = [] # To store the index current node with the highes score last_idx = len(l_labels[1:-1]) # Iterating from n to 1: n, n-1, n-2...1 for i in range(len(n)+1,1,-1): # update last_idx with the index of the node that had the highest score if i == len(n)+1: last_idx = int(parent[0][i]) else: last_idx = int(parent[last_idx-1][i]) #we have to -1 to undo the offset made when saving the argmax earlier # append tag/label corresponding to node with highest score to output args.append(l_labels[int(last_idx)]) return list(reversed(args)) ``` #### Results Using this algorithm, we were able to achieve the following results for EN/dev.in: > #Entity in gold data: 13179 > #Entity in prediction: 12310 > > #Correct Entity : 9948 > Entity precision: 0.8081 > Entity recall: 0.7548 > Entity F: 0.7806 > > #Correct Sentiment : 9482 > Sentiment precision: 0.7703 > Sentiment recall: 0.7195 Sentiment F: 0.7440 The output is written to `EN/dev_sp.p5.out` ### 2nd Order HMM We also tried modifying the initial HMM to use 2 of the previous states to predict the next state. Thus now, we have a larger transition table that finds the probability that the previous 2 states predicts the next state. This is calculated using the functions below. ```python ''' Inputs: states (list)：list of labels l_labels (list): list of all possible states label_rows (list): list of all possible double initial states Output: transition_table (2d numpy array): transition table for given training set Function: Takes list of all possible states and outputs the transition table for given training set ''' def generate_2ndorder_transition_table(states, l_labels, label_rows): transition_table = np.zeros([len(label_rows),len(l_labels[1:])]) cols = l_labels[1:] for l in states: for ind in range(len(l)-2): state_1 = l[ind] state_2 = l[ind+1] state_3 = l[ind+2] ind_row = label_rows.index((state_1, state_2)) ind_col = cols.index(state_3) transition_table[ind_row, ind_col] += 1 for ind_row, row in enumerate(transition_table): total = row.sum() if total == 0: continue for ind_col, val in enumerate(row): transition_table[ind_row, ind_col] = transition_table[ind_row, ind_col]/total return transition_table ``` This will also result in a change in the viterbi algorithm where the pi value now compares all possible configurations of the previous 2 states and finds the best. The modified viterbi algorithm is shown below. ```python ''' Input: n (list of str): sentence split into list by spaces l_labels (list of str): list of tags tr_arr (2d numpy arr): transmission table tr_arr[initial state, final state] em_arr (2d numpy arr): emission table em_arr[word, state] word_list (list of str): list of all words in the training set rows_states (list): list of all possible double initial states Output: pred_tag (list of string): list of predicted tags Function: Does modified viterbi algorithm to find the best path using 2 initial states to predict the next state ''' def viterbi_2ndorder(n, l_labels, tr_arr, em_arr, word_list, row_states): args = [] all_states = l_labels[1:-1] len_states = len(all_states) # Set up pi array with each state containing top 3 pi states and within each state contains the [pi_prob, parent_idx] pi_arr = np.zeros([len_states,len(n)]) # going through the table for j in range(len(n)): curr_word = n[j] if curr_word not in word_list: curr_word = "#UNK#" word_idx = word_list.index(curr_word) # First column if j == 0: for curr_state in range(len_states): if tr_arr[0][curr_state] == 0 or em_arr[word_idx][curr_state] == 0: pi_val = float('-inf') else: pi_val = np.log(tr_arr[0][curr_state]) + np.log(em_arr[word_idx][curr_state]) pi_arr[curr_state,j] = pi_val # Second column elif j == 1: for curr_state in range(len_states): temp_vals = [] prev_pi_vals = pi_arr[:,j-1] for u_idx, state in enumerate(prev_pi_vals): pi_val = state prev_state = all_states[u_idx] prev_prev_state = 'START' tr_row_idx = row_states.index((prev_prev_state, prev_state)) if tr_arr[tr_row_idx][curr_state] == 0 or em_arr[word_idx][curr_state] == 0: new_pi_val = float('-inf') else: new_pi_val = pi_val + np.log(tr_arr[tr_row_idx][curr_state]) + np.log(em_arr[word_idx][curr_state]) temp_vals.append(new_pi_val) pi_arr[curr_state,j] = max(temp_vals) # All other columns elif j > 1: for curr_state in range(len_states): prev_pi_vals = pi_arr[:,j-1] # Initialize temp_vals = [] for u_idx, pi_val in enumerate(prev_pi_vals): for state_ind, state in enumerate(all_states): prev_state = all_states[u_idx] prev_prev_state = state tr_row_idx = row_states.index((prev_prev_state, prev_state)) tr_val = tr_arr[tr_row_idx][curr_state] em_val = em_arr[word_idx][curr_state] if tr_val == 0 or em_val == 0: new_pi_val = float('-inf') else: new_pi_val = pi_val + np.log(tr_val) + np.log(em_val) temp_vals.append(new_pi_val) pi_arr[curr_state,j] = max(temp_vals) # Checks if the length of the sentence is 1 if len(n)==1: temp_vals = [] last_pi_vals = pi_arr[:,len(n)-1] for u_idx, pi_val in enumerate(last_pi_vals): prev_state = all_states[u_idx] prev_prev_state = 'START' tr_row_idx = row_states.index((prev_prev_state, prev_state)) tr_val = tr_arr[tr_row_idx][-1] if tr_val == 0: new_pi_val = float('-inf') else: new_pi_val = pi_val + np.log(tr_val) temp_vals.append(new_pi_val) parent = all_states[np.argmax(temp_vals)] return [parent] # When the length of the sentence is more than 1 # Calculate the best 2 parent nodes from the STOP node temp_vals = [] parent_vals = [] last_pi_vals = pi_arr[:,len(n)-1] last_word = n[len(n)-1] second_last_word = n[len(n)-2] if last_word not in word_list: last_word = '#UNK#' if second_last_word not in word_list: second_last_word = '#UNK#' for u_idx, pi_val in enumerate(last_pi_vals): for state_ind, state in enumerate(all_states): prev_state = all_states[u_idx] prev_prev_state = state tr_row_idx = row_states.index((prev_prev_state, prev_state)) tr_val = tr_arr[tr_row_idx][-1] if tr_val == 0 or em_arr[word_list.index(last_word)][all_states.index(prev_state)]==0 or em_arr[word_list.index(second_last_word)][all_states.index(prev_prev_state)] == 0: new_pi_val = float('-inf') else: new_pi_val = np.log(tr_val) + np.log(em_arr[word_list.index(last_word)][all_states.index(prev_state)]) + np.log(em_arr[word_list.index(second_last_word)][all_states.index(prev_prev_state)]) temp_vals.append(new_pi_val) parent_vals.append((prev_prev_state, prev_state)) # Checks if an impossible path has been taken and if so, force the previous 2 states to be 'I-NP', 'O' if max(temp_vals)==float('-inf'): prev_prev_state, prev_state = 'I-NP', 'O' else: best_idx = np.argmax(temp_vals) prev_prev_state, prev_state = parent_vals[best_idx] # Checks if the length of the sentence is 2 if len(n)==2: return [prev_prev_state, prev_state] else: path = traceback(prev_state, prev_prev_state, n, pi_arr, all_states, tr_arr, row_states) return path ``` Finally the backtracking algorithm is slightly different as well. From the STOP state, we predict the previous 2 states when going through all the pi values. These 2 states are then used to predict the next state until the first word in the sentence. The backtracking algorithm is shown below. ```python ''' Input: parent (str): 1st parent of STOP second_parent (str): 2nd parent of STOP n (list): input sentence pi_arr (2d numpy arr): table of pi values all_states (list): list of all the tags not including START and STOP tr_arr (numpy arr): transition table rows_states (list): list of all possible double initial states Output: pred_vals (list): predicted labels Function: Traceback using the parent values ''' def traceback(parent, second_parent, n, pi_arr, all_states, tr_arr, rows_states): pred_vals = [parent, second_parent] for curr_idx in range(len(n)-3,-1,-1): temp_arr = [] for state_ind, state in enumerate(all_states): pi_val = pi_arr[state_ind, curr_idx] tr_row = rows_states.index((state,second_parent)) tr_col = all_states.index(parent) tr_val = tr_arr[tr_row][tr_col] if tr_val == 0: temp_arr.append(float('-inf')) else: temp_arr.append(pi_val+np.log(tr_val)) # Checks if an impossible path has been taken and if so, force the predicted states to be 'O' if max(temp_arr)==float('-inf'): pred_vals.append('O') parent = second_parent second_parent = 'O' else: parent_idx = np.argmax(temp_arr) pred_vals.append(all_states[parent_idx]) parent = second_parent second_parent = all_states[parent_idx] pred_vals.reverse() return pred_vals ``` We found out that the 2nd order HMM is unable to predict the last 2 tags well. Thus, we changed the method of prediction to exclude the pi value and to only consider the transmission value and the emission value of the previous 2 words. This improved our results on the EN test set. For all cases where the algorithm reached an impossible path, the output of 'O' is predicted. This significantly reduces the number of entities predicted and improves our precision score at a tradeoff for our recall score. Overall, it still improves the f-score and thus we decided to stick with this. ## Collaborators - Cornelius Yap 1003358 - Tan Yu Xiang 1003317 - Wang Zefan 1003791

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.