icyyci
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # 50.007 Machine Learning - Sentiment Analysis Project Report ## Part 2 ### Emission Function Implement the emission function that estimates the emission parameters from the training set using MLE (maximum likelihood estimation). ```python= ''' Inputs: x (str): word to be queried y (str): label to be queried tag_word_table (dictionary): data in table form k (float): number of occurences #UNK# is found Output: probability (float): probability of generating x from y based on tag_word_table Function: Calculates emission probability ''' def emission(x, y, tag_word_table, k = 0.5): word_list = tag_word_table[y] if x == "#UNK#": emission_count = k else: emission_count = word_list.count(x) ycount = len(word_list) return emission_count / (ycount + k) ``` With the calculated emission parameters, generate predictions on the output labels and evaluate results with evaluation script provided. #### EN: > #Entity in gold data: 13179 > #Entity in prediction: 18650 > > #Correct Entity : 9542 > Entity precision: 0.5116 > Entity recall: 0.7240 > Entity F: 0.5996 > > #Correct Sentiment : 8456 > Sentiment precision: 0.4534 > Sentiment recall: 0.6416 > Sentiment F: 0.5313 #### CN: > #Entity in gold data: 700 > #Entity in prediction: 4248 > > #Correct Entity : 345 > Entity precision: 0.0812 > Entity recall: 0.4929 > Entity F: 0.1395 > > #Correct Sentiment : 167 > Sentiment precision: 0.0393 > Sentiment recall: 0.2386 > Sentiment F: 0.0675 #### SG: > #Entity in gold data: 4301 > #Entity in prediction: 12237 > > #Correct Entity : 2386 > Entity precision: 0.1950 > Entity recall: 0.5548 > Entity F: 0.2885 > > #Correct Sentiment : 1531 > Sentiment precision: 0.1251 > Sentiment recall: 0.3560 > Sentiment F: 0.1851 ## Part 3 ### Transition Function Implement the transition function that estimates the transition parameters from the training set using MLE (maximum likelihood estimation). ```python ''' Inputs: u (str): initial state/tag v (str): final state/tag words (list):list of words Output: transition_probability (float): transition probability from state u to state v Function: Takes initial state u and final state v, calculates the transition probability from state u to state v ''' def transition(u,v,words): count_u = 0 count_u_v = 0 for sentence in words: if u == "START": count_u = len(words) if sentence[0][1] == v: count_u_v +=1 elif v == "STOP": for i in range(len(sentence)): if sentence[i][1] == u: count_u += 1 if i+1 == len(sentence): count_u_v += 1 else: for i in range(len(sentence)): if sentence[i][1] == u: count_u += 1 if i+1!=len(sentence) and sentence[i+1][1] == v: count_u_v += 1 return count_u_v/count_u ``` ### Viterbi Algorithm Implement the Viterbi Algorithm that predicts the output labels for a given sentence. ```python ''' Inputs: n (list): list of input words to be predicted l_labels (str): list of all possible states tr_arr (2d numpy array): transition table em_arr (2d numpy array): emission table word_list (list): list of words Output: args (list): list of predicted states Function: Takes input sentence of n words, outputs list of n predicted tags/states corresponding to input words ''' def viterbi(n, l_labels, tr_arr, em_arr, word_list): # Initializoing pi array pi_arr = np.zeros([len(l_labels[1:-1]),len(n)+2]) ### Forwarding Algo # Iterating through the columns of the pi array for j in range(len(n)+2): # column 0: START nodes, assign value 1 to all nodes in column 0 if j == 0: for i in range(len(pi_arr)): pi_arr[i][j] = 1 # column 1: nodes right after START, pi value = 1(pi value of START) * transition_prob * emission_ prob elif j == 1: if n[j-1] not in word_list: n[j-1] = "#UNK#" for u_idx, u in enumerate((l_labels[1:-1])): if tr_arr[0][u_idx] == 0 or em_arr[word_list.index(n[j-1])][u_idx] == 0: pi_arr[u_idx][j] = float('-inf') else: pi_arr[u_idx][j] = np.log(tr_arr[0][u_idx]) + np.log(em_arr[word_list.index(n[j-1])][u_idx]) # columns 2 to n : pi value = max(pi value of previous nodes * transition_prob * emission_ prob) elif j > 1 and j < len(n)+1: # n[j-1]: current word in sentence, if not found in word_list, replace with "#UNK#" if n[j-1] not in word_list: n[j-1] = "#UNK#" # Iterating through column v: current column for u_idx, u in enumerate((l_labels[1:-1])): # v # array to store temp scores pi_temp = [] # Iterating through column u: previous column for u_idx_temp, u_temp in enumerate((l_labels[1:-1])): # u if tr_arr[u_idx_temp+1][u_idx] == 0 or em_arr[word_list.index(n[j-1])][u_idx] == 0: pi_temp.append(float('-inf')) else: # append pi_value_v (current) = pi_value_u (previous) * transition_prob(u,v) * emission_prob(v,word) pi_temp.append(pi_arr[u_idx_temp][j-1] + np.log(tr_arr[u_idx_temp+1][u_idx]) + np.log(em_arr[word_list.index(n[j-1])][u_idx])) #pi_value_v = max(_temp) pi_arr[u_idx][j] = max(pi_temp) # column n+1 : STOP node: pi value = max(pi value of previous nodes * transition_prob) elif j == len(n)+1: pi_temp = [] for u_idx, u in enumerate((l_labels[1:-1])): if tr_arr[u_idx+1][len(l_labels[1:-1])] == 0: pi_temp.append(float("-inf")) else: pi_temp.append(np.log(tr_arr[u_idx+1][len(l_labels[1:-1])]) + pi_arr[u_idx][j-1]) for u_idx_temp, u_temp in enumerate((l_labels[1:-1])): pi_arr[u_idx_temp][j] = max(pi_temp) ### Backtracking Algo # list to store predicted outputs args = [] # To store the index current node with the highes score last_idx = len(l_labels[1:-1]) # Iterating from n to 1: n, n-1, n-2...1 for i in range(len(n),0,-1): # array to store all temp scores calculated temp = [] # Iterating through the rows for u_idx, u in enumerate((l_labels[1:-1])): if tr_arr[u_idx+1][last_idx] == 0: temp.append(float("-inf")) else: # append the score = transition_prob * pi value to temp temp.append(np.log(tr_arr[u_idx+1][last_idx]) + pi_arr[u_idx][i]) # update last_idx with the index of the node that had the highest score if np.max(temp) == float("-inf"): last_idx = 7 else: last_idx = np.argmax(temp) # append tag/label corresponding to node with highest score to output args.append(l_labels[last_idx+1]) return list(reversed(args)) ``` To account for the numerical underflow issue that occurs when negative probabilities are constantly multiplied with each other during the forwarding portion of the algorithm, we have modified the forward propagating pi score algorithm to: > `pi_current = pi_prev + log(transition(prev,current)) + log(emission(current, observation))` instead of the previous: > `pi_current = pi_prev * transition(prev,current) * emission(current, observation)` Subsequently, all nodes with `pi = 0` had their scores replaced with `float("-inf")` Furthermore, to account for the impossible path issue that occurs during the backtracking portion of the algo, such that when all the pi values for a given observation are `float("-inf")`, we will force an output label of "O" to increase the accuracy for EN dataset. Finally, With the calculated emission and transition parameters, generate predictions on the output labels using the Viterbi Algorithm and evaluate results with evaluation script provided. #### EN: > #Entity in gold data: 13179 > #Entity in prediction: 12520 > > #Correct Entity : 10139 > Entity precision: 0.8098 > Entity recall: 0.7693 > Entity F: 0.7891 > > #Correct Sentiment : 9656 > Sentiment precision: 0.7712 > Sentiment recall: 0.7327 > Sentiment F: 0.7515 #### SG: > #Entity in gold data: 4301 > #Entity in prediction: 4221 > > #Correct Entity : 2193 > Entity precision: 0.5195 > Entity recall: 0.5099 > Entity F: 0.5147 > > #Correct Sentiment : 1805 > Sentiment precision: 0.4276 > Sentiment recall: 0.4197 > Sentiment F: 0.4236 #### CN: > #Entity in gold data: 700 > #Entity in prediction: 815 > > #Correct Entity : 204 > Entity precision: 0.2503 > Entity recall: 0.2914 > Entity F: 0.2693 > > #Correct Sentiment : 121 > Sentiment precision: 0.1485 > Sentiment recall: 0.1729 > Sentiment F: 0.1597 ## Part 4 ### Finding top 3 paths using Viterbi Algortihm Modified the viterbi algorithm to store the top 3 pi values from the previous layer for each node. This guarantees us to find the top 3 paths when backtracking from the 'STOP' node. ``` Python ''' Input: n (list of str): sentence split into list by spaces l_labels (list of str): list of tags tr_arr (2d numpy arr): transmission table tr_arr[initial state, final state] em_arr (2d numpy arr): emission table em_arr[word, state] word_list (list of str): list of all words in the training set Output: pred_tag (list of string): list of predicted tags Function: Does modified viterbi algorithm to find the top 3 best path ''' def viterbi_top3(n, l_labels, tr_arr, em_arr, word_list): args = [] all_states = l_labels[1:-1] len_states = len(all_states) # Set up pi array with each state containing top 3 pi states and within each state contains the [pi_prob, parent_idx] pi_arr = np.zeros([len_states,len(n),3,2]) # going through the table for j in range(len(n)): curr_word = n[j] if curr_word not in word_list: curr_word = "#UNK#" word_idx = word_list.index(curr_word) # First column if j == 0: for curr_state in range(len_states): if tr_arr[0][curr_state] == 0 or em_arr[word_idx][curr_state] == 0: pi_val = float('-inf') else: pi_val = np.log(tr_arr[0][curr_state]) + np.log(em_arr[word_idx][curr_state]) pi_arr[curr_state,j] = np.array([[pi_val, 0], [pi_val, 0], [pi_val, 0]]) # Second column elif j == 1: for curr_state in range(len_states): temp_vals = [] prev_pi_vals = pi_arr[:,j-1,:,:] for u_idx, state in enumerate(prev_pi_vals): pi_val, parent_idx = state[0,:] prev_state = u_idx if tr_arr[prev_state+1][curr_state] == 0 or em_arr[word_idx][curr_state] == 0: new_pi_val = float('-inf') else: new_pi_val = pi_val + np.log(tr_arr[prev_state+1][curr_state]) + np.log(em_arr[word_idx][curr_state]) temp_vals.append([new_pi_val, prev_state]) # Sort the values in descending order by using the first value of the list (pi_val) sorted_vals = sorted(temp_vals, key = lambda x: x[0], reverse=True) pi_arr[curr_state,j] = np.array(sorted_vals[:3]) # All other columns elif j > 1: for curr_state in range(len_states): prev_pi_vals = pi_arr[:,j-1,:,:] top3 = find_top3(prev_pi_vals, len_states, tr_arr, em_arr, curr_state, word_idx) pi_arr[curr_state,j] = top3 temp_vals = [] last_pi_vals = pi_arr[:,len(n)-1,:,:] for u_idx, state in enumerate(last_pi_vals): for pi_val, parent_idx in state: prev_state = u_idx tr_val = tr_arr[prev_state+1][-1] if tr_val == 0: new_pi_val = float('-inf') else: new_pi_val = pi_val + np.log(tr_val) temp_vals.append([new_pi_val, prev_state]) # Sort the values in descending order by using the first value of the list (pi_val) sorted_vals = sorted(temp_vals, key = lambda x: x[0], reverse=True) top3 = np.array(sorted_vals[:3]) # paths_table contains all the paths that visited that node for each node in the graph # the path at index j and with state_ind state is paths_table[j][state] paths_table = [] path = [] for j in range(len(n)): temp_paths = [] for state in range(len_states): temp_paths.append([]) paths_table.append(temp_paths) pred_vals_best, paths_table = traceback(int(top3[0,1]), n, pi_arr, all_states, paths_table) pred_vals_2ndbest, paths_table = traceback(int(top3[1,1]), n, pi_arr, all_states, paths_table) pred_vals_3rdbest, paths_table = traceback(int(top3[2,1]), n, pi_arr, all_states, paths_table) return pred_vals_best ``` In the code above, the pi array setup, has the different tags (not including START and STOP) as the rows and the words of the input sentence as the columns. Each node has a 3x2 numpy array which stores the top 3 [pi_val, parent_idx]. Storing the parent index helps with the backtracking later on. For the first column, all 3 rows are the same, as it could only come from the 'START' state. In the 2nd column, we compare all the nodes from the 1sr column and find the top 3. In the 3rd column onwards, we compare all the top 3 from the previous layers, giving us 3x the number of pi values to compare as compared to the original viterbi algorithm. We then store the top 3 pi_val and parent_idx pair in each node. This is computed using the find_top3 function. ```python ''' Input: prev_pi_vals (len_statesx3x2 numpy array): previous pi values len_states (int): num of possible states tr_arr (2d numpy arr): transition table em_arr (2d numpy arr): emission table curr_state (int): current state index in the table word_idx (int): index of current word in word_list Output: top3 (3x2 numpy array): array of the top 3 values with the 1st value being the highest Function: Finds the top 3 probabilities and their parents ''' def find_top3(prev_pi_vals, len_states, tr_arr, em_arr, curr_state, word_idx): # Initialize temp_vals = [] for u_idx, state in enumerate(prev_pi_vals): for pi_val, parent_idx in state: prev_state = u_idx tr_val = tr_arr[prev_state+1][curr_state] em_val = em_arr[word_idx][curr_state] if tr_val == 0 or em_val == 0: new_pi_val = float('-inf') else: new_pi_val = pi_val + np.log(tr_val) + np.log(em_val) temp_vals.append([new_pi_val, prev_state]) # Sort the values in descending order by using the first value of the list (pi_val) sorted_vals = sorted(temp_vals, key = lambda x: x[0], reverse=True) top3 = np.array(sorted_vals[:3]) return top3 ``` When we reach the end of the pi array, the last column of (pi values*transition value from that node to the STOP state) are compared to find the top 3. This top 3 values are used in the backtracking algorithm. ```python ''' Input: parent_idx (int): index of parent of STOP n (list): input sentence pi_arr (2d numpy arr): table of pi values all_states (list): list of all the tags not including START and STOP paths_table (nested list): table containing all the paths of previous iterations for each node Output: pred_vals (list): predicted values paths_table (list of list): updated path table for this iteration Function: Traceback using the parent values ''' def traceback(parent_idx, n, pi_arr, all_states, paths_table): pred_vals = [] path = ['STOP'] for curr_idx in range(len(n)-1,0,-1): best_idx = paths_table[curr_idx][parent_idx].count(path) paths_table[curr_idx][parent_idx].append(path) next_path = path + [all_states[parent_idx]] path = next_path pred_vals.append(all_states[parent_idx]) _, parent_idx = pi_arr[parent_idx, curr_idx, best_idx, :] parent_idx = int(parent_idx) pred_vals.append(all_states[parent_idx]) pred_vals.reverse() return pred_vals, paths_table ``` In the backtracking algorithm, we use the parent value calculated before to start the backtracking. This parent value leads us to the previous node where we check the number of time this same path has been taken before. This number of times tells us which of the top 3 values we should use when backtracking. E.g. If backtracking visits node 1 from ['STOP', 'O'], it stores ['STOP', 'O'] in the list for node 1. If ['STOP','O'] is taken again in the next time this backtracking algorithm is taken, the 2nd best value should be used for this node. This is then repeated until the last node and the entire predicted list is returned. #### EN: > #Entity in gold data: 13179 > #Entity in prediction: 14371 > > #Correct Entity : 10611 > Entity precision: 0.7384 > Entity recall: 0.8051 > Entity F: 0.7703 > > #Correct Sentiment : 9736 > Sentiment precision: 0.6775 > Sentiment recall: 0.7388 > Sentiment F: 0.7068 ## Part 5 ### Structural Perceptron For part 5, we attempted using Structural Perceptron to make the predictions. We followed the algorithm listed here http://www.phontron.com/slides/nlp-programming-en-12-struct.pdf but made a few modifications to it to suit our use-case better. #### Features Making use of what we already have, we decided to define our features using the transition and emission probability that we have created and use for the normal HMM Viterbi algorithm in Part 3. We define a function `get_features(key)` which takes in a key and returns the respective feature value. They key follows the following format: `((prev_state, current_state), (current_state, word))` ``` python ''' Inputs: key (2d tuple): Takes in a 2d tuple that contains 2 tuples of transition state and emission state. I.e: ((prev_state, current_state), (current_state, word)) Output: A float value that is the emission probability * transition probability Function: Takes in a key and finds the respective emission and transition probability. Multiply the two and return the value ''' def get_features(self,key): #Extract the states and word from the key transition = key[0] emission = key[1] prev_state = transition[0] current_state = transition[1] word = emission[1] #Retrieve the index of the states and word prev_state_index = self.labels_list.index(prev_state) #labels_list contains start, but current_state columns in both tr_table and em_table does not have start, so we have to minus 1 to offset the index to match that of the tables current_state_index = self.labels_list.index(current_state) - 1 word_index = self.word_list.index(word) #Get the transition feature value trans_val = self.transition_table[prev_state_index][current_state_index] #Get the emission feature value if current_state == "STOP": # if the state is STOP, there is no emission value em_val = 0 else: em_val = self.emission_table[word_index][current_state_index] # Returns the transmission probability and the emission probability return trans_val * em_val ``` #### Weights As we are trying to improve on the Part 3 Viterbi Algorithm, we decided to initialise the weights to all contain a value of 1. This way, the predicted output of our algorithm before any update of weights, will be the exact same output as our Part 3 Viterbi Algorithm. We plan to spot the mistakes made by the Part 3 Viterbi Algorithm, and update the weights according to better handle those mistakes. Whenever we spot a mistake in the prediction, we update the predicted weights by subtracting the difference between the actual features and the predicted features, such as to make it smaller, and we update the actual weights by adding the difference so as to make it larger. ```python ''' Inputs: training_sentences (list): list of sentences from the training set training_labels (list): list of correct labels from the training set epochs (int): Specify the number of epochs to run the training algo Output: None Function: Takes in a list of sentences and their respective labels from the training set and update train the weights using perceptron update algorithm. ''' # The following code is truncated for the sake of being succint def perceptron_training(self,training_sentences, training_labels, epochs): if current_state != actual_current_state: pred_features = self.get_features(key) actual_features = self.get_features(correct_key) if (actual_features - pred_features) > 0: #if actual features is higher, we want to reduce the predicted weights and boost the actual weights #Reduce predicted weights weights -= (actual_features - pred_features) self.update_weights(weights, key) #Boost the actual weights correct_weights += (actual_features - pred_features) self.update_weights(correct_weights,correct_key) elif (actual_features - pred_features) < 0: #if actual features is lower, we want to reduce the predicted weights and boost the actual weights #Reduce predicted weights (actual-pred) is negative weights -= (-(actual_features - pred_features)) self.update_weights(weights, key) #Boost the actual weights (atual-pred) is positive correct_weights += (-(actual_features - pred_features)) self.update_weights(correct_weights,correct_key) ``` #### Regularization In order to prevent a scenario where some particular weights gets boosted to a very high/low value as a result of the algorithm making lots of mistakes on that particular instance, we decided to add in a learning rate and regularizer to control how much each weight gets updated. When the weights are updated, we take the reciprocal of the regularisation parameter and multiply it to the update value. We added two hyperparameters that can be defined when calling the `perceptron_training` function. User can specify an integer to `irp` to define the initial regularizer parameter, and a value to `learning_rate` to define how much to add to the regularizer after an update to the weight is made. The `perceptron_training` function will now be as follows: ```python ''' Inputs: training_sentences (list): list of sentences from the training set training_labels (list): list of correct labels from the training set epochs (int): Specify the number of epochs to run the training algo irp (int/float): Specify the initial regularisation parameter learning_rate (int/float): Specify how much to add to the regularisation parameter after each update Output: None Function: Takes in a list of sentences and their respective labels from the training set and update train the weights using perceptron update algorithm. ''' # The following code is truncated for the sake of being succint def perceptron_training(self,training_sentences, training_labels, epochs, irp = 5, learning_rate = 0.1): if current_state != actual_current_state: pred_features = self.get_features(key) actual_features = self.get_features(correct_key) if key in regularisation_dict.keys(): regularisation_param = regularisation_dict[key] else: regularisation_param = irp if correct_key in regularisation_dict.keys(): correct_regularisation_param = regularisation_dict[correct_key] else: correct_regularisation_param = irp if (actual_features - pred_features) > 0: #if actual features is higher, we want to reduce the predicted weights and boost the actual weights #Reduce predicted weights weights -= ((1/regularisation_param) * (actual_features - pred_features)) self.update_weights(weights, key) regularisation_dict[key] = regularisation_param + learning_rate #Boost the actual weights correct_weights += ((1/correct_regularisation_param) * (actual_features - pred_features)) self.update_weights(correct_weights,correct_key) regularisation_dict[correct_key] = correct_regularisation_param + learning_rate elif (actual_features - pred_features) < 0: #if actual features is lower, we want to reduce the predicted weights and boost the actual weights #Reduce predicted weights (actual-pred) is negative weights -= ((1/regularisation_param) * (-(actual_features - pred_features))) self.update_weights(weights, key) regularisation_dict[key] = regularisation_param + learning_rate #Boost the actual weights (atual-pred) is positive correct_weights += ((1/correct_regularisation_param) * (-(actual_features - pred_features))) self.update_weights(correct_weights,correct_key) regularisation_dict[correct_key] = correct_regularisation_param + learning_rate ``` #### Modified Viterbi In order to make use of the weights when running the Viterbi Algorithm, we had to update our Viterbi Algorithm. The changes made are just simply to use the weights and multiply it with the emission and transition probability. E.g: `get_weights((u,v),(v,o)) * tr_array(u,v) * em_array(v,o) * pi(u,j-1)` ``` python ''' Inputs: n (list): list of input words to be predicted l_labels (str): list of all possible states tr_arr (2d numpy array): transition table em_arr (2d numpy array): emission table word_list (list): list of words weights (dict): A dictionary of wrods Output: args (list): list of predicted states Function: Takes input sentence of n words, outputs list of n predicted tags/states corresponding to input words ''' def viterbi_sp(n, l_labels, tr_arr, em_arr, word_list, weights): # Initializoing pi array pi_arr = np.zeros([len(l_labels[1:-1]),len(n)+2]) parent = np.zeros([len(l_labels[1:-1]),len(n)+2]) pred = [] ### Forwarding Algo # Iterating through the columns of the pi array for j in range(len(n)+2): # column 0: START nodes, assign value 1 to all nodes in column 0 if j == 0: for i in range(len(pi_arr)): pi_arr[i][j] = 0 # column 1: nodes right after START, pi value = 1(pi value of START) * transition_prob * emission_ prob elif j == 1: if n[j-1] not in word_list: n[j-1] = "#UNK#" for u_idx, u in enumerate((l_labels[1:-1])): if tr_arr[0][u_idx] == 0 or em_arr[word_list.index(n[j-1])][u_idx] == 0: pi_arr[u_idx][j] = float('-inf') else: pi_arr[u_idx][j] = weights[('START', l_labels[u_idx+1]),(l_labels[u_idx+1], n[j-1])]*((np.log(tr_arr[0][u_idx])+1) + (np.log(em_arr[word_list.index(n[j-1])][u_idx])+1)) parent[u_idx][j] = 0 # columns 2 to n : pi value = max(pi value of previous nodes * transition_prob * emission_ prob) elif j > 1 and j < len(n)+1: # n[j-1]: current word in sentence, if not found in word_list, replace with "#UNK#" if n[j-1] not in word_list: n[j-1] = "#UNK#" # Iterating through column v: current column for u_idx, u in enumerate((l_labels[1:-1])): # v # array to store temp scores pi_temp = [] # Iterating through column u: previous column for u_idx_temp, u_temp in enumerate((l_labels[1:-1])): # u if tr_arr[u_idx_temp+1][u_idx] == 0 or em_arr[word_list.index(n[j-1])][u_idx] == 0: pi_temp.append(float('-inf')) else: # append pi_value_v (current) = pi_value_u (previous) * transition_prob(u,v) * emission_prob(v,word) pi_temp.append(pi_arr[u_idx_temp][j-1] + weights[(l_labels[u_idx_temp+1], l_labels[u_idx+1]),(l_labels[u_idx+1], n[j-1])]* ((np.log(tr_arr[u_idx_temp+1][u_idx])+1) + (np.log(em_arr[word_list.index(n[j-1])][u_idx])+1))) #pi_value_v = max(_temp) pi_arr[u_idx][j] = max(pi_temp) parent[u_idx][j] = np.argmax(pi_temp)+1 # column n+1 : STOP node: pi value = max(pi value of previous nodes * transition_prob) elif j == len(n)+1: pi_temp = [] for u_idx, u in enumerate((l_labels[1:-1])): if tr_arr[u_idx+1][len(l_labels[1:-1])] == 0: pi_temp.append(float("-inf")) else: pi_temp.append(weights[(l_labels[u_idx+1], 'STOP'),('STOP', 'none')]* (np.log(tr_arr[u_idx+1][len(l_labels[1:-1])])+1) + pi_arr[u_idx][j-1]) for u_idx_temp, u_temp in enumerate((l_labels[1:-1])): pi_arr[u_idx_temp][j] = max(pi_temp) parent[u_idx_temp][j] = np.argmax(pi_temp)+1 ### Backtracking Algo # list to store predicted outputs args = [] # To store the index current node with the highes score last_idx = len(l_labels[1:-1]) # Iterating from n to 1: n, n-1, n-2...1 for i in range(len(n)+1,1,-1): # update last_idx with the index of the node that had the highest score if i == len(n)+1: last_idx = int(parent[0][i]) else: last_idx = int(parent[last_idx-1][i]) #we have to -1 to undo the offset made when saving the argmax earlier # append tag/label corresponding to node with highest score to output args.append(l_labels[int(last_idx)]) return list(reversed(args)) ``` #### Results Using this algorithm, we were able to achieve the following results for EN/dev.in: > #Entity in gold data: 13179 > #Entity in prediction: 12310 > > #Correct Entity : 9948 > Entity precision: 0.8081 > Entity recall: 0.7548 > Entity F: 0.7806 > > #Correct Sentiment : 9482 > Sentiment precision: 0.7703 > Sentiment recall: 0.7195 Sentiment F: 0.7440 The output is written to `EN/dev_sp.p5.out` ### 2nd Order HMM We also tried modifying the initial HMM to use 2 of the previous states to predict the next state. Thus now, we have a larger transition table that finds the probability that the previous 2 states predicts the next state. This is calculated using the functions below. ```python ''' Inputs: states (list):list of labels l_labels (list): list of all possible states label_rows (list): list of all possible double initial states Output: transition_table (2d numpy array): transition table for given training set Function: Takes list of all possible states and outputs the transition table for given training set ''' def generate_2ndorder_transition_table(states, l_labels, label_rows): transition_table = np.zeros([len(label_rows),len(l_labels[1:])]) cols = l_labels[1:] for l in states: for ind in range(len(l)-2): state_1 = l[ind] state_2 = l[ind+1] state_3 = l[ind+2] ind_row = label_rows.index((state_1, state_2)) ind_col = cols.index(state_3) transition_table[ind_row, ind_col] += 1 for ind_row, row in enumerate(transition_table): total = row.sum() if total == 0: continue for ind_col, val in enumerate(row): transition_table[ind_row, ind_col] = transition_table[ind_row, ind_col]/total return transition_table ``` This will also result in a change in the viterbi algorithm where the pi value now compares all possible configurations of the previous 2 states and finds the best. The modified viterbi algorithm is shown below. ```python ''' Input: n (list of str): sentence split into list by spaces l_labels (list of str): list of tags tr_arr (2d numpy arr): transmission table tr_arr[initial state, final state] em_arr (2d numpy arr): emission table em_arr[word, state] word_list (list of str): list of all words in the training set rows_states (list): list of all possible double initial states Output: pred_tag (list of string): list of predicted tags Function: Does modified viterbi algorithm to find the best path using 2 initial states to predict the next state ''' def viterbi_2ndorder(n, l_labels, tr_arr, em_arr, word_list, row_states): args = [] all_states = l_labels[1:-1] len_states = len(all_states) # Set up pi array with each state containing top 3 pi states and within each state contains the [pi_prob, parent_idx] pi_arr = np.zeros([len_states,len(n)]) # going through the table for j in range(len(n)): curr_word = n[j] if curr_word not in word_list: curr_word = "#UNK#" word_idx = word_list.index(curr_word) # First column if j == 0: for curr_state in range(len_states): if tr_arr[0][curr_state] == 0 or em_arr[word_idx][curr_state] == 0: pi_val = float('-inf') else: pi_val = np.log(tr_arr[0][curr_state]) + np.log(em_arr[word_idx][curr_state]) pi_arr[curr_state,j] = pi_val # Second column elif j == 1: for curr_state in range(len_states): temp_vals = [] prev_pi_vals = pi_arr[:,j-1] for u_idx, state in enumerate(prev_pi_vals): pi_val = state prev_state = all_states[u_idx] prev_prev_state = 'START' tr_row_idx = row_states.index((prev_prev_state, prev_state)) if tr_arr[tr_row_idx][curr_state] == 0 or em_arr[word_idx][curr_state] == 0: new_pi_val = float('-inf') else: new_pi_val = pi_val + np.log(tr_arr[tr_row_idx][curr_state]) + np.log(em_arr[word_idx][curr_state]) temp_vals.append(new_pi_val) pi_arr[curr_state,j] = max(temp_vals) # All other columns elif j > 1: for curr_state in range(len_states): prev_pi_vals = pi_arr[:,j-1] # Initialize temp_vals = [] for u_idx, pi_val in enumerate(prev_pi_vals): for state_ind, state in enumerate(all_states): prev_state = all_states[u_idx] prev_prev_state = state tr_row_idx = row_states.index((prev_prev_state, prev_state)) tr_val = tr_arr[tr_row_idx][curr_state] em_val = em_arr[word_idx][curr_state] if tr_val == 0 or em_val == 0: new_pi_val = float('-inf') else: new_pi_val = pi_val + np.log(tr_val) + np.log(em_val) temp_vals.append(new_pi_val) pi_arr[curr_state,j] = max(temp_vals) # Checks if the length of the sentence is 1 if len(n)==1: temp_vals = [] last_pi_vals = pi_arr[:,len(n)-1] for u_idx, pi_val in enumerate(last_pi_vals): prev_state = all_states[u_idx] prev_prev_state = 'START' tr_row_idx = row_states.index((prev_prev_state, prev_state)) tr_val = tr_arr[tr_row_idx][-1] if tr_val == 0: new_pi_val = float('-inf') else: new_pi_val = pi_val + np.log(tr_val) temp_vals.append(new_pi_val) parent = all_states[np.argmax(temp_vals)] return [parent] # When the length of the sentence is more than 1 # Calculate the best 2 parent nodes from the STOP node temp_vals = [] parent_vals = [] last_pi_vals = pi_arr[:,len(n)-1] last_word = n[len(n)-1] second_last_word = n[len(n)-2] if last_word not in word_list: last_word = '#UNK#' if second_last_word not in word_list: second_last_word = '#UNK#' for u_idx, pi_val in enumerate(last_pi_vals): for state_ind, state in enumerate(all_states): prev_state = all_states[u_idx] prev_prev_state = state tr_row_idx = row_states.index((prev_prev_state, prev_state)) tr_val = tr_arr[tr_row_idx][-1] if tr_val == 0 or em_arr[word_list.index(last_word)][all_states.index(prev_state)]==0 or em_arr[word_list.index(second_last_word)][all_states.index(prev_prev_state)] == 0: new_pi_val = float('-inf') else: new_pi_val = np.log(tr_val) + np.log(em_arr[word_list.index(last_word)][all_states.index(prev_state)]) + np.log(em_arr[word_list.index(second_last_word)][all_states.index(prev_prev_state)]) temp_vals.append(new_pi_val) parent_vals.append((prev_prev_state, prev_state)) # Checks if an impossible path has been taken and if so, force the previous 2 states to be 'I-NP', 'O' if max(temp_vals)==float('-inf'): prev_prev_state, prev_state = 'I-NP', 'O' else: best_idx = np.argmax(temp_vals) prev_prev_state, prev_state = parent_vals[best_idx] # Checks if the length of the sentence is 2 if len(n)==2: return [prev_prev_state, prev_state] else: path = traceback(prev_state, prev_prev_state, n, pi_arr, all_states, tr_arr, row_states) return path ``` Finally the backtracking algorithm is slightly different as well. From the STOP state, we predict the previous 2 states when going through all the pi values. These 2 states are then used to predict the next state until the first word in the sentence. The backtracking algorithm is shown below. ```python ''' Input: parent (str): 1st parent of STOP second_parent (str): 2nd parent of STOP n (list): input sentence pi_arr (2d numpy arr): table of pi values all_states (list): list of all the tags not including START and STOP tr_arr (numpy arr): transition table rows_states (list): list of all possible double initial states Output: pred_vals (list): predicted labels Function: Traceback using the parent values ''' def traceback(parent, second_parent, n, pi_arr, all_states, tr_arr, rows_states): pred_vals = [parent, second_parent] for curr_idx in range(len(n)-3,-1,-1): temp_arr = [] for state_ind, state in enumerate(all_states): pi_val = pi_arr[state_ind, curr_idx] tr_row = rows_states.index((state,second_parent)) tr_col = all_states.index(parent) tr_val = tr_arr[tr_row][tr_col] if tr_val == 0: temp_arr.append(float('-inf')) else: temp_arr.append(pi_val+np.log(tr_val)) # Checks if an impossible path has been taken and if so, force the predicted states to be 'O' if max(temp_arr)==float('-inf'): pred_vals.append('O') parent = second_parent second_parent = 'O' else: parent_idx = np.argmax(temp_arr) pred_vals.append(all_states[parent_idx]) parent = second_parent second_parent = all_states[parent_idx] pred_vals.reverse() return pred_vals ``` We found out that the 2nd order HMM is unable to predict the last 2 tags well. Thus, we changed the method of prediction to exclude the pi value and to only consider the transmission value and the emission value of the previous 2 words. This improved our results on the EN test set. For all cases where the algorithm reached an impossible path, the output of 'O' is predicted. This significantly reduces the number of entities predicted and improves our precision score at a tradeoff for our recall score. Overall, it still improves the f-score and thus we decided to stick with this. ## Collaborators - Cornelius Yap 1003358 - Tan Yu Xiang 1003317 - Wang Zefan 1003791

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully