# 50.007 Machine Learning - Sentiment Analysis Project Report
## Part 2
### Emission Function
Implement the emission function that estimates the emission parameters from the training set using MLE (maximum likelihood estimation).
```python=
'''
Inputs:
x (str): word to be queried
y (str): label to be queried
tag_word_table (dictionary): data in table form
k (float): number of occurences #UNK# is found
Output:
probability (float): probability of generating x from y based on tag_word_table
Function:
Calculates emission probability
'''
def emission(x, y, tag_word_table, k = 0.5):
word_list = tag_word_table[y]
if x == "#UNK#":
emission_count = k
else:
emission_count = word_list.count(x)
ycount = len(word_list)
return emission_count / (ycount + k)
```
With the calculated emission parameters, generate predictions on the output labels and evaluate results with evaluation script provided.
#### EN:
> #Entity in gold data: 13179
> #Entity in prediction: 18650
>
> #Correct Entity : 9542
> Entity precision: 0.5116
> Entity recall: 0.7240
> Entity F: 0.5996
>
> #Correct Sentiment : 8456
> Sentiment precision: 0.4534
> Sentiment recall: 0.6416
> Sentiment F: 0.5313
#### CN:
> #Entity in gold data: 700
> #Entity in prediction: 4248
>
> #Correct Entity : 345
> Entity precision: 0.0812
> Entity recall: 0.4929
> Entity F: 0.1395
>
> #Correct Sentiment : 167
> Sentiment precision: 0.0393
> Sentiment recall: 0.2386
> Sentiment F: 0.0675
#### SG:
> #Entity in gold data: 4301
> #Entity in prediction: 12237
>
> #Correct Entity : 2386
> Entity precision: 0.1950
> Entity recall: 0.5548
> Entity F: 0.2885
>
> #Correct Sentiment : 1531
> Sentiment precision: 0.1251
> Sentiment recall: 0.3560
> Sentiment F: 0.1851
## Part 3
### Transition Function
Implement the transition function that estimates the transition parameters from the training set using MLE (maximum likelihood estimation).
```python
'''
Inputs:
u (str): initial state/tag
v (str): final state/tag
words (list):list of words
Output:
transition_probability (float): transition probability from state u to state v
Function:
Takes initial state u and final state v, calculates the transition probability from state u to state v
'''
def transition(u,v,words):
count_u = 0
count_u_v = 0
for sentence in words:
if u == "START":
count_u = len(words)
if sentence[0][1] == v:
count_u_v +=1
elif v == "STOP":
for i in range(len(sentence)):
if sentence[i][1] == u:
count_u += 1
if i+1 == len(sentence):
count_u_v += 1
else:
for i in range(len(sentence)):
if sentence[i][1] == u:
count_u += 1
if i+1!=len(sentence) and sentence[i+1][1] == v:
count_u_v += 1
return count_u_v/count_u
```
### Viterbi Algorithm
Implement the Viterbi Algorithm that predicts the output labels for a given sentence.
```python
'''
Inputs:
n (list): list of input words to be predicted
l_labels (str): list of all possible states
tr_arr (2d numpy array): transition table
em_arr (2d numpy array): emission table
word_list (list): list of words
Output:
args (list): list of predicted states
Function:
Takes input sentence of n words, outputs list of n predicted tags/states corresponding to input words
'''
def viterbi(n, l_labels, tr_arr, em_arr, word_list):
# Initializoing pi array
pi_arr = np.zeros([len(l_labels[1:-1]),len(n)+2])
### Forwarding Algo
# Iterating through the columns of the pi array
for j in range(len(n)+2):
# column 0: START nodes, assign value 1 to all nodes in column 0
if j == 0:
for i in range(len(pi_arr)):
pi_arr[i][j] = 1
# column 1: nodes right after START, pi value = 1(pi value of START) * transition_prob * emission_ prob
elif j == 1:
if n[j-1] not in word_list:
n[j-1] = "#UNK#"
for u_idx, u in enumerate((l_labels[1:-1])):
if tr_arr[0][u_idx] == 0 or em_arr[word_list.index(n[j-1])][u_idx] == 0:
pi_arr[u_idx][j] = float('-inf')
else:
pi_arr[u_idx][j] = np.log(tr_arr[0][u_idx]) + np.log(em_arr[word_list.index(n[j-1])][u_idx])
# columns 2 to n : pi value = max(pi value of previous nodes * transition_prob * emission_ prob)
elif j > 1 and j < len(n)+1:
# n[j-1]: current word in sentence, if not found in word_list, replace with "#UNK#"
if n[j-1] not in word_list:
n[j-1] = "#UNK#"
# Iterating through column v: current column
for u_idx, u in enumerate((l_labels[1:-1])): # v
# array to store temp scores
pi_temp = []
# Iterating through column u: previous column
for u_idx_temp, u_temp in enumerate((l_labels[1:-1])): # u
if tr_arr[u_idx_temp+1][u_idx] == 0 or em_arr[word_list.index(n[j-1])][u_idx] == 0:
pi_temp.append(float('-inf'))
else:
# append pi_value_v (current) = pi_value_u (previous) * transition_prob(u,v) * emission_prob(v,word)
pi_temp.append(pi_arr[u_idx_temp][j-1] + np.log(tr_arr[u_idx_temp+1][u_idx]) + np.log(em_arr[word_list.index(n[j-1])][u_idx]))
#pi_value_v = max(_temp)
pi_arr[u_idx][j] = max(pi_temp)
# column n+1 : STOP node: pi value = max(pi value of previous nodes * transition_prob)
elif j == len(n)+1:
pi_temp = []
for u_idx, u in enumerate((l_labels[1:-1])):
if tr_arr[u_idx+1][len(l_labels[1:-1])] == 0:
pi_temp.append(float("-inf"))
else:
pi_temp.append(np.log(tr_arr[u_idx+1][len(l_labels[1:-1])]) + pi_arr[u_idx][j-1])
for u_idx_temp, u_temp in enumerate((l_labels[1:-1])):
pi_arr[u_idx_temp][j] = max(pi_temp)
### Backtracking Algo
# list to store predicted outputs
args = []
# To store the index current node with the highes score
last_idx = len(l_labels[1:-1])
# Iterating from n to 1: n, n-1, n-2...1
for i in range(len(n),0,-1):
# array to store all temp scores calculated
temp = []
# Iterating through the rows
for u_idx, u in enumerate((l_labels[1:-1])):
if tr_arr[u_idx+1][last_idx] == 0:
temp.append(float("-inf"))
else:
# append the score = transition_prob * pi value to temp
temp.append(np.log(tr_arr[u_idx+1][last_idx]) + pi_arr[u_idx][i])
# update last_idx with the index of the node that had the highest score
if np.max(temp) == float("-inf"):
last_idx = 7
else:
last_idx = np.argmax(temp)
# append tag/label corresponding to node with highest score to output
args.append(l_labels[last_idx+1])
return list(reversed(args))
```
To account for the numerical underflow issue that occurs when negative probabilities are constantly multiplied with each other during the forwarding portion of the algorithm, we have modified the forward propagating pi score algorithm to:
> `pi_current = pi_prev + log(transition(prev,current)) + log(emission(current, observation))`
instead of the previous:
> `pi_current = pi_prev * transition(prev,current) * emission(current, observation)`
Subsequently, all nodes with `pi = 0` had their scores replaced with `float("-inf")`
Furthermore, to account for the impossible path issue that occurs during the backtracking portion of the algo, such that when all the pi values for a given observation are `float("-inf")`, we will force an output label of "O" to increase the accuracy for EN dataset.
Finally, With the calculated emission and transition parameters, generate predictions on the output labels using the Viterbi Algorithm and evaluate results with evaluation script provided.
#### EN:
> #Entity in gold data: 13179
> #Entity in prediction: 12520
>
> #Correct Entity : 10139
> Entity precision: 0.8098
> Entity recall: 0.7693
> Entity F: 0.7891
>
> #Correct Sentiment : 9656
> Sentiment precision: 0.7712
> Sentiment recall: 0.7327
> Sentiment F: 0.7515
#### SG:
> #Entity in gold data: 4301
> #Entity in prediction: 4221
>
> #Correct Entity : 2193
> Entity precision: 0.5195
> Entity recall: 0.5099
> Entity F: 0.5147
>
> #Correct Sentiment : 1805
> Sentiment precision: 0.4276
> Sentiment recall: 0.4197
> Sentiment F: 0.4236
#### CN:
> #Entity in gold data: 700
> #Entity in prediction: 815
>
> #Correct Entity : 204
> Entity precision: 0.2503
> Entity recall: 0.2914
> Entity F: 0.2693
>
> #Correct Sentiment : 121
> Sentiment precision: 0.1485
> Sentiment recall: 0.1729
> Sentiment F: 0.1597
## Part 4
### Finding top 3 paths using Viterbi Algortihm
Modified the viterbi algorithm to store the top 3 pi values from the previous layer for each node. This guarantees us to find the top 3 paths when backtracking from the 'STOP' node.
``` Python
'''
Input:
n (list of str): sentence split into list by spaces
l_labels (list of str): list of tags
tr_arr (2d numpy arr): transmission table tr_arr[initial state, final state]
em_arr (2d numpy arr): emission table em_arr[word, state]
word_list (list of str): list of all words in the training set
Output:
pred_tag (list of string): list of predicted tags
Function:
Does modified viterbi algorithm to find the top 3 best path
'''
def viterbi_top3(n, l_labels, tr_arr, em_arr, word_list):
args = []
all_states = l_labels[1:-1]
len_states = len(all_states)
# Set up pi array with each state containing top 3 pi states and within each state contains the [pi_prob, parent_idx]
pi_arr = np.zeros([len_states,len(n),3,2])
# going through the table
for j in range(len(n)):
curr_word = n[j]
if curr_word not in word_list:
curr_word = "#UNK#"
word_idx = word_list.index(curr_word)
# First column
if j == 0:
for curr_state in range(len_states):
if tr_arr[0][curr_state] == 0 or em_arr[word_idx][curr_state] == 0:
pi_val = float('-inf')
else:
pi_val = np.log(tr_arr[0][curr_state]) + np.log(em_arr[word_idx][curr_state])
pi_arr[curr_state,j] = np.array([[pi_val, 0], [pi_val, 0], [pi_val, 0]])
# Second column
elif j == 1:
for curr_state in range(len_states):
temp_vals = []
prev_pi_vals = pi_arr[:,j-1,:,:]
for u_idx, state in enumerate(prev_pi_vals):
pi_val, parent_idx = state[0,:]
prev_state = u_idx
if tr_arr[prev_state+1][curr_state] == 0 or em_arr[word_idx][curr_state] == 0:
new_pi_val = float('-inf')
else:
new_pi_val = pi_val + np.log(tr_arr[prev_state+1][curr_state]) + np.log(em_arr[word_idx][curr_state])
temp_vals.append([new_pi_val, prev_state])
# Sort the values in descending order by using the first value of the list (pi_val)
sorted_vals = sorted(temp_vals, key = lambda x: x[0], reverse=True)
pi_arr[curr_state,j] = np.array(sorted_vals[:3])
# All other columns
elif j > 1:
for curr_state in range(len_states):
prev_pi_vals = pi_arr[:,j-1,:,:]
top3 = find_top3(prev_pi_vals, len_states, tr_arr, em_arr, curr_state, word_idx)
pi_arr[curr_state,j] = top3
temp_vals = []
last_pi_vals = pi_arr[:,len(n)-1,:,:]
for u_idx, state in enumerate(last_pi_vals):
for pi_val, parent_idx in state:
prev_state = u_idx
tr_val = tr_arr[prev_state+1][-1]
if tr_val == 0:
new_pi_val = float('-inf')
else:
new_pi_val = pi_val + np.log(tr_val)
temp_vals.append([new_pi_val, prev_state])
# Sort the values in descending order by using the first value of the list (pi_val)
sorted_vals = sorted(temp_vals, key = lambda x: x[0], reverse=True)
top3 = np.array(sorted_vals[:3])
# paths_table contains all the paths that visited that node for each node in the graph
# the path at index j and with state_ind state is paths_table[j][state]
paths_table = []
path = []
for j in range(len(n)):
temp_paths = []
for state in range(len_states):
temp_paths.append([])
paths_table.append(temp_paths)
pred_vals_best, paths_table = traceback(int(top3[0,1]), n, pi_arr, all_states, paths_table)
pred_vals_2ndbest, paths_table = traceback(int(top3[1,1]), n, pi_arr, all_states, paths_table)
pred_vals_3rdbest, paths_table = traceback(int(top3[2,1]), n, pi_arr, all_states, paths_table)
return pred_vals_best
```
In the code above, the pi array setup, has the different tags (not including START and STOP) as the rows and the words of the input sentence as the columns. Each node has a 3x2 numpy array which stores the top 3 [pi_val, parent_idx]. Storing the parent index helps with the backtracking later on. For the first column, all 3 rows are the same, as it could only come from the 'START' state. In the 2nd column, we compare all the nodes from the 1sr column and find the top 3. In the 3rd column onwards, we compare all the top 3 from the previous layers, giving us 3x the number of pi values to compare as compared to the original viterbi algorithm. We then store the top 3 pi_val and parent_idx pair in each node. This is computed using the find_top3 function.
```python
'''
Input:
prev_pi_vals (len_statesx3x2 numpy array): previous pi values
len_states (int): num of possible states
tr_arr (2d numpy arr): transition table
em_arr (2d numpy arr): emission table
curr_state (int): current state index in the table
word_idx (int): index of current word in word_list
Output:
top3 (3x2 numpy array): array of the top 3 values with the 1st value being the highest
Function:
Finds the top 3 probabilities and their parents
'''
def find_top3(prev_pi_vals, len_states, tr_arr, em_arr, curr_state, word_idx):
# Initialize
temp_vals = []
for u_idx, state in enumerate(prev_pi_vals):
for pi_val, parent_idx in state:
prev_state = u_idx
tr_val = tr_arr[prev_state+1][curr_state]
em_val = em_arr[word_idx][curr_state]
if tr_val == 0 or em_val == 0:
new_pi_val = float('-inf')
else:
new_pi_val = pi_val + np.log(tr_val) + np.log(em_val)
temp_vals.append([new_pi_val, prev_state])
# Sort the values in descending order by using the first value of the list (pi_val)
sorted_vals = sorted(temp_vals, key = lambda x: x[0], reverse=True)
top3 = np.array(sorted_vals[:3])
return top3
```
When we reach the end of the pi array, the last column of (pi values*transition value from that node to the STOP state) are compared to find the top 3. This top 3 values are used in the backtracking algorithm.
```python
'''
Input:
parent_idx (int): index of parent of STOP
n (list): input sentence
pi_arr (2d numpy arr): table of pi values
all_states (list): list of all the tags not including START and STOP
paths_table (nested list): table containing all the paths of previous iterations for each node
Output:
pred_vals (list): predicted values
paths_table (list of list): updated path table for this iteration
Function:
Traceback using the parent values
'''
def traceback(parent_idx, n, pi_arr, all_states, paths_table):
pred_vals = []
path = ['STOP']
for curr_idx in range(len(n)-1,0,-1):
best_idx = paths_table[curr_idx][parent_idx].count(path)
paths_table[curr_idx][parent_idx].append(path)
next_path = path + [all_states[parent_idx]]
path = next_path
pred_vals.append(all_states[parent_idx])
_, parent_idx = pi_arr[parent_idx, curr_idx, best_idx, :]
parent_idx = int(parent_idx)
pred_vals.append(all_states[parent_idx])
pred_vals.reverse()
return pred_vals, paths_table
```
In the backtracking algorithm, we use the parent value calculated before to start the backtracking. This parent value leads us to the previous node where we check the number of time this same path has been taken before. This number of times tells us which of the top 3 values we should use when backtracking. E.g. If backtracking visits node 1 from ['STOP', 'O'], it stores ['STOP', 'O'] in the list for node 1. If ['STOP','O'] is taken again in the next time this backtracking algorithm is taken, the 2nd best value should be used for this node. This is then repeated until the last node and the entire predicted list is returned.
#### EN:
> #Entity in gold data: 13179
> #Entity in prediction: 14371
>
> #Correct Entity : 10611
> Entity precision: 0.7384
> Entity recall: 0.8051
> Entity F: 0.7703
>
> #Correct Sentiment : 9736
> Sentiment precision: 0.6775
> Sentiment recall: 0.7388
> Sentiment F: 0.7068
## Part 5
### Structural Perceptron
For part 5, we attempted using Structural Perceptron to make the predictions. We followed the algorithm listed here http://www.phontron.com/slides/nlp-programming-en-12-struct.pdf but made a few modifications to it to suit our use-case better.
#### Features
Making use of what we already have, we decided to define our features using the transition and emission probability that we have created and use for the normal HMM Viterbi algorithm in Part 3. We define a function `get_features(key)` which takes in a key and returns the respective feature value. They key follows the following format: `((prev_state, current_state), (current_state, word))`
``` python
'''
Inputs:
key (2d tuple): Takes in a 2d tuple that contains 2 tuples of transition state and
emission state. I.e: ((prev_state, current_state), (current_state, word))
Output:
A float value that is the emission probability * transition probability
Function:
Takes in a key and finds the respective emission and transition probability.
Multiply the two and return the value
'''
def get_features(self,key):
#Extract the states and word from the key
transition = key[0]
emission = key[1]
prev_state = transition[0]
current_state = transition[1]
word = emission[1]
#Retrieve the index of the states and word
prev_state_index = self.labels_list.index(prev_state)
#labels_list contains start, but current_state columns in both tr_table and em_table does not have start, so we have to minus 1 to offset the index to match that of the tables
current_state_index = self.labels_list.index(current_state) - 1
word_index = self.word_list.index(word)
#Get the transition feature value
trans_val = self.transition_table[prev_state_index][current_state_index]
#Get the emission feature value
if current_state == "STOP": # if the state is STOP, there is no emission value
em_val = 0
else:
em_val = self.emission_table[word_index][current_state_index]
# Returns the transmission probability and the emission probability
return trans_val * em_val
```
#### Weights
As we are trying to improve on the Part 3 Viterbi Algorithm, we decided to initialise the weights to all contain a value of 1. This way, the predicted output of our algorithm before any update of weights, will be the exact same output as our Part 3 Viterbi Algorithm. We plan to spot the mistakes made by the Part 3 Viterbi Algorithm, and update the weights according to better handle those mistakes. Whenever we spot a mistake in the prediction, we update the predicted weights by subtracting the difference between the actual features and the predicted features, such as to make it smaller, and we update the actual weights by adding the difference so as to make it larger.
```python
'''
Inputs:
training_sentences (list): list of sentences from the training set
training_labels (list): list of correct labels from the training set
epochs (int): Specify the number of epochs to run the training algo
Output:
None
Function:
Takes in a list of sentences and their respective labels from the training set and update train the weights using perceptron update algorithm.
'''
# The following code is truncated for the sake of being succint
def perceptron_training(self,training_sentences, training_labels, epochs):
if current_state != actual_current_state:
pred_features = self.get_features(key)
actual_features = self.get_features(correct_key)
if (actual_features - pred_features) > 0:
#if actual features is higher, we want to reduce the predicted weights and boost the actual weights
#Reduce predicted weights
weights -= (actual_features - pred_features)
self.update_weights(weights, key)
#Boost the actual weights
correct_weights += (actual_features - pred_features)
self.update_weights(correct_weights,correct_key)
elif (actual_features - pred_features) < 0:
#if actual features is lower, we want to reduce the predicted weights and boost the actual weights
#Reduce predicted weights (actual-pred) is negative
weights -= (-(actual_features - pred_features))
self.update_weights(weights, key)
#Boost the actual weights (atual-pred) is positive
correct_weights += (-(actual_features - pred_features))
self.update_weights(correct_weights,correct_key)
```
#### Regularization
In order to prevent a scenario where some particular weights gets boosted to a very high/low value as a result of the algorithm making lots of mistakes on that particular instance, we decided to add in a learning rate and regularizer to control how much each weight gets updated. When the weights are updated, we take the reciprocal of the regularisation parameter and multiply it to the update value. We added two hyperparameters that can be defined when calling the `perceptron_training` function. User can specify an integer to `irp` to define the initial regularizer parameter, and a value to `learning_rate` to define how much to add to the regularizer after an update to the weight is made. The `perceptron_training` function will now be as follows:
```python
'''
Inputs:
training_sentences (list): list of sentences from the training set
training_labels (list): list of correct labels from the training set
epochs (int): Specify the number of epochs to run the training algo
irp (int/float): Specify the initial regularisation parameter
learning_rate (int/float): Specify how much to add to the regularisation parameter after each update
Output:
None
Function:
Takes in a list of sentences and their respective labels from the training set and update train the weights using perceptron update algorithm.
'''
# The following code is truncated for the sake of being succint
def perceptron_training(self,training_sentences, training_labels, epochs, irp = 5, learning_rate = 0.1):
if current_state != actual_current_state:
pred_features = self.get_features(key)
actual_features = self.get_features(correct_key)
if key in regularisation_dict.keys():
regularisation_param = regularisation_dict[key]
else:
regularisation_param = irp
if correct_key in regularisation_dict.keys():
correct_regularisation_param = regularisation_dict[correct_key]
else:
correct_regularisation_param = irp
if (actual_features - pred_features) > 0:
#if actual features is higher, we want to reduce the predicted weights and boost the actual weights
#Reduce predicted weights
weights -= ((1/regularisation_param) * (actual_features - pred_features))
self.update_weights(weights, key)
regularisation_dict[key] = regularisation_param + learning_rate
#Boost the actual weights
correct_weights += ((1/correct_regularisation_param) * (actual_features - pred_features))
self.update_weights(correct_weights,correct_key)
regularisation_dict[correct_key] = correct_regularisation_param + learning_rate
elif (actual_features - pred_features) < 0:
#if actual features is lower, we want to reduce the predicted weights and boost the actual weights
#Reduce predicted weights (actual-pred) is negative
weights -= ((1/regularisation_param) * (-(actual_features - pred_features)))
self.update_weights(weights, key)
regularisation_dict[key] = regularisation_param + learning_rate
#Boost the actual weights (atual-pred) is positive
correct_weights += ((1/correct_regularisation_param) * (-(actual_features - pred_features)))
self.update_weights(correct_weights,correct_key)
regularisation_dict[correct_key] = correct_regularisation_param + learning_rate
```
#### Modified Viterbi
In order to make use of the weights when running the Viterbi Algorithm, we had to update our Viterbi Algorithm. The changes made are just simply to use the weights and multiply it with the emission and transition probability. E.g: `get_weights((u,v),(v,o)) * tr_array(u,v) * em_array(v,o) * pi(u,j-1)`
``` python
'''
Inputs:
n (list): list of input words to be predicted
l_labels (str): list of all possible states
tr_arr (2d numpy array): transition table
em_arr (2d numpy array): emission table
word_list (list): list of words
weights (dict): A dictionary of wrods
Output:
args (list): list of predicted states
Function:
Takes input sentence of n words, outputs list of n predicted tags/states corresponding to input words
'''
def viterbi_sp(n, l_labels, tr_arr, em_arr, word_list, weights):
# Initializoing pi array
pi_arr = np.zeros([len(l_labels[1:-1]),len(n)+2])
parent = np.zeros([len(l_labels[1:-1]),len(n)+2])
pred = []
### Forwarding Algo
# Iterating through the columns of the pi array
for j in range(len(n)+2):
# column 0: START nodes, assign value 1 to all nodes in column 0
if j == 0:
for i in range(len(pi_arr)):
pi_arr[i][j] = 0
# column 1: nodes right after START, pi value = 1(pi value of START) * transition_prob * emission_ prob
elif j == 1:
if n[j-1] not in word_list:
n[j-1] = "#UNK#"
for u_idx, u in enumerate((l_labels[1:-1])):
if tr_arr[0][u_idx] == 0 or em_arr[word_list.index(n[j-1])][u_idx] == 0:
pi_arr[u_idx][j] = float('-inf')
else:
pi_arr[u_idx][j] = weights[('START', l_labels[u_idx+1]),(l_labels[u_idx+1], n[j-1])]*((np.log(tr_arr[0][u_idx])+1) + (np.log(em_arr[word_list.index(n[j-1])][u_idx])+1))
parent[u_idx][j] = 0
# columns 2 to n : pi value = max(pi value of previous nodes * transition_prob * emission_ prob)
elif j > 1 and j < len(n)+1:
# n[j-1]: current word in sentence, if not found in word_list, replace with "#UNK#"
if n[j-1] not in word_list:
n[j-1] = "#UNK#"
# Iterating through column v: current column
for u_idx, u in enumerate((l_labels[1:-1])): # v
# array to store temp scores
pi_temp = []
# Iterating through column u: previous column
for u_idx_temp, u_temp in enumerate((l_labels[1:-1])): # u
if tr_arr[u_idx_temp+1][u_idx] == 0 or em_arr[word_list.index(n[j-1])][u_idx] == 0:
pi_temp.append(float('-inf'))
else:
# append pi_value_v (current) = pi_value_u (previous) * transition_prob(u,v) * emission_prob(v,word)
pi_temp.append(pi_arr[u_idx_temp][j-1] +
weights[(l_labels[u_idx_temp+1], l_labels[u_idx+1]),(l_labels[u_idx+1], n[j-1])]*
((np.log(tr_arr[u_idx_temp+1][u_idx])+1) + (np.log(em_arr[word_list.index(n[j-1])][u_idx])+1)))
#pi_value_v = max(_temp)
pi_arr[u_idx][j] = max(pi_temp)
parent[u_idx][j] = np.argmax(pi_temp)+1
# column n+1 : STOP node: pi value = max(pi value of previous nodes * transition_prob)
elif j == len(n)+1:
pi_temp = []
for u_idx, u in enumerate((l_labels[1:-1])):
if tr_arr[u_idx+1][len(l_labels[1:-1])] == 0:
pi_temp.append(float("-inf"))
else:
pi_temp.append(weights[(l_labels[u_idx+1], 'STOP'),('STOP', 'none')]*
(np.log(tr_arr[u_idx+1][len(l_labels[1:-1])])+1) + pi_arr[u_idx][j-1])
for u_idx_temp, u_temp in enumerate((l_labels[1:-1])):
pi_arr[u_idx_temp][j] = max(pi_temp)
parent[u_idx_temp][j] = np.argmax(pi_temp)+1
### Backtracking Algo
# list to store predicted outputs
args = []
# To store the index current node with the highes score
last_idx = len(l_labels[1:-1])
# Iterating from n to 1: n, n-1, n-2...1
for i in range(len(n)+1,1,-1):
# update last_idx with the index of the node that had the highest score
if i == len(n)+1:
last_idx = int(parent[0][i])
else:
last_idx = int(parent[last_idx-1][i]) #we have to -1 to undo the offset made when saving the argmax earlier
# append tag/label corresponding to node with highest score to output
args.append(l_labels[int(last_idx)])
return list(reversed(args))
```
#### Results
Using this algorithm, we were able to achieve the following results for EN/dev.in:
> #Entity in gold data: 13179
> #Entity in prediction: 12310
>
> #Correct Entity : 9948
> Entity precision: 0.8081
> Entity recall: 0.7548
> Entity F: 0.7806
>
> #Correct Sentiment : 9482
> Sentiment precision: 0.7703
> Sentiment recall: 0.7195
Sentiment F: 0.7440
The output is written to `EN/dev_sp.p5.out`
### 2nd Order HMM
We also tried modifying the initial HMM to use 2 of the previous states to predict the next state. Thus now, we have a larger transition table that finds the probability that the previous 2 states predicts the next state. This is calculated using the functions below.
```python
'''
Inputs:
states (list):list of labels
l_labels (list): list of all possible states
label_rows (list): list of all possible double initial states
Output:
transition_table (2d numpy array): transition table for given training set
Function:
Takes list of all possible states and outputs the transition table for given training set
'''
def generate_2ndorder_transition_table(states, l_labels, label_rows):
transition_table = np.zeros([len(label_rows),len(l_labels[1:])])
cols = l_labels[1:]
for l in states:
for ind in range(len(l)-2):
state_1 = l[ind]
state_2 = l[ind+1]
state_3 = l[ind+2]
ind_row = label_rows.index((state_1, state_2))
ind_col = cols.index(state_3)
transition_table[ind_row, ind_col] += 1
for ind_row, row in enumerate(transition_table):
total = row.sum()
if total == 0:
continue
for ind_col, val in enumerate(row):
transition_table[ind_row, ind_col] = transition_table[ind_row, ind_col]/total
return transition_table
```
This will also result in a change in the viterbi algorithm where the pi value now compares all possible configurations of the previous 2 states and finds the best. The modified viterbi algorithm is shown below.
```python
'''
Input:
n (list of str): sentence split into list by spaces
l_labels (list of str): list of tags
tr_arr (2d numpy arr): transmission table tr_arr[initial state, final state]
em_arr (2d numpy arr): emission table em_arr[word, state]
word_list (list of str): list of all words in the training set
rows_states (list): list of all possible double initial states
Output:
pred_tag (list of string): list of predicted tags
Function:
Does modified viterbi algorithm to find the best path using 2 initial states to predict the next state
'''
def viterbi_2ndorder(n, l_labels, tr_arr, em_arr, word_list, row_states):
args = []
all_states = l_labels[1:-1]
len_states = len(all_states)
# Set up pi array with each state containing top 3 pi states and within each state contains the [pi_prob, parent_idx]
pi_arr = np.zeros([len_states,len(n)])
# going through the table
for j in range(len(n)):
curr_word = n[j]
if curr_word not in word_list:
curr_word = "#UNK#"
word_idx = word_list.index(curr_word)
# First column
if j == 0:
for curr_state in range(len_states):
if tr_arr[0][curr_state] == 0 or em_arr[word_idx][curr_state] == 0:
pi_val = float('-inf')
else:
pi_val = np.log(tr_arr[0][curr_state]) + np.log(em_arr[word_idx][curr_state])
pi_arr[curr_state,j] = pi_val
# Second column
elif j == 1:
for curr_state in range(len_states):
temp_vals = []
prev_pi_vals = pi_arr[:,j-1]
for u_idx, state in enumerate(prev_pi_vals):
pi_val = state
prev_state = all_states[u_idx]
prev_prev_state = 'START'
tr_row_idx = row_states.index((prev_prev_state, prev_state))
if tr_arr[tr_row_idx][curr_state] == 0 or em_arr[word_idx][curr_state] == 0:
new_pi_val = float('-inf')
else:
new_pi_val = pi_val + np.log(tr_arr[tr_row_idx][curr_state]) + np.log(em_arr[word_idx][curr_state])
temp_vals.append(new_pi_val)
pi_arr[curr_state,j] = max(temp_vals)
# All other columns
elif j > 1:
for curr_state in range(len_states):
prev_pi_vals = pi_arr[:,j-1]
# Initialize
temp_vals = []
for u_idx, pi_val in enumerate(prev_pi_vals):
for state_ind, state in enumerate(all_states):
prev_state = all_states[u_idx]
prev_prev_state = state
tr_row_idx = row_states.index((prev_prev_state, prev_state))
tr_val = tr_arr[tr_row_idx][curr_state]
em_val = em_arr[word_idx][curr_state]
if tr_val == 0 or em_val == 0:
new_pi_val = float('-inf')
else:
new_pi_val = pi_val + np.log(tr_val) + np.log(em_val)
temp_vals.append(new_pi_val)
pi_arr[curr_state,j] = max(temp_vals)
# Checks if the length of the sentence is 1
if len(n)==1:
temp_vals = []
last_pi_vals = pi_arr[:,len(n)-1]
for u_idx, pi_val in enumerate(last_pi_vals):
prev_state = all_states[u_idx]
prev_prev_state = 'START'
tr_row_idx = row_states.index((prev_prev_state, prev_state))
tr_val = tr_arr[tr_row_idx][-1]
if tr_val == 0:
new_pi_val = float('-inf')
else:
new_pi_val = pi_val + np.log(tr_val)
temp_vals.append(new_pi_val)
parent = all_states[np.argmax(temp_vals)]
return [parent]
# When the length of the sentence is more than 1
# Calculate the best 2 parent nodes from the STOP node
temp_vals = []
parent_vals = []
last_pi_vals = pi_arr[:,len(n)-1]
last_word = n[len(n)-1]
second_last_word = n[len(n)-2]
if last_word not in word_list:
last_word = '#UNK#'
if second_last_word not in word_list:
second_last_word = '#UNK#'
for u_idx, pi_val in enumerate(last_pi_vals):
for state_ind, state in enumerate(all_states):
prev_state = all_states[u_idx]
prev_prev_state = state
tr_row_idx = row_states.index((prev_prev_state, prev_state))
tr_val = tr_arr[tr_row_idx][-1]
if tr_val == 0 or em_arr[word_list.index(last_word)][all_states.index(prev_state)]==0 or em_arr[word_list.index(second_last_word)][all_states.index(prev_prev_state)] == 0:
new_pi_val = float('-inf')
else:
new_pi_val = np.log(tr_val) + np.log(em_arr[word_list.index(last_word)][all_states.index(prev_state)]) + np.log(em_arr[word_list.index(second_last_word)][all_states.index(prev_prev_state)])
temp_vals.append(new_pi_val)
parent_vals.append((prev_prev_state, prev_state))
# Checks if an impossible path has been taken and if so, force the previous 2 states to be 'I-NP', 'O'
if max(temp_vals)==float('-inf'):
prev_prev_state, prev_state = 'I-NP', 'O'
else:
best_idx = np.argmax(temp_vals)
prev_prev_state, prev_state = parent_vals[best_idx]
# Checks if the length of the sentence is 2
if len(n)==2:
return [prev_prev_state, prev_state]
else:
path = traceback(prev_state, prev_prev_state, n, pi_arr, all_states, tr_arr, row_states)
return path
```
Finally the backtracking algorithm is slightly different as well. From the STOP state, we predict the previous 2 states when going through all the pi values. These 2 states are then used to predict the next state until the first word in the sentence. The backtracking algorithm is shown below.
```python
'''
Input:
parent (str): 1st parent of STOP
second_parent (str): 2nd parent of STOP
n (list): input sentence
pi_arr (2d numpy arr): table of pi values
all_states (list): list of all the tags not including START and STOP
tr_arr (numpy arr): transition table
rows_states (list): list of all possible double initial states
Output:
pred_vals (list): predicted labels
Function:
Traceback using the parent values
'''
def traceback(parent, second_parent, n, pi_arr, all_states, tr_arr, rows_states):
pred_vals = [parent, second_parent]
for curr_idx in range(len(n)-3,-1,-1):
temp_arr = []
for state_ind, state in enumerate(all_states):
pi_val = pi_arr[state_ind, curr_idx]
tr_row = rows_states.index((state,second_parent))
tr_col = all_states.index(parent)
tr_val = tr_arr[tr_row][tr_col]
if tr_val == 0:
temp_arr.append(float('-inf'))
else:
temp_arr.append(pi_val+np.log(tr_val))
# Checks if an impossible path has been taken and if so, force the predicted states to be 'O'
if max(temp_arr)==float('-inf'):
pred_vals.append('O')
parent = second_parent
second_parent = 'O'
else:
parent_idx = np.argmax(temp_arr)
pred_vals.append(all_states[parent_idx])
parent = second_parent
second_parent = all_states[parent_idx]
pred_vals.reverse()
return pred_vals
```
We found out that the 2nd order HMM is unable to predict the last 2 tags well. Thus, we changed the method of prediction to exclude the pi value and to only consider the transmission value and the emission value of the previous 2 words. This improved our results on the EN test set.
For all cases where the algorithm reached an impossible path, the output of 'O' is predicted. This significantly reduces the number of entities predicted and improves our precision score at a tradeoff for our recall score. Overall, it still improves the f-score and thus we decided to stick with this.
## Collaborators
- Cornelius Yap 1003358
- Tan Yu Xiang 1003317
- Wang Zefan 1003791