# What us attention in the context of Recurrent Neural Networks
Attention mechanisms are the key of natural language processing. They allow a model to "pay attention" to certain parts of the input when processing it, rather than considering the entire input equally. This is especially useful in the context of Recurrent Neural Networks (RNNs), where the input may be a long sequence of data and the model needs to be able to focus on specific parts of the sequence at different times.
One of the earliest and most popular attention mechanisms is the Luong attention mechanism, which was introduced in a 2015 paper by Minh-Thang Luong et al. The Luong attention mechanism works by first calculating a set of attention weights for each element in the input sequence. These attention weights represent the importance of each element in the sequence and are used to weight the contributions of each element when generating the final output.
To calculate the attention weights, the Luong attention mechanism uses a set of learnable parameters and a scoring function to compare each element in the input sequence to the current hidden state of the RNN. The scoring function may be different depending on the specific implementation, but common choices include dot product or concatenation. The attention weights are then calculated using a softmax function, which ensures that they sum to 1 and can be interpreted as probabilities.
Another popular attention mechanism is the Bahdanau attention mechanism, which was introduced in a 2014 paper by Dzmitry Bahdanau. Like the Luong attention mechanism, the Bahdanau attention mechanism also calculates attention weights for each element in the input sequence. However, it uses a different approach to calculate the attention weights.
Instead of using a fixed scoring function, the Bahdanau attention mechanism uses a small neural network to calculate the attention weights. This neural network takes as input the current hidden state of the RNN and the current input element, and outputs a scalar value representing the attention weight for that element. The attention weights are then calculated using a softmax function as in the Luong attention mechanism.
Here is an example of how to implement an attention mechanism in Python using TensorFlow:
```
import tensorflow as tf
class LuongDotAttention(tf.keras.layers.Layer):
def __init__(self):
super(LuongDotAttention, self).__init__()
def call(self, query, values):
query_with_time_axis = tf.expand_dims(query, 1)
values_transposed = tf.transpose(values, perm=[0, 2, 1])
# LUONGH Dot-product
score = tf.transpose(tf.matmul(query_with_time_axis,
values_transposed), perm=[0, 2, 1])
# attention_weights shape == (batch_size, max_length, 1)
attention_weights = tf.nn.softmax(score, axis=1)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * values
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
class BahdanauAttention(tf.keras.layers.Layer):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
# Simple Neural Network
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, query, values):
query_with_time_axis = tf.expand_dims(query, 1)
score = self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values)))
# Compute attention weights
attention_weights = tf.nn.softmax(score, axis=1)
# Computing context vector
context_vector = attention_weights * values
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
```