In this Section, we will organize and summarize key concepts about fully recurrent networks, the forward pass, derivatives, and visualization of computational graphs. We'll also provide mathematical equations, explanations, and illustrations for better understanding.
A fully recurrent network (RNN) is a type of neural network designed for processing sequences of data, such as time-series data or text. Unlike traditional feed-forward networks, RNNs have connections that loop back on themselves, allowing the network to remember information from previous time steps. This property makes RNNs well-suited for sequential tasks like language modeling and speech recognition.
The basic equations for a fully recurrent network are:
The forward pass is the process of passing an input sequence through the network to get an output. In an RNN, this involves processing each time step sequentially, computing the intermediate values, hidden activations, and predictions.
The forward pass for the RNN involves the following steps:
Initialization: Start with zero values for the hidden state at the initial time step.
Loop Through Time Steps: For each time step :
Compute :
This combines the current input and the memory from the previous time step.
Compute :
This calculates the hidden state for the current time step, adding non-linearity to the model.
Output Calculation (Final Time Step):
Compute :
This calculates the logit value at the final time step.
Compute Predicted Output :
The sigmoid function converts the logit into a probability.
Loss Calculation: Use binary cross-entropy to calculate the loss:
The binary cross-entropy loss function can suffer from numerical instability due to the use of logarithms and exponentials. There are two main problems to consider:
Overflow: When becomes very large (positive), can grow extremely quickly, leading to overflow issues where the value exceeds the range that can be represented numerically. When is large positive
Underflow: When becomes very negative, becomes very small, leading to underflow issues where the value becomes too tiny for the computer to represent, effectively resulting in zero. When is large negative
Logarithm of Near-Zero Values: The logarithm function approaches negative infinity as approaches zero, which means that if the predicted probability () is very close to 0 or 1, the cross-entropy can become extremely large. When approaches 0 or 1
To address these problems, it's better to work with the logits directly rather than the probabilities. Instead of calculating the sigmoid and then applying the cross-entropy loss, you combine the sigmoid and loss calculations into a single expression to maintain stability:
Using the log-sum-exp trick:
This formulation helps ensure that the exponential terms do not grow too large or too small, which helps avoid overflow and underflow.
The derivative of the binary cross-entropy loss function with respect to the logit is:
This result is unexpectedly simple because it directly relates the predicted probability to the true label . The simplicity makes the optimization process efficient and easy to interpret.
The forward pass as described earlier uses a loop to iterate over each time step in the sequence. This can be inefficient, especially for long sequences. To speed up the forward pass, you can reconfigure the computation to avoid the loop:
The computational graph provides a visual representation of how the inputs, activations, and outputs are connected across time steps in an RNN.
The following code creates a computational graph using NetworkX and Matplotlib to represent the operations happening at each time step.
In frameworks like PyTorch, the computational graph is automatically constructed during the forward pass. Each operation on tensors creates nodes and edges in the computational graph, which PyTorch uses to calculate gradients during the backward pass.
PyTorch builds a dynamic computational graph during the forward pass:
During backward pass:
Understanding the computational graph helps in manual gradient computation:
Batch Processing
Sequence Chunking
ReLU Instead of Tanh
Linear Approximation
These concepts are foundational for understanding how recurrent neural networks learn to process sequences, remember important information, and make predictions based on past inputs. The computational graph is particularly important for implementing backpropagation, whether done manually or by using frameworks like PyTorch.