Back-Propagation

## **Introduction** Imagine teaching a child to ride a bicycle. They wobble, they might fall, but with each attempt, they adjust their balance, pedal a bit smoother, and steer more confidently. This process of learning from errors is similar to how **Artificial Neural Networks (ANNs)** learn to perform tasks like recognizing your friend's face in a photo or translating languages. At the core of this learning process lies a fascinating algorithm called **Backpropagation**. Think of **Backpropagation** as the method by which a neural network reflects on its mistakes and tweaks itself to improve future performance. It uses the magic of calculus specifically, partial derivativesto figure out exactly how to adjust the **``knobs and dials''** inside the network to give accurate output. In this article, we'll dive deep into backpropagation, We'll use a simple numerical example with two layers and two neurons in each layer to make the concepts concrete. We'll also explore different activation functions commonly used in ANNs. ## **Understanding Neural Networks** Before we dive into backpropagation, let's understand what a neural network is. * **Neurons (Nodes):** The basic units of a neural network, analogous to neurons in the human brain. * **Layers:** Neurons are organized into layers---input, hidden}, and output layers. * **Weights and Biases:** Connections between neurons have weights that amplify or diminish signals, and biases that shift the activation function. * **Activation Functions:** Functions that introduce non-linearity, allowing the network to learn complex patterns. Examples include the sigmoid, ReLU, and tanh functions. A Simple Neural Network ![SNN](https://hackmd.io/_uploads/ry6FLylzJl.png) ## **Forward Propagation:** The Quest for Predictions Imagine you're tossing a ball to a friend standing on a hill. The ball's journey upwards represents the forward pass in our neural network. ![FNN](https://hackmd.io/_uploads/HJscUJlG1e.png) #### **Let's consider a simple neural network:** **Input Layer:** 2 neurons (features $x_1$ and $x_2$) Input to Hidden Layer For each neuron in the hidden layer: $$ z^{(1)} = \sum_i w_i^{(1)} x_i + b^{(1)} $$ $$ a^{(1)} = \sigma(z^{(1)}) $$ Where: $z^{(1)}$: Weighted sum of inputs. $w_i^{(1)}$: Weights from input neurons to a hidden neuron. $x_i$: Input features. $b^{(1)}$: Bias term. $a^{(1)}$: Activation from the hidden neuron. $\sigma$: Activation function (e.g., sigmoid). **Hidden Layer:** 2 neurons ($h_1$ and $h_2$) **Output Layer:** 2 neurons (outputs $\hat{y}_1$ and $\hat{y}_2$) For the output neuron: $$ z^{(2)} = \sum_j w_j^{(2)} a_j^{(1)} + b^{(2)} $$ $$ \hat{y} = \sigma(z^{(2)}) $$ Where: $w_j^{(2)}$: Weights from hidden neurons to the output neuron. $a_j^{(1)}$: Activations from hidden neurons. $\hat{y}$: Predicted output. ### **Numerical Example** Given: Inputs: $x_1 = 0.05$, $x_2 = 0.10$ Weights from Input to Hidden Layer: \begin{align*} w_{1,1} &= 0.15, & w_{1,2} &= 0.20 \\ w_{2,1} &= 0.25, & w_{2,2} &= 0.30 \end{align*} Weights from Hidden to Output Layer: \begin{align*} w_{3,1} &= 0.40, & w_{3,2} &= 0.45 \\ w_{4,1} &= 0.50, & w_{4,2} &= 0.55 \end{align*} Biases: Hidden layer biases: $b^{(1)}_1 = 0.35$, $b^{(1)}_2 = 0.35$ Output layer biases: $b^{(2)}_1 = 0.60$, $b^{(2)}_2 = 0.60$ Target Outputs: $y_1 = 0.01$, $y_2 = 0.99$ Activation Function: We'll use the \textbf{sigmoid function}: $$ \sigma(z) = \frac{1}{1 + e^{-z}} $$ **Step-by-Step Forward Pass** ### **1. Compute the Weighted Sum for Hidden Layer Neurons** For $h_1$: \begin{align*} z^{(1)}_1 &= x_1 w_{1,1} + x_2 w_{2,1} + b^{(1)}_1 \\ &= (0.05)(0.15) + (0.10)(0.25) + 0.35 \\ &= 0.0075 + 0.025 + 0.35 = 0.3825 \end{align*} For $h_2$: \begin{align*} z^{(1)}_2 &= x_1 w_{1,2} + x_2 w_{2,2} + b^{(1)}_2 \\ &= (0.05)(0.20) + (0.10)(0.30) + 0.35 \\ &= 0.01 + 0.03 + 0.35 = 0.39 \end{align*} ### **2. Apply Activation Function to Hidden Layer Neurons** For $h_1$: $$ a^{(1)}_1 = \sigma(z^{(1)}_1) = \frac{1}{1 + e^{-0.3825}} \approx 0.5948 $$ $h_2$: $$ a^{(1)}_2 = \sigma(z^{(1)}_2) = \frac{1}{1 + e^{-0.39}} \approx 0.5963 $$ ### **3.Compute the Weighted Sum for Output Layer Neurons** $\hat{y}_1$: \begin{align*} z^{(2)}_1 &= a^{(1)}_1 w_{3,1} + a^{(1)}_2 w_{4,1} + b^{(2)}_1 \\ &= (0.5948)(0.40) + (0.5963)(0.50) + 0.60 \\ &\approx 0.2379 + 0.2981 + 0.60 = 1.1360 \end{align*} $\hat{y}_2$: \begin{align*} z^{(2)}_2 &= a^{(1)}_1 w_{3,2} + a^{(1)}_2 w_{4,2} + b^{(2)}_2 \\ &= (0.5948)(0.45) + (0.5963)(0.55) + 0.60 \\ &\approx 0.2676 + 0.3270 + 0.60 = 1.1946 \end{align*} ### **4. Apply Activation Function to Output Layer Neurons** $\hat{y}_1$: $$ \hat{y}_1 = \sigma(z^{(2)}_1) = \frac{1}{1 + e^{-1.1360}} \approx 0.7575 $$ $\hat{y}_2$: $$ \hat{y}_2 = \sigma(z^{(2)}_2) = \frac{1}{1 + e^{-1.1946}} \approx 0.7675 $$ ![FNN1](https://hackmd.io/_uploads/ryghIygG1g.png) ### Calculating Loss After the forward pass, the network makes predictions $\hat{y}_1$ and $\hat{y}_2$. But how good are these predictions? ### Calculating Total Loss Using the {Mean Squared Error (MSE)} for each output neuron: $$ L = \frac{1}{2} \sum_{i} (y_i - \hat{y}_i)^2 $$ Compute the loss for each output: $\hat{y}_1$: $$ L_1 = \frac{1}{2} (y_1 - \hat{y}_1)^2 = \frac{1}{2} (0.01 - 0.7575)^2 \approx \frac{1}{2} (-0.7475)^2 \approx 0.2791 $$ $\hat{y}_2$: $$ L_2 = \frac{1}{2} (y_2 - \hat{y}_2)^2 = \frac{1}{2} (0.99 - 0.7675)^2 \approx \frac{1}{2} (0.2225)^2 \approx 0.0248 $$ Total Loss: $$ L_{\text{total}} = L_1 + L_2 \approx 0.2791 + 0.0248 = 0.3039 $$ ## Backward Propagation: Learning from Mistakes Backpropagation, short for "backward propagation of errors," is a supervised learning algorithm used to train artificial neural networks. It's like retracing your steps after getting lost to figure out where you went wrong. First, we need the derivative of the loss with respect to each output neuron. #### Backpropagation chain rule for neural networks To compute the gradient of the loss function $ L $ with respect to the weights $ w $ in a neural network, we use the chain rule. Suppose the loss $ L $ depends on a series of intermediate variables $ z $ and activations $ a $, then: $$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} $$ Here: - $( \frac{\partial L}{\partial a} )$: Gradient of the loss with respect to the activation. - $( \frac{\partial a}{\partial z} )$: Derivative of the activation function. - $( \frac{\partial z}{\partial w} )$: Derivative of the weighted sum with respect to the weights. ## Compute Output Layer Deltas For neuron $\hat{y}_1$: $$ \delta^{(2)}_1 = \frac{\partial L}{\partial z^{(2)}_1} = \frac{\partial L}{\partial \hat{y}_1} \cdot \frac{\partial \hat{y}_1}{\partial z^{(2)}_1} $$ $\frac{\partial L}{\partial \hat{y}_1}$: $$ \frac{\partial L}{\partial \hat{y}_1} = \hat{y}_1 - y_1 = 0.7575 - 0.01 = 0.7475 $$ $\frac{\partial \hat{y}_1}{\partial z^{(2)}_1}$: $$ \frac{\partial \hat{y}_1}{\partial z^{(2)}_1} = \hat{y}_1 (1 - \hat{y}_1) = 0.7575 \times 0.2425 \approx 0.1837 $$ $\delta^{(2)}_1$: $$ \delta^{(2)}_1 = 0.7475 \times 0.1837 \approx 0.1373 $$ Similarly, for neuron $\hat{y}_2$: \begin{align*} \frac{\partial L}{\partial \hat{y}_2} &= \hat{y}_2 - y_2 = 0.7675 - 0.99 = -0.2225 \\ \frac{\partial \hat{y}_2}{\partial z^{(2)}_2} &= \hat{y}_2 (1 - \hat{y}_2) = 0.7675 \times 0.2325 \approx 0.1784 \\ \delta^{(2)}_2 &= (-0.2225) \times 0.1784 \approx -0.0397 \end{align*} Compute Gradients for Weights from Hidden to Output Layer For weight $w_{3,1}$: $$ \frac{\partial L}{\partial w_{3,1}} = \delta^{(2)}_1 \cdot a^{(1)}_1 = 0.1373 \times 0.5948 \approx 0.0817 $$ For weight $w_{4,1}$: $$ \frac{\partial L}{\partial w_{4,1}} = \delta^{(2)}_1 \cdot a^{(1)}_2 = 0.1373 \times 0.5963 \approx 0.0819 $$ For weight $w_{3,2}$: $$ \frac{\partial L}{\partial w_{3,2}} = \delta^{(2)}_2 \cdot a^{(1)}_1 = (-0.0397) \times 0.5948 \approx -0.0236 $$ For weight $w_{4,2}$: $$ \frac{\partial L}{\partial w_{4,2}} = \delta^{(2)}_2 \cdot a^{(1)}_2 = (-0.0397) \times 0.5963 \approx -0.0237 $$ Compute Gradients for Biases in Output Layer For bias $b^{(2)}_1$: $$ \frac{\partial L}{\partial b^{(2)}_1} = \delta^{(2)}_1 = 0.1373 $$ For bias $b^{(2)}_2$: $$ \frac{\partial L}{\partial b^{(2)}_2} = \delta^{(2)}_2 = -0.0397 $$ $$ \frac{\partial L}{\partial b^{(2)}_2} = \delta^{(2)}_2 = -0.0397 $$ ### Calculating Gradients at the Hidden Layer #### Compute Hidden Layer Deltas For neuron $h_1$: $$ \delta^{(1)}_1 = \left( \delta^{(2)}_1 w_{3,1} + \delta^{(2)}_2 w_{3,2} \right) \cdot a^{(1)}_1 (1 - a^{(1)}_1) $$ Compute the sum: $$ S = \delta^{(2)}_1 w_{3,1} + \delta^{(2)}_2 w_{3,2} = (0.1373)(0.40) + (-0.0397)(0.45) \approx 0.0549 - 0.0179 = 0.0370 $$ Compute $a^{(1)}_1 (1 - a^{(1)}_1)$: $$ a^{(1)}_1 (1 - a^{(1)}_1) = 0.5948 (1 - 0.5948) \approx 0.5948 \times 0.4052 \approx 0.2410 $$ Compute $\delta^{(1)}_1$: $$ \delta^{(1)}_1 = 0.0370 \times 0.2410 \approx 0.0089 $$ Similarly, for neuron $h_2$: $$ S = \delta^{(2)}_1 w_{4,1} + \delta^{(2)}_2 w_{4,2} = (0.1373)(0.50) + (-0.0397)(0.55) \approx 0.0687 - 0.0218 = 0.0469 $$ $$ a^{(1)}_2 (1 - a^{(1)}_2) = 0.5963 (1 - 0.5963) \approx 0.5963 \times 0.4037 \approx 0.2407 $$ $$ \delta^{(1)}_2 = 0.0469 \times 0.2407 \approx 0.0113 $$ ### Compute Gradients for Weights from Input to Hidden Layer For weight $w_{1,1}$: $$ \frac{\partial L}{\partial w_{1,1}} = \delta^{(1)}_1 \cdot x_1 = 0.0089 \times 0.05 \approx 0.0004 $$ For weight $w_{2,1}$: $$ \frac{\partial L}{\partial w_{2,1}} = \delta^{(1)}_1 \cdot x_2 = 0.0089 \times 0.10 \approx 0.0009 $$ For weight $w_{1,2}$: $$ \frac{\partial L}{\partial w_{1,2}} = \delta^{(1)}_2 \cdot x_1 = 0.0113 \times 0.05 \approx 0.0006 $$ For weight $w_{2,2}$: $$ \frac{\partial L}{\partial w_{2,2}} = \delta^{(1)}_2 \cdot x_2 = 0.0113 \times 0.10 \approx 0.0011 $$ ### Compute Gradients for Biases in Hidden Layer For bias $b^{(1)}_1$: $$ \frac{\partial L}{\partial b^{(1)}_1} = \delta^{(1)}_1 = 0.0089 $$ For bias $b^{(1)}_2$: $$ \frac{\partial L}{\partial b^{(1)}_2} = \delta^{(1)}_2 = 0.0113 $$ ## Updating the Weights: Making Adjustments With the gradients computed, we adjust the weights to reduce the loss. ### Gradient Descent Update Rule Using a learning rate $\eta = 0.5$: #### Update Weights from Hidden to Output Layer $$ w_{3,1_{\text{new}}} = w_{3,1_{\text{old}}} - \eta \cdot \frac{\partial L}{\partial w_{3,1}} = 0.40 - 0.5 \times 0.0817 = 0.3591 $$ $$ w_{4,1_{\text{new}}} = 0.50 - 0.5 \times 0.0819 = 0.4590 $$ $$ w_{3,2_{\text{new}}} = 0.45 - 0.5 \times (-0.0236) = 0.45 + 0.0118 = 0.4618 $$ $$ w_{4,2_{\text{new}}} = 0.55 - 0.5 \times (-0.0237) = 0.55 + 0.0119 = 0.5619 $$ #### Update Biases in Output Layer $$ b^{(2)}_{1_{\text{new}}} = 0.60 - 0.5 \times 0.1373 = 0.5314 $$ $$ b^{(2)}_{2_{\text{new}}} = 0.60 - 0.5 \times (-0.0397) = 0.60 + 0.0199 = 0.6199 $$ #### Update Weights from Input to Hidden Layer $$ w_{1,1_{\text{new}}} = 0.15 - 0.5 \times 0.0004 = 0.1498 $$ $$ w_{2,1_{\text{new}}} = 0.25 - 0.5 \times 0.0009 = 0.2496 $$ $$ w_{1,2_{\text{new}}} = 0.20 - 0.5 \times 0.0006 = 0.1997 $$ $$ w_{2,2_{\text{new}}} = 0.30 - 0.5 \times 0.0011 = 0.2995 $$ #### Update Biases in Hidden Layer $$ b^{(1)}_{1_{\text{new}}} = 0.35 - 0.5 \times 0.0089 = 0.3456 $$ $$ b^{(1)}_{2_{\text{new}}} = 0.35 - 0.5 \times 0.0113 = 0.3443 $$ # The Magic of the Chain Rule The **chain rule** allows us to compute the derivative of a composite function by multiplying the derivatives of its constituent functions. ## Applying the Chain Rule in Backpropagation For the loss function $L$ and weight $w$: $$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w} $$ In our numerical example, this chain of derivatives allowed us to compute how changes in weights affect the overall loss. # Intuition Behind Partial Derivatives Partial derivatives tell us how a small change in one variable (holding others constant) affects the output. - $\frac{\partial L}{\partial w}$: How does changing a specific weight affect the loss? - By calculating these, we know exactly which weights to adjust and by how much. ## A Real-Life Analogy Consider a chef tweaking a recipe to improve a dish: - **Ingredients (Weights)**: The amounts of each ingredient. - **Taste (Loss)**: The quality of the dish. - **Adjustments (Gradients)**: Knowing which ingredient to increase or decrease to improve the taste. Backpropagation is like the chef tasting the dish and deciding how to adjust each ingredient. ## Putting It All into Perspective Backpropagation may seem complex, but at its core, it's about learning from mistakes and making adjustments---a process we humans are intimately familiar with. - **Forward Propagation**: Making a prediction based on current knowledge. - **Calculating Loss**: Assessing how good that prediction was. - **Backward Propagation**: Understanding how to improve future predictions. - **Weight Updates**: Implementing those improvements. ## Types of Backpropagation This are some different types of Backpropagation ### 1. Standard Backpropagation - **Description**: The classic form of backpropagation used in fully connected feedforward neural networks. Errors are propagated backward from the output layer to the input layer using the chain rule of differentiation. - **Use Case**: General-purpose neural networks for tasks like classification and regression. ### 2. Mini-Batch Backpropagation - **Description**: Gradients are computed for small, randomly selected subsets of the dataset (mini-batches). - **Use Case**: The most common form of backpropagation in modern deep learning, especially for training large datasets efficiently. ### 3. Truncated Backpropagation Through Time (TBPTT) - **Description**: A version of backpropagation designed for recurrent neural networks (RNNs). Instead of backpropagating through the entire sequence, gradients are computed over a fixed time window. - **Use Case**: Time-series analysis, natural language processing (NLP), and sequential data tasks with long sequences. ### 4. Backpropagation Through Time (BPTT) - **Description**: A backpropagation method applied to recurrent neural networks (RNNs) by unrolling the network over time and computing gradients for all time steps. - **Use Case**: Training recurrent neural networks to handle sequential data with dependencies across time. ### 5. Backpropagation with Regularization - **Description**: Incorporates regularization terms (e.g., L1, L2, dropout) into the loss function to penalize large weights and improve generalization. - **Use Case**: Preventing overfitting and improving model generalization in complex neural networks. ### 6. Gradient Clipping in Backpropagation - **Description**: Limits the magnitude of gradients during backpropagation to prevent exploding gradients, especially in deep or recurrent networks. - **Use Case**: Stabilizing training for recurrent or deep neural networks with high sensitivity to large gradients. ## Conclusion Backpropagation enables neural networks to learn and adapt. By walking through a numerical example, we have explained the mathematics behind the backpropagation. Just like a musician refining their performance through practice, or a child learning to ride a bike, neural networks learn by iteratively improving based on feedback. And it's all made possible by the power of partial derivatives and the chain rule. # Key Takeaways - **Backpropagation** allows neural networks to learn from errors by adjusting weights. - **Numerical Examples** help in understanding the practical implementation. - **Partial Derivatives** indicate how changes in weights affect the loss. - **Activation Functions** like sigmoid, tanh, and ReLU introduce non-linearity. - **The Chain Rule** is essential for computing gradients in layered networks. - **Learning** is an iterative process of prediction, evaluation, and adjustment.