Backpropagation: A Simple Explanation Using Partial Derivatives

# Backpropagation: A Simple Explanation Using Partial Derivatives Backpropagation is the algorithm neural networks use to learn. It computes how much each weight contributed to the prediction error using partial derivatives, and then adjusts the weights to reduce that error. :::info Analogy: Learning to Hit a Target A simple way to understand backpropagation is to imagine teaching someone to throw a ball at a target: • You watch where the ball lands. • You compare it to where it should have landed. • Then you give feedback: • “You threw too far — next time use a little less force.” • “You threw too far to the left — adjust your angle to the right.” ::: :::success Backpropagation works the same way: • The network makes a prediction (throws the ball). • It measures the error (how far from the target it landed). • It determines which weights caused the mistake (which muscle movements were off). • It nudges each weight slightly to reduce the error (a small correction to the throw). ::: Just like practicing throws, this process repeats many times. Each round of feedback improves the weights, reducing prediction error and making the network more accurate. Neural Network Training Has Two Phases :::warning 1. Forward Pass Compute the prediction. ::: :::warning 2. Backward Pass Compute gradients and update weights. ::: In this article, a small and interpretable neural network is used in order to clearly demonstrate how backpropagation works. The network contains two input neurons, two hidden neurons, and a single output neuron with a sigmoid activation. This structure is often written as a 2–2–1 architecture, and it is intentionally chosen for both conceptual clarity and mathematical simplicity. The following diagram is the network that will be used to explain backprogation: :::info A neuron is the fundamental computational unit of the network, computing a weighted sum of its inputs plus a bias and applying an activation function to produce its output. ::: ```mermaid flowchart LR I1((i₁)) I2((i₂)) H1((h₁)) H2((h₂)) Y((ŷ)) %% Input → Hidden I1 -->|w₁| H1 I1 -->|w₂| H2 I2 -->|w₃| H1 I2 -->|w₄| H2 %% Hidden → Output H1 -->|w₅| Y H2 -->|w₆| Y ``` # 1. Forward Pass: Compute Outputs Step 1 — Hidden Neuron Weighted Inputs Each hidden neuron receives a weighted sum of the inputs. This value represents the raw signal before applying the activation function. The neuron is called hidden because it belongs to an internal layer of the network, meaning its activation is not directly observed in the input data nor exposed as a final output, but instead serves as an intermediate feature representation learned by the model. $$ u_1 = w_1 i_1 + w_3 i_2 $$ $$ u_2 = w_2 i_1 + w_4 i_2 $$ :::info Intuition: Think of each weight as a volume knob. • Large & positive → input strongly activates the neuron • Negative → input pushes the neuron downward • Near zero → input is mostly ignored ::: :::success What is happening: The network is “mixing” $i_1$ and $i_2$ into two different combinations ($u_1$ and $u_2$), one for each hidden neuron. ::: Step 2 — Hidden Neuron Activations We apply the sigmoid function to introduce nonlinearity: $$ h_1 = \sigma(u_1) $$ $$ h_2 = \sigma(u_2) $$ :::info Intuition: Sigmoid acts like a soft switch. It forces values to be between 1 and 0. • Large positive → near 1 • Large negative → near 0 • Zero → 0.5 ::: This helps the network learn patterns that aren’t purely linear. Step 3 — Output Neuron Weighted Input The output neuron combines information from both hidden neurons: $$ u_3 = w_5 h_1 + w_6 h_2 $$ :::success Intuition: This step mixes the two hidden-layer outputs into a single score that represents the final “confidence” before applying sigmoid. ::: Step 4 — Output Prediction The network applies sigmoid again to convert $u_3$ into a probability: $$ \hat{y} = \sigma(u_3) $$ :::info Interpretation: $\hat{y}$ is the model’s predicted probability that the input belongs to class 1. Examples: • $\hat{y}$ = 0.92 → “92% confidence this is class 1.” • $\hat{y}$ = 0.15 → “15% confidence this is class 1.” ::: Step 5 — Compute Error (Mean Squared Error) To measure how far the prediction is from the true label, we use Mean Squared Error (MSE) for a single training example: $$ J = \frac{1}{2}(\hat{y} - y)^2 $$ :::warning Interpretation: • If $\hat{y}$ is close to the true label $y$, the error is small. • If $\hat{y}$ is far from the correct answer, the error is large. ::: # Numerical Example Let’s plug in simple numbers so you can see the entire forward pass in action. :::info Given: • Inputs: $i_1 = 1.0$, $i_2 = 0.5$ • Weights: • $w_1 = 0.3,; w_3 = -0.2$ • $w_2 = 0.8,; w_4 = 0.4$ • $w_5 = 1.2,; w_6 = -0.7$ • True label: $y = 1$ ::: 1. Compute Hidden Inputs $$ u_1 = 0.3(1.0) + (-0.2)(0.5) = 0.2 $$ $$ u_2 = 0.8(1.0) + 0.4(0.5) = 1.0 $$ 2. Hidden Activations $$ h_1 = \sigma(0.2) \approx 0.5498 $$ $$ h_2 = \sigma(1.0) \approx 0.7311 $$ 3. Output Neuron Input $$ u_3 = 1.2(0.5498) - 0.7(0.7311) $$ $$ u_3 = 0.1480 $$ 4. Output Prediction $$ \hat{y} = \sigma(0.1480) \approx 0.5369 $$ :::success Interpretation: The network predicts a ~53.7% probability that the label is 1. ::: 5. Compute Loss (Mean Squared Error) We use Mean Squared Error (MSE) to measure how far the prediction is from the true label: $$ J = \frac{1}{2}(\hat{y} - y)^2 $$ Plug in the values: $$ J = \frac{1}{2}(0.5369 - 1)^2 $$ $$ J = \frac{1}{2}(-0.4631)^2 = \frac{1}{2}(0.2145) \approx 0.1072 $$ :::warning Loss ≈ 0.1072 The prediction is somewhat wrong, but not extremely far from the true value. ::: # Backward Pass – Compute Gradients Using Partial Derivatives Backpropagation uses the Chain Rule. Every weight in a neural network affects the final loss indirectly, by flowing through several intermediate steps: $w \;\rightarrow\; u \;\rightarrow\; h \;\rightarrow\; u_3 \;\rightarrow\; \hat{y} \;\rightarrow\; J$ Because of this, we cannot take the derivative of the loss with respect to a weight in a single step. Instead, we apply the chain rule, which says: :::info If a quantity changes through several intermediate steps, the total derivative is the product of all the small derivatives along the path. ::: For example: $$ \frac{\partial J}{\partial w_5} = \frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial u_3} \cdot \frac{\partial u_3}{\partial w_5} $$ This multiplication naturally appears for every weight. But rewriting long chain-rule expressions everywhere would be messy and repetitive. So backpropagation introduces a shortcut. Why We Use “Delta” (δ) A delta is simply a collected chunk of chain-rule derivatives. For example: $\delta_3 = \frac{\partial J}{\partial u_3}$ This bundles together the repeated derivatives: • how the loss changes with the prediction, and • how the prediction changes with the pre-activation $u_3$ Using a delta avoids rewriting the chain rule every time. Once we define: $\delta_3 = \frac{\partial J}{\partial u_3}$ the gradient becomes much simpler: $\frac{\partial J}{\partial w_5} = \delta_3 h_1$ :::success Backpropagation = chain rule • grouping repeated pieces into “delta” terms to make the math clean and reusable. ::: Step 1 — Output Delta You begin by computing how the loss J changes with respect to the output neuron’s pre-activation $u_3$. Because we are using MSE loss and a sigmoid output, the chain rule gives: :::success $$\delta_3 = \frac{\partial J}{\partial u_3} = (\hat{y} - y)\,\hat{y}(1 - \hat{y})$$ ::: This value $\delta_3$ is the error signal at the output neuron — it measures how much a small change in $u_3$ would change the final loss. Step 2 — Gradients for Output Weights :::info Convert $\delta_3$ into gradients for the hidden → output weights. ::: $$\frac{\partial J}{\partial w_5} = \delta_3\, h_1$$ $$\frac{\partial J}{\partial w_6} = \delta_3\, h_2$$ :::success gradient = $\delta$ (delta) × activation feeding into that weight ::: Step 3 — Hidden Deltas To determine how much each hidden neuron contributed to the error, we backpropagate $\delta_3$ through the output weights and multiply by the sigmoid derivative. Sigmoid derivative: $$ \sigma’(u) = \sigma(u)(1 - \sigma(u)) $$ Hidden neuron error signals: :::warning $$ \delta_1 = \delta_3 w_5 h_1(1 - h_1) $$ $$ \delta_2 = \delta_3 w_6 h_2(1 - h_2) $$ ::: $\delta_1$ and $\delta_2$ tell us how much each hidden neuron contributed to the output error. Step 4 — Gradients for Input → Hidden Weights Now convert each hidden neuron’s error signal into gradients for the input → hidden weights: $$ \frac{\partial J}{\partial w_1} = \delta_1 i_1 \qquad \frac{\partial J}{\partial w_3} = \delta_1 i_2 $$ $$ \frac{\partial J}{\partial w_2} = \delta_2 i_1 \qquad \frac{\partial J}{\partial w_4} = \delta_2 i_2 $$ This completes the backward pass. Now you have the gradient of the loss with respect to every weight in the network. Step 5: Updating the Weights Backpropagation only computes gradients — it does not change any weights. Weight updates are done by an optimizer such as gradient descent: :::info $$ w_{\text{new}} = w_{\text{old}} - \eta \frac{\partial J}{\partial w} $$ ::: Where: • $\eta$ is the learning rate • $\frac{\partial J}{\partial w}$ is the gradient from backpropagation :::success If the gradient is positive → the weight decreases. If the gradient is negative → the weight increases. ::: By repeating the cycle: forward pass → backward pass → weight update the network gradually improves its predictions over time. # Numerical Example 1. Compute Output Delta Compute sigmoid derivative: $$ \hat{y}(1 - \hat{y}) = 0.5369(1 - 0.5369) = 0.5369 \cdot 0.4631 \approx 0.2484 $$ Now multiply: $$ \delta_3 = (\hat{y} - y)\,\hat{y}(1 - \hat{y}) = (-0.4631)(0.2484) \approx -0.1150$$ :::warning $\delta_3 \approx -0.4631$ The prediction is too low compared to the true label $y = 1$. ::: 2. Gradients for Output Weights Use $\delta_3$ to find how much each hidden neuron contributed to the output error: $$ \frac{\partial J}{\partial w_5} = \delta_3 h_1 \approx (-0.1150)(0.5498) \approx -0.0632 $$ $$ \frac{\partial J}{\partial w_6} = \delta_3 h_2 \approx (-0.1150)(0.7311) \approx -0.0841 $$ :::success Gradients for hidden → output weights: • $\dfrac{\partial J}{\partial w_5} \approx -0.0632$ • $\dfrac{\partial J}{\partial w_6} \approx -0.0841$ ::: 3. Hidden Deltas Next, compute how much each hidden neuron contributed to the error. First, the sigmoid slopes: $$ h_1(1-h_1) \approx 0.5498(1 - 0.5498) \approx 0.2475 $$ $$ h_2(1-h_2) \approx 0.7311(1 - 0.7311) \approx 0.1966 $$ Then the hidden neuron error signals: $$ \delta_1 = \delta_3 w_5 h_1(1-h_1) = (-0.1150)(1.2)(0.2475) \approx -0.0341 $$ $$ \delta_2 = \delta_3 w_6 h_2(1-h_2) = (-0.1150)(-0.7)(0.1966) \approx 0.0158 $$ :::info Hidden deltas: • $\delta_1 \approx -0.0341$ (hidden neuron $h_1$) • $\delta_2 \approx 0.0158$ (hidden neuron $h_2$) ::: 4. Gradients for Input → Hidden Weights Now convert hidden deltas into gradients for the first layer: For the weights into $h_1$: $$ \frac{\partial J}{\partial w_1} = \delta_1 i_1 = (-0.0341)(1.0) = -0.0341 $$ $$ \frac{\partial J}{\partial w_3} = \delta_1 i_2 = (-0.0341)(0.5) = -0.0170 $$ For the weights into $h_2$: $$ \frac{\partial J}{\partial w_2} = \delta_2 i_1 = (0.0158)(1.0) = 0.0158 $$ $$ \frac{\partial J}{\partial w_4} = \delta_2 i_2 = (0.0158)(0.5) = 0.0079 $$ :::success Gradients for input → hidden weights: • $\dfrac{\partial J}{\partial w_1} = -0.0341$ • $\dfrac{\partial J}{\partial w_3} = -0.0170$ • $\dfrac{\partial J}{\partial w_2} = 0.0158$ • $\dfrac{\partial J}{\partial w_4} = 0.0079$ ::: 5. Update the Weights (Gradient Descent) Now that all gradients are known, we can update the weights. The general update rule is: $$ w_{\text{new}} = w_{\text{old}} - \eta \frac{\partial J}{\partial w} $$ Let’s update a few weights as an example. For $w_5$: $$ w_5^{\text{new}} = 1.2 - 0.1(-0.0632) = 1.2 + 0.00632 \approx 1.2063 $$ For $w_1$: $$ w_1^{\text{new}} = 0.3 - 0.1(-0.0341) = 0.3 + 0.00341 \approx 0.3034 $$ :::warning Because their gradients were negative: • $w_5$ and $w_1$ both increased slightly. • This adjustment should move the prediction closer to the true label $y = 1$ on the next forward pass. ::: Over many repetitions of: forward pass → backward pass → weight update the network gradually reduces the loss and improves its predictions. # Practical Example: Predicting Whether an Email Is Spam To make backpropagation concrete, consider a neural network trained to classify emails as spam or not spam. :::info Each email is converted into numerical input values: • $i_1 = 1$ if the subject contains “FREE” • $i_2 =$ number of hyperlinks in the email • $i_3 =$ percentage of capital letters • $i_4 =$ length of the message • (more features can be added) These form the input vector fed into the network. ::: :::success During training the network learns meaningful weights: • Does “FREE” strongly indicate spam? (increase $w_1$) • Do many links increase the likelihood of spam? (increase $w_2$) • Are capital letters important? (adjust $w_3$) • How should these signals combine? (learn $w_5, w_6$, etc.) The network discovers these relationships automatically. ::: Training Cycle (Where Backpropagation Happens) ___ 1. Forward Pass — Compute a Prediction The network outputs a probability: $$ \hat{y} = \text{probability the email is spam} $$ :::info Example: • $\hat{y} = 0.87$ → “87% chance this email is spam.” ::: ___ 2. Compute the Loss The prediction is compared with the true label $y \in {0,1}$: $$ J = \text{how wrong the prediction was} $$ :::info small error → small weight update large error → large update ::: ___ 3. Backpropagation — Assign Blame to Each Weight Backprop figures out how each weight contributed to the error: • If “FREE” should matter more → increase its weight ($w_1$ rises) • If capital letters were over-weighted → decrease that weight • If number of links helped correctly → strengthen that weight ::: info All of this is computed using the chain rule. ::: ___ 4. After Many Training Examples… The network learns stable patterns: • “FREE!!!” → very strong spam indicator • Many links → moderately strong spam indicator • High capitalization → weak indicator • Short message + many URLs → suspicious :::info Over time, accuracy improves as gradient descent repeatedly adjusts weights. ::: # Limitations of Backpropagation Backpropagation is powerful, but not perfect. Here are the major limitations: :::warning 1. Requires Differentiable Activation Functions Backprop can only be used when every layer is differentiable. • Works with: sigmoid, tanh, ReLU • Does not work with discrete or rule-based systems ::: :::warning 2. Sensitive to Vanishing and Exploding Gradients With deep networks: • gradients may shrink to zero → training stalls • or grow too large → weights diverge This is why activations like ReLU and architectures like LSTMs were introduced. ::: :::warning 3. Needs Labeled Data Supervised backprop requires many $(x, y)$ pairs. If data is scarce or expensive to label, performance suffers. ::: :::warning 4. Local Minima and Saddle Points Gradient descent follows the slope, which means: • it may get stuck in poor local minima • it may pause in flat saddle-point regions ::: :::warning 5. Computationally Expensive Backprop requires: • storing activations for every layer • computing gradients for every weight • iterating over the dataset many times Larger models → more compute, more memory. :::