What is this thing called backpropagation?

# What is this thing called backpropagation? ## A student's companion to Skansi 2018, pp. 93-97 ### author: Sebastian Klaßmann, sklassm1@uni-koeln.de ### date: August $22^{nd}$, 2021 ### version: 0.4 --- ### 1.0 Needed derivation rules (Skansi 2018:23–25) **CHAINRULE**: $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$, for some $u$ **LD**: $[f(x)a + g(x)b]' = a \cdot f'(x) + b \cdot g'(x)$ Bzw.: **LDa** $[a \cdot f(x)]' = a \cdot f'(x)$ **LDb** $[f(x) + g(x)]' = f'(x) + g'(x)$ **Exp**: $[f(x)^n]' = n \cdot f(x)^{n-1} \cdot f'(x)$ **DerDifVar**: $\frac{dy}{dz}z = 1$ **Const**: $c' = 0$ **REC**: $(\frac{1}{f(x)})' = -(\frac{f'(x)}{f(x)^2})$ **ChainExp**: $(e^{f(x)})' = e^{f(x)} \cdot f'(x)$ --- **Disclaimer:** Skansis account of backpropagation relies on the SSE error function and the sigmoid activation function. While the concepts explained below might also apply to other network architectures, the equations will largely only hold given these hyperparameters. ### 2.0 Deriving the delta rule from SSE In order to understand backpropagation, Skansi 2018 (pp. 90-93) among other things provides derivations that we will need below. Note that Skansi calculates these derivations in order to demonstrate how they relate to the generalized delta rule for updating weights in case of the linear and logistic neuron. For our intents and purposes, we will focus on the derivations in order to make use of them to explore backpropagation for a simple network described in section 3. More specifically, Skansi in this section provides the derivation of the output of a logistic neuron with regards to the logit (1) and the derivation of the error function (SSE) with regards to the output *y* of the output layer for a single prediction *t* (2). $$(1): \frac{dy}{dz} = \frac{d}{dz} [\frac{1}{1+e^{-z}}]= y(1-y)$$ $$(2): \frac{dE}{dy} = \frac{d}{dy}[\frac{1}{2}(t-y)^2] = -1 \cdot (t-y)$$ Starting with (1): We know that in case of a logistic neuron, we are using the activation function $y(z) = \frac{1}{1 + e^{-z}}$. Let us start deriving $\frac{dy}{dz}$ from definition of y: $$ \frac{d}{dz} [\frac{1}{1+e^{-z}}] $$ Reciprocal Rule : $g(z) = 1 + e^{-z}$ $$ = -\frac{\frac{d}{dz}[1 + e^{-z}]} {(1 + e^{-z})^2} $$ *please note that $^2$ is missing in Skansis explantion!* LD (Linear Differentiation) applied to the numerator: $$ = -\frac{ \frac{d}{dz} [1] + \frac{d}{dz} [e^{-z}] }{(1+e^{-z})^2} $$ Apply Constant rule to first summand of enumerator, making it 0. $$ = -\frac{ 0 + \frac{d}{dz} [e^{-z}] }{(1+e^{-z})^2} $$ Apply ChainExp to second summand ($(e^{f(x)})' = e^{f(x)} \cdot f'(x)$) $$ = -\frac{ e^{-z} \cdot \frac{d}{dz}[-1 \cdot z] } {(1+e^{-z})^2} $$ Apply Linear Differentiation to constant factor -1 with z: $$ = -\frac{ e^{-z} \cdot (-1) \cdot \frac{d}{dz}[z] } {(1+e^{-z})^2} $$ Derive Differentiation Variable (z): $$ = -\frac{ e^{-z} \cdot (-1) \cdot 1 } {(1+e^{-z})^2} $$ Rearrange: $$ = -\frac{-e^{-z}}{(1+e^{-z})^2} $$ Tidy up signs: $$ = \frac{ e^{-z} } {(1+e^{-z})^2} $$ Therefore: $$ \frac{d}{dz} (\frac{1}{1+e^{-z}}) = \frac{ e^{-z} } {(1+e^{-z})^2} $$ Factorize in two factors A and B: $$ = \frac{1}{(1+e^{-z})} \cdot \frac{e^{-z}}{(1+e^{-z})} $$ By definition of $y(z) = \frac{1}{(1+e^{-z})} = A$. Let us focus on $B = \frac{e^{-z}}{(1+e^{-z})}$ and start by adding and subtracting 1 in the numerator and setting brackets: $$ \frac{e^{-z}}{(1+e^{-z})} = \frac{(1 + e^{-z})-1}{(1+e^{-z})} $$ Split the fraction in two terms with the same denominator: $$= \frac{(1 + e^{-z})}{(1 + e^{-z})} - \frac{1}{(1 + e^{-z})}$$ Simplify and apply definition of $y(z) = \frac{1}{(1+e^{-z})}$ to the second fraction: $$B = 1 - y$$ $A \cdot B$, yielding: $$ y(1-y)$$ Result: $$\frac{dy}{dz} = \frac{d}{dz} [\frac{1}{1+e^{-z}}]= y(1-y)$$ ___ Let us have a look at (2). For simplicity, we only want to consider one training case, i.e. one target value $t$ and one output $y$, therefore neglecting the $\sum$ inherent in the definition of the SSE: $$\frac{dE}{dy} = \frac{d}{dy}[\frac{1}{2}(t-y)^2]$$ Apply Exponent rule: $$ = \frac{1}{2} \cdot 2 \cdot (t-y) \cdot \frac{d}{dy}[t-y] $$ Cancellation of factors: $$ = (t-y) \cdot \frac{d}{dy}[t-y] $$ Linear Diff: $$ = (t-y) \cdot (\frac{d}{dy}[t] - \frac{d}{dy}[y]) $$ *please note that there is a spelling error in Skansi 2018 here.* Since t is a constant, its derivative is 0 (Const), and since y is the differentiation variable, its derivative is 1 (DerDiVar): $$ = (t-y) \cdot (0 - 1)$$ By tidying up the expression we get: $$ = (-1)\cdot(t-y)$$ and finally: $$ (2:) \frac{dE}{dy} = \frac{d}{dy}[\frac{1}{2}(t-y)^2]= -1 \cdot (t-y)$$. --- ### 3.0 Backpropagation **Skansi, p. 94: Mathematically speaking, backpropagation is: $$w_{updated} = w_{old} - \eta \nabla E$$** Note that in order for backpropagation to work, we will have to start with a forward pass for a given set of training samples. In case of a simple processing of one input data point, this forward pass means passing an input vector through the network, calculating weighted inputs and activations along the way and storing all these values along with an error pertaining the desired output of the network given the input. What backpropagation is trying to find, is essentially the influence that single parameters (i.e., weights and biases) have on the network's error. Mathematically speaking and focussing on the weights as examples, we want to determine the rate of change for the error function with regards to the individual weights in our network, i.e. for a given weight $w_i$, we want to find out the partial derivation of E with regard to that single weight: $\frac{\partial E}{\partial w_i}$. A second intuition that we need to understand in order to grasp backpropagation is the fact that a forward pass through a given network consisting of three layers (input, hidden, output) can be considered as a nested function application to the input x, resulting in the network's output y: $$ y_o = f_o(f_h(f_i(x))) $$ Turning back to Skansi, p. 96, let us first of all consider the case of a single hidden, fully connected layer and propagate the error backwards until we arrive at the input layer. For this, we will be using the derivations from 2.0, whenever applicable. #### 3.1 From Output to Hidden layer Remember that for this example, we - in line with Skansi 2018 - will be using the SSE error function (sum of squared errors). Please note that Skansi 2018 is inconsistent in naming the used error function in chapter 4. He is referring to the SSE throughout all of chapter 4. $$E = \frac{1}{2} \sum_{o \in Output}{(t_o - y_o)^2}$$ *Error function: sum of squared errors (SSE)* Our first partial goal is to dermine, how much of a role a single weight between the hidden and the output layer plays in terms of our overall error, i.e. we need to calculte $\frac{\partial E}{\partial w_{oh}}$. E is based on the squared difference between target values and observed outputs given a training set. To make things easier, we will only be considering a single training case / input again. A first step is to determine the partial derivtive $\frac{\partial E}{\partial y_o}$ for a single output. $$\frac{\partial E}{\partial y_o} = \frac{\partial}{\partial y}[\frac{1}{2}(t_o - y_o)^2]= -1 \cdot (t_o - y_o)$$ *as shown in section 2.0.* Knowing that we can also consider the value $y_o$ in terms of a function application to $z_o$ (the logit or input to the output layer), a logical next step is to derive our error function in regards to the weighted input to the output layer, making use of the CHAINRULE, considering $y_o$ as being the result of applying the inner function $u(z_x)$. $$ \frac{\partial E}{\partial z_o} = \frac{\partial E}{\partial y_o} \cdot \frac{\partial y_o}{\partial z_o} = \frac{\partial y_o}{\partial z_o} \cdot \frac{\partial E}{\partial y_o}$$ From 2.0, we already know the derivation of our first derivation term, even though we are interested in a partial derivation here: $\frac{dy}{dz} = \frac{d}{dz} [\frac{1}{1+e^{-z}}]= y(1-y)$. Therefore, we can substitute this for our first derivation term and get: $$ = y_o(1-y_o) \cdot \frac{\partial E}{\partial y_o}$$ Next, we need to propagate the error back to the outputs of our hidden layer. We know that the value of the weighted input $z_o$ is given by a function application to the activations $y_h$ in the hidden layer. Therefore, we again need to apply the chain rule. For this we need to keep in mind that for any successive layers i and j that are connected via a weighted edge, $\frac{\partial z_j}{\partial y_i} = w_{ji}$. This is due to the fact that in case of a partial derivation, all other outputs from previous neurons are treated as constants, making the other summands of our weighted input = 0 due to the appliction of *const*. $(f'(c) = 0)$. $$\frac{\partial E}{\partial y_h} = \sum_h(\frac{dz_o}{dy_h} \cdot \frac{\partial E}{\partial z_o}) = \sum_h (w_{oh} \cdot \frac{\partial E}{\partial z_o})$$ **To recapitulate,** we have now succesfully expressed the relationship between the value of E and a given activation (output) of the hidden layer. We know that any given logit $z_o = \sum_h (y_h \cdot w_{oh})$. If we want to look one step further backwards and determine the role of any of the weights between the hidden and output layer in generating our error, we can treat all other weights as constants and thus easily see that $\frac{\partial z_o}{\partial w_{oh}} = y_{h}$. Therefore, once more applying the chain rule: $$\frac{\partial E}{\partial w_{oh}} = \frac{\partial z_o}{\partial w_{oh}} \cdot \frac{\partial E}{\partial z_o} = y_{h} \cdot \frac{\partial E}{\partial z_o} $$ According to Skansi 2018, this now gives us the full backbone of backpropagation, as we can take the above steps to yield any $\frac{\partial E}{\partial w}$ throughout our network and - through bias absorption - have a means of expressing the gradient of our error function regarding any trainable network parameter. For our purposes, the gradient of the Error function can be thought of as a vector containing the partial derivatives of the cost function with regards to all the weights in our network. Using vector notation, this clarifies the mathematical backbone of backpropagation, the generalized weight-update rule: $$w_{updated} = w_{old} + \eta (-\nabla E) = w_{old} - \eta \nabla E$$ #### 3.2 A sample forward pass and backpropagation (Skansi 2018, pp. 98-102) To see the above derivtions in action, let us consider the following example taking from Skansi. Note that the simplicity of the network structure depicted below allows us to ignore the $\sum$ inherent in our formulation of $\frac{\partial E}{\partial y_h}$ as given above. ![](https://i.imgur.com/K4fQAHA.jpg) **Please note that the neuron F has been changed in the above figure, in accordance with the textual information in Skansi 2018, pp. 98-102.** If we want to determine $\frac{\partial E}{\partial w_3}$, we can do so by means of repeatedly applying the chain rule: $$\frac{\partial E}{\partial w_3} = \frac{\partial E}{\partial y_C} \cdot \frac{\partial y_C}{\partial z_C} \cdot \frac{\partial z_C}{\partial w_3} $$ Looking at these three derivation terms, we can observe that: 1. $$\frac{\partial E}{\partial y_C} = \frac{\partial z_F}{\partial y_C} \cdot \frac{\partial E}{\partial z_F} = w_5 \cdot \frac{\partial y_F}{\partial z_F} \cdot \frac{\partial E}{\partial y_F} = w_5 \cdot (y_F \cdot (1 - y_F)) \cdot (-(t - y_F))$$ 2. $$ \frac{\partial y_C}{\partial z_C} = y_C \cdot (1 - y_C)$$ 3. $$ \frac{\partial z_C}{\partial w_3} = x_B $$ as discussed in section 3.1 with the notable difference that we have now arrivd at the input $x_B$. We can now substitute these observations for the derivation terms in the partial derivation taken from above: $$\frac{\partial E}{\partial w_3} = \frac{\partial E}{\partial y_C} \cdot \frac{\partial y_C}{\partial z_C} \cdot \frac{\partial z_C}{\partial w_3} = (w_5 \cdot (y_F \cdot (1 - y_F)) \cdot (-(t - y_F))) \cdot (y_C \cdot (1 - y_C)) \cdot x_B $$ This final form only relies on known network parameters and shows how the partial derivation can be calculated using the information stored during the forward pass through our network. It is now easy to see how the other partial derivatives of E in regards to the remaining weights can be calculated, following the respective paths backwards through the network: $$ \frac{\partial E}{\partial w_1} = \frac{\partial E}{\partial y_C} \cdot \frac{\partial y_C}{\partial z_C} \cdot \frac{\partial z_C}{\partial w_1} = (w_5 \cdot (y_F \cdot (1 - y_F)) \cdot (-(t - y_F))) \cdot (y_C \cdot (1 - y_C)) \cdot x_A $$ $$ \frac{\partial E}{\partial w_2} = \frac{\partial E}{\partial y_D} \cdot \frac{\partial y_D}{\partial z_D} \cdot \frac{\partial z_D}{\partial w_2} = (w_6 \cdot (y_F \cdot (1 - y_F)) \cdot (-(t - y_F))) \cdot (y_D \cdot (1 - y_D)) \cdot x_A $$ $$ \frac{\partial E}{\partial w_4} = \frac{\partial E}{\partial y_D} \cdot \frac{\partial y_D}{\partial z_D} \cdot \frac{\partial z_D}{\partial w_4} = (w_6 \cdot (y_F \cdot (1 - y_F)) \cdot (-(t - y_F))) \cdot (y_D \cdot (1 - y_D)) \cdot x_B $$ $$ \frac{\partial E}{\partial w_5} = \frac{\partial E}{\partial y_F} \cdot \frac{\partial y_F}{\partial z_F} \cdot \frac{\partial z_F}{\partial w_5} = -(t-y_F) \cdot (y_F \cdot (1-y_F)) \cdot y_C $$ $$ \frac{\partial E}{\partial w_6} = \frac{\partial E}{\partial y_F} \cdot \frac{\partial y_F}{\partial z_F} \cdot \frac{\partial z_F}{\partial w_6} = -(t-y_F) \cdot (y_F \cdot (1-y_F)) \cdot y_D $$ ____ ### Exercise: Considering the network structure described above, please calculate a forward pass and the error (SSE) for the input $[x_A = 0.23, x_B=0.82]$. Using the derivations described above, please calculate the partial error derivatives for the target output $t=1$. Update the weights accordingly and compare the results of a new forward pass to the first one.