A bite of Math behind AI

--- title: A bite of Math behind AI tags: AI description: A bite on Mathematics behind AI type: pdf slideOptions: theme: white --- ###### tags: `AI` `Applied Mathematics` ### **1. Predict from scratch, who cares about mistakes!** We will start from an simple prediction task, > **Example forever** > > You are going to predict the price of a house from a selection of the following factors: > - Living areas (on $m^2$) > - Number of bedrooms > - Number of garages > - Number of bathsroom > - nabourhood average income? > - some other factors The factors above are just a simplified example of all posible factors when people considering or predicting house prices. Different people might have different choices. Some people might only have one factor, such as, whether it is "good-looking" or not; where as others might have very complicated model that considering a lot of factors. **But for now**, we just use two factors that I believe it will be correct for predicting house price, and you will have nowhere to argue about this with me, just take it. The two factors are `"Living areas"` and `"Number of bedrooms"`, let's go! > **1. My stupid prediction initially** > I am extremely stupid and has no idea what to do with this prediction, so I just grab whatever numbers in my head. I believe that the price per living area should be $ $100\ per\ m^2$, and the price per number of bedroom should be $ $5000\ per\ badroom$ So I made my prediction to an Algebraic equation: $$Price = area \times 100\ + NumberOfBedroom \times 5000$$ So a 3 bedrooms house with $120 m^2$ of living area would cost $120 \times 100 + 3 \times 5000 = 27000$, based on my naive model. Of course, my model is far from being accurate. But how can I **learn** to make it more accurate? I will learn it by collecting some information of houses sold like the table below: | # Bedrooms | Living area | Sold Price | | -----------| ----------- | ---------- | | 4 | 170 | 1,200,000 | | 3 | 130 | 980,000 | | 5 | 230 | 1,600,000 | | ...... | ...... | ...... | Then, I will formally "define" what I have done so far. The two factors I choosed, I will call them **variables**. The number from my stupid guess, which means, $5000$ and $100$, I will call them **weights** of each variable. So for example, 5000 is the weight of number of bedrooms, 100 is the weight of living area. Thus my model can be defined as $$a \times weight_a + b \times weight_b = prediction$$, in which $a => Living\ areas; \ b=>Number\ of\ bedrooms$ One more thing I will do is to add my initial predictions to the table above. Let's see how stupid I was! | # Bedrooms | Living area | Sold Price | Myprediction| | ---------- | ----------- | ---------- | --- | | 4 | 170 | 1,200,000 |37000| | 3 | 130 | 980,000 |28000| | 5 | 230 | 1,600,000 |48000| | ...... | ...... | ...... | | ### **2. Loss Functions: Evaluate how bad my model is** Let's summarise what I have done so far. I creatd a function, for which I passed two sets of values to it: - The first set of values is the value pair (Number of Bedroom, Living Area), Lets call this paired value *samples*, and write it in the form of $\Bbb{x} := (x_1, x_2)$, where $x_1$ is the Number of bedrooms and $x_2$ is the living area. - The second set of values is the value pair (Weight1, Weight2), which is the weight for number of bedrooms and living area respectively. Let's call them just weights. So my funciton is defined by, every time I give it two pairs, one pair of samples and one pair of weights, it give me the predicted house price. Writen in a math form: $$f(\Bbb{x}, \Bbb{w}) := x_1\times w_1 + x_2\times w_2$$ Now I have collected 100 data points of houses, and I have done my naive prediction by passing the pair of samples 100 times to my model (my naive **Weights** doesn't change at this time), and I got 100 **wrong** predictions. Before I could improve myself, I need to know how bad I have just done in general. This can be done easily through *Calculating the average difference between sold-price and my prediction*, the larger the average difference (Let's call it **Error** from now on), the worse my model. But wait a munite, if I have two predictions, one is 10000 *above* the sold-price and another is 10000 *below* the sold-price, then the average would be Zero! With such a bad performance! We can handle this by squaring the errors. so $\frac{10000^2 + (-10000)^2}{2} = 100M$, which give us reasonable feedback. Now, we can define this "Averaging the squaring of the errors" as a function as well! Let's call it **Loss function**, denote by $L(f(\Bbb{x}, \Bbb{w}),\Bbb{y})$, where y is the sold-prices. A warm reminder for some reader is that you can understand $L(f(\Bbb{x}, \Bbb{w}),\Bbb{y})$ as $L(prediction, y)$ for simplicity, as $f(\Bbb{x}, \Bbb{w}) = prediction$. Our purpose of learning, or model training, is to make our Loss smaller by updating weights, in a mathematical way of expression: $$\arg min_{\Bbb{w}\in \Bbb{R}^d}\ \ \ \frac{1}{l}\sum_{i=1}^{l}L(f(\Bbb{x_i}, \Bbb{w}),\Bbb{y})$$ Where $\Bbb{R}^d$ is the set of all possible pairs of weights, and d is the dimension of the weights, in our example, we call them pairs, it means $d = 2$, of cause there will be high dimensional weights, once our considered factors become more and more! $l$ is the number of samples (or the number of rows in the table above), just like the 100 in our example above. ### **3. A good thing about “Averaging the squaring of the errors”** Now let's list out what we have on hand: - Samples of houses with factors as pair, i.e. (Number of bedrooms, living area) - Corresponding samples of sold-price - A pair of **NAIVE** weights correspond to the two factors - A list of predicted price based on factors and our **NAIVE** weights - A function that compute the **LOSS** (differences between the real sold-prices and predicted prices), then average out the list of samples We hope to learn something from the **LOSS** and thus update our weights to be a smarter pair which give us better prediction cloing to the real price. How can we achieve this? A better quesiton could be: > does this **LOSS** function have a minimum that we can reach? The answer is postive for our lucky choice, "Averaging the squares of the errors". Let's check out how lucky we are, from now on, we will represent $f(\Bbb{x}, \Bbb{w}) := \hat{y}$, called "y-hat", then the loss function can be represented as $L(f(\Bbb{x_i}, \Bbb{w}),\Bbb{y})= L(\hat{y},y) :=\frac{\sum_{i=1}^{l}(\hat{y}_i-y)^2}{l}$. As there is only one variable in the expression, we can replace $\hat{y}_i - y$ as $t_i$, so we have $L(t) := \frac{\sum_{i=1}^{l}t_i^2}{l}$, which is more familiar to us. Looks like $f(t) = at^2$ for $a$ to be a positive constant, right? Yes, Quodratic funciton! Let's graph it: <img src="https://i.ibb.co/KmzvqCz/Quodratic1.png" width="300" height="200"> One word we can use the summarise the graph above is "**convex**". Mathematically, convexity of a function $f: \Bbb{R}^d \mapsto \Bbb{R}$ can be express as: $$f(tx + (1-t)y)\le tf(x) + (1-t)f(y);\ \forall x,y \in \Bbb{R}^d,\ t\in [0,1]$$ But we need a friendly translation here! Let's do it from both words and graph. - From words: Picking any two points for inputs, the outputs of these two points from the function can be connected by a straight line, so curve of the function between these two points will always below or match this straight line. - From Graph: <img src="https://i.ibb.co/D5LzPRT/Quodratic2.png" width="200" height="200"> On the graph above, I randomly picked two points on the x-axis to generate two points on the curve, called A and B, connected them with a straight line, then all the function curve between the the range of two points on the x-axis would below or exactly match the line between A and B.This is the idea of convexity of a funciton. This helps us to answer the first question above, yes, the **LOSS** function does have a minimum (more specifically, due to convexity, a global minimum) for us to approach! Now, our second question: > How do we approach there? In general,we need to two pieces of information to achieve this, the direction of moving, and the magnitute of moving. Following the Graph above, we can demonstrate the idea through another graph: <img src="https://i.ibb.co/Jkkg1zz/Quodratic3.png" width="300" height="200"> Let's say that our LOSS of the naive model is at D. We want our loss goes downs to the bottom of the curve, one way is to go to the direction as the black arrow showed. This can be done by letting the target variable (i.e. the horizontal one, the input of the function) *goes to the negtive direction of the gradient of the **LOSS** function*, also attached with some magnitute of moving (i.e. controlled by the term, "learning rate"). After this, our new input of the **LOSS** function would become $A^{'}$, our new output of the **LOSS** function would become E, which is lower than D, as required. Now we know that: - there is a minimum of our LOSS function to approach - there is a way, we can approach the minumum But before we can moving toward implement our method, two more question I need to consider: - How can I be garanteed that, using the way provided here, I will eventually reach the minimum by finite number of steps? - What is the "learning rate"? We will answer the two questions together in the next section. ### **4. Another good thing about “Averaging the squaring of the errors”** Like the previous section, let's summarise what we have so far: - Samples of factors, e.g. (Number of bedrooms, living area) - samples of sold-price - A pair of **NAIVE** weights on two factors - Predicted price based on factors and our **NAIVE** weights - A **Loss** function: convex - A way of decrease loss by updating weights: go to the negtive direction of the gradient of the loss function However, even we know there is a minumum, and we know a way to approach the minimum, **It is not garanteed that we can eventually reach there with finite steps**, let's see a classic counterexample: <img src="https://i.ibb.co/pfJB1Q5/Quodratic4.png" width="300" height="200"> From above, the funciton $f(x):= |x|$ is convex. If we use the method we discussed, after certain steps, the loss will stuck at a certain level without decreasing any more. So based on method of gradient descent, we need another feature from our **Loss** function, differenciability and smoothness (<ins>**NOTE**</ins>: in practice, there are ways to deal with non-differenciable loss functions such as subgradient, which is beyone the discussion of this series). Luckily, our choice of “Averaging the squaring of the errors” satisfy both requirement! It is differenciable, and it is smooth, more presicely, **Lipschitz smooth**. Which means the change of the gradient, would never like the absolute function above, change rapidly. The Lipschitz smooth garanteed that the gradient of function changes smoothly. More importantly, it garanteed that, if we choose a correct learning rate, we will decrease our loss! Why did we so lucky to choose a function that fit all requirement here? Because the funciton we choosed was a famous and classic one called **Mean Squared Error (MSE)**. Now let's look at detail how to compute the gradient of our loss function. People without maths background might just accept the result for now. $$L(w) = \frac{1}{l} \sum_{i=1}^{l}(y_i-x_iw)^2$$ $$\frac{\partial L}{\partial w} = \frac{2}{l}\sum_{i=1}^{l}x_i(y_i - x_iw) = \frac{2}{l}\sum_{i=1}^{l}x_ie_i$$, where $e_i$ is the difference between predicted value and real price in ith row. From the equation above, one can check that, the gradient of MSE is larger when the error is larger, smaller when error is smaller. ### **5. Let's try 2 iterations** **Step 0**: | # Bedrooms | Living area | Sold Price | Myprediction | Error | | ---------- | ----------- | ---------- | ------------ | --- | | 4 | 170 | 1,200,000 | 37000 |-1,163,000 | | 3 | 130 | 980,000 | 28000 |-952,000 | | 5 | 230 | 1,600,000 | 48000 |-1,552,000 | - $w_0 = (5000, 100)$ - learning rate $\alpha = 0.00001$ - Calculate $$w_1 = w_0 - \alpha \frac{2}{3}\sum_{i=1}^{3}x_ie_i = w_0 -(-101.8, -4522.9) = (5000 + 101.8, 100 + 4522.9) = (5101.8, 4622.9)$$ **Step 1**: Calculate new predictions and errors using updated weights, $w_1$ | # Bedrooms | Living area | Sold Price |Myprediction| Error| | ---------- | ----------- | ---------- | ------------ | --- | | 4 | 170 | 1,200,000 | 806300.2|-393699.8| | 3 | 130 | 980,000 | 616282.4|-363717.6 | | 5 | 230 | 1,600,000 | 1088776|-511224| - $w_1 = (5101.8, 4622.9)$ - learning rate $\alpha = 0.00001$ - Calculate new weights by: $$w_2 = w_1 - \alpha \frac{2}{3}\sum_{i=1}^{3}x_ie_i= w_1 -(-34.8, -1545.3) = (5101.8 + 34.8, 4622.9 + 1545.3) = (5136.6, 6168.2)$$ **Step 2**: Calculate new predictions and errors using updated weights, $w_2$, (5136.6, 6168.2) | # Bedrooms | Living area | Sold Price |Myprediction| Error| | ---------- | ----------- | ---------- | ------------ | --- | | 4 | 170 | 1,200,000 | 1,069,140.4|-130,859.6| | 3 | 130 | 980,000 | 817,275.8|-162,724.2 | | 5 | 230 | 1,600,000 | 1,444,369|-155,631| As you can see from the example above, after 2 iterations, our naive model become much closer to the real price. You might also expect that after more rounds, the error would be smaller. Some reader might also realise the in our example here, the optimal weights can be found explicitly by **linear regression**. We are not going to details of linear regression in this article. The real power of Gradient Descent Algorithm is when we want to create a Black Box. ### **6. Another Layer of weights with activations** The current model architecture can be represented below: <img src="https://i.ibb.co/4MjvK22/NN01.png" width="250" height="200"> In the input layer, we have two factors, Output layer, we have one result, which is the predicted value. If we have more factors, let's say more than 10, we can just update our weights to the 10 dimension. As $x_i$ will become a 10 dimensional variable. The graph of 10-factors model looks like this. <img src="https://i.ibb.co/r22skQm/NN02.png" width="250" height="300"> The powerful part about this architecture is that you can add layers into it! Let's provide a visual representation firstly, with a hiden layer (with a non-linear activation function) between the input factors, and the output result. <img src="https://i.ibb.co/vJmG4Wr/NN1.png" width="450" height="300"> Now we have a fincier model. We are no longer predicting the price (i.e. the Output Layer of $\Bbb{R}$) directly, but we are "predicting" 10 hidden factors or activators, using different weights, then using those 10 hidden factors "as the inputs" with 10 weights to predict the price. You might ask: 1. why we create this hidden layer? Why not just use the original factors? 2. What is the activation function in the hidden layer? I can only give you my personal answer on this: 1. The inputs, factors, are variables that can be observed by us and thus collected as data. But there might be some factors that are latent and cannot be observed or explained by us. The hidden layer acturaly provide a chance to our model that, using the observable factors, to represent some latent factors. So we can understand that, our model are not longer only relay on those factors we can observe, but also have potential to **"consider"** factors behind the sense. 2. A non-linear activation function guarantees the multi-layer structure. Because if we use a linear activation or without any activation, the entire structure will collaps to a single layer with weights add-up. The hidden factors in the hidden layer with activation are called **Neuruon**, which is an analog to the biological one in our brain. Actually, the number of layer and the number of neuruon in each layer is upto your choice with consideration on complexity of comupatation. The larger the network structure, the more complex computation required. The next question would be, are our training method (i.e. formally called **Gradient Descent**) fitting this new achitecture? Thanks to Math again, the answer is positive! To explain this, let's go back to simpler version, 2 layers with each layer two neuruons. <img src="https://i.ibb.co/Zgc14yp/NN2.png" width="520" height="300"> For the input layer, now we have 4 weights, form a $2 \times 2$ matrix $\begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \end{bmatrix}$, in which the first row of the matrix take two input values, and calculate the value for the upper hidden neuruon, where as the second row of the matrix take two input values, and calculate the value for the lower hidden neuruon. Then the hidden layer also have 2 weights, and we can treat is as a $1 \times 2$ matrix as well $\begin{bmatrix} w_{o1} & w_{o2} \\ \end{bmatrix}$, in which this matrix compute the output value by taking values of two hidden nueruon as inputs. So we have two functions and a composed loss function - $f(\Bbb{x}, \Bbb{w_1})$, where x is the input vector and $\Bbb{w_1}$ is the first weight matrix, for simplicity, we assume the non-linear activation function is already built-in $f$; - $g(f(x,\Bbb{w_1}), \Bbb{w_o})$, where g takes output of f, as well as our second layer weights, to compute the final price. - the **loss** funciton become $L(g(\Bbb{f(x,\Bbb{w_1})}, \Bbb{w_o}),\Bbb{y})$ Thank to the **Chain Rule**! $$\frac{\partial L}{\partial w_o} = \frac{\partial L}{\partial g} \frac{\partial g}{\partial w_o} $$ $$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial w_1} = \frac{\partial L}{\partial g}\frac{\partial g}{\partial f} \frac{\partial f}{\partial w_1}$$ One thing need to be noticed is that, for the gradient of $\Bbb{w_o}$, we are calculating $\frac{\partial g}{\partial w_o}$, where as for the gradient of $\Bbb{w_1}$, we are calculating $\frac{\partial g}{\partial f}$, these are two different partial derivatives in $g$. You can also attach another chain rule to $\frac{\partial f}{\partial w_1}$ with respect to the related activation function. For example, if $a(f(x, \Bbb{w_1}))$ is the activated neuruon, then $\frac{\partial f}{\partial w_1}$ can be replaced with $$\frac{\partial a}{\partial f}\frac{\partial f}{\partial w_1}$ Once we have the gradients for both $\Bbb{w_o}$ and $\Bbb{w_1}$, we can update them with **gradient descent** descripted above. This is the most powerful part about the Neurual Network achitecture: starting from random weights, we can learn a much better weights for specific tasks, through the process of forward and backward propagation. The weighings we trained can have a good representation on latent variables. **But how do we explain those hidden neuruons?** Well, this is the reason that this powful achitecture is still being seen as a **Black Box**, because we cannot find a appropriate and robust representation of knowledge to explain the function of those hidden neuruons, what we only know for now is that, **it works**!