ft_linear_regression

# ft_linear_regression An introductory data science / machine learning project. Code can be found [here](https://github.com/LaiAnTan/42KL-ft_linear_regression). --- ## Prologue: Predictive analysis Predicting analysis is the process of analyzing patterns in data with the purpose of predicting future outcomes. There are many types of techniques used in predictive analysis, and linear regression is one such technique. ## Linear Regression Linear regression is the simplest method used for predictive analysis. In simple terms, linear regression is about using the value of a variable to predict the value of another variable. The variable used for prediction is called the ***independent variable***. The variable predicted is called the ***dependent variable***. Linear regression fits a straight line that minimizes the difference between precicted values and actual values. (i.e. finding the best-fit line) For example: ![Linear Regression Example](https://hackmd.io/_uploads/HkjHaRQop.png) In the picture, we can use the red line to get a good prediction of a car's price based on its milage. ### Least Squares One such way of obtaining the linear regression line is through the ***least squares*** method. The least squares method is a parameter estimation method based on minimizing the sum of the squares of the residuals. It provides a closed form solution for linear regression. The formula for least squares is as follows: $$m = \frac{\sum_{i=0}^{n-1}(x - \bar x)(y - \bar y)}{\sum_{i=0}^{n-1}(x - \bar x)^2}$$ Least Squares produces a closed-form solution for linear regression problems. However, obtaining such a solution might be unfeasible most of the time, due to time / mathematical constraints. The solution? Gradient Descent. ### Accuracy of linear regression The coefficient of determination, $R^2$ is a measure to show how well a regression line approximates the actual data. Other definitions: - i.e. measure of relationship between independent variable and dependent variable. - i.e. how much of the variation in the dependent variable can be explained by the independent variable. The formula for $R^2$ is as follows: $$ R^2=\frac{\sum(\hat y - y)^2}{\sum(y-\bar y)^2} $$ $R^2 = Coefficient\ of\ determination$ $\hat y = Predicted\ value$ $\bar y = Mean$ The ***scale to interpret $R^2$*** is as follows: |$R^2$ |Strength of relationship| |---------|------------------------| |0 - 0.2 |No relationship | |0.2 - 0.4|Weak relationship | |0.4 - 0.6|Medium relationship | |0.6 - 0.8|Strong relationship | |0.8 - 1 |Very Strong relationship| ## Gradient Descent Gradient descent is arguably the most basic algorithm in machine learning. It is an optimization algorithm, which is used to **approximate the minimum of a differentiable function.** Gradient descent does so by repeatedly taking steps in the direction of steepest descent (i.e. negative direction of gradient), in which trying to locate the minima of a function. ### Benefits of Gradient Descent For larger and more complex datasets, a closed form solution (i.e. using least squares) may not always be available, and thus gradient descent is prefered for its efficiency and stability. Another benefit of gradient descent is that it is a general algorithm which can be applied to different use cases (objective functions), whereas sum of least squares is a non iterative way to solve a specific case of an optimisation problem. ### Pseudo / code The pseudocode of gradient descent is as follows: 1. initialize weights randomly 2. calculate gradient for all parameters of cost function using partial differentiation 3. update the weights, w = w - learning rate * weight gradient 4. repeat until number of iterations exceed threshold or weight difference smaller than threshold Now for the simplified code: ```py! for step in range(1, self.max_steps + 1): N = data.shape[0] slope_grad = 0 # d Cost / d slope intercept_grad = 0 # d Cost / d intercept for x, y in data: slope_grad += (self.estimate(x) - y) * x intercept_grad += (self.estimate(x) - y) self.slope -= self.learning_rate * (1 / N) * slope_grad self.intercept -= self.learning_rate * (1 / N) * intercept_grad ``` ### Objective functions, Cost Functions, Loss Functions An ***objective function*** is any function that is optimized during training. A ***cost function*** is a measure of how well a machine learning model performs by quantifying the difference between predicted and actual outputs. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). To **minimize** the cost function is just to find optimal parameters that result in the lowest cost (output). The reason why we are minimizing the cost function is because **the lower the cost, the lesser the error in the model and thus the performance of the model increases**. The cost function used for linear regression is Mean Squared Error (MSE). A ***loss function*** is a function defined on a singular data point which calculates the loss (i.e. penealty). The loss function used for linear regression is Squared Loss. Therefore, in order from general to specific: ``` objective function -> cost function -> loss function ``` ### Examples of Cost Functions #### Mean Absolute Error (MAE) In MAE, we take the mean of the absolute residuals for each dataset. $$MAE=\frac{1}{n}\sum_{i=0}^{n-1}(|\hat y - y|)$$ #### Mean Squared Error (MSE) In MSE, we take the mean of the squares of the residuals of each data point in the dataset. The formula is as follows: $$MSE=\frac{1}{n}\sum_{i=0}^{n-1}(\hat y - y)^2$$ This is the one we use in our version of gradient descent. #### Root Mean Squared Error (RMSE) This is MSE, but the square root is taken afterwards. ### Derivative of Mean Squared Error One important aspect of MSE is that it is [differentiable](https://youtu.be/KLXP2RL0-Vg?si=3dNDcFxhVnq0XveZ). We have to take the derivative of MSE to calculate the gradient. Let $C$ be the cost function, which is the Mean Squared Error. $$C=\frac{1}{n}{\sum_{i=0}^{n-1}(\hat y-y)^2}$$ Perform partial differentiation of the cost function, $C$: $$u=(\hat y_i-y_i)$$ $$C=\frac{1}{n}{\sum_{i=0}^{n-1}u^2}$$ $$\frac{\delta C}{\delta u}=\frac{2}{n}{\sum_{i=0}^{n-1}u}$$ $$\frac{\delta C}{\delta u}=\frac{2}{n}{\sum_{i=0}^{n-1}(\hat y_i-y_i)}$$ $$\frac{\delta u}{\delta \hat y_i}=1$$ Let the hypothesis be: $$\hat y_i = mx_i+b$$ Perform partial differentation of the hypothesis (linear regression formula): $$\frac{\delta \hat y_i}{\delta m}=x_i$$ $$\frac{\delta \hat y_i}{\delta c}=1$$ Partial derivative of cost, $C$ with respect to slope, $m$ and intercept, $b$: $$\frac{\delta C}{\delta b}=\frac{\delta C}{\delta u}\times\frac{\delta u}{\delta \hat y_i}\times\frac{\delta \hat y_i}{\delta b}$$ $$\frac{\delta C}{\delta b}=\frac{2}{n}{\sum_{i=0}^{n-1}(\hat y_i-y_i)}\times1\times1$$ $$\therefore \frac{\delta C}{\delta b}=\frac{2}{n}{\sum_{i=0}^{n-1}(\hat y_i-y_i)}$$ $$\frac{\delta C}{\delta m}=\frac{\delta C}{\delta u}\times\frac{\delta u}{\delta \hat y_i}\times\frac{\delta \hat y_i}{\delta m}$$ $$\therefore \frac{\delta C}{\delta m}=\frac{2}{n}{\sum_{i=0}^{n-1}(\hat y_i-y_i)}\times1\times x_i$$ Therefore, we can use $\dfrac{\delta C}{\delta b}$ and $\dfrac{\delta C}{\delta m}$ to calculate the gradients of $b$ and $m$. ### Problems with Gradient Descent Converging to a minima can be really slow, causing long traning times. If the dataset has multiple local minimas, then it's highly probable that the algorithm may never converge to a global minima. (i.e. the algorithm will never reach its peak potential) ### Variants of Gradient Descent The gradient descent algorithm described above is the most basic gradient descent algorithm, called ***Batch Gradient Descent***. ***Stochastic Gradient Descent*** is a faster version of batch gradient descent, where only a random select few data points' gradients' are calculated. This improves the speed in which it converges for large datasets. Mini-batch Gradient Descent splits the training dataset into small batches, and updates the weights after each batch. > [Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.](https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/) ## Normalization The current gradient descent we have is very prone to errors: - [overflow errors](https://stackoverflow.com/questions/49865952/implementing-gradient-descent-in-python-and-receiving-an-overflow-error) - stuck in local minima / saddle point The resulting linear regression is also not very accurate (as shown by the $R^2$ value) The solution? To normalize / standardize the data. ### Benefits of normalization - reduce the influence outliers have over the data - make the scale of the data to be more managable, i.e. no overflows Overall, normalization helps increase the efficiency of training models and the overall performance of the model. ### Types of Normalization #### Min Max Feature Scaling Min Max Feature Scaling (Min Max Normalization) scales all values into the range [0, 1]. It gurantees that all features will have the same scale, but does not handle outliers well. The formula for Min Max Feature Scaling is as follows: $$\hat x = \frac{x - x_{min}}{x_{max} - x_{min}}\; ,\; \hat y = \frac{y - y_{min}}{y_{max} - y_{min}}$$ #### Z-Score Scaling (Standardization) Z-Score Scaling scales values based on the mean and standard deviation. It handles outliers pretty well, but does not gurantee the same scale. $$\hat x = \frac{x - \mu_x}{\sigma_x}\; ,\;\hat y = \frac{y - \mu_y}{\sigma_y}$$ ### Reverting coefficients After training the linear regression model with normalized data, we may have to obtain the original coefficients for the equation, in which can be used to predict new data. There is no need to revert coefficients to their original scale if the relationship between variables is all that matters to the user. However, in the case of myself, doing some basic algebra will suffice: #### Reverting coefficients after Z-Score Scaling Our goal is to change the subject of the equation until we get an equation that relates the original y, $y_o$ to the original x, $x_o$ in linear form. Linear regression formula (normalized): $$y_n = m_nx_n+c_n$$ Substitute Z-Score Scaling formula: $$\frac{y_o - \mu_y}{\sigma_y} = m_n(\frac{x_o - \mu_x}{\sigma_x})+c_n$$ $$\frac{y_o - \mu_y}{\sigma_y} = \frac{m_nx_o - m_n\mu_x}{\sigma_x}+c_n$$ $$\frac{y_o - \mu_y}{\sigma_y} - c_n = \frac{m_nx_o - m_n\mu_x}{\sigma_x}$$ $$\frac{y_o - \mu_y -c_n\sigma_y }{\sigma_y} = \frac{m_nx_o - m_n\mu_x}{\sigma_x}$$ $$\frac{y_o - \mu_y -c_n\sigma_y }{\sigma_y} = \frac{m_nx_o - m_n\mu_x}{\sigma_x}$$ $${\sigma_x}({y_o - \mu_y -c_n\sigma_y }) = \sigma_y({m_nx_o - m_n\mu_x})$$ $${y_o\sigma_x - \mu_y\sigma_x -c_n\sigma_y\sigma_x } = {m_nx_o\sigma_y - m_n\mu_x\sigma_y}$$ $$y_o\sigma_x = {m_nx_o\sigma_y - m_n\mu_x\sigma_y} + \mu_y\sigma_x + c_n\sigma_y\sigma_x$$ $$y_o = \frac{\sigma_ym_n}{\sigma_x}x_o + \frac{- m_n\mu_x\sigma_y + \mu_y\sigma_x + c_n\sigma_y\sigma_x}{\sigma_x}$$ $$y_o = \frac{\sigma_ym_n}{\sigma_x}x_o - \frac{m_n\mu_x\sigma_y}{\sigma_x} + \mu_y + c_n\sigma_y$$ $$\therefore m_o=\frac{\sigma_ym_n}{\sigma_x}\;,\;c_o= - \frac{m_n\mu_x\sigma_y}{\sigma_x} + \mu_y + c_n\sigma_y$$ We can now use the original coefficients for predicting new data. #### Reverting Min Max Feature Scaling Coefficients Im too lazy to type out the latex, heres an ugly picture: ![image](https://hackmd.io/_uploads/Bk1gdpQip.png) In summary, the current process for training a linear regression model via gradient descent is: 1. Normalize / Standardize dataset 2. Perform Gradient Descent 3. Revert the normalized coefficients ## Sources #### Linear Regression https://www.youtube.com/watch?v=zPG4NjIkCjc https://youtu.be/JvS2triCgOY?si=3rRvQAAagro8nZfX https://youtu.be/w2FKXOa0HGA?si=lvcBXh9yvXA7AiMB http://www.sthda.com/english/articles/38-regression-model-validation/158-regression-model-accuracy-metrics-r-square-aic-bic-cp-and-more/ https://www.youtube.com/watch?v=w2FKXOa0HGA https://www.ibm.com/topics/linear-regression #### Gradient Descent https://youtu.be/IHZwWFHWa-w?si=JlFPhOA9Hpcg3d2E https://youtu.be/sDv4f4s2SB8?si=9Fx8NQqbI6u1tnxL https://stats.stackexchange.com/questions/179026/objective-function-cost-function-loss-function-are-they-the-same-thing https://en.wikipedia.org/wiki/Mean_squared_error https://youtu.be/Wk8vycBT4qM?si=Tq3mhlytVoUdxBt7 https://youtu.be/KLXP2RL0-Vg?si=3dNDcFxhVnq0XveZ https://www.youtube.com/watch?v=Souzjv6WfrY&t=1084s #### Normalization https://www.codecademy.com/article/normalization https://stackoverflow.com/questions/26414913/normalize-columns-of-a-dataframe https://en.wikipedia.org/wiki/Normalization_(statistics) https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing