# ft_linear_regression
An introductory data science / machine learning project.
Code can be found [here](https://github.com/LaiAnTan/42KL-ft_linear_regression).
---
## Prologue: Predictive analysis
Predicting analysis is the process of analyzing patterns in data with the purpose of predicting future outcomes.
There are many types of techniques used in predictive analysis, and linear regression is one such technique.
## Linear Regression
Linear regression is the simplest method used for predictive analysis.
In simple terms, linear regression is about using the value of a variable to predict the value of another variable.
The variable used for prediction is called the ***independent variable***.
The variable predicted is called the ***dependent variable***.
Linear regression fits a straight line that minimizes the difference between precicted values and actual values. (i.e. finding the best-fit line)
For example:

In the picture, we can use the red line to get a good prediction of a car's price based on its milage.
### Least Squares
One such way of obtaining the linear regression line is through the ***least squares*** method.
The least squares method is a parameter estimation method based on minimizing the sum of the squares of the residuals. It provides a closed form solution for linear regression.
The formula for least squares is as follows:
$$m = \frac{\sum_{i=0}^{n-1}(x - \bar x)(y - \bar y)}{\sum_{i=0}^{n-1}(x - \bar x)^2}$$
Least Squares produces a closed-form solution for linear regression problems.
However, obtaining such a solution might be unfeasible most of the time, due to time / mathematical constraints.
The solution? Gradient Descent.
### Accuracy of linear regression
The coefficient of determination, $R^2$ is a measure to show how well a regression line approximates the actual data.
Other definitions:
- i.e. measure of relationship between independent variable and dependent variable.
- i.e. how much of the variation in the dependent variable can be explained by the independent variable.
The formula for $R^2$ is as follows:
$$
R^2=\frac{\sum(\hat y - y)^2}{\sum(y-\bar y)^2}
$$
$R^2 = Coefficient\ of\ determination$
$\hat y = Predicted\ value$
$\bar y = Mean$
The ***scale to interpret $R^2$*** is as follows:
|$R^2$ |Strength of relationship|
|---------|------------------------|
|0 - 0.2 |No relationship |
|0.2 - 0.4|Weak relationship |
|0.4 - 0.6|Medium relationship |
|0.6 - 0.8|Strong relationship |
|0.8 - 1 |Very Strong relationship|
## Gradient Descent
Gradient descent is arguably the most basic algorithm in machine learning.
It is an optimization algorithm, which is used to **approximate the minimum of a differentiable function.**
Gradient descent does so by repeatedly taking steps in the direction of steepest descent (i.e. negative direction of gradient), in which trying to locate the minima of a function.
### Benefits of Gradient Descent
For larger and more complex datasets, a closed form solution (i.e. using least squares) may not always be available, and thus gradient descent is prefered for its efficiency and stability.
Another benefit of gradient descent is that it is a general algorithm which can be applied to different use cases (objective functions), whereas sum of least squares is a non iterative way to solve a specific case of an optimisation problem.
### Pseudo / code
The pseudocode of gradient descent is as follows:
1. initialize weights randomly
2. calculate gradient for all parameters of cost function using partial differentiation
3. update the weights, w = w - learning rate * weight gradient
4. repeat until number of iterations exceed threshold or weight difference smaller than threshold
Now for the simplified code:
```py!
for step in range(1, self.max_steps + 1):
N = data.shape[0]
slope_grad = 0 # d Cost / d slope
intercept_grad = 0 # d Cost / d intercept
for x, y in data:
slope_grad += (self.estimate(x) - y) * x
intercept_grad += (self.estimate(x) - y)
self.slope -= self.learning_rate * (1 / N) * slope_grad
self.intercept -= self.learning_rate * (1 / N) * intercept_grad
```
### Objective functions, Cost Functions, Loss Functions
An ***objective function*** is any function that is optimized during training.
A ***cost function*** is a measure of how well a machine learning model performs by quantifying the difference between predicted and actual outputs. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization).
To **minimize** the cost function is just to find optimal parameters that result in the lowest cost (output).
The reason why we are minimizing the cost function is because **the lower the cost, the lesser the error in the model and thus the performance of the model increases**.
The cost function used for linear regression is Mean Squared Error (MSE).
A ***loss function*** is a function defined on a singular data point which calculates the loss (i.e. penealty).
The loss function used for linear regression is Squared Loss.
Therefore, in order from general to specific:
```
objective function -> cost function -> loss function
```
### Examples of Cost Functions
#### Mean Absolute Error (MAE)
In MAE, we take the mean of the absolute residuals for each dataset.
$$MAE=\frac{1}{n}\sum_{i=0}^{n-1}(|\hat y - y|)$$
#### Mean Squared Error (MSE)
In MSE, we take the mean of the squares of the residuals of each data point in the dataset.
The formula is as follows:
$$MSE=\frac{1}{n}\sum_{i=0}^{n-1}(\hat y - y)^2$$
This is the one we use in our version of gradient descent.
#### Root Mean Squared Error (RMSE)
This is MSE, but the square root is taken afterwards.
### Derivative of Mean Squared Error
One important aspect of MSE is that it is [differentiable](https://youtu.be/KLXP2RL0-Vg?si=3dNDcFxhVnq0XveZ).
We have to take the derivative of MSE to calculate the gradient.
Let $C$ be the cost function, which is the Mean Squared Error.
$$C=\frac{1}{n}{\sum_{i=0}^{n-1}(\hat y-y)^2}$$
Perform partial differentiation of the cost function, $C$:
$$u=(\hat y_i-y_i)$$
$$C=\frac{1}{n}{\sum_{i=0}^{n-1}u^2}$$
$$\frac{\delta C}{\delta u}=\frac{2}{n}{\sum_{i=0}^{n-1}u}$$
$$\frac{\delta C}{\delta u}=\frac{2}{n}{\sum_{i=0}^{n-1}(\hat y_i-y_i)}$$
$$\frac{\delta u}{\delta \hat y_i}=1$$
Let the hypothesis be:
$$\hat y_i = mx_i+b$$
Perform partial differentation of the hypothesis (linear regression formula):
$$\frac{\delta \hat y_i}{\delta m}=x_i$$
$$\frac{\delta \hat y_i}{\delta c}=1$$
Partial derivative of cost, $C$ with respect to slope, $m$ and intercept, $b$:
$$\frac{\delta C}{\delta b}=\frac{\delta C}{\delta u}\times\frac{\delta u}{\delta \hat y_i}\times\frac{\delta \hat y_i}{\delta b}$$
$$\frac{\delta C}{\delta b}=\frac{2}{n}{\sum_{i=0}^{n-1}(\hat y_i-y_i)}\times1\times1$$
$$\therefore \frac{\delta C}{\delta b}=\frac{2}{n}{\sum_{i=0}^{n-1}(\hat y_i-y_i)}$$
$$\frac{\delta C}{\delta m}=\frac{\delta C}{\delta u}\times\frac{\delta u}{\delta \hat y_i}\times\frac{\delta \hat y_i}{\delta m}$$
$$\therefore \frac{\delta C}{\delta m}=\frac{2}{n}{\sum_{i=0}^{n-1}(\hat y_i-y_i)}\times1\times x_i$$
Therefore, we can use $\dfrac{\delta C}{\delta b}$ and $\dfrac{\delta C}{\delta m}$ to calculate the gradients of $b$ and $m$.
### Problems with Gradient Descent
Converging to a minima can be really slow, causing long traning times.
If the dataset has multiple local minimas, then it's highly probable that the algorithm may never converge to a global minima. (i.e. the algorithm will never reach its peak potential)
### Variants of Gradient Descent
The gradient descent algorithm described above is the most basic gradient descent algorithm, called ***Batch Gradient Descent***.
***Stochastic Gradient Descent*** is a faster version of batch gradient descent, where only a random select few data points' gradients' are calculated. This improves the speed in which it converges for large datasets.
Mini-batch Gradient Descent splits the training dataset into small batches, and updates the weights after each batch.
> [Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.](https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/)
## Normalization
The current gradient descent we have is very prone to errors:
- [overflow errors](https://stackoverflow.com/questions/49865952/implementing-gradient-descent-in-python-and-receiving-an-overflow-error)
- stuck in local minima / saddle point
The resulting linear regression is also not very accurate (as shown by the $R^2$ value)
The solution? To normalize / standardize the data.
### Benefits of normalization
- reduce the influence outliers have over the data
- make the scale of the data to be more managable, i.e. no overflows
Overall, normalization helps increase the efficiency of training models and the overall performance of the model.
### Types of Normalization
#### Min Max Feature Scaling
Min Max Feature Scaling (Min Max Normalization) scales all values into the range [0, 1].
It gurantees that all features will have the same scale, but does not handle outliers well.
The formula for Min Max Feature Scaling is as follows:
$$\hat x = \frac{x - x_{min}}{x_{max} - x_{min}}\; ,\; \hat y = \frac{y - y_{min}}{y_{max} - y_{min}}$$
#### Z-Score Scaling (Standardization)
Z-Score Scaling scales values based on the mean and standard deviation.
It handles outliers pretty well, but does not gurantee the same scale.
$$\hat x = \frac{x - \mu_x}{\sigma_x}\; ,\;\hat y = \frac{y - \mu_y}{\sigma_y}$$
### Reverting coefficients
After training the linear regression model with normalized data, we may have to obtain the original coefficients for the equation, in which can be used to predict new data.
There is no need to revert coefficients to their original scale if the relationship between variables is all that matters to the user.
However, in the case of myself, doing some basic algebra will suffice:
#### Reverting coefficients after Z-Score Scaling
Our goal is to change the subject of the equation until we get an equation that relates the original y, $y_o$ to the original x, $x_o$ in linear form.
Linear regression formula (normalized):
$$y_n = m_nx_n+c_n$$
Substitute Z-Score Scaling formula:
$$\frac{y_o - \mu_y}{\sigma_y} = m_n(\frac{x_o - \mu_x}{\sigma_x})+c_n$$
$$\frac{y_o - \mu_y}{\sigma_y} = \frac{m_nx_o - m_n\mu_x}{\sigma_x}+c_n$$
$$\frac{y_o - \mu_y}{\sigma_y} - c_n = \frac{m_nx_o - m_n\mu_x}{\sigma_x}$$
$$\frac{y_o - \mu_y -c_n\sigma_y }{\sigma_y} = \frac{m_nx_o - m_n\mu_x}{\sigma_x}$$
$$\frac{y_o - \mu_y -c_n\sigma_y }{\sigma_y} = \frac{m_nx_o - m_n\mu_x}{\sigma_x}$$
$${\sigma_x}({y_o - \mu_y -c_n\sigma_y }) = \sigma_y({m_nx_o - m_n\mu_x})$$
$${y_o\sigma_x - \mu_y\sigma_x -c_n\sigma_y\sigma_x } = {m_nx_o\sigma_y - m_n\mu_x\sigma_y}$$
$$y_o\sigma_x = {m_nx_o\sigma_y - m_n\mu_x\sigma_y} + \mu_y\sigma_x + c_n\sigma_y\sigma_x$$
$$y_o = \frac{\sigma_ym_n}{\sigma_x}x_o + \frac{- m_n\mu_x\sigma_y + \mu_y\sigma_x + c_n\sigma_y\sigma_x}{\sigma_x}$$
$$y_o = \frac{\sigma_ym_n}{\sigma_x}x_o - \frac{m_n\mu_x\sigma_y}{\sigma_x} + \mu_y + c_n\sigma_y$$
$$\therefore m_o=\frac{\sigma_ym_n}{\sigma_x}\;,\;c_o= - \frac{m_n\mu_x\sigma_y}{\sigma_x} + \mu_y + c_n\sigma_y$$
We can now use the original coefficients for predicting new data.
#### Reverting Min Max Feature Scaling Coefficients
Im too lazy to type out the latex, heres an ugly picture:

In summary, the current process for training a linear regression model via gradient descent is:
1. Normalize / Standardize dataset
2. Perform Gradient Descent
3. Revert the normalized coefficients
## Sources
#### Linear Regression
https://www.youtube.com/watch?v=zPG4NjIkCjc
https://youtu.be/JvS2triCgOY?si=3rRvQAAagro8nZfX
https://youtu.be/w2FKXOa0HGA?si=lvcBXh9yvXA7AiMB
http://www.sthda.com/english/articles/38-regression-model-validation/158-regression-model-accuracy-metrics-r-square-aic-bic-cp-and-more/
https://www.youtube.com/watch?v=w2FKXOa0HGA
https://www.ibm.com/topics/linear-regression
#### Gradient Descent
https://youtu.be/IHZwWFHWa-w?si=JlFPhOA9Hpcg3d2E
https://youtu.be/sDv4f4s2SB8?si=9Fx8NQqbI6u1tnxL
https://stats.stackexchange.com/questions/179026/objective-function-cost-function-loss-function-are-they-the-same-thing
https://en.wikipedia.org/wiki/Mean_squared_error
https://youtu.be/Wk8vycBT4qM?si=Tq3mhlytVoUdxBt7
https://youtu.be/KLXP2RL0-Vg?si=3dNDcFxhVnq0XveZ
https://www.youtube.com/watch?v=Souzjv6WfrY&t=1084s
#### Normalization
https://www.codecademy.com/article/normalization
https://stackoverflow.com/questions/26414913/normalize-columns-of-a-dataframe
https://en.wikipedia.org/wiki/Normalization_(statistics)
https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing