Regression is when we havea dataset which consist of a dependant variable (the outcome) and an independant variable (the things we can change / control) which we are trying to find the relationship between them.
After the relationship is found, we can use that to predict the outcome for a given change. There are multiple regression types
|
Linear regression is regression using a straight line |
|
Logistic regression is regression using a log-like curve |
To generate the line that represents a linear regression, we can use this method. The idea is that we will use our datapoints and determine the minimum value of the residuals using a formula
change | outcome |
---|---|
1 | 2 |
3 | 4 |
3 | 5 |
4 | 4 |
5 | 5 |
change mean = 3, outcome mean = 4
So we start with a sample dataset; we get the mean of both the x and y values (independant and dependant variables) which is (3, 4) and it turns out the regression line has to go through (3,4)
|
* notice the linear regression is a line and can be y = mx + c |
And then we will take the difference between all x values and the means x; and same goes for y
x | y | x - mean_x | y - mean_y | (x - mean_x) ^ 2 | (x - mean_x)(y - mean_y) |
---|---|---|---|---|---|
1 | 2 | -2 | -2 | 4 | 4 |
2 | 4 | -1 | 0 | 1 | 0 |
3 | 5 | 0 | 1 | 0 | 0 |
4 | 4 | 1 | 0 | 1 | 0 |
5 | 5 | 2 | 1 | 4 | 2 |
We also need to calculate (x - mean_x)^2
and (x - mean_x)(y - mean_y)
because the are used in the following formula for least squares, this is derived using the minimum value of the residuals function.
This would give us the gradient of the r\regression line, we can simple calculate the intercept of the regression line by subsituting the gradient in the y = mx + c
equation using (3, 4) as our x
and y
respectively
R2 is used to represent the accuracy of the relationship. A completely accurate mapping will have an R2 of 1 (mean / mean), while relationships which are less accurate will apporach 0. It takes the sum of the predicted outcomes - mean outcomes over the sum of actual outcomes - mean outcomes squared.
Basically saying how far is the mean of the predicted values is from the mean of the actual values? While R2 is comparing the means of predicted and mean of actual values, the standard error or residue is the comparison between the predicted values and the actual values themselves.
Residuals are differences between the predicted values and the actual values themselves and not the mean (like R2). The least square functions objective is to miuimize this as much as possible. Hence, it is derived from finding the minimum value for the sum of all residuals squared
We will start off with the sum of all residuals squared.
|
why squared? because to remove negatives and it has properties that makes it a good estimator |
The above can be subsituted, giving us
We are trying to find values of $ b_{1}$ and so that the sum of squared residuals is minimum, we can use partial derivative.
Note that the last line there is like so because of the chain rule, we need to also calculate the derivative inside the parenthesis since the derivative here isnt , its instead. The first equation can be simplified to
The second equation can be derived like so
We can solve the first equation by simplifying the equation in terms if
The third and fourth line is done by using the constant factor tule for summation:
If is a constant, this can be represented by . the and are means.
We can use the above and solve for the second equation like so
Which is equivilant to
Gradient Descent is also another way to find the minimum for squared residuals using approximation. We can start with a simple example to demonstrate how it works:
|
The regression line should follow the formula |
We can slowly guess b1
until it reaches its minumum, according to this graph
However, doing this is slow if we have a large dataset, as we need to make many adjustments. Gradient descent helps us to do less calculations by incresing the length of steps at the beginning and slowing down when it approaches the minumum
The first thing that we need is the loss function (a function we want to minimize the value of). In this case, the loss function will be the sum of squared residuals. We then take the following steps
resullt of derivative * learning rate (another arbitrary value)
old output - step size
We can also use multiple parameters with multiple loss functions, say we are trying to minimize both b1
and b0
, we can do
The 0 and 1 coefficients are random initial values for the first iteration. It will be change to the following after the iteration ends
The Z-score is a metric that measures how far a data point is away from the mean, and the standard deviation is a unit of measurement for the z-score. E.g. the Z-score of value 69 is 0.2 standard deviations away from the mean.
As such, the magnitude of 1 standard deviation needs to be derived from the dataset itself, so that the z-score calculations remain relative. Naturally, a dataset which has more spread out will have a bigger standard deviation becase when we want to determine the z-score, we will have to take bigger steps to reach the relative distance and vice-versa.
|
Standard deviation |
|
Z score |
Z score visualization:
61.3 | 69.2 | 77.1 | 85 | 92.9 | 100.8 | 108.7 |
x - 3(sd) | x - 2(sd) | x - sd | x | x + sd | x - 2(sd) | x + 3(sd) |
z = -3 | z = -2 | z = -1 | z = 0 | z = 1 | z = 2 | z = 3 |
The problem: When it comes to approaching values by step size like in gradient descent, there are cases where one feature (axis) might overpower or have greater change than the other. This causes more uneccasary updates for the not-so-changey feature and it will make it oscillate or even worse, diverge before the optimum minima is reached for both of these features.
Feature scaling mitigates this by manipulating the data so that they fit certain constraints which will resilt is a faster learning / approximation process.
Normalization aka min-max scaling is a feature scaling method which scales the data so that the minimum of the data is 0 and the maxumim is 1. This decreases the magnitude of the changes to ease the approximation process. This feature scaling retains the relative ratio between data points but it is sensitive to outliers. E.g.
As we can see, the more drastic the outliers, the more close the values 1, 2, and 3 get scaled together. This will cause them to be assumed as equal in come cases and will cause data loss.
min max formula
Standardization is a feature scaling which shifts the data so that the mean is 0 and the standard deviation is 1. This does not retain the relative scale of the data.
standardization formula
Althought z-score scaling might look like eliminating the relative scales between data, it is not since :
Activate env
souce bin/env/activate
Run training program
python train.py
Run Prediction program
python predict.py
I also did some comparison between GD and Least squares method and here are the results;
You can tune the learning rate and the other parameters to optimize the time, but in my case of a dataset of 200 data points with 40± random deviation, any learning rate bigger than the ones shown above will diverge. Notice how the 0.5 learning rate performs slower than the ones with smaller learning rates; Im guessing it is oscillating around the true minimum approximation until it stops which creates alot of uneccasary computations
I didnt test with larger data size or more iterations because those will take forever to run on my machine, but I hope to see more robust and complete benchmarks of this; especially more options for the parameters to get a good idea on how to estimate those values
Here is the benchmark code