# Linear Regression ## Articles There are no shortage of linear regression articles, here are some of the best / clearest that have working code examples. Linear regression is one of many regression-based tools for determining which variables have an effect on which other variables. It's used in science and business all the time! They're very handy, but you can fool yourself if you misuse them. * [Real Python Linear Regression](https://realpython.com/linear-regression-in-python/) * https://www.w3schools.com/python/python_ml_linear_regression.asp * https://en.wikipedia.org/wiki/Regression_analysis * [Linear Regressions Explained Simply](https://www.youtube.com/watch?v=7ArmBVF2dCs) Reading about regression on Wikipedia will probably feel demotivating. There's so much dense math and it disassembles into complex detail that never really gets tied together at the end. It's a complex topic and it's really addressed well in the Think Stats and Think Bayes series by Allen B. Downey (a heroic educator who I learned / continue to learn a lot from). - https://greenteapress.com/wp/think-stats-2e/ - http://allendowney.github.io/ThinkBayes2/ These books have exercises in them and code! They're courses unto themselves. For this class, we'll focus on a single chapter of one book, chapter one of Think Stats. This is where we'll start to get into simple linear regressions ## Confusing Terms Explained Simply Coefficient - multiplying variable Estimation / Prediction - using one variable to predict another (used to establish good Lead Measures) "Errors" or Residuals - refers to the distance of the data from the line we're looking at - how far off the data is from the line Squared Errors - the distance from the line, squared (to account for two dimensions of distance) Sum of Squared Errors - how far all the data is off from the particular value we're giving it to compare against, cumulatively ```python= ## single dimension of variation data = [3,3,4,4,2,1,6,7,9,14] def get_variation_around_mean(data): the_mean = sum(data) / len(data) sum_of_squared_errors = 0 for datum in data: residual = datum - the_mean squared_error = residual * residual sum_of_squared_errors += squared_error variation_around_the_mean = sum_of_squared_errors / len(data) return sum_of_squared_errors, variation_around_the_mean ``` Ordinary Least Squares - used to find the "line of best fit" that minimizes the sum of the squared errors (SSE) between the observed values of the response variable and the predicted values from the linear equation. The line of best fit is determined by finding the parameters (coefficients) for the equation that minimizes the SSE. ```python= ## two dimensions of distance mean using the slope intercept formula (y=mx+b) to predict a value for y, given x ## this formula gets the difference between the predicted value and the actual value, which we want to minimize with ordinary least squares def distance(x_coefficient,x_value,y_intercept,actual_value): predicted_value = (x_coefficient * x_value) + y_intercept return actual_value - predicted_value ## given line parameters, determine the sum of squared errors def fit_data_to_line(x_coefficient,y_intercept,data): sum_of_squared_errors = 0 for datum in data: x = datum.get("x") y = datum.get("y") residual = distance(x_coefficient,x,y_intercept,y) squared_error = residual * residual sum_of_squared_errors += squared_error variation_around_fit = sum_of_squared_errors / len(data) return sum_of_squared_errors, variation_around_fit def determine_linear_fit_sse(data): results = {} smallest_sse = math.get("inf") ## check everything between 0 and 1 in .01 for x_coefficient in range(0,1.00,0.01): for y_intercept in range(0,5): sse,fit_variation = fit_data_to_line(x_coefficient,y_intercept,data) results[x_coefficient,y_intercept,data] = sse,fit_variation if sse < smallest_sse[0]: smallest_sse = (sse,fit_variation,x_coefficient,y_intercept) return smallest_sse ``` Fit - how well a particular line fits the data, the degree to which the slope of the particular line produces the ordinary least square value. Variation Around the Mean - Sum of Squared Errors / Count of Data Points R^2 (r squared) / Coefficient of Determination : how much variation in y can be explained by x (done after finding the smallest SSE line) (variance around the mean - variance around the smallest SSE) / variance around the mean Regression - checking every line slope to figure out which has the best fit One-hot - a binary number like 00010000 or 00000001 where there is only one 1 character, and the rest are 0s. The location of the 1 usually represents one of a small (less than 8) number of categories. Used because they are "cheap" to store and fast to work with. See also: Bitmask ## Python Packages - Scikit-Learn - statsmodels - Pandas - NumPy ## Kaggle Examples - https://www.kaggle.com/code/lizhoward/linear-regression-on-medical-insurance/edit - ** https://www.kaggle.com/code/ashydv/sales-prediction-simple-linear-regression - https://www.kaggle.com/datasets/podsyp/regression-with-categorical-data - https://www.kaggle.com/code/anurag629/medical-cost-multivariate-linear-regression https://pandas.pydata.org/docs/user_guide/10min.html#selection Careful not to make [Spurious Correlations](https://www.tylervigen.com/spurious-correlations)