# Linear regression ###### tags: `linear regression`, `multiple linear regression`, `multivariate linear regression`, `regression` TL;DR: In sklearn.linear_model.LinearRegression, - If input one y, then it's multiple linear regression even it's only one x. - If input multiple y's, it's multivariate linear regression. - If you have multiple y & would like to have one regressor for each y, then apply MultiOutputRegressor. - [jupyter notebook](https://github.com/syhsu/sklearn_linear_regression/blob/main/linear_regression.ipynb) - Since the module uses [plain ordinary least squares method](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), it assumes the y's are independent to each other (covariance=0). ## Introduction We use **sklearn.linear_model.LinearRegression** to demonstrate the difference between **multiple linear regression & multivariate linear regression.** Please find the definitions & references below. ### Definitions **1. Multiple linear regression:** Linear regression with multiple predictor variables - Multiple inputs & ONE output **(y is a scalr)** $$ y=f(x) $$ **2. Multivariate linear regression:** Linear regression with a multivariate response variable - Multiple inputs & multiple outputs **(y is a vector)**, $$ y_1,y_2,...,y_m=f(x_1,x_2,...,x_n)$$ ### Reference [matlab](https://nl.mathworks.com/help/stats/linear-regression.html?s_tid=CRUX_lftnav) [stackexchange](https://stats.stackexchange.com/a/224234) ## Let's begin ```python= from random import random import pandas as pd import numpy as np from sklearn import linear_model # Dummy dataset (ref. from https://stackoverflow.com/a/34172907) lr = lambda : [random() for i in range(100)] x = pd.DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()}) y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4 ``` With the dummy dataset, the expected regression results will be - R2 = 1, - coefficients=[1,2,3] - bias=4 ## Multiple linear regression: only ONE target ```python= model = linear_model.LinearRegression() model.fit( x[["x1", "x2", "x3"]], y) # check results model.score(x[["x1", "x2", "x3"]], y) # R2 model.coef_ # slopes model.intercept_ # bias ``` You will see the following results (same with what we should expect). - 1.0 - array([1., 2., 3.]) - 4.0 ## What happens if we input multiple y in LinearRegression()? Let's use y to create two targets, y1 & y2 and see how LinearRegression does. ```python= # multivariate linear regression y_multiple = pd.DataFrame({"y1":y, "y2":y}) model.fit( x[["x1", "x2", "x3"]], y_multiple) # check results model.score(x[["x1", "x2", "x3"]], y_multiple) model.coef_ # a (2,3) matrix model.intercept_ # a (2,) vector ``` You will see the following results. - 1.0 - array([[1., 2., 3.], [1., 2., 3.]]) - array([4., 4.]) Apperently it's **multivariate linear regression** since LinearRegression() fits on multiple y's. **However due to the module implementation (see the notes section in [link](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)), the covariance among y's is assuming to be 0, that is, y1 & y2 are independent.** ## What if I want one regressor for each y? Ans: MultiOutputRegressor ```python= from sklearn.multioutput import MultiOutputRegressor wrapper = MultiOutputRegressor(model).fit(x[["x1", "x2", "x3"]], y_multiple) # check results wrapper.estimators_ wrapper.estimators_[0].score(x[["x1", "x2", "x3"]], y_multiple['y1']) # only ONE y wrapper.estimators_[0].coef_ ``` You will see the following results. - [LinearRegression(), LinearRegression()] - 1.0 - array([1., 2., 3.]) Since we have two y's, MultiOutputRegressor yields 2 estimators, and each estimator fits on one y. This also explains why y_multiple['y1'] is the input in Line7 while calculating the fitting score. <!--```python= >>> from sklearn import linear_model >>> X = [[1, 2, 3, 4, 5, 6]] >>> Y = [[1, 2, 3]] >>> lr = linear_model.LinearRegression() >>> model = lr.fit(X, Y) >>> model.predict([[1,2,3,4,5,6]]) array([[ 1., 2., 3.]]) # https://stackoverflow.com/questions/36984987/multivariate-linear-regression-in-python-analog-of-mvregress-in-matlab ```--> <!-- - [Math definition](https://stats.stackexchange.com/a/224234): ![](https://hackmd.io/_uploads/r16MCuVEt.png) Reference: [stackoverflow](https://stackoverflow.com/questions/11479064/multiple-linear-regression-in-python) [matlab](https://nl.mathworks.com/help/stats/mvregress.html#d123e603515) ## Unequal sizes [ref](https://www.theanalysisfactor.com/when-unequal-sample-sizes-are-and-are-not-a-problem-in-anova/)-->