Simple Linear Regression

{%hackmd 8nGPjMOiTy-0DWU2xy0q0Q %} # Simple Linear Regression [Home Page](/_9v1g3C3TXmUfBbxkjnn0A) [toc] ## Formula | Name | Notation | Definition | |:---------------------------- |:--------- |:--------------------------------------- | | Sum of square total | $SSTO$ | $\sum_{i = 1}^{n} (y_i - \ol y)^2$ | | Sum of square errors | $SSE$ | $\sum_{i = 1}^{n} (y_i - \hat y_i)^2$ | | Sum of square regression | $SSR$ | $\sum_{i = 1}^{n} (\hat y_i - \ol y)^2$ | | Mean of squared error | $MSE$ | $SSE / (n − p)$ | | Mean of squared regression | $MSR$ | $SSE / (p − 1)$ | | Coefficient of Determination | $R^2$ | $1 - SSE / SSTO = SSR / SSTO$ | ## Simple Linear Regression Given data $(X_1, Y_1), (X_2, Y_2), \cdots, (X_n, Y_n)$, use *predictor* $X$ to predict *response* $Y$, we want to find *deterministic* model $$ Y_i = f (X_i) $$ But there exists some random phenomenon, that $\varepsilon_i$ are i.i.d. random variable $$ y_i = f (x_i) + \varepsilon_i $$ Want to find a *sample linear regression* function $f$, $\hat y$ is the *fitted value* $$ \hat y_i = f (x_i) = \beta_0 + \beta_1 x_i $$ ## Lease Squared Estimate (LSE) The *residual* of data is $$ e_i = y_i - \hat y_i $$ and the *residual sum of squares (RSS)* is $$ RSS = \sum_{i = 1}^{n} e_i^2 = \sum_{i = 1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2 $$ The *lease squared estimate (LSE)* of $\beta$ is $$ (b_0, b_1) = \argmin_{(\beta_0, \beta_1) \in \mathbb R^2} \ RSS $$ ### Properties #### Properties of LSE Define the mean of data is $$ \ol{x} = \frac{1}{n} \sum_{i = 1}^{n} x_i $$ LSE has close form $$ b_0 = \overline{y} - b_1 \overline{x} $$ and $$ b_1 = \frac{\sum_{i = 1}^{n} (x_i - \overline{x}) (y_i - \overline{y})}{\sum_{i = 1}^{n} (x_i - \overline{x})^2} = \frac{\sum_{i = 1}^{n} x_i (y_i - \overline{y})}{\sum_{i = 1}^{n} (x_i - \overline{x})^2} = \frac{\sum_{i = 1}^{n} (x_i - \overline{x}) y_i}{\sum_{i = 1}^{n} (x_i - \overline{x})^2} $$ Mean of $x$ and $y$ will pass through the regression line $$ \overline{y} = b_0 + b_1 \overline{x} $$ #### Properties of Residuals $$ \sum_{i = 1}^{n} e_i^2 = \sum_{i = 1}^{n} e_i x_i = \sum_{i = 1}^{n} e_i \hat y_i = 0 $$ # Normal Assumption For model $$ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i $$ Assume $\varepsilon_i \iid N (0, \sigma^2)$ 1. $E (\varepsilon_i) = 0$ 2. $\var (\varepsilon_i) = \sigma^2$ 3. $\cov (\varepsilon_i, \varepsilon_j) = 0$ Implies $Y_i \iid N (\beta_0 + \beta_1 x_i, \sigma^2)$ 1. $E (Y_i) = \beta_0 + \beta_1 x_i$ 2. $\var (Y_i) = \sigma^2$ 3. $\cov (Y_i, Y_j) = 0$ ## Square Errors (SE's) Define *Sum of square errors (SSE)* $$ SSE = \sum_{i = 1}^{n} e_i^2 = \sum_{i = 1}^{n} (Y_i - \hat Y_i)^2 $$ with expectation $E (SSE) = (n - 2) \sigma^2$. Define *Mean of square error (MSE)* by $$ MSE = \frac{SSE}{n - 2} $$ which is an unbiased estimator of $\sigma^2$, i.e. $E (MSE) = \sigma^2$ Estimate the stand error of $b_0$ and $b_1$ by $$ s^2 (b_0) = MSE \left( \frac{1}{n} + \frac{\overline{x}^2}{\sum_{i = 1}^{n} (x_i - \overline{x})^2} \right) $$ and $$ s^2 (b_1) = MSE \left( \frac{1}{\sum_{i = 1}^{n} (x_i - \overline{x})^2} \right) $$ ### Distribution of SE's The distribution of $b_0$ and $b_1$ are $$ \begin{align*} b_0 & \sim N (\beta_0, s^2 (\beta_0)) = N \brs{ \beta_0, \sigma^2 \brs{\frac{1}{n} + \frac{\ol x^2 }{\sum_{i = 1}^{n} (x_i - \ol x)^2} }} \\ b_1 & \sim N (\beta_1, s^2 (\beta_1)) = N \brs{ \beta_0, \sigma^2 \brs{\frac{1}{\sum_{i = 1}^{n} (x_i - \ol x)^2}} } \end{align*} $$ The distribution of $SSE$ is $$ \frac{SSE}{\sigma^2} \sim \chi_{n - 2}^2 $$ so that, for $j = 0, 1$ $$ \frac{b_j - \beta_j}{s (b_j)} \sim t_{n - 2} $$ ::: spoiler Proof $$ \begin{align*} \frac{b_j - \beta_j}{s (b_j)} & = \frac{b_j - \beta_j}{s (\beta_j)} / \frac{s (b_j)}{s (\beta_j)} \\ & = \frac{b_j - \beta_j}{s (\beta_j)} / \sqrt{\frac{MSE}{\sigma^2}} \\ & \overset{D}{=} N (0, 1) / \sqrt{ \frac{\chi_{n - 2}^2}{n - 2} } \\ & \sim t_{n - 2} \end{align*} $$ ::: ## Hypothesis Test ### Hypothesis Test To test $$ H : \beta_j = c \quad \text{v.s.} \quad A : \beta_j \ne c $$ Let $$ t^* = \frac{b_j - c}{s (b_j)} $$ At level $\alpha \in (0, 1)$ - If $| t^* | \leq t_{n - 2} (\alpha / 2)$, conclude $H$ - If $| t^* | > t_{n - 2} (\alpha / 2)$, conclude $A$ where $t_{d} (q)$ is $1 - q$ quantile of $t$ with $d$ degree of freedom ### Confidence Interval A $1 - \alpha$ confidence interval for $\beta_j$ is $$ \brs{ \hat b_j \pm s (b_j) \cdot t_{n - 1} (\alpha / 2) } $$ ### Prediction Interval Given new data $x_h$, applying $$ \hat y_h = \beta_0 + \beta_1 x_h $$ Assume $Y_h \sim N (\beta_0 + \beta_1 x_h, \sigma^2)$, so $$ Y_h - \hat Y_h \sim N (0, s^2 (pred)) $$ where $$ s^2 (pred) = \sigma^2 \brs{ 1 + \frac{1}{n} + \frac{(x_h - \ol x)^2}{\sum_{i = 1}^{n} (x_i - \ol x)^2} } $$ :::spoiler Proof $$ \begin{align*} s^2 (pred) & = s^2 (Y_h - \hat Y_h) \\ & = s^2 (Y_h) + s^2 (\hat Y_h) \\ & = \sigma^2 + \sigma^2 \brs{\frac{1}{n} + \frac{(x_h - \ol x)^2}{\sum_{i = 1}^{n} (x_i - \ol x)^2} } \\ & = \sigma^2 \brs{ 1 + \frac{1}{n} + \frac{(x_h - \ol x)^2}{\sum_{i = 1}^{n} (x_i - \ol x)^2} } \end{align*} $$ ::: A $1 - \alpha$ prediction interval for $Y_h$ is $$ \brs{ \hat y_h \pm s (pred) \cdot t_{n - 2} (\alpha / 2) } $$ # Analysis of Variance (ANOVA) ## Coefficient of Determination (R square) Define the *coefficient of determination* $$ R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO} $$ $R^2 \approx 1$ better fit when the linear model is applied, but $R^2$ is increasing in $\sum_{i = 1}^{n} (x_i - \ol x)^2$ Limitation 1. $R^2 \approx 1$ not implies useful prediction performance 2. $R^2 \approx 1$ not implies the fitted model is the correct model 3. $R^2 \approx 0$ not implies $X$ and $Y$ are not related, the model may be non-linear. [Home Page](/_9v1g3C3TXmUfBbxkjnn0A) [toc]