ISLR hw4 - HackMD

--- tags: ISLR --- # ISLR hw4 ## Q4 ### a Definition is not clear enough. If relation of X and Y is just a straight line, Linear Regression is better. Otherwise, Cubic Regression is a curve, will be colser to data than Linear Regression ### b Cubic Regression should have higher test RSS, because it would easy to overfit. ### c Cubic Regression would have lower train RSS, because higher model capacity could learn more complex situation. ### d Don't have enough infomation. Because testing set could be very simliar with or very different from trainging. Will lead Cubic Regression have lower or higher test RSS than Training one. Depends on the model capacity and train-test-split method ## Q5 pass ## Q6 $y = \beta_0 + \beta_1 * x$ is the least squares line If ($\bar{x}, \bar{y}$) in this line, apply to the least squares line $\bar{y} = \beta_0 + \beta_1 * \bar{x}$ we know $\beta_0 = \bar{y} - \beta_1 * \bar{x}$ so, $\bar{y} = (\bar{y} - \beta_1 * \bar{x}) + \beta_1 * \bar{x}$ $\bar{y} = \bar{y}$, prove that the least squares line pass through ($\bar{x}, \bar{y}$) ## Q11 ```r set.seed(1) x <- rnorm(100) y <- 2 * x + rnorm(100) ``` ### a hypothesis H0 : $\beta = 0$ ```r q11_lm <- lm(y ~ x + 0) summary(q11_lm) # Call: # lm(formula = y ~ x + 0) # Residuals: # Min 1Q Median 3Q Max # -1.9154 -0.6472 -0.1771 0.5056 2.3109 # Coefficients: # Estimate Std. Error t value Pr(>|t|) # x 1.9939 0.1065 18.73 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Residual standard error: 0.9586 on 99 degrees of freedom # Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776 # F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16 ``` p-value of t-test is < 0.001, null hypothesis is rejected. ### b ```r q11b_lm <- lm(x ~ y + 0) summary(q11b_lm) # Call: # lm(formula = x ~ y + 0) # Residuals: # Min 1Q Median 3Q Max # -0.8699 -0.2368 0.1030 0.2858 0.8938 # Coefficients: # Estimate Std. Error t value Pr(>|t|) # y 0.39111 0.02089 18.73 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Residual standard error: 0.4246 on 99 degrees of freedom # Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776 # F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16 ``` p-value of t-test is < 0.001, null hypothesis is rejected. ### c From the result of a and b, there exist $y = 2 * x + b$ Obviously, we got the same result from our true data. ### d pass the math prof ```r x_len <- length(x) p1 <- sqrt(x_len - 1) * (x %*% y) p2 <- sqrt(sum(x^2) * sum(y^2) - (x %*% y)^2) p1 / p2 # [,1] # [1,] 23.16648 ``` ### e change the position of x and y, we can get similar result ```r y_len <- length(y) p1 <- sqrt(y_len - 1) * (y %*% x) p2 <- sqrt(sum(y^2) * sum(x^2) - (y %*% x)^2) p1 / p2 # [,1] # [1,] 23.16648 ``` ### f ```r q11f_lm <- lm(y ~ x) summary(q11f_lm) # Call: # lm(formula = y ~ x) # Residuals: # Min 1Q Median 3Q Max # -1.73798 -0.67412 0.07978 0.72408 1.95851 # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.004363 0.092613 0.047 0.963 # x 2.037699 0.091071 22.375 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Residual standard error: 0.8995 on 98 degrees of freedom # Multiple R-squared: 0.8363, Adjusted R-squared: 0.8346 # F-statistic: 500.6 on 1 and 98 DF, p-value: < 2.2e-16 ``` ```r q11f2_lm <- lm(x ~ y) summary(q11f2_lm) # Call: # lm(formula = x ~ y) # Residuals: # Min 1Q Median 3Q Max # -1.2345 -0.2643 0.0074 0.2856 0.9842 # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.03786 0.04139 0.915 0.363 # y 0.41041 0.01834 22.375 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Residual standard error: 0.4037 on 98 degrees of freedom # Multiple R-squared: 0.8363, Adjusted R-squared: 0.8346 # F-statistic: 500.6 on 1 and 98 DF, p-value: < 2.2e-16 ``` t-value is 22.375 for two case ## Q13 ### a ```r x <- rnorm(100) ``` ### b ```r eps <- rnorm(100, sd = sqrt(0.25)) ``` ### c ```r y <- -1 + 0.5 * x + eps length(y) ``` length of y is 100, $\beta_0$ is -1, $\beta_1$ is 0.5 ### d ```r plot(x, y) ``` ![](https://i.imgur.com/Uwe5qSO.png) Looks there are some positive relation between x and y. ### e ```r q13d_lm <- lm(y ~ x) summary(q13d_lm) # Call: # lm(formula = y ~ x) # Residuals: # Min 1Q Median 3Q Max # -1.46738 -0.33594 -0.00888 0.37595 1.14212 # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -0.97275 0.05538 -17.563 <2e-16 *** # x 0.53553 0.05369 9.975 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Residual standard error: 0.5468 on 98 degrees of freedom # Multiple R-squared: 0.5038, Adjusted R-squared: 0.4987 # F-statistic: 99.49 on 1 and 98 DF, p-value: < 2.2e-16 ``` $\beta_0$ and $\beta_1$ have very high t value and quiet low p value. We have high confidence there are some variable in $\beta_0$ and $\beta_1$. ### f ```r plot(x, y) abline(q13d_lm, col = 'red') abline(-1, 0.5, col = 'blue') legend('bottomright', c('Least square', 'Regression'), col = c('red', 'blue'), lwd = 3) ``` ![](https://i.imgur.com/SJKocld.png) ### g ```r q13g_lm <- lm(y ~ x + I(x ^ 2)) summary(q13g_lm) # Call: # lm(formula = y ~ x + I(x^2)) # Residuals: # Min 1Q Median 3Q Max # -1.50259 -0.34018 0.01148 0.35115 1.13079 # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -0.90733 0.06574 -13.802 < 2e-16 *** # x 0.51746 0.05403 9.577 1.1e-15 *** # I(x^2) -0.06426 0.03572 -1.799 0.0751 . # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Residual standard error: 0.5407 on 97 degrees of freedom # Multiple R-squared: 0.5198, Adjusted R-squared: 0.5099 # F-statistic: 52.5 on 2 and 97 DF, p-value: 3.54e-16 ``` $x^2$ don't have enough p to prove there are a $\beta_2$. ### h, i, j pass