ISLR hw7 - HackMD

--- tags: ISLR --- # ISLR hw7 Range: 「課本第四章習題：第4題、第6題、第7題、第10題、第11題」進行一個模擬實驗來觀察並說明此現象. 1. 從常態分布隨機生成 50筆資料，其中每筆資料都有10萬個predictors (X_1, …X_100000) 2. 利用Bernoulli(0.5) 分布生成 50 筆 y. 將這些 y 隨機指定為 step 1 中的response. 因此 true test error=50%. 3. 使用錯誤CV 法估計test error. 4. 使用正確CV 法估計test error. 詳細內容請參考第五章簡報第33~35頁 ## ch04 Q4 ### a On Average, $1 / 10 = 10\%$. ### b On average, $\frac{1}{10} * \frac{1}{10} = \frac{1}{100} = 1\%$ ### c $0.1^{100} = 1e-100$ ### d In a high-dimension space(means given many p), distance of points would be very far. We need huge data point when data have many p. ### e Contain, on average, 10% data. Lenght of hyper cube would be $0.1^\frac{1}{n}$ if p is n ## Q6 $$ P(X) = \frac{e^{(\beta_0+\beta_1X_1+\beta_2X_2)}}{1 + e^{(\beta_0+\beta_1X_1+\beta_2X_2)}} \\ \beta_0 = −6, \beta_1 = 0.05, \beta_2 = 1. $$ $x^2$ ### a ```r e <- -6 + 0.05 * 40 + 1 * 3.5 exp(e) / (1 + exp(e)) [1] 0.3775407 ``` ### b x1 is unknown, x2 = 3.5 $$ i = -6 + 0.05 * x_1 + 1 * 3.5 \\ 0.5 = \frac{e^i}{1 + e^i} \\ 0.5 + 0.5 * e^i = e^i \\ 0.5 = 0.5 * e^i, e^i = 1 \\ 取log, -2.5 + 0.05 * x_1 = 0 \\ 0.05 * x_1 = 2.5, x = 50 $$ ## Q7 The response is yes and no, base on probabilty, we know this $$ Pr(Y=Yes|X=x)=\frac{\pi_1 f_1(x)}{\pi_1 f_1(x) + \pi_2 f_2(x)} \\ \pi_1 = 0.8, \pi_2 = 0.2 $$ And f1(x) is the PDF of a normal random variable(mean 10 and variance 36) and f2(x) is the PDF of a normal random variable(mean 0 and variance 36). ```r p1 <- 0.8 * dnorm(4, mean = 10,sd = 6) p2 <- 0.2 * dnorm(4, mean = 0,sd = 6) p1 / (p1 + p2) # [1] 0.7518525 ``` p is about $0.75 \%$ ## Q10 ### a ```r library(ISLR) summary(Weekly) # Year Lag1 Lag2 Lag3 Lag4 # Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950 # 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580 1st Qu.: -1.1580 # Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410 Median : 0.2380 # Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472 Mean : 0.1458 # 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090 3rd Qu.: 1.4090 # Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260 # Lag5 Volume Today Direction # Min. :-18.1950 Min. :0.08747 Min. :-18.1950 Down:484 # 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540 Up :605 # Median : 0.2340 Median :1.00268 Median : 0.2410 # Mean : 0.1399 Mean :1.57462 Mean : 0.1499 # 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050 # Max. : 12.0260 Max. :9.32821 Max. : 12.0260 ``` ```r pairs(Weekly) ``` ![](https://i.imgur.com/mSdVVlx.png) Year seems be associated with Volume ### b ```r # Call: # glm(formula = direction_formula, family = binomial, data = Weekly) # Deviance Residuals: # Min 1Q Median 3Q Max # -1.6949 -1.2565 0.9913 1.0849 1.4579 # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) 0.26686 0.08593 3.106 0.0019 ** # Lag1 -0.04127 0.02641 -1.563 0.1181 # Lag2 0.05844 0.02686 2.175 0.0296 * # Lag3 -0.01606 0.02666 -0.602 0.5469 # Lag4 -0.02779 0.02646 -1.050 0.2937 # Lag5 -0.01447 0.02638 -0.549 0.5833 # Volume -0.02274 0.03690 -0.616 0.5377 # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # (Dispersion parameter for binomial family taken to be 1) # Null deviance: 1496.2 on 1088 degrees of freedom # Residual deviance: 1486.4 on 1082 degrees of freedom # AIC: 1500.4 # Number of Fisher Scoring iterations: 4 ``` Only lag2 cuz his z value is small ### c ```r direction_pred <- predict(direction_glm, type = 'response') direction_res <- ifelse(direction_pred >= 0.5, 'Up', 'Down') table(direction_res, Weekly$Direction) # direction_res Down Up # Down 54 48 # Up 430 557 ``` ```r acc <- (54 + 557) / nrow(Weekly) tp <- 54 / (430 + 54) tn <- 557 / (48 + 557) # > acc # [1] 0.5610652 # > tp # [1] 0.1115702 # > tn # [1] 0.9206612 ``` accuracy is about $56\%$, TP is low just $11.1\%$ but TN is high $92\%$ ### d ```r train_data <- Weekly[Weekly$Year <= 2008,] test_data <- Weekly[Weekly$Year > 2008,] weekly_glm <- glm(Direction ~ Lag2, data = train_data, family = 'binomial') weekly_pred <- predict(weekly_glm, test_data, type = 'response') weekly_res <- ifelse(weekly_pred >= 0.5, 'Up', 'Down') table(weekly_res, test_data[,'Direction']) # weekly_res Down Up # Down 9 5 # Up 34 56 ``` ```r mean(weekly_res == test_data[,'Direction']) # [1] 0.625 ``` acc is $63.5\%$ ### e ```r library(dplyr) library(MASS) lda_model <- lda(Direction ~ Lag2, train_data) lda_pred <- predict(lda_model, test_data) lda_class <- lda_pred$class table(lda_class, test_data[,'Direction']) # lda_class Down Up # Down 9 5 # Up 34 56 ``` ```r mean(lda_class == test_data[,'Direction']) # 0.625 ``` acc is $63.5\%$ ### f ```r qda_model <- qda(Direction ~ Lag2, train_data) qda_pred <- predict(qda_model, test_data) qda_class <- qda_pred$class table(qda_class, test_data[,'Direction']) # qda_class Down Up # Down 0 0 # Up 43 61 ``` ```r mean(qda_class == test_data[,'Direction']) # [1] 0.5865385 ``` acc is $58.6\%$ ### g ```r library(class) knn_pred <- knn(as.matrix(train_data$Lag2), as.matrix(test_data$Lag2), # train x and test x train_data$Direction, k=1) table(knn_pred, test_data$Direction) # knn_pred Down Up # Down 21 30 # Up 22 31 ``` ```r mean(knn_pred == test_data[,'Direction']) # [1] 0.5 ``` ### h Logistic regression and LDA perform better than other methods. ### i ```r k <- c(1, 3, 5, 9, 15, 31, 51) knn_m <- function(k){ knn_pred <- knn(as.matrix(train_data$Lag2), as.matrix(test_data$Lag2), train_data$Direction, k=k) return(mean(mean(knn_pred == test_data[,"Direction"]))) } sapply(k, knn_m) # [1] 0.5096154 0.5576923 0.5288462 0.5576923 0.5865385 0.5384615 0.5576923 ``` k = 9 is the best, but not good enough compared with Logistic Regression ## Q11 ### a ```r mid <- median(Auto$mpg) mpg01 <- rep(0, dim(Auto)[1]) mpg01[Auto$mpg >= mid] <- 1 new_auto <- data.frame(Auto, mpg01) new_auto ``` ### b ```r pairs(new_auto) ``` ![](https://i.imgur.com/DPfxQlW.png) mpg, weight, horsepower, acceleration and displacement are good to predict mpg01 ### c ```r n <- nrow(new_auto) idx <- sample(1:n, n * 0.7) train_data <- new_auto[idx,] test_data <- new_auto[-idx,] ``` ### d ```r lda_model <- lda(mpg01 ~ mpg + weight + horsepower + acceleration + displacement, train_data) lda_pred <- predict(lda_model, test_data) table(lda_pred$class, test_data$mpg01) # 0 1 # 0 60 3 # 1 4 51 ``` ```r 1 - mean(lda_pred$class == test_data$mpg01) # [1] 0.05932203 ``` test eeeor is $5.9\%$ ### e ```r qda_model <- qda(mpg01 ~ mpg + weight + horsepower + acceleration + displacement, train_data) qda_pred <- predict(qda_model, test_data) table(qda_pred$class, test_data$mpg01) # 0 1 # 0 59 3 # 1 5 51 ``` ```r 1 - mean(qda_pred$class == test_data$mpg01) # [1] 0.06779661 ``` test eeeor is $6.8\%$ ### f ```r glm_model <- glm(mpg01 ~ mpg + weight + horsepower + acceleration + displacement, train_data, family=binomial) glm_pred <- predict(glm_model, test_data, type = 'response') glm_pred <- ifelse(glm_pred > 0.5, 1, 0) table(glm_pred, test_data$mpg01) # glm_pred 0 1 # 0 64 0 # 1 0 54 ``` ```r 1 - mean(glm_pred == test_data$mpg01) # 0 ``` test eeeor is $0\%$ ### g pass ## 模擬 ```r ```