---
tags: ISLR
---
# ISLR hw7
Range: 「課本第四章習題:第4題、第6題、第7題、第10題、第11題」
進行一個模擬實驗來觀察並說明此現象.
1. 從常態分布隨機生成 50筆資料,其中每筆資
料都有10萬個predictors (X_1, …X_100000)
2. 利用Bernoulli(0.5) 分布生成 50 筆 y. 將這些 y
隨機指定為 step 1 中的response. 因此 true
test error=50%.
3. 使用錯誤CV 法估計test error.
4. 使用正確CV 法估計test error.
詳細內容請參考第五章簡報第33~35頁
## ch04 Q4
### a
On Average, $1 / 10 = 10\%$.
### b
On average, $\frac{1}{10} * \frac{1}{10} = \frac{1}{100} = 1\%$
### c
$0.1^{100} = 1e-100$
### d
In a high-dimension space(means given many p), distance of points would be very far. We need huge data point when data have many p.
### e
Contain, on average, 10% data. Lenght of hyper cube would be $0.1^\frac{1}{n}$ if p is n
## Q6
$$
P(X) = \frac{e^{(\beta_0+\beta_1X_1+\beta_2X_2)}}{1 + e^{(\beta_0+\beta_1X_1+\beta_2X_2)}} \\
\beta_0 = −6, \beta_1 = 0.05, \beta_2 = 1.
$$
$x^2$
### a
```r
e <- -6 + 0.05 * 40 + 1 * 3.5
exp(e) / (1 + exp(e))
[1] 0.3775407
```
### b
x1 is unknown, x2 = 3.5
$$
i = -6 + 0.05 * x_1 + 1 * 3.5 \\
0.5 = \frac{e^i}{1 + e^i} \\
0.5 + 0.5 * e^i = e^i \\
0.5 = 0.5 * e^i, e^i = 1 \\
取log, -2.5 + 0.05 * x_1 = 0 \\
0.05 * x_1 = 2.5, x = 50
$$
## Q7
The response is yes and no, base on probabilty, we know this
$$
Pr(Y=Yes|X=x)=\frac{\pi_1 f_1(x)}{\pi_1 f_1(x) + \pi_2 f_2(x)} \\
\pi_1 = 0.8, \pi_2 = 0.2
$$
And f1(x) is the PDF of a normal random variable(mean 10 and variance 36) and f2(x) is the PDF of a normal random variable(mean 0 and variance 36).
```r
p1 <- 0.8 * dnorm(4, mean = 10,sd = 6)
p2 <- 0.2 * dnorm(4, mean = 0,sd = 6)
p1 / (p1 + p2)
# [1] 0.7518525
```
p is about $0.75 \%$
## Q10
### a
```r
library(ISLR)
summary(Weekly)
# Year Lag1 Lag2 Lag3 Lag4
# Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
# 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580 1st Qu.: -1.1580
# Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410 Median : 0.2380
# Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472 Mean : 0.1458
# 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090 3rd Qu.: 1.4090
# Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
# Lag5 Volume Today Direction
# Min. :-18.1950 Min. :0.08747 Min. :-18.1950 Down:484
# 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540 Up :605
# Median : 0.2340 Median :1.00268 Median : 0.2410
# Mean : 0.1399 Mean :1.57462 Mean : 0.1499
# 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
# Max. : 12.0260 Max. :9.32821 Max. : 12.0260
```
```r
pairs(Weekly)
```

Year seems be associated with Volume
### b
```r
# Call:
# glm(formula = direction_formula, family = binomial, data = Weekly)
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.6949 -1.2565 0.9913 1.0849 1.4579
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.26686 0.08593 3.106 0.0019 **
# Lag1 -0.04127 0.02641 -1.563 0.1181
# Lag2 0.05844 0.02686 2.175 0.0296 *
# Lag3 -0.01606 0.02666 -0.602 0.5469
# Lag4 -0.02779 0.02646 -1.050 0.2937
# Lag5 -0.01447 0.02638 -0.549 0.5833
# Volume -0.02274 0.03690 -0.616 0.5377
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# (Dispersion parameter for binomial family taken to be 1)
# Null deviance: 1496.2 on 1088 degrees of freedom
# Residual deviance: 1486.4 on 1082 degrees of freedom
# AIC: 1500.4
# Number of Fisher Scoring iterations: 4
```
Only lag2 cuz his z value is small
### c
```r
direction_pred <- predict(direction_glm, type = 'response')
direction_res <- ifelse(direction_pred >= 0.5, 'Up', 'Down')
table(direction_res, Weekly$Direction)
# direction_res Down Up
# Down 54 48
# Up 430 557
```
```r
acc <- (54 + 557) / nrow(Weekly)
tp <- 54 / (430 + 54)
tn <- 557 / (48 + 557)
# > acc
# [1] 0.5610652
# > tp
# [1] 0.1115702
# > tn
# [1] 0.9206612
```
accuracy is about $56\%$, TP is low just $11.1\%$ but TN is high $92\%$
### d
```r
train_data <- Weekly[Weekly$Year <= 2008,]
test_data <- Weekly[Weekly$Year > 2008,]
weekly_glm <- glm(Direction ~ Lag2, data = train_data, family = 'binomial')
weekly_pred <- predict(weekly_glm, test_data, type = 'response')
weekly_res <- ifelse(weekly_pred >= 0.5, 'Up', 'Down')
table(weekly_res, test_data[,'Direction'])
# weekly_res Down Up
# Down 9 5
# Up 34 56
```
```r
mean(weekly_res == test_data[,'Direction'])
# [1] 0.625
```
acc is $63.5\%$
### e
```r
library(dplyr)
library(MASS)
lda_model <- lda(Direction ~ Lag2, train_data)
lda_pred <- predict(lda_model, test_data)
lda_class <- lda_pred$class
table(lda_class, test_data[,'Direction'])
# lda_class Down Up
# Down 9 5
# Up 34 56
```
```r
mean(lda_class == test_data[,'Direction'])
# 0.625
```
acc is $63.5\%$
### f
```r
qda_model <- qda(Direction ~ Lag2, train_data)
qda_pred <- predict(qda_model, test_data)
qda_class <- qda_pred$class
table(qda_class, test_data[,'Direction'])
# qda_class Down Up
# Down 0 0
# Up 43 61
```
```r
mean(qda_class == test_data[,'Direction'])
# [1] 0.5865385
```
acc is $58.6\%$
### g
```r
library(class)
knn_pred <- knn(as.matrix(train_data$Lag2), as.matrix(test_data$Lag2), # train x and test x
train_data$Direction, k=1)
table(knn_pred, test_data$Direction)
# knn_pred Down Up
# Down 21 30
# Up 22 31
```
```r
mean(knn_pred == test_data[,'Direction'])
# [1] 0.5
```
### h
Logistic regression and LDA perform better than other methods.
### i
```r
k <- c(1, 3, 5, 9, 15, 31, 51)
knn_m <- function(k){
knn_pred <- knn(as.matrix(train_data$Lag2), as.matrix(test_data$Lag2),
train_data$Direction, k=k)
return(mean(mean(knn_pred == test_data[,"Direction"])))
}
sapply(k, knn_m)
# [1] 0.5096154 0.5576923 0.5288462 0.5576923 0.5865385 0.5384615 0.5576923
```
k = 9 is the best, but not good enough compared with Logistic Regression
## Q11
### a
```r
mid <- median(Auto$mpg)
mpg01 <- rep(0, dim(Auto)[1])
mpg01[Auto$mpg >= mid] <- 1
new_auto <- data.frame(Auto, mpg01)
new_auto
```
### b
```r
pairs(new_auto)
```

mpg, weight, horsepower, acceleration and displacement are good to predict mpg01
### c
```r
n <- nrow(new_auto)
idx <- sample(1:n, n * 0.7)
train_data <- new_auto[idx,]
test_data <- new_auto[-idx,]
```
### d
```r
lda_model <- lda(mpg01 ~ mpg + weight + horsepower + acceleration + displacement, train_data)
lda_pred <- predict(lda_model, test_data)
table(lda_pred$class, test_data$mpg01)
# 0 1
# 0 60 3
# 1 4 51
```
```r
1 - mean(lda_pred$class == test_data$mpg01)
# [1] 0.05932203
```
test eeeor is $5.9\%$
### e
```r
qda_model <- qda(mpg01 ~ mpg + weight + horsepower + acceleration + displacement, train_data)
qda_pred <- predict(qda_model, test_data)
table(qda_pred$class, test_data$mpg01)
# 0 1
# 0 59 3
# 1 5 51
```
```r
1 - mean(qda_pred$class == test_data$mpg01)
# [1] 0.06779661
```
test eeeor is $6.8\%$
### f
```r
glm_model <- glm(mpg01 ~ mpg + weight + horsepower + acceleration + displacement,
train_data, family=binomial)
glm_pred <- predict(glm_model, test_data, type = 'response')
glm_pred <- ifelse(glm_pred > 0.5, 1, 0)
table(glm_pred, test_data$mpg01)
# glm_pred 0 1
# 0 64 0
# 1 0 54
```
```r
1 - mean(glm_pred == test_data$mpg01)
# 0
```
test eeeor is $0\%$
### g
pass
## 模擬
```r
```