Mulitvariate HW3

# Mulitvariate HW3 ###### tags: `multivariate` ## Q1 > Make a draftman plot for the car data with the variables `X1=price, X2=mileage, X8=weight, X9=length` > > Move the brush into the region of heavy cars. What can you say about price, mileage and length? Move the brush onto high fuel economy. Mark the Japanese, European and American cars. You should find the same condition as in boxplot. > > (The operation of highlighting points corresponding to a selected range of one of the variables is called brushing.) Plot the car dataset in draftman plot. Separate the region of the car by color: RGB. ![](https://i.imgur.com/vVJwVR9.png) Look at the third column, the weight has a little effect on the price. When the weight is larger, the price gets higher. The second row show the mileage gets lower when weight is larger. The last row show significantly positive relative, weight gets higher when the length is larger. For high fuel economy, we can figure out the slope of points to (0,0) is relative to fuel efficiency in the scatter plot at the bottom right of second row and first column. For most efficient group, the slope are steepest, most of the points are green, which represented the Japanese cars. Red points represented US cars are less efficient. Blue points stand for European cars are also less efficient. We plot the (M/P) value on boxplot. ![](https://i.imgur.com/CRiB3TK.png) Japanese car has high fuel efficiency is obvious, and European and U.S. are less. Using R is very convenience in some way ``` R # draftman plot pairs(data[1:4], main="Draftman Plot of car dataset", pch=21, bg=c("red", "green3", "blue")[c(data[5])$C]) # boxplot data["CP"] = data["M"] / data["P"] boxplot(CP~C, data=data, names=c("U.S.", "Japan", "Europe"), main="Mileage / Price", horizontal=T) ``` The code https://gist.github.com/linnil1/45f4f154de3e22d84cf7e0baad2c33a9 ## Q2 > A physical anthropologist performed a mineral analysis of nine ancient Peruvian hairs. The results for the chromium (x1) and strontium (x2 ) levels, in parts per million (ppm), were as follows: > ``` > cr st > 1 0.48 12.57 > 2 40.53 73.68 > 3 2.19 11.13 > 4 0.55 20.03 > 5 0.74 20.29 > 6 0.66 0.78 > 7 0.93 4.64 > 8 0.37 0.43 > 9 0.22 1.08 > ``` > It is known that low levels (less than or equal to .100 ppm) of chromium suggest the presence of diabetes, while strontium is an indication of animal protein intake. > (a) Construct and plot a 90% joint confidence ellipse for the population mean vector $μ′ = [μ1,μ2 ]$ , assuming that these nine Peruvian hairs represent a random sample from individuals belonging to a particular ancient Peruvian culture. > > (b) Obtain the individual simultaneous 90% confidence intervals for μ1 and μ2 by "projecting" the ellipse constructed in Part a on each coordinate axis. Does it appear as if this Peruvian culture has a mean strontium level of 10? That is, are any of the points (μ1 arbitrary, 10) in the confidence regions? Is $(.30, 10]'$ a plausible value for μ? Discuss. > > $c$ Do these data appear to be bivariate normal? Discuss their status with reference to QQ plots and a scatter diagram. If the data are not bivariate normal, what implications does this have for the results in Parts a and b? > > (d) Repeat the analysis with the obvious "outlying" observation removed. Do the inferences change? Comment. ### a,b I think the easiest way to plot the ellipse is to enumerate all the points around the mean point. Plotted on the scatter plot if the point that is in confidence level. To do so, create a meshgrids. ``` R outn = 100 devx = seq(from=-20, to=20, length.out=outn) devy = seq(from=-30, to=30, length.out=outn) x = rep(devx, outn) dim(x) = c(outn, outn) x = t(x) dim(x) = c(outn * outn) y = rep(devy, outn) xy = matrix(c(x, y), outn * outn) ``` run matrix manuplication(Just two lines) to calculate the f value. ``` R # a q1 = apply(xy %*% inv_v * xy, 1, sum) q1_crit = qf(0.90, p, n - p) / ((n - p) * n / p / (n - 1)) # b fx = x ** 2 * n / var(cr) fy = y ** 2 * n / var(st) q2_crit = qf(0.90, 1, n - 1) q2 = (fx < q2_crit) & (fy<q2_crit) ``` Plot it! ![](https://i.imgur.com/cpmBU2t.png) Then, I found another useful tool: contour. Just show the only level that I interested in: the f-value score. R program will automatically plot the contour of the ellipse for you. ``` R # remap to original mean as the center devx = devx + m[1] devy = devy + m[2] contour(devx, devy, q1, levels=c(q1_crit), label=c("Q1"), col="Blue", xlab="cr", ylab="st") ``` ![](https://i.imgur.com/gVSVNbN.png) Finally plot it out with two contour. ![](https://i.imgur.com/axLJeBe.png) `(.30, 10)` is inside the confidence interval. Also for`(0, 10)` by roughly looking to the plotting. ### c,d The square of binormal is chi-square, so I use chi-quare QQ-plot. The distance are calculated with $$ d^2 = (x-u)^T\Sigma^{-1}(x-u) $$ ``` R d = t(apply(data, 1, function(x) x - m)) dis = diag(d %*% inv_v %*% t(d)) qqplot(qchisq(ppoints(500), df=p), dis, main = expression("Q-Q plot for" ~~ {chi^2}[nu == 2]), xlab="Chi-squared", ylab="sample", asp=1) ``` We can find the outliner makes the qqplot significant non-linear. The largest distance is found at index 2 ``` 0.42154138 7.09781710 0.05102815 2.37090174 2.34159676 1.17407982 0.44341984 1.15673357 0.94288163 ``` After remove it, the distance are ``` R data = data[c(-2),] ``` ``` 0.5181444 5.3838751 2.1849994 1.9728961 0.9369430 0.3951284 1.2245387 1.3834750 ``` Still, the line didn't change significantly, because the line as domainated by first several points. ![](https://i.imgur.com/sEZiVo2.png) ### The code are here * ab https://gist.github.com/linnil1/02b8c90bb5a9f8f72644ec54f81eed0f * cd https://gist.github.com/linnil1/34b637616e616529f23d39a9598c61fa ## Q3 > 3. Carry out a profile analysis on the beetle data in the table. Evaluate the data whether there are parallelism. ``` R # parallelism data1_d = data1[, 2:p] - data1[, 1:p-1] data2_d = data2[, 2:p] - data2[, 1:p-1] p_d = p - 1 mean1_d = colMeans(data1_d) mean2_d = colMeans(data2_d) var1_d = var(data1_d) var2_d = var(data2_d) varpool_d = (var1_d * (n1 - 1) + var2_d * (n2 - 1)) / (n1 + n2 - 2) t2_d = n1 * n2 / (n1 + n2) * (mean1_d - mean2_d) %*% solve(varpool_d) %*% (mean1_d - mean2_d) f_d = (n1 + n2 - p_d - 1) / p_d / (n1 + n2 - 2) * t2_d fcrit_d = qf(0.95, p_d, n1 + n2 - p_d - 1) if (f_d < fcrit_d) { print("Cannot Rejcet Parallelism") } else print("Rejcet Parallelism") ``` The output ``` f_d = 41.83803 fcrit_d = 2.874187 ``` So we reject parallelism hypothesis. And also, I plot the mean of each profile in graph, the red line cross to green line at the first profile. ![](https://i.imgur.com/XJZ5ZF8.png) code https://gist.github.com/linnil1/8d9112c45bce0f2bc24b6e279cc2a87e