--- title: Homework 2 label: homework layout: post geometry: margin=2cm tags: homework --- # CS 100 Homework #2 ### Hypothesis Testing ##### Due: November 5, 2022 at 10 pm ### Instructions Please submit to Gradescope your R Markdown (.Rmd) file. Please also knit your R markdown file, and submit the resulting PDF/HTML (PDF preferred) file as well. Be sure to follow the CS100 course collaboration policy as you work on this and all CS100 assignments. ### Objectives By the end of this homework, you will know: * what hypothesis testing is By the end of this homework, you will be able to: * calculate a \\(z\\)-score, and a corresponding \\(p\\)-value * perform a hypothesis test ### Introduction **Hypothesis testing** is a branch of classical statistics that asks questions of the form "how likely is it that observed data are the result of random chance, rather than being attributable to other factors?" In this homework, you will investigate how likely it is that the data gathered about the air pressure of the footballs used during the 2014 playoff games were the result of random chance. You will also investigate some characteristics of our politicians to determine how likely it is that Democrats and Republicans in Congress are drawn from the same population as one another, and as the rest of the U.S. population. ### Part 1: Deflategate During the 2014 playoffs, the New England Patriots (our home team; sigh!) were accused by the Indianapolis Colts of purposefully supplying footballs with pressure below the minimum amount required, perhaps to make their balls easier to catch. During halftime, two NFL officials measured the pressure of most of the Patriots’ footballs (11 of them) and some of the Colts’ footballs (4 of them). (Halftime ended before all the balls could be inspected.) It was determined that, on average, the Patriots’ balls had lower air pressure than the Colts’. Your task in this question is to work with the available data (15 measurements) to try to determine whether this outcome could be attributed to random chance. This question is based on an analysis in <!-- [Chapter 12](https://inferentialthinking.com/chapters/12/3/Deflategate.html) of --> [Inferential Thinking](https://www.inferentialthinking.com). For more details about the real-world events on which their analysis is based, you can read, for example, [this ESPN article](http://www.espn.com/blog/new-england-patriots/post/_/id/4782561/timeline-of-events-for-deflategate-tom-brady). #### Data You can find the deflategate data [here](https://cs.brown.edu/courses/cs100/homeworks/data/2/deflategate.csv). It contains 3 variables: - Football: the football team of the ball being measured - Blakeman: the first official’s measurement - Prioleau: the second official’s measurement ##### Giving the Patriots' the benefit of the doubt: Testing the null hypothesis 1. The two officials obtained slightly different measurements for each ball, presumably due to a normal amount of measurement variation. Compute the average of the two measurements for each of the footballs, and then average these average measurements across the two teams. Confirm our earlier assertion that the Patriots' footballs had less pressure, on average, than the Colts'. *Hint:* You might find the `separate` function in the `tidyr` library useful. 1. The NFL requires its footballs to be inflated to a pressure between 12.5 and 13.5 pounds per square inch (psi). The Patriots' footballs were measured at the start of the game to be inflated to 12.5 psi, while the Colts' were inflated to 13 psi. After play and exposure to air, the footballs are expected to lose some pressure. Calculate the drop in pressure for each ball, between its measurement at the start of the game and its average measurement at halftime. As above, calculate the average drops in pressure. Which team's average drop was greater? 1. Since you are trying to determine if the Patriots' footballs lost more pressure than the Colts', a natural test statistic is the difference between the means, in this case the difference between the two mean drops (i.e., the Patriots' mean drop minus the Colts' mean drop). Calculate the value of this statistic. If you calculated this statistic as we did (the Patriots' mean drop minus the Colts'), the result is positive, which indicates that the Patriots' average drop was larger than the Colts'. The next obvious question then is: Can this positive difference be explained by chance? Or is the difference large enough that it requires an alternative explanation? Let's assume that each of these 15 measurements were as equally likely to have been associated with one of the 11 measurements of the Patriots' ball as they were with one of the four of the Colts'. Your next goal is to test this assumption, which is more formally known as a **null hypothesis**. To do so, you'll run a simulation! Your simulation should randomly assign 11 of the recorded average drops to the Patriots, and 4 to the Colts, and then recompute the test statistic. If the measurements are really equally likely to have been associated with each of the teams, the simulated test statistics should be roughly as large as the actual test statistic: i.e., the one corresponding to the data observed at halftime. 4. Use the `sample` function, which by default samples without replacement, to shuffle the vector of mean drops, and then assume the first 11 values correspond to the Patriots’ measurements, and the last 4, to the Colts’. Then recalculate the test statistic. (*Hint:* Recall that you can select a range of values in a vector, say from the 3rd value to the 7th value, like so: `vector[3:7]`). 5. Redo this exercise 10,000 times to calculate 10,000 values of the test statistic, and then create a histogram of these simulated test statistics. Where does the actual test statistic lie? Eyeballing your histogram, does it appear to be a high- or a low-probability outcome? Calculate the empirical probability of the test statistic by counting how many times it falls below one of your simulated test statistics and dividing by 10,000. This empirical probability tells you the probability of the observed value of the test statistic, given these 15 measurements, and assuming the Patriots' mean drops are the same as the Colts'. Given the empirical probability that you calculated, do you think this difference in mean drops occurred by chance? Here’s a brief recap of this problem, from the point of view of hypothesis testing. The histogram you built was a representation of the distribution of differences in mean drops, assuming the null hypothesis, namely that the observed pressure drops were as likely to have been the result of measuring one of the Patriots' balls as one of the Colts' balls. You then calculated the empirical probability of the observed data, given this distribution. Had we started by assuming a significance level, we could have then decided to accept or reject the null hypothesis based on this empirical probability. In short, in this problem, you simulated a hypothesis test. ### Part 2: Approval Ratings In this problem, you will explore the sense in which confidence intervals and two-sided hypothesis tests can be seen as two sides of the same coin. The approval ratings for each of the five past presidents after ten months in office, as reported [here](https://news.gallup.com/interactives/185273/presidential-job-approval-center.aspx), are listed in the table below. | President | Approve | Disapprove | Sample Size | | ------- | ------- | ------- | ------- | | Trump | 615 | 810 | 1500 | | Obama | 600 | 795 | 1500 | | Bush Jr | 900 | 555 | 1500 | | Clinton | 690 | 660 | 1500 | | Bush Sr | 1065 | 285 | 1500 | Bush Sr has the highest approval rating, well above the others, while Trump's and Obama's ratings seem more or less indistinguishable (by which we mean, possibly the result of sampling error). But it is less clear whether the difference between, say, Clinton's ratings and Obama's, are statistically significant. These approval ratings are binomially distributed, as they are the outcomes of 1500 Bernoulli trials. Recall that a binomial random variable, say $X$ (denoting the number of successes), has mean $np$ and variance $np(1-p)$, where $p$ is the success probability and $n$ is the number of trials. This success probability $p$ associated with such an $X$ can be estimated by a sample proportion, namely $\hat{p} = \frac{X}{n}$. For example, the sample proportion $\hat{p}_T$ for Trump is $\frac{615}{1500} = 0.41$, while the sample proportion $\hat{p}_O$ for Obama is $\frac{600}{1500} = 0.40$. This sample proportion is itself a random variable, with mean $\mathbb{E} \left[ \frac{X}{n} \right] = \frac{1}{n} \mathbb{E} [X] = \frac{np}{n} = p$ and variance $\textrm{Var}(\frac{X}{n}) = \frac{1}{n^2} \textrm{Var}(X) = \frac{p(1 - p)}{n}$. By the central limit theorem, a sample proportion (like any sample mean) is normally distributed with mean equal to the population mean, in this case $p$, and standard deviation equal to the standard error SE, namely $\sqrt{\frac{p(1-p)}{n}}$. Throughout this problem, we assume the significance level \\(\alpha = 0.05\\). 1. Compute 95% confidence intervals on the sample proportions for Trump and Obama. Follow the method described in class, in which we use an estimate the SE, which we obtain by substituting $\hat{p}$ for $p$ in the formula, since $p$ is unknown. 1. Plot a histogram of the distributions of each of these sample proportions, along with the lower and upper bounds of your confidence intervals. *Hint*: Use the `rnorm` function to generate a vector of sufficient length (at least 1000) of random numbers normally distributed with mean \\(\hat{p}\\) and standard deviation SE. > *Another Hint*: You can use the following line of code to plot a histogram, assuming you store the results of your call to the `rnorm` function in a vector called `x`. Note that it plots a vertical line at the mean, `p_hat`. If you called the difference of sample proportions something else, substitute the name you used for `p_hat`. ```{r} ggplot(data = as.data.frame(x), aes(x = x)) + geom_histogram(bins = 20) + geom_vline(xintercept = p_hat, linetype = 2) ``` > *Hint*: You can use the `grid.arrange` function in the "gridExtra" package to see two (or more) ggplots side by side. Give each ggplot a name, say `one` and `two`, and then use `grid.arrange(one, two, ncol = 2)`. 3. Do your 95% confidence intervals overlap? If so, you cannot conclude that Trump's and Obama's ratings are different, at a $0.05$ confidence level. But if not, you can. 4. What about Clinton's and Obama's approval ratings? Do they differ at a $0.05$ confidence level? Explain why in reference to the confidence intervals. Another way to answer these same questions would have been to construct a 95% confidence interval around the difference between two sample proportions, e.g., Trump's and Obama's approval ratings, and to ask whether 0 lies in that interval. If it does, then you cannot conclude that Trump's and Obama's ratings are different, at a $0.05$ confidence level. But if it does, you can. Alternatively, in this problem, you will perform a hypothesis test to determine whether Trump's and Obama's approval ratings differ at a $0.05$ significance level. The random variable of interest here is again the difference between two sample proportions, e.g., Trump's and Obama's, which, overloading notation, we again denote by $\hat{p}$. By the central limit theorem, this random variable is normally distributed with mean 0 and standard deviation given by the standard error SE. The standard error of a difference of sample proportions is the square root of the sum of the variance, say $\sigma_T^2$, of the proportion who supported Trump divided by the number of people polled (say $n_T$), and the variance, say $\sigma_O^2$, of the proportion who supported Obama, again divided by the number of people polled (say $n_O$), namely $\text{SE} =\sqrt{\frac{\sigma_T^2}{n_T} + \frac{\sigma_O^2}{n_O}}$. <!-- 1. Compute a 95% confidence interval on the difference of sample proportions for Trump and Obama. 1. Recall that by the central limit theorem, a difference of sample proportions is normally distributed, in this case with mean \\(\hat{p}\\) and standard deviation SE. Plot a histogram of this distribution. *Hint*: Use the `rnorm` function to generate a vector of sufficient length (at least 1000) of random numbers normally distributed with mean \\(\hat{p}\\) and standard deviation SE. 1. Is 0 contained in your 95% confidence interval? If so, you cannot conclude that Trump's and Obama's ratings are different, at a 95% confidence level. But if not, you can. --> 5. Plot a histogram of a normal distribution with mean 0 and standard deviation SE. Draw a dotted vertical line at `p_hat`: i.e., at the difference of the two sample proportions. 5. Plot a histogram of a (standard) normal distribution with mean 0 and standard deviation 1. This time, draw a vertical line at the $z$-statistic, i.e., $\hat{p} - 0 = \hat{p}$ divided by SE. 5. Perform a two-sided hypothesis test by calling the `pnorm` function on your $z$-statistic to find the $p$-value and compare this value to 0.05. Should the `lower.tail` be set to true or false? *Hint*: You need to adjust the output of the `pnorm` function in some way to account for the two-sidedness of the test. (A two-sided test is appropriate because we are checking whether these sample proportions differ, not whether one is greater than the other.) 5. Finally, generate another vector of random numbers with mean 0 and standard deviation SE. Compare the values in this vector to `p_hat` to compute the $p$-value (without calling the `pnorm` function). 5. At a 5% significance level, were Trump's and Obama's ratings different? What is the relevant $p$-value? Is your conclusion consistent with your answer to Question 3? 5. What about Clinton's and Obama's? What is the relevant $p$-value in this case? Again, is your conclusion consistent with your answer to Question 4?