# Stat 120 Study Guide Solutions
## Data Collection
### D.1: (core)
The table below shows the first 8 observations from a sample of 200 individuals, who reported their age, race, income, and job satisfaction score (on a scale from 0 to 100).
<img src="https://hackmd.io/_uploads/By-a8nK0a.png" width="60%"/>
1. What are the cases?
*The cases are the 200 respondents.*
2. What are the variables recorded in this study? List them and identify each as either categorical or quantitative.
*Age (Quantitative), Income (Categorical), Race (Categorical), Score (Quantitative)*
### D.2 (core)
Below is a brief overview of an experiment published in *Archives of General Psychiatry*. Over a 4-month period, among 30 people with bipolar disorder, patients who were given a high dose (10 g/day) of omega-3 fats from fish oil improved more than those given a placebo.
1. Why didn't the experimenters just given everyone the omega-3 fats to see if they improved?
*Without a comparison group it would be impossible to separate the effect of the omega-3 fats from confounding variables, such as time.*
2. The experimenters randomly assigned patients to the two groups. Why was this important?
*Randomizing patients into groups accounts for potential confounders that the researchers did not control. The randomization should balance them between the treatment groups.*
3. Can the experimenters generalize their results to all bipolar patients?
*The researchers cannot generalize the results since the subjects were not randomly sampled from this population.*
4. Can the experimenters claim that the omega-3 fats caused the improvement?
*Yes, since this was a randomized experiment the researchers can attribute the difference in improvement they observed to the treatment.*
### D.3 (core)
In a large city school system with 20 elementary schools, the school board is considering the adoption of a new policy that would require elementary students to pass a test in order to be promoted to the next grade. The school board wants to find out whether parents agree with this plan and is considering using one of the following sampling schemes:
1. Put a big ad in the newspaper asking people to log their opinions on the district website.
2. Randomly select one of the elementary schools and contact every parent by phone.
3. Send a survey home with every student and ask parents to fill it out and return it the next day.
4. Randomly select 50 parents from each elementary school. Send them a survey and follow up with a phone call if they do not return the survey within a week.
Which sampling scheme would you recommend to the school board? Justify your answer.
*Sampling scheme "4" has the best chance of obtaining a representative sample of parents. Random sampling is used in this scheme, so many possible biases may be avoided. In addition, the follow-up call will help reduce non-response bias, though it may still be present.*
### D.4 (core)
The journal *Circulation* reported that among 1900 people who had heart attacks, those who drank an average of 19 cups of tea a week were 44% more likely than non-tea drinkers to survive at least 3 years after the heart attack.
1. What is the population of interest?
*All people who have had heart attacks.*
1. What is the sample?
*The 1900 people who had heart attacks in this study.*
1. What is the parameter of interest?
*Difference in the proportion of people surviving at least 3 years between tea drinkers and non-tea drinkers.*
1. What is the statistic?
$\widehat{p}_{\text{tea}}-\widehat{p}_{\text{non-tea}}$ = 0.44
## Exploratory Data Analysis
### E.1
In a survey conducted by the Gallup organization September 6-9, 2012, 1,017 adults were asked "In general, how much trust and confidence do you have in the mass media - such as newspapers, TV, and radio - when it comes to reporting the news fully, accurately, and fairly?" 81 said that they had a "great deal" of confidence, 325 said they had a "fair amount" of confidence, 397 said they had "not very much" confidence, and 214 said they had "no confidence at all".
1. Display the results in a frequency table.
| Great deal | Fair amount | Not very much | No confidence at all |
|------------|-------------|---------------|----------------------|
| 81 | 325 | 397 | 214 |
1. Sketch a bar chart of the data.
<img src="https://hackmd.io/_uploads/ryRYcSEyA.png" width="60%"/>
1. What proportion of respondents have a fair amount of confidence in the media?
$\widehat{p} = 325/1017 \approx 0.32$
### E.2 (core)
The below histogram displays the distance (in miles) from a random sample of 500 New York taxi trips. The data come from the New York City Taxi and Limousine Commission's database of yellow-taxi trips.
<img src="https://hackmd.io/_uploads/Sy7NoRKCp.png" width="60%"/>
1. Describe the distribution of trip distances, commenting on modes, symmetry, and unusual observations.
*The distribution is unimodal and skewed to the right. There is one outlier around 25 miles.*
2. Is it better to report the mean and standard deviation or the median and IQR for this data set? Explain.
*Since the distribution is skewed and there is an outlier, we should use resistant measures of center and spread; thus, we prefer the median and IQR.*
### E.3
Intro statistics students conducted a survey as part of their final project. The questions asked included:
- How would you rate yourself politically?
- How would you describe your diet?
Below is a stacked bar chart of the 289 responses.
<img src="https://hackmd.io/_uploads/Bkw-hCYCT.png" width="80%"/>
Describe what this plot is showing about the association between politics and diet.
*There is an association between student diet and political rating. We see this because the distribution of political rating is not the same across the three diets. For example, there were no conservatives within the vegetarian diet, but about 25% of carnivores were conservatives.*
### E.4
Below are boxplots displaying the relationship between vitamin use and the concentration of retinol (a micronutrient) in the blood for a sample of $n = 315$ individuals. Does there seem to be an association between these two variables? Briefly justify your answer.
<img src="https://hackmd.io/_uploads/HJ9q2AtR6.png" width="80%"/>
*There does not appear to be an association between vitamin use and the concentration of retinol in the blood. The distributions (mean, IQD, min) are all approximately equal across the three vitamin use groups.*
### E.5 (core)
The below scatterplot displays data collected from a random sample of 500 New York taxi trips. The data come from the New York City Taxi and Limousine Commission's database of yellow-taxi trips, which contains a number of variables including distance (in miles) and total cost of the trip (in dollars). Describe the association between the distance and total cost of a taxi ride in New York.
<img src="https://hackmd.io/_uploads/HJOba0YAp.png" width="60%"/>
*There is a strong, positive, linear association between the total cost and distance of a taxi right in New York. There are a couple potential outliers at about (2, 60) and (28, 60).*
## Simple Linear Regression
### R.1
Meadowfoam is a small plant found growing in moist meadows of the U.S. Pacific Northwest. Researchers reported the results from one study in a series designed to find out how to elevate meadowfoam production to a profitable crop. In a controlled growth chamber, they focused on the effects of two light-related factors: light intensity and the timing of the onset of the light treatment. Below are results from a quick regression analysis in R.
```
Call:
lm(formula = Flowers ~ Intensity, data = meadowfoam)
Coefficients:
(Intercept) Intensity
77.38500 -0.04047
```
Write down the equation of the fitted regression line using proper notation.
$\widehat{y} = 77.385 - 0.0405x$
### R.2 (core)
Meadowfoam is a small plant found growing in moist meadows of the U.S. Pacific Northwest. Researchers reported the results from one study in a series designed to find out how to elevate meadowfoam production to a profitable crop. In a controlled growth chamber, they focused on the effects of two light-related factors: light intensity and the timing of the onset of the light treatment. Below are results from a quick regression analysis in R.
```
Call:
lm(formula = Flowers ~ Intensity, data = meadowfoam)
Coefficients:
(Intercept) Intensity
77.38500 -0.04047
```
1. Write a one-sentence interpretation of the slope, in context.
*A 1* $\mu mol/m^{2}/sec$ *increase in light intensity is associated with a decrease in the average number of flowers per plant of 0.0405.*
*Note: a 1 $\mu mol/m^{2}/sec$ increase in light intensity is pretty silly, so a better interpretation might use a 50 $\mu mol/m^{2}/sec$ increase. Here, a 50 $\mu mol/m^{2}/sec$ increase in light intensity is associated with a decrease in the average number of flowers per plant of $0.0405(50) \approx 2.03$*.
2. Does the intercept make sense in the context of the problem? Explain briefly.
*The intercept does not make sense in this context since plants need light to survive, so it doesn't seem reasonable that meadowfoam would produce so many flowers per plant when it isn't exposed to light.*
### R.3
Meadowfoam is a small plant found growing in moist meadows of the U.S. Pacific Northwest. Researchers reported the results from one study in a series designed to find out how to elevate meadowfoam production to a profitable crop. In a controlled growth chamber, they focused on the effects of two light-related factors: light intensity and the timing of the onset of the light treatment. Below are results from a quick regression analysis in R.
```
Call:
lm(formula = Flowers ~ Intensity, data = meadowfoam)
Coefficients:
(Intercept) Intensity
77.38500 -0.04047
```
Use the fitted regression line to predict the number of flowers per plant when light intensity is 450 $\mu$mol/$m^2$/sec.
$\widehat{y}=77.385-0.0405(450)=59.16$ *flowers per plant*
### R.4
A regression model was used to predict penguin heart rate as a function of duration of dive (in minutes). Below is a scatterplot of the observed data with the fitted regression line superimposed.
<img src="https://hackmd.io/_uploads/ByyW0g5Ra.png" width="60%"/>
Should you used this model to predict a penguin's heart rate for a 20 minute dive? Justify your answer using statistical reasoning.
*No you should not use this model to predict a penguin's heart rate for a 20 minute dive, since x = 20 is outside the observed range of the sample data, making this an extrapolation. Further, it's evident that the relationship is not linear, so this model is not useful for prediction.*
### R.5
A regression model was used to predict penguin heart rate as a function of duration of dive (in minutes). Below is the residual plot and a histogram of the residuals. Does the regression model to adequately describe the association between heart rate and dive duration? Explain your reasoning using specific evidence from these plots.
<img src="https://hackmd.io/_uploads/rygxae5CT.png" width="80%"/>
*The regression model does not adequately describe the association between heart rate and dive duration. The residual plot shows signs of curvature, indicating that there is not a linear association between heart rate and dive duration. In addition, there may be non-constant spread of the residuals, and the distribution of the residuals is
skewed.*
## Hypothesis Testing
### HT.1
Below are two histograms. One is a population distribution and the other is a null distribution for the test of the following hypotheses: $H_0: \mu = 85$ vs. $H_0: \mu > 85$. Which is which? Support your answer with statistical justification.

*The null distribution will be centered around the value specified in the null hypothesis, so the plot on the right is the null distribution (centerd around 85).*
### HT.2 (core)
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an official heart transplant candidate. Some patients got a transplant and some did not. Patients in the treatment group got a transplant and those in the control group did not. Of the 34 patients in the control group, 4 survived to the end of the study. Of the 69 people in the treatment group, 24 survived to the end of the study.
Clearly state the hypotheses being tested.
*$H_0: p_1 - p_2 = 0$ vs. $H_a: p_1 - p_2 > 0$, where $p_1$ denotes the proportion of patients in the treatment group who survived to the end of the study and $p_2$ denotes the proportion of patients in the control group who survived to the end of the study.*
### HT.3 (core)
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an official heart transplant candidate. Some patients got a transplant and some did not. Patients in the treatment group got a transplant and those in the control group did not. In the study, the difference in the proportion of subjects surviving in the treatment and the control groups (treatment - control) is 0.23. Below is a randomization distribution (comprised of 250 simulated statistics) that can be used to conduct this hypothesis test. Calculate the p-value using this distribution. Be sure to show your work.
<img src="https://hackmd.io/_uploads/ByJtxb9Ca.png" width="80%"/>
*The p-value is the proportion of test statistics in the null distribution that are at least as "extreme" (or rare) as the observed test statistic, which is $\widehat{p}_1 - \widehat{p}_2 \approx 0.23$ here. There are 5 observations at 0.23 or above in this randomization distribution, so p-value $=5/250 = 0.02$.*
### HT.4 (core)
The US Environmental Protection Agency has set the action level for lead contamination of drinking water at 15 ppb (parts per billion). Samples are regularly tested to ensure that mean lead contamination is not above this level. Suppose we are testing:
$H_0: \mu = 15$ vs. $H_a: \mu > 15$
Suppose that researchers calculated a p-value of 0.05. What does this tell you about the strength of evidence relating to the hypotheses?
*There is some evidence that the average lead concentration is above 15 ppb.*
### HT.5 (core)
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an official heart transplant candidate. Some patients got a transplant and some did not. Patients in the treatment group got a transplant and those in the control group did not. In the study, the difference in the proportion of subjects surviving in the treatment and the control groups (treatment - control) is 0.23.
Suppose that researchers calculated a p-value of 0.02. What does this tell you about the strength of evidence relating to the hypotheses?
*There is strong evidence that the proportion of patients in the treatment group who survived to the end of the study is discrenibly higher than the proportion of patients in the control group who survived to the end of the study.*
### HT.6
Every year, the United States Department of Health and Human Services releases to the public a large data set containing information on births recorded in the country. In this problem you will work with a random sample of 1,000 cases from the data set released in 2014.
You might have heard that human gestation is typically 40 weeks; however, a friend mentions that recent increases in cesarean births is likely to have decreased length of gestation.
Below are summary statistics of the gestation length from the random sample of 1,000 people.
<img src="https://hackmd.io/_uploads/HyeoX5i0T.png" width="55%"/>
Calculate the standardized test statistic for this situation.
$t = \dfrac{38.67 - 40}{3/\sqrt{1000}} \approx -14.019$
### HT.7
An insurance company checks records on 582 accidents selected at random and notes that teenagers were at the wheel in 91 of them. Calculate the standardized test statistic that can be used to determine whether less than 20% of auto accidents involve teenage drivers.
Notice that $91/582 \approx 0.156$
$$z = \dfrac{156- 0.2}{\sqrt{\dfrac{0.2(1-0.2)}{582}}}$$
### HT.8
Suppose that you wish to test $H_0 : p = 0.2$ vs $H_a : p \ne 0.2$ using the sample results from a random sample of size $n = 1000$. You have calculated the test statistic $z=4.74$. Assume that all conditions necessary for inference are met.
Describe how to find the p-value for this hypothesis test.
*To find the p-value, we use the standard normal distribution. The alternative hypothesis is two-sided (≠), so we need to find the area under the standard normal curve that is at least 4.74 units away from the center (0). Below is a sketch.*
<img src="https://hackmd.io/_uploads/SyFEdFV10.png" width="50%"/>
### HT.9
A random sample of 48 students at a large university reported getting an average of 7 hours of sleep on weeknights, with standard deviation 1.62 hours. Assuming that all conditions for inference are met, describe how you can calculate the p-value for the following hypotheses if the test statistic is $t=-4.28$.
$H_0: \mu = 8$ vs. $H_a: \mu < 8$.
*To find the p-value, we use the t distribution with 47 degrees of freedom. The alternative hypothesis is one-sided, so we need to find the area under the t curve that is to the left of -4.28. (A clear sketch would also be fine.)*
### HT.10 (core)
The US Environmental Protection Agency has set the action level for lead contamination of drinking water at 15 ppb (parts per billion). Samples are regularly tested to ensure that mean lead contamination is not above this level. Suppose we are testing:
$H_0: \mu = 15$ vs. $H_a: \mu > 15$
Suppose that researchers calculated a p-value of 0.02. State conclusion to this hypothesis test in the context of the problem.
*There is statistically discernible evidence that the average lead concentration in drinking water is above the action level of 15 ppb (p-value = 0.02).*
## Confidence Intervals
### CI.1
In the 2016 Olympic Men's Marathon, 140 athletes finished the race. Below are summary statistics of those finish times (in minutes).
|n | mean | median | standard deviation|
|--|----|-----|----|
|140|142.367|140.765|7.723|
Suppose that you draw a random sample of size $n=15$ from these 140 marathon times.
Describe how the sampling distribution of the sample mean from a random sample of size $n=15$ compares to the distribution of all marathon times. Be sure to comment on the shape, center, and spread.
*With a smaller sample size of $n=15$ we can't assume that the sampling distribution of the sample mean will be unimodal and symmetric, so we'll focus on the mean and standard deviation of the distributions.*
*The center (mean) of the sampling distribution should be the same as the population distribution of all marathon times, since we have random sampling.*
*The spread (sd) of the sampling distribution for the sample mean is $s/\sqrt{n} = 7.723/\sqrt{140} \approx 0.653$, which is much smaller than the spread of the population distribution.*
### CI.2 (core)
Lead in groundwater poses a serious public-health problem. A study conducted in Minnesota measured the water quality of 895 randomly selected wells. One of the contaminants measured was lead concentration (in ppb). Percentiles of the bootstrap distribution for the sample mean are provided. Find a 97% confidence interval for the mean lead concentration in Minnesota wells.
<img src="https://hackmd.io/_uploads/S1P9-ZcR6.png" width="90%"/>
*To find a 97% confidence interval, we need the 1.5% and 98.5% percentiles, so our interval is 0.7 to 2.05 ppm.*
### CI.3 (core)
Lead in groundwater poses a serious public-health problem. A study conducted in Minnesota measured the water quality of 895 randomly selected wells. One of the contaminants measured was lead concentration (in ppb). A statistician calculated an 89% confidence interval to be (0.82, 1.8). Interpret this confidence interval in the context of the problem.
*We are 89% confident that the average lead concentration in Minnesota wells is between 0.82 to 1.8 ppm.*
### CI.4 (core)
A group of researchers who are interested in the possible effects of distracting stimuli during eating, such as an increase or decrease in the amount of food consumption, monitored food intake for a group of 44 patients who were randomized into two equal groups. The treatment group ate lunch while playing solitaire, and the control group ate lunch without any added distractions. Patients in the treatment group ate 52.1 grams of biscuits, with a standard deviation of 45.1 grams, and patients in the control group ate 27.1 grams of biscuits, with a standard deviation of 26.4 grams.
The researchers found a 95% confidence interval for the difference in mean biscuit consumption to be (6.41, 43.59). Interpret this confidence interval in the context of the problem.
*We are 95% confident people eating while distracted (playing solitaire) eat between 6.41 and 43.59 grms more biscuits than people in the control group.*
### CI.5
According to a survey by the UCLA Higher Education Institute,26 69 percent of the first year college students in the sample reported feeling homesick. A 95% confidence interval for this proportion is given by (67%, 71%).
Explain to a friend who has not taken Stat 120 why we use the words "confidence" or "sure" rather than "probability" or "chance" when interpretting this confidence interval.
*We don't use the word probability or chance because the parameter value either is contained in our interval or not (i.e., a 100% chance or 0% chance). Our confidence is in the process used to create our confidence interval. Thus, when we say "we are 95% confident" we are saying that 95% of intervals constructed using the same process will contain population proportion (parameter).*
### CI.6 (core)
An insurance company checks records on 582 accidents selected at random and calculated a 95% confidence interval for the proportion of all auto accidents that involve teenage drivers to be (0.128, 0.189).
A politician urging tighter restrictions on drivers' licenses issued to teens says, "In one of every five auto accidents, a teenager is behind the wheel." Do the insurance company's findings support or contradict this statement? Explain.
*One in every five auto accidents means that p = 0.2. Since 0.2 is not in our confidence interval it is not a plausible value of the proportion of accidents involving teenage drivers; thus, our findings contradict this statement.*
### CI.7
An insurance company checks records on 582 accidents selected at random and notes that teenagers were at the wheel in 91 of them. Calculate a 92% confidence interval for the proportion of all auto accidents that involve teenage drivers. (Plug-in completely, but do not simplify.)
```{r}
qnorm(.90) = 1.281552
qnorm(.92) = 1.405072
qnorm(.95) = 1.644854
qnorm(.96) = 1.750686
qnorm(.97) = 1.880794
qnorm(.98) = 2.053749
qnorm(.99) = 2.326348
```
$\widehat{p} \pm z^* SE = \frac{91}{582} \pm 1.750686 \sqrt{\frac{91/582(1-91/582)}{582}}$
*Note that $z^*$ is the 0.96 quantile from the standard normal distribution.*
### CI.8
Lead in groundwater poses a serious public-health problem. A study conducted in Minnesota measured the water quality of 895 randomly selected wells. One of the contaminants measured was lead concentration (in ppb). Below are summary statistics from this random sample calculated via `favstats()`.
```
min Q1 median Q3 max mean sd n missing
0.02 0.09 0.23 0.61 210.29 1.268246 9.253287 895 0
```
Calculate a 95% confidence interval for the mean lead concentration in Minnesota wells. (Plug-in completely, but do not simplify.)
$\bar{x} \pm t^* (s/\sqrt{n}) = 1.268246 \pm t^* (9.253287/\sqrt{895})$
*Here, $t^*$ is the 0.975 quantile from a t distribution with 894 degrees of freedom.*
### CI.9:
Suppose that we have calculated a 99% confidence interval for the proportion of University of Minnesota students graduating from a Minnesota high school. If we plan to construct a second 99% confidence interval based on a new sample, how can we reduce the margin of error?
*To reduce the margin of error of our new interval, we can increase the sample size.*