owned this note
owned this note
Published
Linked with GitHub
---
title: "Tutorial 9: Paired t-test"
author: "Elizabeth A. Albright, PhD"
output: rtf_document
---
# Tutorial 9: Paired t-test
```{r libraries, include=F}
library(moments) #package for skewness
library(knitr) #package for making tables (kable)
library(tidyverse) #multiple packages for data wrangling
library(gt) # a package to make tables
library(lubridate) # a package to manipulate dates
```
```{r tutorial6}
rm(list=ls()) #removing objects
airquality.df<-read_csv("airquality.csv") #reading data
airquality.df <- airquality.df %>% #making airquality data frame
mutate(date=mdy(`Date Local`))%>% #making date variable
glimpse() #looking at data
```
In this next chunk I am making a new data frame of paired data, so i can show you how to calculate a paired t-test. For each monitoring site (`Site Num`) i want two observed values, one observation on July 1, 2019, and one observation of December 31, 2019. I am doing this to make a paired data set for demonstration purposes.
```{r paired}
az.paired.df<-airquality.df %>% # making a new dataframe named az.paired.df
filter(`State Name`=="Arizona") %>% #filtering only Arizona by State Name
filter(date =="2019-07-01"|date=="2019-12-31")%>% #filtering only two dates
group_by(`Site Num`)%>% #im grouping by Site Num so i can count how many observations for each Site Num
mutate(n=n())%>% #i'm using mutate to count # of observations for each Site Num. the function is n()
filter(n==2)%>% #i'm filtering here on n for two observations
glimpse() #i'm looking at the data
```
To make things easier, I'm selecting the variables of interest: `Site Num`, date, ozone.
```{r}
az.paired.df<-az.paired.df%>%
select(`Site Num`, date, ozone)%>%
arrange(`Site Num`, date)
head(az.paired.df)
```
We can run the paired t-test using the following chunk with the grouping variable date.
```{r}
t.test(ozone ~ date, az.paired.df, alternative="greater", paired=T)
```
But now we need to test the assumption of the paired t-test--this is normally distributed population of differences. This is a little tricky because of how the data are structured with a grouping variable (long data!), but we can pivot the data to make it wide with the dplyr function pivot_wider().
```{r}
az.paired.wide.df<-az.paired.df%>% #making a new data frame in the wide format
pivot_wider(names_from=date, values_from=ozone)%>% #making data wider by taking the names from date (2019-12-31 and 2019-07-01) and placing the values of the observations into those two columns
glimpse()
az.paired.wide.df
```
We now can use mutate() to calculate the difference between the two variables
Our new wide data set should have three variables: (1) `Site Num`, `2019-12-31` and `2019-07-01`. And now we can calculate the difference in ozone levels (ppm) between the two dates.
```{r}
az.paired.wide.df<-az.paired.wide.df%>% #Now I am taking the
mutate(diff=`2019-07-01`-`2019-12-31`) #making a new variable using subtraction
az.paired.wide.df #allows us to see the new data frame with the diff variable
```
Now we can check to see whether the differences are pulled from a normal distribution (with the Shapiro-Wilk test).
```{r}
shapiro.test(az.paired.wide.df$diff)
```
Argh. Barely okay. Not great. Ugh. let's make a histogram of the differences. I'm not going to make it pretty. You should do that in your assignments, though = ).
```{r}
ggplot(data=az.paired.wide.df, aes(x=diff))+
geom_histogram()
```
Yeah, looks to be positively skewed (you should calculate skewness).
What could we do? You guessed it--we could log the data and take the difference of the logs.
```{r}
az.paired.wide.df<-az.paired.wide.df%>%
mutate(log.2019.07.01=log(`2019-07-01`),
log.2019.12.31=log(`2019-12-31`),
diff.in.log=log.2019.07.01-log.2019.12.31)%>%
glimpse()
```
Now we can do a paired t-test on the differences in the logged data.
```{r}
t.test(az.paired.wide.df$log.2019.07.01, az.paired.wide.df$log.2019.12.31, paired=TRUE, alternative="greater")
```
```{r}
exp(0.2874)
exp(0.2238)
```
1.33 is the estimate of the median of the individual ratios. Please see the Statistical Sleuth on page 74. In a randomized paired treatment study, this value exp(Z) represents a multiplicative treatment effect.
Now let's calculate Shapiro Wilk on the differences of the logged values!
```{r}
shapiro.test(az.paired.wide.df$diff.in.log)
```
```{r}
ggplot(data=az.paired.wide.df, aes(x=diff.in.log))+
geom_histogram()
```
Well crap--that really didn't help (and it looks like it made things worse (Shapiro Wilk has an even smaller p-value which means we reject the null that the sample (of the differences) was drawn from a normal distribution)).
What else could we try?!?! How about a comparison test that does NOT assume normality of the population of differences? That's the non-parametric Wilcoxon signed rank test. The function is wilcox.test(). Use ?wilcox.test to see the argument.
```{r}
non.parametric.paired <- wilcox.test(az.paired.wide.df$`2019-07-01`, az.paired.wide.df$`2019-12-31`, paired = TRUE, alternative="greater")
non.parametric.paired
```
We report the V (701) and p-value (p<0.001) in the context of the hypotheses for the Wilcoxon signed rank test (see lecture notes).
So, okay, let's summarize what we did:
(1) a paired t-test comparing ozone on July 1, 2019 to December 31, 2019 to say something about the mean of the differences. the differences marginally passed the Shapiro-Wilk test. Paired t-test isn't very sensitive to violations of normality assumption, so may be okay.
(2) a paired t-test comparing the log of ozone on July 1, 2019 to log ozone on December 31, 2019 to say something about the mean of the differences in log ozone. Differences in log ozone levels did not pass Shapiro Wilk test--I would report, but put less weight behind this test than (1).
(3) ran a non-parametric Wilcoxon signed rank test. No assumption of normality needed. Conclusion is broader those (a location shift in distribution).