owned this note
owned this note
Published
Linked with GitHub
# Group Lab
[TOC]
# Code
## Setup the environment
```{r libraries, include=FALSE, echo = FALSE, warning= FALSE, message = FALSE}
rm(list=ls())
options(scipen = 999)
library(moments)
library(knitr)
library(tidyverse)
library(gt)
library(broom)
getwd()
setwd("whereever it is for you") ## set your wd
fracking.df <- read.csv("waterquality.csv")
fracking.df$proximity <- factor(fracking.df$proximity, levels=c("Near", "Far")) ## define the proximity as factor
fracking.df$location <- factor(fracking.df$location, levels=c("Valley", "Uplnad")) ## define the location as factor
```
### Logging the data
``` {r logging the data, echo = FALSE, warning= FALSE, message = FALSE}
fracking.df<- fracking.df %>%
mutate(log_methane=log(methane))
```
### Filter
```{r filtering data, echo = FALSE, warning= FALSE, message = FALSE}
near_upland.df <- fracking.df %>%
filter(proximity == "Near" &
location == "Upland")
far_upland.df <- fracking.df %>%
filter(proximity == "Far" &
location == "Upland")
near_valley.df <- fracking.df %>%
filter(proximity == "Near" &
location == "Valley")
far_valley.df <- fracking.df %>%
filter(proximity == "Far" &
location == "Valley")
valley.df <- fracking.df %>%
filter(location == "Valley")
upland.df <- fracking.df %>%
filter(location == "Upland")
```
## Describing the data
### summary table
```{r summary table, echo = FALSE, warning= FALSE, message = FALSE}
summary_methane.df <- fracking.df %>%
dplyr::group_by(proximity) %>%
select(methane) %>%
dplyr::summarize(
length.methane=length(methane),
mean.methane=mean(methane),
median.methane=median(methane),
sd.methane=sd(methane),
skew.methane=skewness(methane))
summary.valley.df <- valley.df %>%
group_by(proximity) %>%
dplyr::summarize(
length.methane = length(methane),
mean.methane = mean(methane),
median.methane = median(methane),
sd.methane = sd(methane),
skew.methane = skewness(methane))
summary.upland.df <- upland.df %>%
group_by(proximity) %>%
dplyr::summarize(
length.methane = length(methane),
mean.methane = mean(methane),
median.methane = median(methane),
sd.methane = sd(methane),
skew.methane = skewness(methane))
summary.four.df <- fracking.df %>%
group_by(location) %>%
select(methane) %>%
dplyr::summarize(
length.methane=length(methane),
mean.methane=mean(methane),
median.methane=median(methane),
sd.methane=sd(methane),
skew.methane=skewness(methane))
proximity <- c("All", "All")
summary.four.df <- cbind(summary.four.df,proximity)
summary.four.df = summary.four.df[,c(1,7,2,3,4,5,6)]
location <- c("All","All","Valley","Valley", "Upland", "Upland")
summary.df <- rbind(summary_methane.df,
summary.valley.df,
summary.upland.df)
summary.df <- summary.df %>%
cbind(location)
summary.df = summary.df[,c(7,1,2,3,4,5,6)]
summary.df <- rbind(summary.df, summary.four.df)
summary.df %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
c(length.methane) ~ px(100),
everything() ~ px(80)) %>%
tab_header(
title = md("Summary Statistics of Methane Levels around northeastern Pennsylvania Fracking Sites"),
subtitle = "ug/L methane in well water") %>%
fmt_passthrough(columns=c(proximity)) %>%
fmt_passthrough(columns =c(location))%>%
fmt_number(columns = c(length.methane), decimals = 0) %>%
fmt_number(columns = c(mean.methane), decimals=2) %>%
fmt_number(columns = c(median.methane), decimals = 2) %>%
fmt_number(columns = c(sd.methane), decimals = 2) %>%
fmt_number(columns = c(skew.methane), decimals = 2) %>%
cols_label(
location = "Location",
proximity ="Proximity",
length.methane = "Observations",
mean.methane = "Mean",
median.methane = "Median",
sd.methane = "SD",
skew.methane = "Skewness")
```
## Comparison between Proximity in the whole Area
``` {r data.entry(),echo = FALSE, warning= FALSE, message = FALSE}
fracking.df <- read.csv("PAFracking.csv")
```
``` {r subset,echo = FALSE, warning= FALSE, message = FALSE}
near_fracking.df <- fracking.df %>%
filter(`proximity` == "Near")
far_fracking.df <- fracking.df %>%
filter(`proximity` == "Far")
```
##### Histogram
``` {r facet_hist,echo = FALSE, warning= FALSE, message = FALSE}
methane<-ggplot(fracking.df, aes(x=methane, fill=`proximity`))+
geom_histogram(col="black", binwidth = .5)+
scale_x_continuous(trans="log")+
scale_fill_manual(values=c("royalblue", "gray"))
methane+facet_grid(`proximity`~.)+
labs(x="Methane Concentration (ug/L)", title="Methane Concentration in PA Drinking Wells by Proximity to Fracking,
All Topographies",
fill="Proximity")
```
``` {r tables,echo = FALSE, warning= FALSE, message = FALSE}
summary_methane.df <- fracking.df %>%
dplyr::group_by(proximity) %>%
select(methane) %>%
dplyr::summarize(
length_methane=length(methane),
mean_methane=mean(methane),
median_methane=median(methane),
sd_methane=sd(methane),
skew_methane=skewness(methane)
)
summary_methane.df
```
```{r gt_table,echo = FALSE, warning= FALSE, message = FALSE}
summary_methane.df %>%
gt() %>%
tab_header(
title = md("Summary Statistics of Methane Levels in Pennsylvania Wells")) %>%
#this section sets what is in each column and formats the numbers in the columns.
fmt_passthrough (columns=c(`proximity`)) %>%
fmt_number(columns = c(length_methane), decimals = 0) %>%
fmt_number(columns = c(mean_methane), decimals=4) %>%
fmt_number(columns = c(median_methane), decimals = 4) %>%
fmt_number(columns = c(sd_methane), decimals = 4) %>%
fmt_number(columns = c(skew_methane), decimals = 3) %>%
cols_label(
`proximity`="Proximity",
length_methane = "Observations",
mean_methane = "Mean",
median_methane = "Median",
sd_methane = "SD",
skew_methane = "Skewness" )
```
```{r normality test,echo = FALSE, warning= FALSE, message = FALSE}
shapiro.test(near_fracking.df$methane)
shapiro.test(far_fracking.df$methane)
shapiro.test(near_fracking.df$log_methane)
shapiro.test(far_fracking.df$log_methane)
```
``` {r tests,echo = FALSE, warning= FALSE, message = FALSE}
unlogged_methane<-t.test(methane ~ `proximity`, fracking.df, alternative="greater")
logged_methane<-t.test(log_methane ~ `proximity`, fracking.df, alternative="greater")
non_parametric<-wilcox.test(methane ~ `proximity`, data=fracking.df)
unlogged_methane
logged_methane
non_parametric
results.df<-rbind(tidy(unlogged_methane,conf.int=TRUE),tidy(logged_methane,conf.int=TRUE))
rowname_results <- c("Original Data", "Logged Data")
results.df <- results.df %>%
cbind(rowname_results) %>%
subset(select = c(-conf.high,-method,-alternative))%>%
select(rowname_results,everything())
results.df
non_para_results.df<- rbind(tidy(non_parametric))
non_para_rowname_results <- c("Original Data")
non_para_results.df <- non_para_results.df %>%
cbind(non_para_rowname_results) %>%
subset(select = c(-method, -alternative)) %>%
select(non_para_rowname_results,everything())
non_para_results.df
```
``` {r para_test_table,echo = FALSE, warning= FALSE, message = FALSE}
results.df %>%
gt() %>%
tab_header(
title = md("Parametric Test Statistics of Methane Levels in Pennsylvania Wells, All Topographies")) %>%
#this section sets what is in each column and formats the numbers in the columns.
fmt_passthrough(columns = c(rowname_results)) %>%
fmt_number (columns=c(estimate), decimals = 2) %>%
fmt_number(columns = c(estimate1), decimals = 2) %>%
fmt_number(columns = c(estimate2), decimals=2) %>%
fmt_number(columns = c(statistic), decimals = 2) %>%
fmt_number(columns = c(p.value), decimals = 2) %>%
fmt_number(columns = c(parameter), decimals = 2) %>%
fmt_number(columns = c(conf.low), decimals = 2) %>%
cols_label(
rowname_results = "Data",
estimate = "Difference in Means (ug/L)",
estimate1 = "Near Mean (ug/L)",
estimate2 = "Far Mean (ug/L)",
statistic = "T-Stat",
p.value = "P-Value",
parameter = "Degrees Freedom",
conf.low = "Confidence Interval Low")
```
``` {r non_para_test_table,echo = FALSE, warning= FALSE, message = FALSE}
non_para_results.df %>%
gt() %>%
tab_header(
title = md("Non-Parametric Test Statistics of Methane Levels in Pennsylvania Wells")) %>%
fmt_passthrough(columns = c(non_para_rowname_results)) %>%
fmt_number(columns = c(statistic), decimals = 3) %>%
fmt_number(columns = c(p.value), decimals = 3) %>%
cols_label(
non_para_rowname_results = "Data",
statistic = "W",
p.value = "P-Value")
```
``` {r log_transformed_interpretation,echo = FALSE, warning= FALSE, message = FALSE}
exp(1.60013-0.689035)
```
## Comparison between Proximity in Valley
```{r libraries, include=FALSE,echo = FALSE, warning= FALSE, message = FALSE}
rm(list=ls())
options(scipen = 999)
library(moments)
library(knitr)
library(tidyverse)
library(gt)
library(broom)
getwd()
fracking.df<-read.csv("waterquality.csv")
```
```{r filtering data,echo = FALSE, warning= FALSE, message = FALSE}
near_valley.df <- fracking.df %>%
filter(proximity == "Near" &
location == "Valley")
far_valley.df <- fracking.df %>%
filter(proximity == "Far" &
location == "Valley")
valley.df <- fracking.df %>%
filter(location == "Valley")
```
### Q2 Summary
```{r summary for Q2,echo = FALSE, warning= FALSE, message = FALSE}
hist(near_valley.df$methane, breaks = 8)
table(near_valley.df$methane)
summary.valley.df <- valley.df %>%
group_by(proximity) %>%
dplyr::summarize(
length.methane = length(methane),
mean.methane = mean(methane),
median.methane = median(methane),
sd.methane = sd(methane),
skew.methane = skewness(methane))
summary.valley.df
```
### Histogram
```{r data observation for Q2,echo = FALSE, warning= FALSE, message = FALSE}
p<-ggplot(valley.df, aes(x=methane, fill=proximity))+
geom_histogram(col="black")+
scale_x_continuous(trans="log")+
scale_fill_manual(values=c("royalblue", "gray"))
p + facet_grid(proximity ~ .)+
labs(x="Methane Levels around northeastern Pennsylvania Fracking Sites, ug/L", title="Well methane levels at near(n=127) and far from(n=709) fracking Sites in Valley",
fill="Proximity")
```
### GT
```{r good-looking table for data observation for Q2,echo = FALSE, warning= FALSE, message = FALSE}
summary.valley.df %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
c(length.methane) ~ px(100),
everything() ~ px(80)) %>%
tab_header(
title = md("Summary Statistics of Methane Levels around northeastern Pennsylvania Fracking Sites"),
subtitle = "ug/L methane in well water") %>%
fmt_passthrough (columns=c(`proximity`)) %>%
fmt_number(columns = c(length.methane), decimals = 0) %>%
fmt_number(columns = c(mean.methane), decimals=2) %>%
fmt_number(columns = c(median.methane), decimals = 2) %>%
fmt_number(columns = c(sd.methane), decimals = 2) %>%
fmt_number(columns = c(skew.methane), decimals = 2) %>%
cols_label(
proximity ="Proximity",
length.methane = "Observations",
mean.methane = "Mean",
median.methane = "Median",
sd.methane = "SD",
skew.methane = "Skewness" )
```
### Q2 Normality Test
```{r normality test for Q2,echo = FALSE, warning= FALSE, message = FALSE}
shapiro.test(near_valley.df$methane)
shapiro.test(far_valley.df$methane)
```
#small p-value not normally distributed so we reject the null
```{r logorithm,echo = FALSE, warning= FALSE, message = FALSE}
near_valley.df$log.methane <- log(near_valley.df$methane)
far_valley.df$log.methane <- log(far_valley.df$methane)
valley.df$log.methane <- log(valley.df$methane)
```
```{r shapiro test for logged data,echo = FALSE, warning= FALSE, message = FALSE}
shapiro.test(near_valley.df$log.methane)
shapiro.test(far_valley.df$log.methane)
```
#small p-value still not normally distributed so we reject the null
#t-test for Q2
```{r t.test for Q3,echo = FALSE, warning= FALSE, message = FALSE}
valley.df$proximity <- as.factor(valley.df$proximity)
levels(valley.df$proximity)
valley.df$proximity <- fct_rev(valley.df$proximity)
valley.df$proximity <- fct_rev(valley.df$proximity)
t.origin2 <- t.test(methane ~ proximity, valley.df, alternative = "greater")
t.log3 <- t.test(log.methane ~ proximity, valley.df, alternative = "greater")
question2.t.p <- rbind(
tidy(t.origin3, conf.int=TRUE),
tidy(t.log3))
rowname2 <- c("Original Data", "Logged Data")
question2.t <- question2.t.p %>%
cbind(rowname2) %>%
subset(select = c(-conf.high,-method,-alternative))%>%
select(rowname2,everything())
question2.t
## Undo the reverse
valley.df$proximity <- fct_rev(valley.df$proximity)
```
```{r good-looking t.test table for Q2,echo = FALSE, warning= FALSE, message = FALSE}
question2.t %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
tab_header(
title = md("t-test Results for Comparison between *Near* and *Far* Fracking Sites in valley Area")) %>%
fmt_passthrough (columns=c(rowname2)) %>%
fmt_number(columns = c(estimate), decimals = 2) %>%
fmt_number(columns = c(estimate1), decimals=2) %>%
fmt_number(columns = c(estimate2), decimals = 2) %>%
fmt_number(columns = c(statistic), decimals = 2) %>%
fmt_number(columns = c(p.value)) %>%
fmt_number(columns = c(parameter), decimals = 2)%>%
fmt_number(columns = c(conf.low), decimals = 2)%>%
cols_label(
rowname2 ="Data",
estimate = "Sample Mean Difference",
estimate1 = "Near (valley) Sample Mean",
estimate2 = "Far (valley) Sample Mean",
statistic = "t-stats",
p.value = "p-value",
parameter = "Degree of Freedom",
conf.low = "C.I. Lower")
```
#Wilcoxon test
```{r Wilcoxon/Mann Whitney U Test for Q2,echo = FALSE, warning= FALSE, message = FALSE}
wilcox.methane.2 <- wilcox.test(methane~proximity, data=valley.df)
valley.w <- tidy(wilcox.methane.2)
valley.w
```
## Comparison between Proximity in Upland
### Summary for Q3
```{r summary for Q3,echo = FALSE, warning= FALSE, message = FALSE}
hist(near_upland.df$methane, breaks = 100)
table(near_upland.df$methane)
summary.upland.df <- upland.df %>%
group_by(proximity) %>%
dplyr::summarize(
length.methane = length(methane),
mean.methane = mean(methane),
median.methane = median(methane),
sd.methane = sd(methane),
skew.methane = skewness(methane))
summary.upland.df
```
#### Histogram
```{r data observation for Q3,echo = FALSE, warning= FALSE, message = FALSE}
p<-ggplot(upland.df, aes(x=methane, fill=proximity))+
geom_histogram(col="black")+
scale_fill_manual(values=c("royalblue", "gray"))
p + facet_grid(proximity ~ .)+
labs(x="Methane Levels around northeastern Pennsylvania Fracking Sites, ug/L", title="Well methane levels at near(n=127) and far from(n=709) fracking Sites in Upland",
fill="Proximity")
```
##### logged histogram
```{r data observation wtih log scale for Q3,echo = FALSE, warning= FALSE, message = FALSE}
p<-ggplot(upland.df, aes(x=methane, fill=proximity))+
geom_histogram(col="black")+
scale_x_continuous(trans = "log", labels = label_number(accuracy = 0.01))+
scale_fill_manual(values=c("royalblue", "gray"))
p + facet_grid(proximity ~ .)+
labs(x="Methane Levels around northeastern Pennsylvania Fracking Sites, ug/L(logged)", title="Well methane levels at near(n=127) and far from(n=709) fracking Sites in Upland",
fill="Proximity")
```
#### gt
```{r good-looking table for data observation for Q3,echo = FALSE, warning= FALSE, message = FALSE}
summary.upland.df %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
c(length.methane) ~ px(100),
everything() ~ px(80)) %>%
tab_header(
title = md("Summary Statistics of Methane Levels around northeastern Pennsylvania Fracking Sites"),
subtitle = "ug/L methane in well water") %>%
fmt_passthrough (columns=c(`proximity`)) %>%
fmt_number(columns = c(length.methane), decimals = 0) %>%
fmt_number(columns = c(mean.methane), decimals=2) %>%
fmt_number(columns = c(median.methane), decimals = 2) %>%
fmt_number(columns = c(sd.methane), decimals = 2) %>%
fmt_number(columns = c(skew.methane), decimals = 2) %>%
cols_label(
proximity ="Proximity",
length.methane = "Observations",
mean.methane = "Mean",
median.methane = "Median",
sd.methane = "SD",
skew.methane = "Skewness" )
```
### Analysis for Q3
#### Noramlize Attempt
```{r normality test for Q3,echo = FALSE, warning= FALSE, message = FALSE}
shapiro.test(near_upland.df$methane)
shapiro.test(far_upland.df$methane)
```
Both of the data fail to pass the Shapiro-Wilk test. We do reject the null hypothesis (that samples were pulled from normal distributions).
```{r logorithm,echo = FALSE, warning= FALSE, message = FALSE}
near_upland.df$log.methane <- log(near_upland.df$methane)
far_upland.df$log.methane <- log(far_upland.df$methane)
upland.df$log.methane <- log(upland.df$methane)
```
```{r shapiro test for logged data,echo = FALSE, warning= FALSE, message = FALSE}
shapiro.test(near_upland.df$log.methane)
shapiro.test(far_upland.df$log.methane)
```
The p- value is still super small. Let's try another way: remove the data with detection limit.
#### normalize attempt 2: remove detection limit data (neglect this part)
```{r remove data with dl for Q3,echo = FALSE, warning= FALSE, message = FALSE}
far_upland.ndl.df <- far_upland.df%>%
filter(dl == 0)
near_upland.ndl.df <- near_upland.df %>%
filter(dl == 0)
upland.ndl.df <- upland.df %>%
filter(dl == 0)
summary.upland.log.ndl.df <- upland.ndl.df %>%
group_by(proximity) %>%
dplyr::summarize(
length.methane = length(log.methane),
mean.methane = mean(log.methane),
median.methane = median(log.methane),
sd.methane = sd(log.methane),
skew.methane = skewness(log.methane))
summary.upland.log.ndl.df
table(upland.ndl.df$log.methane)
hist(upland.ndl.df$log.methane)
shapiro.test(far_upland.ndl.df$log.methane)
shapiro.test(near_upland.ndl.df$log.methane)
```
The histogram seams to be better, but it is still not normally distributed. But we could still run t-test with it.
### t-test and non-paremetric test for Q3
#### t-test
```{r t.test for Q3,echo = FALSE, warning= FALSE, message = FALSE}
upland.df$proximity <- as.factor(upland.df$proximity)
levels(upland.df$proximity)
upland.df$proximity <- fct_rev(upland.df$proximity)
upland.ndl.df$proximity <- fct_rev(upland.ndl.df$proximity)
t.origin3 <- t.test(methane ~ proximity, upland.df, alternative = "greater")
t.log3 <- t.test(log.methane ~ proximity, upland.df, alternative = "greater")
t.log.ndl3 <- t.test(log.methane ~ proximity, upland.ndl.df, alternative = "greater")
question3.t.p <- rbind(
tidy(t.origin3, conf.int=TRUE),
tidy(t.log3),
tidy(t.log.ndl3))
rowname3 <- c("Original Data", "Logged Data", "Logged without Detection Limit Data")
question3.t <- question3.t.p %>%
cbind(rowname3) %>%
subset(select = c(-conf.high,-method,-alternative))%>%
select(rowname3,everything())
question3.t
## Undo the reverse
upland.df$proximity <- fct_rev(upland.df$proximity)
upland.ndl.df$proximity <- fct_rev(upland.ndl.df$proximity)
```
##### gt for w-test 4
```{r good-looking t.test table for Q3,echo = FALSE, warning= FALSE, message = FALSE}
question3.t %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
tab_header(
title = md("t-test Results for Comparison between *Near* and *Far* Fracking Sites in Upland Area")) %>%
fmt_passthrough (columns=c(rowname3)) %>%
fmt_number(columns = c(estimate), decimals = 2) %>%
fmt_number(columns = c(estimate1), decimals=2) %>%
fmt_number(columns = c(estimate2), decimals = 2) %>%
fmt_number(columns = c(statistic), decimals = 2) %>%
fmt_number(columns = c(p.value)) %>%
fmt_number(columns = c(parameter), decimals = 2)%>%
fmt_number(columns = c(conf.low), decimals = 2)%>%
cols_label(
rowname3 ="Data",
estimate = "Sample Mean Difference",
estimate1 = "Near (upland) Sample Mean",
estimate2 = "Far (upland) Sample Mean",
statistic = "t-stats",
p.value = "p-value",
parameter = "Degree of Freedom",
conf.low = "C.I. Lower")
```
#### Wilcoxon test for Q3
```{r Wilcoxon/Mann Whitney U Test for Q3,echo = FALSE, warning= FALSE, message = FALSE}
wilcox.methane.3 <- wilcox.test(methane~proximity, data=upland.df)
upland.w <- tidy(wilcox.methane.3)
upland.w
```
```{r gt for w-test,echo = FALSE, warning= FALSE, message = FALSE}
upland.w <- upland.w %>%
mutate(data = c("orginal data"))%>%
subset(select = c(-method, -alternative)) %>%
select(data,everything())
upland.w
upland.w %>%
gt() %>%
tab_header(
title = md("Non-Parametric Test Statistics of Methane Levels for upland area in Pennsylvania Wells")) %>%
#this section sets what is in each column and formats the numbers in the columns.
fmt_passthrough(columns = c(data)) %>%
fmt_number(columns = c(statistic), decimals = 3) %>%
fmt_number(columns = c(p.value), decimals = 3) %>%
cols_label(
data = "Data",
statistic = "W",
p.value = "P-Value")
```
## Comparison between Location
### Summary for Q4
#### Histogram
```{r data observation for Q4,echo = FALSE, warning= FALSE, message = FALSE}
q<-ggplot(fracking.df, aes(x=methane, fill=location))+
geom_histogram(col="black")+
scale_fill_manual(values=c("royalblue", "gray"))
q + facet_grid(location ~ .)+
labs(x="Methane Levels around northeastern Pennsylvania Fracking Sites, ug/L", title="Well methane levels at valley(n=865) and upland(n=836)",
fill="Location")
```
##### logged data histogram
```{r logged data observation for Q4,echo = FALSE, warning= FALSE, message = FALSE}
q<-ggplot(fracking.df, aes(x=methane, fill=location))+
geom_histogram(col="black")+
scale_x_log10()+
scale_fill_manual(values=c("royalblue", "gray"))
q + facet_grid(location ~ .)+
labs(x="Methane Levels around northeastern Pennsylvania Fracking Sites, ug/L, logged", title="Well methane levels at valley(n=865) and upland(n=836)",
fill="Location")
```
#### gt
```{r summary and gt for question 4,echo = FALSE, warning= FALSE, message = FALSE}
summary.four.df <- fracking.df %>%
group_by(location) %>%
select(methane) %>%
dplyr::summarize(
length.methane=length(methane),
mean.methane=mean(methane),
median.methane=median(methane),
sd.methane=sd(methane),
skew.methane=skewness(methane),
)
summary.four.df
summary.four.df %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
c(length.methane) ~ px(100),
everything() ~ px(80)) %>%
tab_header(
title = md("Summary Statistics of Methane Levels around northeastern Pennsylvania Fracking Sites"),
subtitle = "ug/L methane in well water") %>%
fmt_passthrough (columns=c(location)) %>%
fmt_number(columns = c(length.methane), decimals = 0) %>%
fmt_number(columns = c(mean.methane), decimals=2) %>%
fmt_number(columns = c(median.methane), decimals = 2) %>%
fmt_number(columns = c(sd.methane), decimals = 2) %>%
fmt_number(columns = c(skew.methane), decimals = 2) %>%
cols_label(
location ="Location",
length.methane = "Observations",
mean.methane = "Mean",
median.methane = "Median",
sd.methane = "SD",
skew.methane = "Skewness" )
```
### Analysis for Q4
#### Noramlize Attempt
```{r normality test for question 4,echo = FALSE, warning= FALSE, message = FALSE}
shapiro.test(valley.df$methane)
shapiro.test(upland.df$methane)
```
```{r normalize attempt,echo = FALSE, warning= FALSE, message = FALSE}
valley.df$log.methane <- log(valley.df$methane)
upland.df$log.methane <- log(upland.df$methane)
four.log.df <- rbind(valley.df,upland.df)
```
```{r shapiro test again for Q4,echo = FALSE, warning= FALSE, message = FALSE}
shapiro.test(valley.df$log.methane)
shapiro.test(upland.df$log.methane)
```
```{r remove data with dl for Q4,echo = FALSE, warning= FALSE, message = FALSE}
valley.ndl.df <- valley.df%>%
filter(dl == 0)
summary.valley.df <- valley.ndl.df %>%
group_by(proximity) %>%
dplyr::summarize(
length.methane = length(log.methane),
mean.methane = mean(log.methane),
median.methane = median(log.methane),
sd.methane = sd(log.methane),
skew.methane = skewness(log.methane))
summary.valley.df
hist(valley.ndl.df$log.methane)
shapiro.test(valley.ndl.df$log.methane)
upland.ndl.df <- upland.df %>%
filter(dl == 0)
shapiro.test(upland.ndl.df$log.methane)
## combine the dataset
four.log.ndl.df <- rbind(valley.ndl.df,upland.ndl.df)
```
### t-test and non-paremetric test for Q4
#### t-test
```{r t.test for Q4,echo = FALSE, warning= FALSE, message = FALSE}
fracking.df$location <- as.factor(fracking.df$location)
levels(fracking.df$location)
fracking.df$location <- fct_rev(fracking.df$location)
four.log.df$location <- fct_rev(four.log.df$location)
four.log.ndl.df$location <- fct_rev(four.log.ndl.df$location)
t.origin4 <- t.test(methane ~ location, fracking.df)
t.log4 <- t.test(log.methane ~ location, four.log.df)
t.log.ndl4 <- t.test(log.methane ~ location, four.log.ndl.df)
question4.t.p <- rbind(
tidy(t.origin4, conf.int=TRUE),
tidy(t.log4),
tidy(t.log.ndl4))
rowname4 <- c("Original Data", "Logged Data", "Logged without Detection Limit Data")
question4.t <- question4.t.p %>%
cbind(rowname4) %>%
subset(select = c(-method,-alternative))%>%
select(rowname4,everything())
question4.t
###undo the reverse
fracking.df$location <- fct_rev(fracking.df$location)
four.log.df$location <- fct_rev(four.log.df$location)
four.log.ndl.df$location <- fct_rev(four.log.ndl.df$location)
```
```{r good-looking table for Q4,echo = FALSE, warning= FALSE, message = FALSE}
question4.t %>%
gt() %>%
tab_options(
table.width = pct(90)) %>%
tab_header(
title = md("t-test Results for the Comparison between Valley and Upland Methane Concentration Level")) %>%
fmt_passthrough (columns = c(rowname4)) %>%
fmt_number(columns = c(estimate), decimals = 2) %>%
fmt_number(columns = c(estimate1), decimals=2) %>%
fmt_number(columns = c(estimate2), decimals = 2) %>%
fmt_number(columns = c(statistic), decimals = 2) %>%
fmt_number(columns = c(p.value), decimals = 15) %>%
fmt_number(columns = c(parameter), decimals = 2)%>%
fmt_number(columns = c(conf.low), decimals = 2)%>%
fmt_number(columns = c(conf.high), decimals = 2)%>%
cols_label(
rowname4 ="Data",
estimate = "Sample Mean Difference",
estimate1 = "Valley Sample Mean",
estimate2 = "Upland Sample Mean",
statistic = "t-stats",
p.value = "p-value",
parameter = "Degree of Freedom",
conf.low = "C.I. Lower",
conf.high = "C.I. Upper")
```
#### Wilcoxon test
```{r Wilcoxon/Mann Whitney UTest,echo = FALSE, warning= FALSE, message = FALSE}
loc.4 <- wilcox.test(methane~location, data = fracking.df)
loc.w <- tidy(loc.4)
loc.w
```
##### gt for w-test 4
```{r gt for w-test for location, echo = FALSE, warning= FALSE, message = FALSE}
loc.w <- loc.w %>%
mutate(data = c("Comparison of Location"))%>%
subset(select = c(-method, -alternative)) %>%
select(data,everything())
loc.w
loc.w %>%
gt() %>%
tab_header(
title = md("Non-Parametric Test Statistics of Methane Levels for Different Location in Pennsylvania Wells")) %>%
fmt_passthrough(columns = c(data)) %>%
fmt_number(columns = c(statistic), decimals = 2) %>%
fmt_number(columns = c(p.value), decimals = 16) %>%
cols_label(
data = "Data",
statistic = "W",
p.value = "P-Value")
```
# Lab Report
Reference:
Molofsky, L.J., Connor, J.A., Wulie, A.S., Wagner, T. and S.K. Farhat (2013).
Evaluation of Methane Sources in Groundwater in Northeastern Pennsylvania. Groundwater, 51(3): 333-349.
# Lab Report
## 1. Introduction
The main question we are interested in answering in this lab is whether there is a statistically significant difference in methane levels in water wells in Pennsylvania based on proximity to fracking operations and topography. We used a data set gathered by Lisa Molofsky et. al. of methane levels in 1701 water wells, with additional variables for each measurement indicating proximity to fracking operations (near or far) and topography of the water well (upland or valley). Additionally, the data set also specified whether the monitoring equipment used to take methane measurements had a detection limit.
## 2. Data Description
Of the 1701 total data observations, 322 were near fracking sites, while 1,379 were far from fracking sites. The mean methane concentration for wells near fracking sites was 795.02 ug/L, and 684.26 ug/L far from fracking sites. The median methane concentration is 5.90 ug/L near fracking sites, and 0.60 ug/L far from fracking sites, indicating that the data is positively skewed. A similar trend of mean and median comparison holds true across observations broken down into valley and upland topographies, as well as when comparing data across all proximities to fracking sites in valley vs. upland topographies (See Table 1).
For all comparisons, the distribution is significantly right (positively) skewed. The skew level for data from all topographies near fracking sites is 6.47, while the skew level for all topographies far from fracking sites is 6.23. Additionally, the Shapiro-Wilk normality test for all topographies indicates that the data is not normally distributed for both the original near observations (W = 0.19548, p-value = 0.000) and original far observations (W = 0.22984, p-value = 0.000). The histograms of all cuts of data showed this same right skewness visually (see Graphs 1-4).
There are noticeable peaks of frequency at 0.1 ug/L and 26 ug/L, which occur 168 and 209 times respectively among the whole data. Tracing back to the source data set, they are the data measureed by the instruments with detection limits. These limited points may distort the analysis result, but the direction remains unclear, since the range of data is large. Also, in terms of all the observations, there are 217 outliers identified by the frequent rule, all of which are greater than 63 ug/L.
## 3. Statistical Analysis and Discussion
In all cases, our null hypothesis is that the mean of near observations minus the mean of far observations is less than or equal to zero. Our alternative hypothesis is that the mean of near observations minus the mean of far observations is greater than zero.
The t-test results for the near vs. far original observations for all topographies do not provide sufficient evidence to reject the Null hypothesis (t=0.46, df = 413.32, p=0.32). However, for the log transformed observations across all topographies, we estimate the median methane level in wells near fracking to be about 249% of the median levels in wells far from fracking. The t-test for log transformed data provides strong statistical evidence to reject the null hypothesis (t=4.97, df=520.92, p=0.00). The Wilcoxon Rank Sum test also suggests that the population distributions of methane levels in wells near and far from fracking sites across all topographies are different (W=272,669, p=0.000).
For valley observations, the results from the t-test show a t-value of -0.10, degrees of freedom (DF) of 267.29 and a p-value of 0.54. Therefore, we fail to reject the null hypothesis. When the data is logged for the valley observations the median methane levels in near fracking sites in valley areas are roughly 162% of median levels of valley sites far from fracking. When looking at the t-test of the logged data which shows a t-value of 1.87, degrees of freedom of 377.96, and a p-value of 0.03 we can reject the null. The Wilcoxon Test results show that the distributions of valley wells near fracking sites and valley wells far from fracking sites are different (W=55,564.500, p=0.001).
For upland observations, the t-test on original data did not find evidence to reject that the mean methane levels in the wells near to the fracking sites is larger than the mean level far from them (t=-0.78, df=385.64, p-value=0.78). Nontheless, the logged data manifests the population mean of methane concentration level far from the fracking sites is smaller than the ones near to them in the upland area (t=4.42, df=168.44, p-value=0.00). Additionally, it is 95% confident that the median of methane concentration level near the fracking sites is 390% of it far from the fracking sites in upland area. The Wilcoxon test also strongly shows there is a difference between two distributions (W= 32805, p=0.000).
In all instances, the data are not normally distributed, but the sample size is large. Thus, the t-test may not be that reliable in this case, while the sample size could slightly compensate the deficiency of the distributional condition. When transforming the data using the log function, the data remained right skewed, as shown in the histograms in the appendix (see Graphs 1-4). Building upon the sample condition, the non-parametric test are more reliable in this research.
## 4. Conclusion
Based on the data and our results, we can extrapolate our sample onto the population of groundwater wells in Saquehanna County, PA, which have similar topographies and fracking activity nearby. The data are significantly right skewed, even when log transforming the data, which renders our parametric t-tests less accurate. Our analysis indicates that Molofsky et. al. exaggerated the confidence in the lack of correlation between fracking activity and methane levels in groundwater wells. Based on our Wilcoxon Ranked Sum tests, we are able to say with a high degree of confidence that the distributions of the near and far samples are different, but given the impact of the poor data quality on our ability to perform parametric t-tests, we are unable to confidently state the direction of this difference.
## 5. Appendix
### Tables
**Table 1**

(All)
**Table 2**

(Simon)
**Table 3**

(Simon)
**Table 4**

(Jon)
**Table 5**

(Jon)
**Table 6**

(Chia Shen)
**Table 7**

(Chia Shen)
**Table 8**

(Chia Shen)
**Table 9**

(Chia Shen)
### Graphs
**Graph 1**

(Simon)
**Graph 2**

(Jon)
**Graph 3**

(Chia Shen)
**Graph 4**

(Chia Shen)