Chia Shen Tsai
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Group Lab [TOC] # Code ## Setup the environment ```{r libraries, include=FALSE, echo = FALSE, warning= FALSE, message = FALSE} rm(list=ls()) options(scipen = 999) library(moments) library(knitr) library(tidyverse) library(gt) library(broom) getwd() setwd("whereever it is for you") ## set your wd fracking.df <- read.csv("waterquality.csv") fracking.df$proximity <- factor(fracking.df$proximity, levels=c("Near", "Far")) ## define the proximity as factor fracking.df$location <- factor(fracking.df$location, levels=c("Valley", "Uplnad")) ## define the location as factor ``` ### Logging the data ``` {r logging the data, echo = FALSE, warning= FALSE, message = FALSE} fracking.df<- fracking.df %>% mutate(log_methane=log(methane)) ``` ### Filter ```{r filtering data, echo = FALSE, warning= FALSE, message = FALSE} near_upland.df <- fracking.df %>% filter(proximity == "Near" & location == "Upland") far_upland.df <- fracking.df %>% filter(proximity == "Far" & location == "Upland") near_valley.df <- fracking.df %>% filter(proximity == "Near" & location == "Valley") far_valley.df <- fracking.df %>% filter(proximity == "Far" & location == "Valley") valley.df <- fracking.df %>% filter(location == "Valley") upland.df <- fracking.df %>% filter(location == "Upland") ``` ## Describing the data ### summary table ```{r summary table, echo = FALSE, warning= FALSE, message = FALSE} summary_methane.df <- fracking.df %>% dplyr::group_by(proximity) %>% select(methane) %>% dplyr::summarize( length.methane=length(methane), mean.methane=mean(methane), median.methane=median(methane), sd.methane=sd(methane), skew.methane=skewness(methane)) summary.valley.df <- valley.df %>% group_by(proximity) %>% dplyr::summarize( length.methane = length(methane), mean.methane = mean(methane), median.methane = median(methane), sd.methane = sd(methane), skew.methane = skewness(methane)) summary.upland.df <- upland.df %>% group_by(proximity) %>% dplyr::summarize( length.methane = length(methane), mean.methane = mean(methane), median.methane = median(methane), sd.methane = sd(methane), skew.methane = skewness(methane)) summary.four.df <- fracking.df %>% group_by(location) %>% select(methane) %>% dplyr::summarize( length.methane=length(methane), mean.methane=mean(methane), median.methane=median(methane), sd.methane=sd(methane), skew.methane=skewness(methane)) proximity <- c("All", "All") summary.four.df <- cbind(summary.four.df,proximity) summary.four.df = summary.four.df[,c(1,7,2,3,4,5,6)] location <- c("All","All","Valley","Valley", "Upland", "Upland") summary.df <- rbind(summary_methane.df, summary.valley.df, summary.upland.df) summary.df <- summary.df %>% cbind(location) summary.df = summary.df[,c(7,1,2,3,4,5,6)] summary.df <- rbind(summary.df, summary.four.df) summary.df %>% gt() %>% tab_options( table.width = pct(80)) %>% cols_width( c(length.methane) ~ px(100), everything() ~ px(80)) %>% tab_header( title = md("Summary Statistics of Methane Levels around northeastern Pennsylvania Fracking Sites"), subtitle = "ug/L methane in well water") %>% fmt_passthrough(columns=c(proximity)) %>% fmt_passthrough(columns =c(location))%>% fmt_number(columns = c(length.methane), decimals = 0) %>% fmt_number(columns = c(mean.methane), decimals=2) %>% fmt_number(columns = c(median.methane), decimals = 2) %>% fmt_number(columns = c(sd.methane), decimals = 2) %>% fmt_number(columns = c(skew.methane), decimals = 2) %>% cols_label( location = "Location", proximity ="Proximity", length.methane = "Observations", mean.methane = "Mean", median.methane = "Median", sd.methane = "SD", skew.methane = "Skewness") ``` ## Comparison between Proximity in the whole Area ``` {r data.entry(),echo = FALSE, warning= FALSE, message = FALSE} fracking.df <- read.csv("PAFracking.csv") ``` ``` {r subset,echo = FALSE, warning= FALSE, message = FALSE} near_fracking.df <- fracking.df %>% filter(`proximity` == "Near") far_fracking.df <- fracking.df %>% filter(`proximity` == "Far") ``` ##### Histogram ``` {r facet_hist,echo = FALSE, warning= FALSE, message = FALSE} methane<-ggplot(fracking.df, aes(x=methane, fill=`proximity`))+ geom_histogram(col="black", binwidth = .5)+ scale_x_continuous(trans="log")+ scale_fill_manual(values=c("royalblue", "gray")) methane+facet_grid(`proximity`~.)+ labs(x="Methane Concentration (ug/L)", title="Methane Concentration in PA Drinking Wells by Proximity to Fracking, All Topographies", fill="Proximity") ``` ``` {r tables,echo = FALSE, warning= FALSE, message = FALSE} summary_methane.df <- fracking.df %>% dplyr::group_by(proximity) %>% select(methane) %>% dplyr::summarize( length_methane=length(methane), mean_methane=mean(methane), median_methane=median(methane), sd_methane=sd(methane), skew_methane=skewness(methane) ) summary_methane.df ``` ```{r gt_table,echo = FALSE, warning= FALSE, message = FALSE} summary_methane.df %>% gt() %>% tab_header( title = md("Summary Statistics of Methane Levels in Pennsylvania Wells")) %>% #this section sets what is in each column and formats the numbers in the columns. fmt_passthrough (columns=c(`proximity`)) %>% fmt_number(columns = c(length_methane), decimals = 0) %>% fmt_number(columns = c(mean_methane), decimals=4) %>% fmt_number(columns = c(median_methane), decimals = 4) %>% fmt_number(columns = c(sd_methane), decimals = 4) %>% fmt_number(columns = c(skew_methane), decimals = 3) %>% cols_label( `proximity`="Proximity", length_methane = "Observations", mean_methane = "Mean", median_methane = "Median", sd_methane = "SD", skew_methane = "Skewness" ) ``` ```{r normality test,echo = FALSE, warning= FALSE, message = FALSE} shapiro.test(near_fracking.df$methane) shapiro.test(far_fracking.df$methane) shapiro.test(near_fracking.df$log_methane) shapiro.test(far_fracking.df$log_methane) ``` ``` {r tests,echo = FALSE, warning= FALSE, message = FALSE} unlogged_methane<-t.test(methane ~ `proximity`, fracking.df, alternative="greater") logged_methane<-t.test(log_methane ~ `proximity`, fracking.df, alternative="greater") non_parametric<-wilcox.test(methane ~ `proximity`, data=fracking.df) unlogged_methane logged_methane non_parametric results.df<-rbind(tidy(unlogged_methane,conf.int=TRUE),tidy(logged_methane,conf.int=TRUE)) rowname_results <- c("Original Data", "Logged Data") results.df <- results.df %>% cbind(rowname_results) %>% subset(select = c(-conf.high,-method,-alternative))%>% select(rowname_results,everything()) results.df non_para_results.df<- rbind(tidy(non_parametric)) non_para_rowname_results <- c("Original Data") non_para_results.df <- non_para_results.df %>% cbind(non_para_rowname_results) %>% subset(select = c(-method, -alternative)) %>% select(non_para_rowname_results,everything()) non_para_results.df ``` ``` {r para_test_table,echo = FALSE, warning= FALSE, message = FALSE} results.df %>% gt() %>% tab_header( title = md("Parametric Test Statistics of Methane Levels in Pennsylvania Wells, All Topographies")) %>% #this section sets what is in each column and formats the numbers in the columns. fmt_passthrough(columns = c(rowname_results)) %>% fmt_number (columns=c(estimate), decimals = 2) %>% fmt_number(columns = c(estimate1), decimals = 2) %>% fmt_number(columns = c(estimate2), decimals=2) %>% fmt_number(columns = c(statistic), decimals = 2) %>% fmt_number(columns = c(p.value), decimals = 2) %>% fmt_number(columns = c(parameter), decimals = 2) %>% fmt_number(columns = c(conf.low), decimals = 2) %>% cols_label( rowname_results = "Data", estimate = "Difference in Means (ug/L)", estimate1 = "Near Mean (ug/L)", estimate2 = "Far Mean (ug/L)", statistic = "T-Stat", p.value = "P-Value", parameter = "Degrees Freedom", conf.low = "Confidence Interval Low") ``` ``` {r non_para_test_table,echo = FALSE, warning= FALSE, message = FALSE} non_para_results.df %>% gt() %>% tab_header( title = md("Non-Parametric Test Statistics of Methane Levels in Pennsylvania Wells")) %>% fmt_passthrough(columns = c(non_para_rowname_results)) %>% fmt_number(columns = c(statistic), decimals = 3) %>% fmt_number(columns = c(p.value), decimals = 3) %>% cols_label( non_para_rowname_results = "Data", statistic = "W", p.value = "P-Value") ``` ``` {r log_transformed_interpretation,echo = FALSE, warning= FALSE, message = FALSE} exp(1.60013-0.689035) ``` ## Comparison between Proximity in Valley ```{r libraries, include=FALSE,echo = FALSE, warning= FALSE, message = FALSE} rm(list=ls()) options(scipen = 999) library(moments) library(knitr) library(tidyverse) library(gt) library(broom) getwd() fracking.df<-read.csv("waterquality.csv") ``` ```{r filtering data,echo = FALSE, warning= FALSE, message = FALSE} near_valley.df <- fracking.df %>% filter(proximity == "Near" & location == "Valley") far_valley.df <- fracking.df %>% filter(proximity == "Far" & location == "Valley") valley.df <- fracking.df %>% filter(location == "Valley") ``` ### Q2 Summary ```{r summary for Q2,echo = FALSE, warning= FALSE, message = FALSE} hist(near_valley.df$methane, breaks = 8) table(near_valley.df$methane) summary.valley.df <- valley.df %>% group_by(proximity) %>% dplyr::summarize( length.methane = length(methane), mean.methane = mean(methane), median.methane = median(methane), sd.methane = sd(methane), skew.methane = skewness(methane)) summary.valley.df ``` ### Histogram ```{r data observation for Q2,echo = FALSE, warning= FALSE, message = FALSE} p<-ggplot(valley.df, aes(x=methane, fill=proximity))+ geom_histogram(col="black")+ scale_x_continuous(trans="log")+ scale_fill_manual(values=c("royalblue", "gray")) p + facet_grid(proximity ~ .)+ labs(x="Methane Levels around northeastern Pennsylvania Fracking Sites, ug/L", title="Well methane levels at near(n=127) and far from(n=709) fracking Sites in Valley", fill="Proximity") ``` ### GT ```{r good-looking table for data observation for Q2,echo = FALSE, warning= FALSE, message = FALSE} summary.valley.df %>% gt() %>% tab_options( table.width = pct(80)) %>% cols_width( c(length.methane) ~ px(100), everything() ~ px(80)) %>% tab_header( title = md("Summary Statistics of Methane Levels around northeastern Pennsylvania Fracking Sites"), subtitle = "ug/L methane in well water") %>% fmt_passthrough (columns=c(`proximity`)) %>% fmt_number(columns = c(length.methane), decimals = 0) %>% fmt_number(columns = c(mean.methane), decimals=2) %>% fmt_number(columns = c(median.methane), decimals = 2) %>% fmt_number(columns = c(sd.methane), decimals = 2) %>% fmt_number(columns = c(skew.methane), decimals = 2) %>% cols_label( proximity ="Proximity", length.methane = "Observations", mean.methane = "Mean", median.methane = "Median", sd.methane = "SD", skew.methane = "Skewness" ) ``` ### Q2 Normality Test ```{r normality test for Q2,echo = FALSE, warning= FALSE, message = FALSE} shapiro.test(near_valley.df$methane) shapiro.test(far_valley.df$methane) ``` #small p-value not normally distributed so we reject the null ```{r logorithm,echo = FALSE, warning= FALSE, message = FALSE} near_valley.df$log.methane <- log(near_valley.df$methane) far_valley.df$log.methane <- log(far_valley.df$methane) valley.df$log.methane <- log(valley.df$methane) ``` ```{r shapiro test for logged data,echo = FALSE, warning= FALSE, message = FALSE} shapiro.test(near_valley.df$log.methane) shapiro.test(far_valley.df$log.methane) ``` #small p-value still not normally distributed so we reject the null #t-test for Q2 ```{r t.test for Q3,echo = FALSE, warning= FALSE, message = FALSE} valley.df$proximity <- as.factor(valley.df$proximity) levels(valley.df$proximity) valley.df$proximity <- fct_rev(valley.df$proximity) valley.df$proximity <- fct_rev(valley.df$proximity) t.origin2 <- t.test(methane ~ proximity, valley.df, alternative = "greater") t.log3 <- t.test(log.methane ~ proximity, valley.df, alternative = "greater") question2.t.p <- rbind( tidy(t.origin3, conf.int=TRUE), tidy(t.log3)) rowname2 <- c("Original Data", "Logged Data") question2.t <- question2.t.p %>% cbind(rowname2) %>% subset(select = c(-conf.high,-method,-alternative))%>% select(rowname2,everything()) question2.t ## Undo the reverse valley.df$proximity <- fct_rev(valley.df$proximity) ``` ```{r good-looking t.test table for Q2,echo = FALSE, warning= FALSE, message = FALSE} question2.t %>% gt() %>% tab_options( table.width = pct(80)) %>% tab_header( title = md("t-test Results for Comparison between *Near* and *Far* Fracking Sites in valley Area")) %>% fmt_passthrough (columns=c(rowname2)) %>% fmt_number(columns = c(estimate), decimals = 2) %>% fmt_number(columns = c(estimate1), decimals=2) %>% fmt_number(columns = c(estimate2), decimals = 2) %>% fmt_number(columns = c(statistic), decimals = 2) %>% fmt_number(columns = c(p.value)) %>% fmt_number(columns = c(parameter), decimals = 2)%>% fmt_number(columns = c(conf.low), decimals = 2)%>% cols_label( rowname2 ="Data", estimate = "Sample Mean Difference", estimate1 = "Near (valley) Sample Mean", estimate2 = "Far (valley) Sample Mean", statistic = "t-stats", p.value = "p-value", parameter = "Degree of Freedom", conf.low = "C.I. Lower") ``` #Wilcoxon test ```{r Wilcoxon/Mann Whitney U Test for Q2,echo = FALSE, warning= FALSE, message = FALSE} wilcox.methane.2 <- wilcox.test(methane~proximity, data=valley.df) valley.w <- tidy(wilcox.methane.2) valley.w ``` ## Comparison between Proximity in Upland ### Summary for Q3 ```{r summary for Q3,echo = FALSE, warning= FALSE, message = FALSE} hist(near_upland.df$methane, breaks = 100) table(near_upland.df$methane) summary.upland.df <- upland.df %>% group_by(proximity) %>% dplyr::summarize( length.methane = length(methane), mean.methane = mean(methane), median.methane = median(methane), sd.methane = sd(methane), skew.methane = skewness(methane)) summary.upland.df ``` #### Histogram ```{r data observation for Q3,echo = FALSE, warning= FALSE, message = FALSE} p<-ggplot(upland.df, aes(x=methane, fill=proximity))+ geom_histogram(col="black")+ scale_fill_manual(values=c("royalblue", "gray")) p + facet_grid(proximity ~ .)+ labs(x="Methane Levels around northeastern Pennsylvania Fracking Sites, ug/L", title="Well methane levels at near(n=127) and far from(n=709) fracking Sites in Upland", fill="Proximity") ``` ##### logged histogram ```{r data observation wtih log scale for Q3,echo = FALSE, warning= FALSE, message = FALSE} p<-ggplot(upland.df, aes(x=methane, fill=proximity))+ geom_histogram(col="black")+ scale_x_continuous(trans = "log", labels = label_number(accuracy = 0.01))+ scale_fill_manual(values=c("royalblue", "gray")) p + facet_grid(proximity ~ .)+ labs(x="Methane Levels around northeastern Pennsylvania Fracking Sites, ug/L(logged)", title="Well methane levels at near(n=127) and far from(n=709) fracking Sites in Upland", fill="Proximity") ``` #### gt ```{r good-looking table for data observation for Q3,echo = FALSE, warning= FALSE, message = FALSE} summary.upland.df %>% gt() %>% tab_options( table.width = pct(80)) %>% cols_width( c(length.methane) ~ px(100), everything() ~ px(80)) %>% tab_header( title = md("Summary Statistics of Methane Levels around northeastern Pennsylvania Fracking Sites"), subtitle = "ug/L methane in well water") %>% fmt_passthrough (columns=c(`proximity`)) %>% fmt_number(columns = c(length.methane), decimals = 0) %>% fmt_number(columns = c(mean.methane), decimals=2) %>% fmt_number(columns = c(median.methane), decimals = 2) %>% fmt_number(columns = c(sd.methane), decimals = 2) %>% fmt_number(columns = c(skew.methane), decimals = 2) %>% cols_label( proximity ="Proximity", length.methane = "Observations", mean.methane = "Mean", median.methane = "Median", sd.methane = "SD", skew.methane = "Skewness" ) ``` ### Analysis for Q3 #### Noramlize Attempt ```{r normality test for Q3,echo = FALSE, warning= FALSE, message = FALSE} shapiro.test(near_upland.df$methane) shapiro.test(far_upland.df$methane) ``` Both of the data fail to pass the Shapiro-Wilk test. We do reject the null hypothesis (that samples were pulled from normal distributions). ```{r logorithm,echo = FALSE, warning= FALSE, message = FALSE} near_upland.df$log.methane <- log(near_upland.df$methane) far_upland.df$log.methane <- log(far_upland.df$methane) upland.df$log.methane <- log(upland.df$methane) ``` ```{r shapiro test for logged data,echo = FALSE, warning= FALSE, message = FALSE} shapiro.test(near_upland.df$log.methane) shapiro.test(far_upland.df$log.methane) ``` The p- value is still super small. Let's try another way: remove the data with detection limit. #### normalize attempt 2: remove detection limit data (neglect this part) ```{r remove data with dl for Q3,echo = FALSE, warning= FALSE, message = FALSE} far_upland.ndl.df <- far_upland.df%>% filter(dl == 0) near_upland.ndl.df <- near_upland.df %>% filter(dl == 0) upland.ndl.df <- upland.df %>% filter(dl == 0) summary.upland.log.ndl.df <- upland.ndl.df %>% group_by(proximity) %>% dplyr::summarize( length.methane = length(log.methane), mean.methane = mean(log.methane), median.methane = median(log.methane), sd.methane = sd(log.methane), skew.methane = skewness(log.methane)) summary.upland.log.ndl.df table(upland.ndl.df$log.methane) hist(upland.ndl.df$log.methane) shapiro.test(far_upland.ndl.df$log.methane) shapiro.test(near_upland.ndl.df$log.methane) ``` The histogram seams to be better, but it is still not normally distributed. But we could still run t-test with it. ### t-test and non-paremetric test for Q3 #### t-test ```{r t.test for Q3,echo = FALSE, warning= FALSE, message = FALSE} upland.df$proximity <- as.factor(upland.df$proximity) levels(upland.df$proximity) upland.df$proximity <- fct_rev(upland.df$proximity) upland.ndl.df$proximity <- fct_rev(upland.ndl.df$proximity) t.origin3 <- t.test(methane ~ proximity, upland.df, alternative = "greater") t.log3 <- t.test(log.methane ~ proximity, upland.df, alternative = "greater") t.log.ndl3 <- t.test(log.methane ~ proximity, upland.ndl.df, alternative = "greater") question3.t.p <- rbind( tidy(t.origin3, conf.int=TRUE), tidy(t.log3), tidy(t.log.ndl3)) rowname3 <- c("Original Data", "Logged Data", "Logged without Detection Limit Data") question3.t <- question3.t.p %>% cbind(rowname3) %>% subset(select = c(-conf.high,-method,-alternative))%>% select(rowname3,everything()) question3.t ## Undo the reverse upland.df$proximity <- fct_rev(upland.df$proximity) upland.ndl.df$proximity <- fct_rev(upland.ndl.df$proximity) ``` ##### gt for w-test 4 ```{r good-looking t.test table for Q3,echo = FALSE, warning= FALSE, message = FALSE} question3.t %>% gt() %>% tab_options( table.width = pct(80)) %>% tab_header( title = md("t-test Results for Comparison between *Near* and *Far* Fracking Sites in Upland Area")) %>% fmt_passthrough (columns=c(rowname3)) %>% fmt_number(columns = c(estimate), decimals = 2) %>% fmt_number(columns = c(estimate1), decimals=2) %>% fmt_number(columns = c(estimate2), decimals = 2) %>% fmt_number(columns = c(statistic), decimals = 2) %>% fmt_number(columns = c(p.value)) %>% fmt_number(columns = c(parameter), decimals = 2)%>% fmt_number(columns = c(conf.low), decimals = 2)%>% cols_label( rowname3 ="Data", estimate = "Sample Mean Difference", estimate1 = "Near (upland) Sample Mean", estimate2 = "Far (upland) Sample Mean", statistic = "t-stats", p.value = "p-value", parameter = "Degree of Freedom", conf.low = "C.I. Lower") ``` #### Wilcoxon test for Q3 ```{r Wilcoxon/Mann Whitney U Test for Q3,echo = FALSE, warning= FALSE, message = FALSE} wilcox.methane.3 <- wilcox.test(methane~proximity, data=upland.df) upland.w <- tidy(wilcox.methane.3) upland.w ``` ```{r gt for w-test,echo = FALSE, warning= FALSE, message = FALSE} upland.w <- upland.w %>% mutate(data = c("orginal data"))%>% subset(select = c(-method, -alternative)) %>% select(data,everything()) upland.w upland.w %>% gt() %>% tab_header( title = md("Non-Parametric Test Statistics of Methane Levels for upland area in Pennsylvania Wells")) %>% #this section sets what is in each column and formats the numbers in the columns. fmt_passthrough(columns = c(data)) %>% fmt_number(columns = c(statistic), decimals = 3) %>% fmt_number(columns = c(p.value), decimals = 3) %>% cols_label( data = "Data", statistic = "W", p.value = "P-Value") ``` ## Comparison between Location ### Summary for Q4 #### Histogram ```{r data observation for Q4,echo = FALSE, warning= FALSE, message = FALSE} q<-ggplot(fracking.df, aes(x=methane, fill=location))+ geom_histogram(col="black")+ scale_fill_manual(values=c("royalblue", "gray")) q + facet_grid(location ~ .)+ labs(x="Methane Levels around northeastern Pennsylvania Fracking Sites, ug/L", title="Well methane levels at valley(n=865) and upland(n=836)", fill="Location") ``` ##### logged data histogram ```{r logged data observation for Q4,echo = FALSE, warning= FALSE, message = FALSE} q<-ggplot(fracking.df, aes(x=methane, fill=location))+ geom_histogram(col="black")+ scale_x_log10()+ scale_fill_manual(values=c("royalblue", "gray")) q + facet_grid(location ~ .)+ labs(x="Methane Levels around northeastern Pennsylvania Fracking Sites, ug/L, logged", title="Well methane levels at valley(n=865) and upland(n=836)", fill="Location") ``` #### gt ```{r summary and gt for question 4,echo = FALSE, warning= FALSE, message = FALSE} summary.four.df <- fracking.df %>% group_by(location) %>% select(methane) %>% dplyr::summarize( length.methane=length(methane), mean.methane=mean(methane), median.methane=median(methane), sd.methane=sd(methane), skew.methane=skewness(methane), ) summary.four.df summary.four.df %>% gt() %>% tab_options( table.width = pct(80)) %>% cols_width( c(length.methane) ~ px(100), everything() ~ px(80)) %>% tab_header( title = md("Summary Statistics of Methane Levels around northeastern Pennsylvania Fracking Sites"), subtitle = "ug/L methane in well water") %>% fmt_passthrough (columns=c(location)) %>% fmt_number(columns = c(length.methane), decimals = 0) %>% fmt_number(columns = c(mean.methane), decimals=2) %>% fmt_number(columns = c(median.methane), decimals = 2) %>% fmt_number(columns = c(sd.methane), decimals = 2) %>% fmt_number(columns = c(skew.methane), decimals = 2) %>% cols_label( location ="Location", length.methane = "Observations", mean.methane = "Mean", median.methane = "Median", sd.methane = "SD", skew.methane = "Skewness" ) ``` ### Analysis for Q4 #### Noramlize Attempt ```{r normality test for question 4,echo = FALSE, warning= FALSE, message = FALSE} shapiro.test(valley.df$methane) shapiro.test(upland.df$methane) ``` ```{r normalize attempt,echo = FALSE, warning= FALSE, message = FALSE} valley.df$log.methane <- log(valley.df$methane) upland.df$log.methane <- log(upland.df$methane) four.log.df <- rbind(valley.df,upland.df) ``` ```{r shapiro test again for Q4,echo = FALSE, warning= FALSE, message = FALSE} shapiro.test(valley.df$log.methane) shapiro.test(upland.df$log.methane) ``` ```{r remove data with dl for Q4,echo = FALSE, warning= FALSE, message = FALSE} valley.ndl.df <- valley.df%>% filter(dl == 0) summary.valley.df <- valley.ndl.df %>% group_by(proximity) %>% dplyr::summarize( length.methane = length(log.methane), mean.methane = mean(log.methane), median.methane = median(log.methane), sd.methane = sd(log.methane), skew.methane = skewness(log.methane)) summary.valley.df hist(valley.ndl.df$log.methane) shapiro.test(valley.ndl.df$log.methane) upland.ndl.df <- upland.df %>% filter(dl == 0) shapiro.test(upland.ndl.df$log.methane) ## combine the dataset four.log.ndl.df <- rbind(valley.ndl.df,upland.ndl.df) ``` ### t-test and non-paremetric test for Q4 #### t-test ```{r t.test for Q4,echo = FALSE, warning= FALSE, message = FALSE} fracking.df$location <- as.factor(fracking.df$location) levels(fracking.df$location) fracking.df$location <- fct_rev(fracking.df$location) four.log.df$location <- fct_rev(four.log.df$location) four.log.ndl.df$location <- fct_rev(four.log.ndl.df$location) t.origin4 <- t.test(methane ~ location, fracking.df) t.log4 <- t.test(log.methane ~ location, four.log.df) t.log.ndl4 <- t.test(log.methane ~ location, four.log.ndl.df) question4.t.p <- rbind( tidy(t.origin4, conf.int=TRUE), tidy(t.log4), tidy(t.log.ndl4)) rowname4 <- c("Original Data", "Logged Data", "Logged without Detection Limit Data") question4.t <- question4.t.p %>% cbind(rowname4) %>% subset(select = c(-method,-alternative))%>% select(rowname4,everything()) question4.t ###undo the reverse fracking.df$location <- fct_rev(fracking.df$location) four.log.df$location <- fct_rev(four.log.df$location) four.log.ndl.df$location <- fct_rev(four.log.ndl.df$location) ``` ```{r good-looking table for Q4,echo = FALSE, warning= FALSE, message = FALSE} question4.t %>% gt() %>% tab_options( table.width = pct(90)) %>% tab_header( title = md("t-test Results for the Comparison between Valley and Upland Methane Concentration Level")) %>% fmt_passthrough (columns = c(rowname4)) %>% fmt_number(columns = c(estimate), decimals = 2) %>% fmt_number(columns = c(estimate1), decimals=2) %>% fmt_number(columns = c(estimate2), decimals = 2) %>% fmt_number(columns = c(statistic), decimals = 2) %>% fmt_number(columns = c(p.value), decimals = 15) %>% fmt_number(columns = c(parameter), decimals = 2)%>% fmt_number(columns = c(conf.low), decimals = 2)%>% fmt_number(columns = c(conf.high), decimals = 2)%>% cols_label( rowname4 ="Data", estimate = "Sample Mean Difference", estimate1 = "Valley Sample Mean", estimate2 = "Upland Sample Mean", statistic = "t-stats", p.value = "p-value", parameter = "Degree of Freedom", conf.low = "C.I. Lower", conf.high = "C.I. Upper") ``` #### Wilcoxon test ```{r Wilcoxon/Mann Whitney UTest,echo = FALSE, warning= FALSE, message = FALSE} loc.4 <- wilcox.test(methane~location, data = fracking.df) loc.w <- tidy(loc.4) loc.w ``` ##### gt for w-test 4 ```{r gt for w-test for location, echo = FALSE, warning= FALSE, message = FALSE} loc.w <- loc.w %>% mutate(data = c("Comparison of Location"))%>% subset(select = c(-method, -alternative)) %>% select(data,everything()) loc.w loc.w %>% gt() %>% tab_header( title = md("Non-Parametric Test Statistics of Methane Levels for Different Location in Pennsylvania Wells")) %>% fmt_passthrough(columns = c(data)) %>% fmt_number(columns = c(statistic), decimals = 2) %>% fmt_number(columns = c(p.value), decimals = 16) %>% cols_label( data = "Data", statistic = "W", p.value = "P-Value") ``` # Lab Report Reference: Molofsky, L.J., Connor, J.A., Wulie, A.S., Wagner, T. and S.K. Farhat (2013). Evaluation of Methane Sources in Groundwater in Northeastern Pennsylvania. Groundwater, 51(3): 333-349. # Lab Report ## 1. Introduction The main question we are interested in answering in this lab is whether there is a statistically significant difference in methane levels in water wells in Pennsylvania based on proximity to fracking operations and topography. We used a data set gathered by Lisa Molofsky et. al. of methane levels in 1701 water wells, with additional variables for each measurement indicating proximity to fracking operations (near or far) and topography of the water well (upland or valley). Additionally, the data set also specified whether the monitoring equipment used to take methane measurements had a detection limit. ## 2. Data Description Of the 1701 total data observations, 322 were near fracking sites, while 1,379 were far from fracking sites. The mean methane concentration for wells near fracking sites was 795.02 ug/L, and 684.26 ug/L far from fracking sites. The median methane concentration is 5.90 ug/L near fracking sites, and 0.60 ug/L far from fracking sites, indicating that the data is positively skewed. A similar trend of mean and median comparison holds true across observations broken down into valley and upland topographies, as well as when comparing data across all proximities to fracking sites in valley vs. upland topographies (See Table 1). For all comparisons, the distribution is significantly right (positively) skewed. The skew level for data from all topographies near fracking sites is 6.47, while the skew level for all topographies far from fracking sites is 6.23. Additionally, the Shapiro-Wilk normality test for all topographies indicates that the data is not normally distributed for both the original near observations (W = 0.19548, p-value = 0.000) and original far observations (W = 0.22984, p-value = 0.000). The histograms of all cuts of data showed this same right skewness visually (see Graphs 1-4). There are noticeable peaks of frequency at 0.1 ug/L and 26 ug/L, which occur 168 and 209 times respectively among the whole data. Tracing back to the source data set, they are the data measureed by the instruments with detection limits. These limited points may distort the analysis result, but the direction remains unclear, since the range of data is large. Also, in terms of all the observations, there are 217 outliers identified by the frequent rule, all of which are greater than 63 ug/L. ## 3. Statistical Analysis and Discussion In all cases, our null hypothesis is that the mean of near observations minus the mean of far observations is less than or equal to zero. Our alternative hypothesis is that the mean of near observations minus the mean of far observations is greater than zero. The t-test results for the near vs. far original observations for all topographies do not provide sufficient evidence to reject the Null hypothesis (t=0.46, df = 413.32, p=0.32). However, for the log transformed observations across all topographies, we estimate the median methane level in wells near fracking to be about 249% of the median levels in wells far from fracking. The t-test for log transformed data provides strong statistical evidence to reject the null hypothesis (t=4.97, df=520.92, p=0.00). The Wilcoxon Rank Sum test also suggests that the population distributions of methane levels in wells near and far from fracking sites across all topographies are different (W=272,669, p=0.000). For valley observations, the results from the t-test show a t-value of -0.10, degrees of freedom (DF) of 267.29 and a p-value of 0.54. Therefore, we fail to reject the null hypothesis. When the data is logged for the valley observations the median methane levels in near fracking sites in valley areas are roughly 162% of median levels of valley sites far from fracking. When looking at the t-test of the logged data which shows a t-value of 1.87, degrees of freedom of 377.96, and a p-value of 0.03 we can reject the null. The Wilcoxon Test results show that the distributions of valley wells near fracking sites and valley wells far from fracking sites are different (W=55,564.500, p=0.001). For upland observations, the t-test on original data did not find evidence to reject that the mean methane levels in the wells near to the fracking sites is larger than the mean level far from them (t=-0.78, df=385.64, p-value=0.78). Nontheless, the logged data manifests the population mean of methane concentration level far from the fracking sites is smaller than the ones near to them in the upland area (t=4.42, df=168.44, p-value=0.00). Additionally, it is 95% confident that the median of methane concentration level near the fracking sites is 390% of it far from the fracking sites in upland area. The Wilcoxon test also strongly shows there is a difference between two distributions (W= 32805, p=0.000). In all instances, the data are not normally distributed, but the sample size is large. Thus, the t-test may not be that reliable in this case, while the sample size could slightly compensate the deficiency of the distributional condition. When transforming the data using the log function, the data remained right skewed, as shown in the histograms in the appendix (see Graphs 1-4). Building upon the sample condition, the non-parametric test are more reliable in this research. ## 4. Conclusion Based on the data and our results, we can extrapolate our sample onto the population of groundwater wells in Saquehanna County, PA, which have similar topographies and fracking activity nearby. The data are significantly right skewed, even when log transforming the data, which renders our parametric t-tests less accurate. Our analysis indicates that Molofsky et. al. exaggerated the confidence in the lack of correlation between fracking activity and methane levels in groundwater wells. Based on our Wilcoxon Ranked Sum tests, we are able to say with a high degree of confidence that the distributions of the near and far samples are different, but given the impact of the poor data quality on our ability to perform parametric t-tests, we are unable to confidently state the direction of this difference. ## 5. Appendix ### Tables **Table 1** ![](https://i.imgur.com/W2VrO0O.png) (All) **Table 2** ![](https://i.imgur.com/6EK2I7N.png) (Simon) **Table 3** ![](https://i.imgur.com/rPT2udp.png) (Simon) **Table 4** ![](https://i.imgur.com/QK1iJIs.png) (Jon) **Table 5** ![](https://i.imgur.com/CUMxcyc.png) (Jon) **Table 6** ![](https://i.imgur.com/vWKS3iZ.png) (Chia Shen) **Table 7** ![](https://i.imgur.com/8iWEdX5.png) (Chia Shen) **Table 8** ![](https://i.imgur.com/DFMiG5y.png) (Chia Shen) **Table 9** ![](https://i.imgur.com/itmhQcm.png) (Chia Shen) ### Graphs **Graph 1** ![](https://i.imgur.com/i8eod6w.png) (Simon) **Graph 2** ![](https://i.imgur.com/vreoGyZ.png) (Jon) **Graph 3** ![](https://i.imgur.com/JU2NAOe.png) (Chia Shen) **Graph 4** ![](https://i.imgur.com/nS9pJhN.png) (Chia Shen)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully