Tutorial 6: Data Wrangling, the t-test and the Log Transformation

--- title: "R Tutorial 6: Data Wrangling, the t-test and the Log Transformation" author: "Elizabeth A. Albright, PhD" output: rtf_document --- # Tutorial 6: Data Wrangling, the t-test and the Log Transformation Hi All! Welcome to Tutorial 6! **You do NOT need to turn in anything as a part of this tutorial**, but it is critical that you understand this tutorial to be successful in your Group Lab that is due on October 2nd. In this tutorial, we will work on what we call data wrangling (managing data). This is often a critical step before any analysis can occur. You will need to install a couple of new packages. Please do so in the console if you have not already. install.packages("tidyverse") # a package of packages used to manipulate data install.packages("gt") # we will try to make fancier tables today! with the gt package After you install these packages, please run the library chunk to make sure all of these packages are loaded. Don't load the package papeR today. You can read about tidyverse here: https://www.tidyverse.org/packages/. ggplot2 is part of tidyverse. Notice below that you can also hide certain messages in the knit markdown output by using the argument "include = F". F is short for FALSE. Similarly, you may choose to omit warnings with "warnings = F" in the header or even the code chunk itself "echo = F". Adjusting the header of markdown files will become more important when you are delivering documents to bosses, supervisors or professional audiences. ```{r libraries, include=F} library(moments) #package for skewness library(knitr) #package for making tables (kable) library(tidyverse) #multiple packages for data wrangling library(gt) # a package to make tables library(lubridate) # a package to manipulate dates ``` Your libraries should be set now. And we should remove our objects (to start fresh). ```{r remove.object} rm(list=ls()) ##This code here removes all the objects from your Global Environment. ``` Please download from Sakai the air quality data that I downloaded from USEPA. This data can be found in the lab folder under resources. the file is airquality.csv. Save the airquality.csv and this .Rmd in the same folder. Be sure you have set your working directory appropriately. ```{r read.airquality} airquality.df<-read_csv("~/🪅Master/04_Study/Fall 2023/ENV 710 TA/R for stats/input/airquality.csv") ``` We have used the str() function to look at the structure of data frames and head() to look at the first few rows. Now let's use glimpse(). What information does the glimpse() function give us? ```{r glimpse} glimpse(airquality.df) ``` The data should have more than 392,669 rows (observations) and 14 columns (vectors within our data frame). Our variable of interest is ozone which was measured over an 8-hour period (Sample.Duration) and is measured in parts per million (Units.of.Measure). We can see State and County Codes (these are what we call FIPS codes) as well as the State.Name. You should see from the glimpse() function that we have a date in our dataframe (`Date Local`). R has set the data type as a character string, but we want it to read it as a date. So let's do that. We can use the lubridate package that we played with in an earlier tutorial. The function mdy() recognizes the date as month/day/year format. We will make a new variable named `date` below using the mutate() function (in dplyr) and the mdy() function in lubridate. We are also making `State Name` into a factor variable. ```{r date} airquality.df <- airquality.df %>% mutate(date=mdy(`Date Local`))%>% glimpse() ``` **Important** Is the new variable `date` in date format year-month-day (such as 2019-12-31)? If yes, you are all set! If not, in the chunk above you may need to change the function mdy() to dmy() or or some other lubridate fuction, depending of the format of the `Date Local` variable. ### Wrangling data with dplyr The package dplyr uses what it a handful of functions to manipulate or wrangle data: select, filter, arrange, mutate, group_by and summarize. ***select()*** used to select columns in a dataframe that you want to use. You can also use select() to reorder your columns by naming the variables you want to have first or last and then selecting the rest by using a range within select with the colon. Example: `select(var1, var2, var3:var15)` ***filter()*** allows us to select rows of data that we are interested in based on a condition ***arrange()*** we use this to arrange the rows (such as by sorting them based on the largest to smallest value in one column) ***mutate()*** can make new columns in our data ***group_by()*** allows us to group rows together by some criteria ***summarize()*** summarizes data, often times once we have grouped rows together or filtered rows. ***count()*** counts all observations in each group https://dplyr.tidyverse.org/reference/count.html We can use these functions together to manipulate, summarize data and do analysis. For example, group_by() groups rows together (e.g., observations from the same state) that we can then summarize. ### The Pipe Operator %>% ### Core to the dplyr package is the pipe operator which is the %>%. This operator takes the data before the operator and shoots it to the following line where we can manipulate it in some way. Please read about the pipe here: https://style.tidyverse.org/pipes.html. Some find the pipe operator to be annoying to type, but you can use a keyboard shortcut of Command + Shift + M on Mac or Control + Shift + M on Windows. To see if all the air quality data are from the same year you could create a "year" table with the mutate(), year(), group_by() and count() functions. With the mutate function, I am creating a new variable called year using the year() function. I then want to group the observations by the year, and count the observations in each year. We now have made a new table called year.tbl. ```{r year.tbl} year.tbl<-airquality.df%>% mutate(year=year(date))%>% group_by(year)%>% count() year.tbl ``` In running the chunk above, you should see that all 392,669 observations are in year 2019. If it did not work, please reach out to one of the TAs (or Betsy) in lab and we will help get things sorted. Now we will develop a new data frame called Indiana.df. To do so, we shoot the airquality.df through the pipe to be filtered based on `State Name`=="Indiana". When I set up a pipe, I like to start a new line after each pipe--it helps me see and understand the code. In the last row of the pipe, I will then run the glimpse function on our new data frame so we can look at the data frame. ```{r indiana} indiana.df<-airquality.df %>% filter(`State Name`=="Indiana")%>% glimpse() ``` How many rows of data are in our new data.frame? How many columns? The pipe is a pretty handy tool to use. Let's make a different dataframe, this time to include observations in North Carolina on January 1, 2019. We will name this data frame northcarolina.df ```{r northcarolina} northcarolina.df<-airquality.df %>% filter(`State Name` =="North Carolina") %>% filter(date == "2019-01-01")%>% glimpse() ``` How many ozone observations were made in North Carolina on January 1, 2019? **Now pick your own state and date and make a new data frame for that state (to practice).** ### dplyr's summarize function ### Now let's use dplyr's summarize function to summarize ozone for all observations in North Carolina. I put dplyr:: in front of summarize because many packages have the function summarize and I want to use the summarize function for dplyr. So the logic of this pipe is: (0) we are making a new data frame of summary statistics called ozone.nc.summ; (1) We start with the airquality.df, (2) we filter airquality.df for all observations where State.Name equals "North Carolina" (always use the double equals ==). (3) We then select the column ozone. (4) Finally we summarize with dplyr::summarize. In the dplyr::summarize function, we set what summary stats we want to include by defining them. ```{r summary.nc} ozone.nc.summ <- airquality.df %>% filter(`State Name` =="North Carolina") %>% select(ozone) %>% dplyr::summarize( length.ozone=length(ozone)-sum(is.na(ozone)), mean.ozone=mean(ozone), median.ozone=median(ozone), sd.ozone=sd(ozone), skew.ozone=skewness(ozone), ) ozone.nc.summ ``` Great! How many ozone observations were made in 2019? What is the mean level of ozone? median? ## Making Pretty Tables with gt Package ## Now let's see if we can get a little fancier with our tables. We will use the package gt. To read more about gt, please check out this website: https://blog.rstudio.com/2020/04/08/great-looking-tables-gt-0-2/ or this site: https://gt.rstudio.com Check out the chunk below. We will start our pipe with the ozone.nc.summ data frame that we just made. We will pipe that to gt() which creates a table object. From there we develop the structure of the table: **tab_header** we can set the title here with title=md(""), where md is short for markdown which means we can use markdown language and subtitle with subtitle=md(""). We then pipe to cols_labels where we structure our columns and label them. **tab_option** Sets the options for the table overall. I set table width at pct(80)--which mean 80% which spans the width of the page. This looks fine if you knit to html, but it's not great when you knit to rtf (which you will need to do). There doesn't look to be an easy solution to this (they are working on updating this very new package). **For now I would make column width adjustments in word once you knit to rtf. You will need to knit to rtf for table formatting to work (not Word).** **fmt_number** Formatting columns: With the fmt_number function, we can format the columns to numbers and tell what values to place in the table. There are a lot of formats beyond numbers (e.g., dates, time, etc.), but we won't worry about those for now. As you can see, all of the statistics are being formatted to numbers with fmt_number. In the first fmt_number(columns= var()) we are setting the first column to the variable length.ozone with zero decimals. The second column will be mean.ozone with four decimal places reported, etc.. **cols_width** sets the width of all columns. I'm setting width of column for length.ozone to 100px and the rest to 80. Read more here re widths: https://gt.rstudio.com/reference/cols_width.html. I encourage you to play around with this. **cols_label()** The cols_label() function allow us to label our columns. For example, I set the first column to length.ozone and set the label equal to "Observations". The second column is set to mean.ozone and labeled "Mean", etc. etc.. ```{r} ozone.nc.summ ``` ```{r nc.tbl} ozone.nc.summ.tbl<-ozone.nc.summ %>% gt() %>% tab_options( table.width = pct(80) )%>% cols_width( c(length.ozone)~ px(100), everything() ~ px(80))%>% tab_header( title = md("Summary Statistics of Ozone Levels in North Carolina"), subtitle = md("8-hr ozone measured in ppm")) %>% fmt_number(columns = c(length.ozone), decimals = 0) %>% fmt_number(columns = c(mean.ozone), decimals=4) %>% fmt_number(columns = c(median.ozone), decimals = 4) %>% fmt_number(columns = c(sd.ozone), decimals = 4) %>% fmt_number(columns = c(skew.ozone), decimals = 3) %>% cols_label( length.ozone = "Observations", mean.ozone = "Mean", median.ozone = "Median", sd.ozone = "SD", skew.ozone = "Skewness") ozone.nc.summ.tbl ``` ## Summarizing data from two states! ## Okay, we have now practiced wrangling data and made summary statistics tables it's time to practice on two states. **Our question of interest is whether mean ozone levels in Arizona and California differed on July 1, 2019.** So we first want to make a data frame of Arizona and California ozone measurements on July 1, 2019. We will call this new data frame ozone.az.ca.df. We are also making `State Name` a factor variable (instead of character). ```{r dataframe} ozone.az.ca.df<-airquality.df %>% filter(`State Name`=="Arizona"|`State Name`=="California") %>% filter(date =="2019-07-01")%>% mutate(`State Name`=as.factor(`State Name`))%>% glimpse() ``` We will can now calculate the summary statistics. It's always important to summarize the data that we are analyzing and to view them graphically. ```{r summarytable} ozone.az.ca.summ <-ozone.az.ca.df %>% dplyr::group_by(`State Name`) %>% select(ozone) %>% dplyr::summarize( length.ozone=length(ozone)- sum(is.na(ozone)), mean.ozone=mean(ozone), median.ozone=median(ozone), sd.ozone=sd(ozone), skew.ozone=skewness(ozone), )%>% glimpse() ``` It is okay if you get a red message about grouping variables. Please ignore this. We can make a few changes to our table format from before and rerun. I added a column for `State Name`, with the format passthrough (fmt_passthrough). That just means that we are passing the variable on through to the table without changes. Now let's run our table and see what happens! ```{r az.ca.tbl} ozone.az.ca.summ.tbl<-ozone.az.ca.summ %>% gt() %>% tab_header( title = md("Summary Statistics of Ozone Levels in Arizona and California"), subtitle = "8-hr ozone measured in ppm") %>% fmt_passthrough (columns=c(`State Name`)) %>% fmt_number(columns = c(length.ozone), decimals = 0) %>% fmt_number(columns = c(mean.ozone), decimals=4) %>% fmt_number(columns = c(median.ozone), decimals = 4) %>% fmt_number(columns = c(sd.ozone), decimals = 4) %>% fmt_number(columns = c(skew.ozone), decimals = 3) %>% cols_label( `State Name`="State", length.ozone = "Observations", mean.ozone = "Mean", median.ozone = "Median", sd.ozone = "SD", skew.ozone = "Skewness") ozone.az.ca.summ.tbl ``` Awesome! This looks good. The table has two rows, one each for Arizona and California. As you can see, the summary statistics are accurately reported. ### Histograms of the ozone data Everyone should be an expert at making histograms at this point. Here are two histograms that I made using the facet_grid() function in ggplot. I am making a facet for each of the states using the ozone.az.ca.df data frame. `facet_grid(`State Name`~.)` tells R to make the histograms on top of each other in a column, organized by `State Name`. Here is a bit more about facet_grid: https://ggplot2.tidyverse.org/reference/facet_grid.html. ```{r az.ca.hist} p<-ggplot(data=ozone.az.ca.df, aes(x=ozone, fill=`State Name`))+ geom_histogram(col="black")+ scale_fill_manual(values=c("royalblue", "gray")) p + facet_grid(`State Name` ~ .)+ labs(x="Ozone concentration, 8-Hour Running Average, ppm", title="Ground-level ozone levels in Arizona (n=44) and California (n=171)", fill="State Name") ``` Now we have the summary statistics and histograms of the two distributions (ozone levels in Arizona and California). How would you describe the distributions? (You don't need to type anything, just think about it here.) Now we want to knit the document so we just see the final tables in our final knitted document. Before we knit, you should put ,include=F in all of the chunks in the {r} after the name of the chunk if you don't want to see the output or code. In some of the chunks we want to hide just the code and see the table or figure, so put ,echo=F in the nc.tbl, az.ca.tbl, and az.ca.hist chunks, . Now knit to a rtf (rich text format) document. **Bonus**: Feel free to look up the package kableExtra which has some of the same functionality as gt(). I also encourage you to look at the gtsummary package (https://www.danieldsjoberg.com/gtsummary/). **Congratulations, once you have knitted to rtf, you have finished the tutorial! **