OSU Software Carpentries: R for Scientific Reproducibility 2024-08-19

# [R for Reproducible Research, Workshop Page](https://imageomics.github.io/2024-08-16-osu-online/) **Instructors:** - Horacio Lopez-Nicora ([Dept. of Plant Pathology](https://plantpath.osu.edu/)), PhD - Jelmer Poelstra ([Molecular and Cellular Imaging Center](https://mcic.osu.edu/)), PhD - Jessica Cooperstone ([Dept. of Horticulture and Crop Sciences](https://hcs.osu.edu/)), PhD --- ## Schedule | Time | Content | Instructor | | -------- | -------- | -------- | | 9:00-9:20 am | Workshop intro, [pre-workshop survey](https://carpentries.typeform.com/to/wi32rS?slug=2024-08-16-online2024-08-16-osu-online) | Horacio | | 9:20-10:15 am | **Introduction to R and RStudio** | Horacio | | 10:15-10:30 am | _Break_ | | | 10:30-11:35 am | **R's data structures and data types** | Jelmer | | 11:35 am-12:25 pm | _Lunch_ | | | 12:20-1:15 pm | **Data frame manipulation with _dplyr_** | Jelmer | | 1:15-1:30 pm | _Break_ | | | 1:30-2:50 pm | **Visualization with _ggplot2_** | Jess | | 2:50-3:00 pm | [Post-workshop survey](https://carpentries.typeform.com/to/UgVdRQ?slug=2024-08-16-online2024-08-16-osu-online) | Jess | ## Course intro and logistics - Instructor introductions -Introduce yourself in the chat! - This HackMD file has some workshop info and will have live transcription added to it during the workshop - it is easiest to view it using the "eye" icon in the top left - Today's schedule - [Code of Conduct](https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html) - How to ask questions (both online and in person) - How to indicate you're ready or still working on Zoom - if everything is going well use the green "yes" or if you need us to slow down, please use the red "no" - [Pre-workshop survey](https://carpentries.typeform.com/to/wi32rS?slug=2024-08-16-online2024-08-16-osu-online) -- **only for those who weren't at Friday's workshop** ## Introduction to R and RStudio Instructions on how to download what you will need for today [here](https://imageomics.github.io/2024-08-16-osu-online/). > Remember, you need both R and RStudio! When you first open RStudio, you will be greeted by three panels: - The interactive R console/Terminal (entire left, this is like regular R) - Environment/History/Connections (in upper right) - Files/Plots/Packages/Help/Viewer (in lower right) Your panels will look different from Horacio's in that they will contain different (or no) information. But the structure will be the same. ### What is an R script? An R script - a type of file (with the extension `.R`) where you can store code - You can open a new R script by going to `File` > `New File` > `R script` - This introduces a new quadrant into RStudio - that is in the top left corner, and the console is now in the bottom left) - By using a script, you can store your code, and then send your code to the console to be run. This is nice because your code is now saved for posterity. - You can run your code with the "run button" - Or by using `ctrl + enter` on a Windows or `command + enter` on a Mac #### Saving a script You will want to save your script to have it for the future. You can save with `File` > `Save as` and indicate where on your computer you want your file to live. If you use `File` > `Save` - your script will be saved in your "working directory" - more about this later. ### What is a working directory? A location/path on your computer that R is considering to be its "home". If you don't give R any additional information, it will assume files you want it to use are in this spot. If you want to know where your working directory is, you can run the code: ``` getwd() ``` You can also navigate to your working directory by using the gear wheel in the bottom right quadrant and select `Go to working directory`. > Note that your working directory will most definitely be different than Horacio's since he has a different file organizational structure than you do. You can set your working directory by providing a complete path to the function ``` setwd() ``` You can also do this by navigating to `Session` > `Set Working Directory` and then `Choose Directory` navigate manually to where you want your working directory to be, and then select it. If you like the idea of easy project management, join us for [Code Club](https://osu-codeclub.github.io/) where we will talk about projects :) ### In the console The `>` sign in your console is the R “prompt”. It indicates that R is ready for you to type something. When you are not seeing the `>` prompt, R is either busy (because you asked it to do a longer-running computation) or waiting for you to complete an incomplete command. If you notice that your prompt turned into a `+`. To get out of this situation, one option is to try and finish the command (in this case, by typing another number) — but here, let’s practice another option: aborting the command by pressing `Esc`. #### Adding comments to our script Help future you by writing comments about what your code does. You can add comments on a line by starting that line with `#`. Here is an example: ``` # this is how i set my working directory setwd(my/path/here) ``` #### Math in the console ``` # these two are the same 5/2 5 / 2 ``` ``` # these two are also the same 5+2 5 + 2 ``` > Don't forget about your [PEMDAS](https://www.mathsisfun.com/operation-order-pemdas.html). #### Functions Functions do things in R. They are often followed by parentheses, e.g., `install.packages()` and `getwd()` but are not always (e.g., `+`, `*`). Nearly all functions that come from a package we need to call up every time we want to use them. We do this using the function `library().` > You install a packages with `install.packages()` once, and every time you want to use the package, you call it up for use with `library()`. If you haven't installed `tidyverse` and `gapminder` yet, you can do so using this code: ``` install.packages("tidyverse") # make sure tidyverse is in quotes install.packages("gapminder") # make sure gapminder is in quotes ``` When you install `tidyverse` you will see in the console what exactly you have applied, and some warnings you should heed. We will go over these warnings more later today. You can see if your packages have installed correctly by: - Running `library(name-of-package)` successfully - Go to the `Packages` tab in the bottom right quadrant and look for your packages There are lots of functions you can use for math stuff: | | | |----------------------------------------------|------------------------------------------------------| | **Function** | **Description** | | **abs(***x***)** | absolute value | | **sqrt(***x***)** | square root | | **ceiling(***x***)** | ceiling(3.475) is 4 | | **floor(***x***)** | floor(3.475) is 3 | | **trunc(***x***)** | trunc(5.99) is 5 | | **round(***x* **, digits=** *n***)** | round(3.475, digits=2) is 3.48 | | **signif(***x* **, digits=** *n***)** | signif(3.475, digits=2) is 3.5 | | **cos(***x***), sin(***x***), tan(***x***)** | also asin(x), acos(*x*), cosh(*x*), acosh(*x*), etc. | | **log(***x***)** | natural logarithm | | **log10(***x***)** | common logarithm | | **exp(***x***)** | e\^*x* | #### Comparing stuff | **Operator** | **Description** | **Example** | |--------------|--------------------------|------------------------| | \> | Greater than | 5 \> 6 returns FALSE | | \< | Less than | 5 \< 6 returns TRUE | | == | Equals to | 10 == 10 returns TRUE | | != | Not equal to | 10 != 10 returns FALSE | | \>= | Greater than or equal to | 5 \>= 6 returns FALSE | | \<= | Less than or equal to | 6 \<= 6 returns TRUE | #### Assigning stuff to a variable The assignment operator is this: `<-` > `Tools` > `Keyboard Shotcuts Help` to learn more about keyboard shortcuts If we run this code, assign the value 250 to the object `length_cm` ``` r length_cm <- 250 ``` When you run this code, you will see `length_cm` in your Environment (top right) We can also save a conversion factor: ``` r conversion <- 2.54 ``` You can now save your length in inches (`length_in`) using the following code: ``` r length_in <- length_cm / conversion ``` When you run the code above - nothing shows up in the console! This is actually what we would expect - since we assigned it to `length_in` but did not print or view it in someway. We can see what is contained in `length_in` by running that code. ``` r length_in ``` The function `round()` allows you to round a number to a certain number of digits. The first value presented to `round()` is the number to round, and the second argument is the number of digits to round to. ``` r round(length_in, digits = 1) ``` If you assign the same variable name with a new number, it will overwrite what you have in your environment. Pay attention to your variable names to limit confusion with your future self (or collaborators). - length_cm (snake case, separated with underscores) - length.cm (periods, separated with periods) - LengthCm (Camel case, separate with capital letters) > Variables in R cannot start with numbers, and cannot contain spaces Tip: make object names descriptive but don't go crazy #### Managing my environment? You can learn all the items that are being stored in your enviromnent with the code `ls()`. If I want to remove the object `x2` from my environment, I can do that with the code `rm(x2)` where rm signifies remove. Now if we run `ls()` again we can see that `x2` is no longer present. ### HELP ME! You are going to need help in R at some point. You can do this by typing: ``` r help("name-of-function-or-package") ``` For example: `help("library")` - remember to put the name in quotes. A shortcut way to do that is using the `?` (which is itself a function). ``` r # this works either with or without the parentheses ?round() ``` ### Getting ready for the next sessions Make sure you've installed the packages `tidyverse` and `gapminder`. You can do that with the code below: ``` r # install tidyverse install.packages("tidyverse") # install gapminder isntall.packages("gapminder") ``` ## R's data structures and data types Before lunch - fundamentals of the R language After lunch - applied data manipulation and exploration In this session, we will learn about R’s data structures and data types. Data structures are the kinds of objects that R can store data in. Here, we will cover the two most common ones: *vectors* and *data frames*. Data types are how R distinguishes between different kinds of data like numbers and character strings. Here, we’ll talk about the 4 main data types: - character - integer - double - logical You can continue with the script from Horacio's section, or you can set yourself up a new one. Jelmer will start by opening a new script (+ symbol in toolbar at the top => “R Script”, or “File” => “New file” => R Script”), and I will save it straight away as `data-structures.R` in a folder on my Desktop (create a new folder there or anywhere else for this workshop, if you haven’t done so already). ### Vectors Vectors in R are lists of things. Each item is called an element. We can start by assigning the number 8 to a new vector called `vector1` and then assigning the word "panda" to `vector2` ``` r vector1 <- 8 vector2 <- "panda" # make sure you put it in quotes ``` We can print each vector by typing out the name of the object, and running it. ``` r vector1 vector2 ``` In the console, the 1 in the square brackets `[1]` is counting the number of elements in your vector. Here, we have only one element in our vectors. This is not going to work since we do not have an object named panda in our environment. ``` r vector_fail <- panda ``` These would work: ``` r vector_works <- "panda" vector_also_works <- vector2 ``` You can also have vectors that have multiple elements in it - this is going to print a vector of 3 elements, where the first element is 2, the second element is 6, and the third element is 3. ``` r c(2, 6, 3) ``` We can do the same thing with character elements. ``` r # create a vector of 2 elements "vhagar" and "meleys" vector_appended <- c("vhagar", "meleys") # add one more element, "balerion the dread" to our vector c(vector_appended, "balerion the dread") ``` The code `1:10` will give you a vector of the numbers 1 through 10 (i.e., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10). We can make a sequence of numbers using the function `seq()`. ``` r vector_seq <- seq(from = 6, to = 8, by 0.2) ``` > Note you can learn more about how each individual function works using the help, in this case `?seq` Vectors are just lists of things in R. It can contain numbers or strings (i.e., character words) but they can only contain one data type (i.e., only numbers, or only characters). ### Vectorization A unique aspect of R is that it is a vectorized language. What do you think will happen here? ``` r vector_seq * 2 ``` What happens here is that each number in `vector_seq` gets multipled by 2. The 2 gets "recycled" #### Challenge 1 A. Start by making a vector x with the whole numbers 1 through 26. Then, subtract 0.5 from each element in the vector and save the result in vector y. Check your results by printing both vectors. <details><summary>Click for the solution</summary> ``` r x <- 1:26 x ``` ``` [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 [26] 26 ``` ``` r y <- x - 0.5 y ``` ``` [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 [16] 15.5 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 ``` </details> B. What do you think will be the result of the following operation? ``` 1:5 * 1:5 ``` <details><summary>Click for the solution</summary> ``` [1] 1 4 9 16 25 ``` </details> ### Exploring vectors We can use the function `head()` to get the first 6 elements of a vector. `head(vector_seq)` The function `tail()` gives you the last 6 elements of a vetor. `tail(vector_seq)` The function `length()` tells you how long your vector is. `length(vector_seq)` The function `sum()` adds up all the numbers in your vector. `sum(vector_seq)` The function `mean()` tells you the mean value of all the numbers in your vector. `mean(vector_seq)` Indexing means referring to the positional number of the element you want to refer to. ``` r # this refers to the second element in our vector vector_seq[2] # this will print elements 2 through 5 vector_seq[2:5] # this will take elements 1 and 8 vector_seq[c(1,8)] ``` You can also use indexing to modify certain elements of your vector. For example, if we want to make the first element in our vector to be 30, we can: ``` r vector_seq[1] <- 30 # print to check if it worked vector_seq ``` ### Data frames Vectors are important but often we are working with data frames rather than vectors. We are going to start simply with some cats (meow). ``` r # create the dataframe cats <- data.frame( coat = c("calico", "black", "tabby"), weight = c(2.1, 5.0, 3.2), likes_string = c(1, 0, 1) ) # print cats ``` Observations in rows, variables in columns. #### Extracting columns We can extract columns from our dataframes using the `$` accessor. ``` # to get the coat column cats$coat # to get the weight column cats$weight # to get the likes_string column cats$likes_string ``` #### Data types Each data type has some unique functionality. For example, we can't multiply a character string by a number because that is really nonsense. ``` r # this won't work "valerion" * 5 ``` If you want to see what data type you have, you can use the function `typeof()`. ``` r typeof("valerion") # a character typeof("5") # character if its in quotes typeof(3.14) # double, or dbl, or numeric, numbers with decimal pts typeof(1:3) # integer typeof(TRUE) # logical ``` #### Challenge 2 What do you expect the following to produce? ``` r typeof("TRUE") typeof(banana) ``` <details><summary>Click for the solution</summary> 1. "TRUE" is character because of the quotes around it. 2. Recall the earlier example: this returns an error because the object banana does not exist. </details> Vectors (and columns in data frames) only have one data type. ``` r # will give us the "structure" of the df cats str(cats) ``` #### Challenge 3 Given what we’ve learned so far, what type of vector do you think the following will produce? ``` r quiz_vector <- c(2, 6, "3") ``` <details><summary>Click for the solution</summary> ``` quiz_vector [1] "2" "6" "3" ``` ``` typeof(quiz_vector) [1] "character" ``` </details> > Vectors can only have one data type! The examples above show coersion of numeric to character. It's a good practice to always understand what data types you have soe weird things aren't happening with your data that you don't know about. More about coersion: ``` r coersion_vector <- c("a", TRUE) # print it coersion_vector ``` ``` typeof(coersion_vector) # we have characters now ``` The most common type of conversion is numbers or logicals to characters. #### Manual conversion types You can convert using manual conversion functions. ``` r # converts in this case character to number as.double(quiz_vector) ``` ``` # converts in this case numbers to characters as.character(c(0, 2, 4)) ``` If we look at the column `cats$likes_string` we see that we have numbers, even though in this case 0 represents "doesn't like" or FALSE and 1 represents "likes" or TRUE. ``` r # convert numbers to logical as.logical(cats$likes_string) ``` If we want to save that back to the data frame we can do that like this: ``` cats$likes_string <- as.logical(cats$likes_string) ``` This is going to coerse to an NA (not available, i.e., missing data). ``` as.double("kiwi") ``` ### Factors In R, categorical data, like different treatments in an experiment, can be stored as “factors”. Factors are useful for statistical analyses and also for plotting, the latter because you can specify a custom order among the so-called “levels” of the factor. ``` r diet_vec <- c("high", "medium", "low", "low", "medium", "high") factor(diet_vec) ``` In the example above, we turned a regular vector into a factor. The levels are sorted alphabetically by default, but we can manually specify an order that makes more sense and that would carry through if we would plot data associated with this factor: ``` r diet_fct <- factor(diet_vec, levels = c("low", "medium", "high")) diet_fct ``` What type of data do we have? ``` r typeof(diet_fct) ``` Come back after lunch by 12:25pm! ## Data frame manipulation with **`dplyr`** Welcome back! So far, we have gone through: - The R console and using RStudio with its advanced functionality - Using R scripts (they end with the file extension `.R`) - The basics of a working directory and how you can set/change this - Functions (e.g., `setwd()`, `c()`, the basic ones, and how to use them - The assignment operator `<-` - Help me! `?name-of-function` - Data structures (vectors and data frame) - Data types and coersion between (both accidentally and on purpose) - Factors, a special data type for categorical data Now we will start manipulating data frames with the package [`dplyr`](https://dplyr.tidyverse.org/) (which is a part of the `tidyverse`). You might want to restart your R to clear whatever is currently in your environment. In the top task bar, you can navigate to `Session` > `Restart R` to restart R. Jelmer is going to use a new script, if you want to open one by `File` > `New File` > `R Script` The `dplyr` package is one of the tidyverse packages. It allows easy manipulation of data frames. You can learn more about it [here](https://dplyr.tidyverse.org/) Here we’re going to cover some of the most commonly used functions, and will also use pipes (`%>%`) to combine them. - `select()` to pick columns (variables) - `filter()` to pick rows (observations) - `rename()` to change column names - `arrange()` to change the order of rows (i.e., to sort a data frame) - `mutate()` to modify values in columns and create new columns - `summarize()` (with `group_by()`) to compute summaries across rows We will start with loading the tidyverse. Tidyverse is actually unique in that it is a series of packages. ``` r # to load the tidyverse # you have to do this every time you open a new R session library(tidyverse) # don't need quotes since tidyverse is an object that R recognizes ``` The packages within the core tidyverse (i.e., those that load when you do `library(tidyverse)` are listed in output in your console. We are also going to load a package that contains the data that we will use today called gapminder. ``` r library(gapminder) ``` Typically when you load a package, if its successful, you will see nothing. In this case, silence is good. Let’s take a look at the dataset, which is stored in a data frame also called `gapminder`: ``` r gapminder ``` # A tibble: 1,704 × 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. # ℹ 1,694 more rows Jelmer is reorganizing his panes now for easy viewing. You can move your panes by clicking on the sort of blocky table thing in the middle, and he has moved his console from the bottom left, to the right - now we see just two of the four quadrants: - Script on the left - Console on the right A tibble is basically a data frame with very minor edits to be a little more user friendly and informative. It's smart enough to not give us an output of 1704 rows by 6 columns. But basically data frame and tibble are the same thing. When we look at the `gapminder` we can see that: - `country` and `continent` are factors - `year` and `pop` are integers - `lifeExp` and `gdpercap` are numeric/doubles (differ from integers in that they have decimal points) As for the dataset itself, note that each row contains some data for a single country in a specific year (across 5-year intervals between 1952 and 2007): - `lifeExp` is the life expectancy in years - `pop` is the population size - `gdpPercap` is the per-capita GDP ### `select()` to pick columns To subset a data frame by keeping or removing certain columns, we can use the `select()` function. > Note, you cannot use tidyverse functions in general on vectors - only on data frames. This isn't a big problem as you will mostly likely work with data frames, but its good you know about it. By default, this function will only **keep the columns that you specify**, which you typically do simply by listing those columns by name: If I only want to keep some columns, I can select only the ones I want to **keep**. ``` r # keep only year, country, and gdpPercap select(.data = gapminder, year, country, gdpPercap) ``` Columns in the resulting data frame will be in the order you have specified in select. You also can select only the ones you want to **remove**. `!` means "not". ``` r # keep only year, country, and gdpPercap select(.data = gapminder, !continent) # not continent ``` There are some [select helpers](https://dplyr.tidyverse.org/reference/select.html) that can be useful for selecting columns. ### `rename()` to change column names (and also the pipe `%>%`) Our next *dplyr* function is one of the simplest: `rename()` to change column names. ``` r # keep only year, country, and gdpPercap # save to a new df called gapminder_sel gapminder_sel <- select(.data = gapminder, year, country, gdpPercap) ``` The syntax to specify the new and old name within the function is `new_name = old_name`. For example, building on the column selection we did above, we may want to rename the `gdpPercap` column: ``` r # save to a new df called gapminder_sel gapminder_sel <- select(.data = gapminder, year, country, gdpPercap) # rename gdpPercap to be gdp_per_capita rename(.data = gapminder_sel, gdp_per_capita = gdpPercap) # new name = old name ``` It is common to use several (*dplyr*) functions in succession to “wrangle” a dataframe into a format, and with the data, that we want. To do so, we could go on like we did above, successively assigning new data frames and moving on to the next step. But there is a nicer way of dong this, using so-called “piping” with a pipe operator: we will use the `%>%` pipe operator. We can do the same renaming as above but instead this time use the pipe. ``` r gapminder %>% select(year, country, gdpPercap) %>% rename(gdp_per_capita = gdpPercap) ``` The pipe is nice for many reasons, including: - it auto-corrects with choices from your data, and this is helpful to avoid misspellings (which happen a lot) - it is easier to read (it's good practice to separated the pipes on different lines) - we don't need to keep specifying the data - the data comes from the previous function (i.e., the data is the first argument for each function) - avoids storing a lot of intermediate data frames you probably won't use ### `filter()` to select rows/observations We want to keep the rows where the life expectancy `lifeExp` is great than 80. We can do that with the code below: ``` r gapminder %>% filter(lifeExp > 80) ``` For each row, the equation `lifeExp > 80` will be evaluated, and the rows for which we get a `TRUE`, we will keep those. We can also write some code to keep only the observations from Europe. ``` r gapminder %>% filter(continent == "Europe") # == tests for equality ``` You can also filter based on multiple conditions, e.g., Asian countries with life expectancy greater than 80 years, and from the year 2007. ``` r gapminder %>% filter(continent == "Asia", # only from Asia year == 2007, # only from the year 2007 lifeExp > 80) # only life expectancy > 80 years ``` We are using the comma `,` above and they are being combined in and "and" fashion - meaning for a row to be kept, all three conditions must be`TRUE`. We might be interested in an "or" situation - where only one of the situations needs to be true to give us a `TRUE`. Here, the "or" is signified with the straight up and down pipe `|` - on my Mac this is right above the `return` key. ``` r gapminder %>% filter(lifeExp > 80 | gdpPercap > 1000) ``` Let's practice putting this together. Let's filter for observations only in the Americas, pick only the columns `year`, `country`, and `gdpPercap`, and then rename `gdpPercap` to be `gdp_per_capita`. ``` r gapminder %>% filter(continent == "Americas") %>% select(year, country, gdpPercap) %>% rename(gdp_per_capita = gdpPercap) ``` ### Challenge 1 Write a single command (which can span multiple lines and include pipes) that will produce a data frame that has `lifeExp`, `country`, and `year` for Africa but not for other continents. How many rows does your data frame have? <details> <summary> Click for the solution </summary> ``` r gapminder %>% filter(continent == "Africa") %>% select(year, country, lifeExp) ``` # A tibble: 624 × 3 year country lifeExp <int> <fct> <dbl> 1 1952 Algeria 43.1 2 1957 Algeria 45.7 3 1962 Algeria 48.3 4 1967 Algeria 51.4 5 1972 Algeria 54.5 6 1977 Algeria 58.0 7 1982 Algeria 61.4 8 1987 Algeria 65.8 9 1992 Algeria 67.7 10 1997 Algeria 69.2 # ℹ 614 more rows It has 624 rows. </details> ### `arrange()` to sort data frames The `arrange()` function is like the sort function in Excel: it changes the order of the rows based on the values in one or more columns. For example, our data set `gapminder` is currently sorted alphabetically by `country` and then by `year`, but we may instead want to sort observations by population size: ``` r gapminder %>% arrange(pop) ``` You might want to sort descending - so that the highest values are at the top. ``` r gapminder %>% arrange(desc(pop)_ ``` You also might want to sort by a few columns, you can do that too. ``` r gapminder %>% arrange(continent, country) ``` ### `mutate()` to modify values in columns and create new columns So far, we’ve focused on functions that “merely” subset and reorganize data frames. We’ve also seen how we can modify column names. But we haven’t seen how we can *change the data* or *compute derived data* in data frames. What if we want to express the population in millions? ``` r gapminder %>% mutate(pop_millions = pop/1000000) ``` To modify a column rather than adding a new one, simply assign back to the same name: ``` r gapminder %>% mutate(pop = pop / 10^6) ``` #### `summarize()` to compute (per-group) summary statistics In combination with `group_by()`, the `summarize()` function can compute data summaries across groups of rows of a data frame. First, let’s see what `summarize()` does when used by itself: if we want to calculate the mean `gdpPercap` across our data frame, we can use the code below. ``` r gapminder %>% summarize(mean_gdpPercap = mean(gdpPercap)) ``` Note that this gives us a new data frame with totally different dimensions compared to what we've been working with. We might want to summarize for different groups - let's try to calculate the mean `gdpPercap` for each `continent`. ``` r gapminder %>% group_by(continent) %>% summarize(mean_gdpPercap = mean(gdpPercap)) ``` `group_by()` implicitly splits a data frame into groups of rows: here, one group for observations from each continent. After that, operations like in `summarize()` will happen separately for each group, which is how we ended up with per-continent means. Finally, another powerful feature is that we can *group by multiple variables* – for example, by `year` *and* `continent`: ``` r gapminder %>% group_by(continent, year) %>% summarize(mean_gdpPercap = mean(gdpPercap), mean_lifeExp = mean(lifeExp)) ``` ## Data visualization with **`ggplot2`** ### The grammar of graphics The package ggplot2 applies a framework for plotting such that any plot can be built from the same basic building blocks. The “gg” in ggplot stands for “grammar of graphics” and all plots share a common template. This is fundamentally different than plotting using a program like Excel, where you first pick your plot type, and then you add your data. With ggplot, you start with data, add a coordinate system, and then add “geoms,” which indicate what type of plot you want. Simplified, we will provide to ggplot: - Our data frame - What connects the data to the graphics (mapping "aesthetics") - "Layers" - determine which type of plot we are going to make, what coordinate system we will use, what scales we want, and other important aspects of our plot ### Getting set up **Loading our packages:** ```r library(tidyverse) # this includes ggplot2 library(gapminder) ``` **Exploring our data set:** We can use the function `View()` to look at our data. This is RStudio-specific functionality that opens our data sort of like how we might view it in Excel -- very nice! ```r View(gapminder) ``` ### Building our plot Note: while the package is called _ggplot2_, the function to make plots is `ggplot`. ```r # Check the help for the ggplot function ?ggplot ``` We can start by providing the data as the first argument to `ggplot()`: ```r ggplot(data = gapminder) ``` Instead of providing the dataframe as the first argument to `ggplot()`, we can use the pipe `%>%` (or the newer pipe `|>`) that Jelmer taught us about. This "sends" the data into the next function. I will use this syntax for the rest of the workshop. ```{r} gapminder %>% ggplot() ``` > Note that both give you the same output. If we want to make a scatterplot to understand the relationship between GDP per capita and life expectancy, we can do so by setting `x` and `y` respectively within `aes()`, or our aesthetic mappings. ```{r} gapminder %>% ggplot(mapping = aes(x = gdpPercap, y = lifeExp)) ``` Ok! We don't have a plot, per se, but we have more than we had before. We can now see that `gdpPercap` is on the x-axis (along with some numbers reflecting the range of our data), and `lifeExp` is on the y-axis (along with some numbers reflecting the range of our data). > Note that the "mapping = " part is actually not necessary (i.e., we can use that argument with or without specifying its name). ```{r} gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) ``` Now we need to tell R what geometry, or "geom" want to use. Let's make a scatterplot here. ```{r} gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point() ``` > Note the switching between `%>%` (piping data into the next function, here `ggplot`) and `+` (adding layers to a plot) All of the "geoms" start with [`geom_*()`](https://ggplot2.tidyverse.org/reference/index.html#layers), and we can see what they all are by starting to type geom and pressing tab. ### Challenge 1 Modify the plot we've made so that you can see the relationship between life expectancy and year. <details><summary>Click for the solution</summary> Map `x = year` and `y = lifeExp`, and use `geom_point()` since we want a scatterplot. ```{r} gapminder %>% ggplot(aes(x = year, y = lifeExp)) + geom_point() ``` </details> ### Challenge 2 Modify to color your points by continent. <details><summary>Need a hint?</summary> Try using the argument `color` within your aesthetic mappings. </details> <details><summary>Need another hint?</summary> Try setting `color = continent` within your aesthetic mappings. </details> <details><summary>Click for the solution</summary> ```{r} gapminder %>% ggplot(aes(x = year, y = lifeExp, color = continent)) + geom_point() ``` </details> ### Adding more layers Instead of making a scatter plot, we might want to make a line plot. Using `ggplot` this is as easy as changing out your geom. ```{r} gapminder %>% ggplot(aes(x = year, y = lifeExp, color = continent)) + geom_line() ``` The plot is jumping around a lot since for each year, we have life expectancy data for each country. Our plot here is not a summary of that data, but instead all of that data together, on top of itself. We might want to have one line for each country, we can do this by specifying `group = country` within our aesthetic mappings. ```{r} gapminder %>% ggplot(aes(x = year, y = lifeExp, color = continent, group = country)) + geom_line() ``` A nice thing about `ggplot` is that you don't actually need to decide if you want to have lines or points, you can have both! ```{r} gapminder %>% ggplot(aes(x = year, y = lifeExp, color = continent, group = country)) + geom_line() + geom_point() ``` Now we see a point for each observation and each point and line is colored based on continent. You can also set your mappings globally (within `ggplot()`) or locally (within a specific geom). Let's see what the difference is. Let's see what happens when we move `color = continent` into `geom_point(aes(color = continent))`. ```{r} gapminder %>% ggplot(aes(x = year, y = lifeExp, group = country)) + geom_point(aes(color = continent)) + geom_line() ``` ### Challenge 3 Change the order of the point and line layers - what happens? <details><summary>Click for the solution</summary> Points then lines (points are on the bottom) ```{r} gapminder %>% ggplot(aes(x = year, y = lifeExp, group = country)) + geom_point() + geom_line(aes(color = continent)) ``` Lines then points (lines are on the bottom) ```{r} gapminder %>% ggplot(aes(x = year, y = lifeExp, group = country)) + geom_line(aes(color = continent)) + geom_point() ``` </details> > Bottom line for challenge 3: layers are really added 1-by-1 on top of each other in the order that you add them in the code!! ### Transformations and statistics Sometimes we might want to apply some kind of transformation to our data while plotting so we can better see relationships between our variables. Let's start with a base plot to see the relationship between GDP per capita (`gdpPercap`) and life expectancy (`lifeExp`): ```{r} gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point() ``` The presence of some outliers for `gdpPercap` make it hard to see this relationship. We can try log base 10 transforming the x-axis to see if this helps using a [`scale_*()`](https://ggplot2.tidyverse.org/reference/index.html#scales) function. ```r # log 10 scales the x axis gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() ``` I can also make our points a little bit transparent to ease our overplotting problem (where too many point are on top of each other, making each point hard to see) by setting `alpha = `. Alpha ranges from 0 (totally transparent) to 1 (totally opaque). Note that I did not put `alpha = 0.5` within an `aes()` - we are not mapping alpha to some variable, we are simply setting what alpha should be. ```r gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.5) + # NOTE: not inside the aes() scale_x_log10() ``` Now we can better see the parts of the plot that are very dark are where there are a lot of data points. We can also add a smoothed line of fit to our data by setting `method = "lm"` within `geom_smooth()` to fit a linear model. ```r gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.5) + # not inside the aes scale_x_log10() + geom_smooth(method = "lm") # smooth with a linear model ie "lm" ``` We can adjust the thickness of the line by setting `linewidth = ` within `geom_smooth()`. ```{r} gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.5) + # not inside the aes scale_x_log10() + geom_smooth(method = "lm", linewidth = 3) # smooth with a linear model ie "lm" ``` That's a ridiculously thick line. ### Challenge 4A Modify the color and the size of the points in the previous example. You'll also probably want to make the linewidth less ridiculous. <details><summary>Need a hint?</summary> Don't put `color` and `size` inside `aes()`. </details> <details><summary>Want another hint?</summary> The equivalent of `linewidth` for points is `size`. </details> <details><summary>Click for the solution</summary> ```{r} gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.5, color = "purple", size = 0.5) + # not inside the aes scale_x_log10() + geom_smooth(method = "lm", linewidth = 1) ``` </details> ### Challenge 4B Modify the plot from 4A so points are a different shape and colored by continent with new trendlines <details><summary>Need a hint?</summary> Don't put `color` and `size` inside `aes()`. </details> <details><summary>Want another hint?</summary> The equivalent of `linewidth` for points is `size`. </details> <details><summary>Click for the solution</summary> All points are now triangles. ```{r} gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point(shape = 17, alpha = 0.5) + scale_x_log10() + geom_smooth(method = "lm", linewidth = 1) # smooth with a linear model ie "lm" ``` All points are now open triangles, we are setting continent to `fill` and making the outline black. ```{r} gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp, fill = continent)) + geom_point(shape = 24, alpha = 0.5, color = "black") + scale_x_log10() + geom_smooth(method = "lm", linewidth = 1) # smooth with a linear model ie "lm" ``` Mapping `shape` to `continent` ```{r} gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point(aes(shape = continent), alpha = 0.5) + scale_x_log10() + geom_smooth(method = "lm", linewidth = 1) # smooth with a linear model ie "lm" ``` </details> ## Multi-panel figures [Small multiples](https://en.wikipedia.org/wiki/Small_multiple) are a useful way to look at data across the same scale to understand patterns. Let's say we want to understand how life expectancy changes over time throughout the Americas? Instead of making an individual plot for each country, we can use `facet_wrap()` to have `ggplot` make our plots all at once. First let's use what Jelmer taught us to `filter` for only the observations from the Americas. ```{r} gapminder_americas <- gapminder %>% filter(continent == "Americas") ``` Then, this new data frame `gapminder_americas` can be the data for our next plot. Let's look first without faceting: ```{r} gapminder_americas %>% ggplot(aes(x = year, y = lifeExp, group = country)) + geom_line() ``` Or, we can do this all at once without saving the intermediate data frame -- note the switch between `%>%` and `+` (!): ```r gapminder_americas <- gapminder %>% filter(continent == "Americas") %>% ggplot(aes(x = year, y = lifeExp, group = country)) + geom_line() ``` **Faceting** allows us to better see each country on its own. ```r gapminder_americas %>% ggplot(aes(x = year, y = lifeExp)) + geom_line() + facet_wrap(vars(country)) + # make facets by country theme(axis.text.x = element_text(angle = 45)) # years on the x on a 45 deg angle ``` ## Modifying text The plots we've made so far could really benefit from some better labels. We can set what we want or plot labels to be as arguments in `labs().` ```r gapminder_americas %>% ggplot(aes(x = year, y = lifeExp)) + geom_line() + facet_wrap(vars(country)) + # make facets by country theme(axis.text.x = element_text(angle = 45)) + # years on the x on a 45 deg angle labs( # x axis title: x = "Year", # y axis title: y = "Life expectancy", # main title of figure: title = "Figure 1. Life expectancy in the Americas from 1952-2007" ) ``` ## Exporting a plot Often we want to take our plot we have made using R and save it for use someplace else. You can export using the Export button in the Plots pane (bottom right) but you are limited on the parameters for the resulting figure. We can do this with more control using the function [`ggsave()`](https://ggplot2.tidyverse.org/reference/ggsave.html). First we will save our plot as an object using the assignment operator `<-`. ```{r} our_plot <- gapminder_americas %>% ggplot(aes(x = year, y = lifeExp)) + geom_line() + facet_wrap(vars(country)) + # make facets by country theme(axis.text.x = element_text(angle = 45)) + # years on the x on a 45 deg angle labs( # x axis title: x = "Year", # y axis title: y = "Life expectancy", # main title of figure: title = "Figure 1. Life expectancy in the Americas from 1952-2007" ) ``` Then we can save it. I am indicating here to save the plot in a folder called `results` in my working directory, as a file called `lifeExp.png`. If you want your file to go within a folder, you have to first create that folder. ```{r} ggsave(filename = "results/lifeExp.png", # file path and name plot = our_plot, # what to save width = 18, height = 12, dpi = 300, # dots per inch, ie resolution units = "cm") # units for width and height ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.