UPGG Computational Bootcamp 2024

--- tags: course notes --- # UPGG Computational Bootcamp 2024 ## Day 1 - ☕️ Morning ☕️ ### Introduction to Unix Shell #### Instructor: Odmaa Bayaraa #### Goals! * Intro to Shell * Navigating files and directories * Writing Scripts and working with data * Project Organization #### Intro to Shell Meet Dr. Rachel! **Why should you care about shell/command line?** * Unix command line is a way to communicate directly with a computer! Most notably, you can interact with the Duke Computing Cluster (DCC). DCC is a power computing resource that you can utilize for analyses and data storage. The only "problem" is that you can only interact with DCC through command lines. * Many bioinformatics tools can only be used through a command line interface. * Automation makes your life easier. Let computer do the competitive tasks for you. So you can spend your time and energy elsewhere! * We humans also make a lot of errors, especially when we have to do repetitive tasks. Computers are a lot better at doing that! * Writing your code down makes your work a lot more reproducible which is crucial in scientific research. **Servers** = A single computer that you can use to work or host apps/websites **Compute clusters** = Many computers/servers combining their power! **Cloud Computing** = On-demand resources like a cluster sourced from the cloud, very powerful! **Opening your terminal** The first line you see probably includes (but this depends on your machine and settings): * The machine name! This is normally your name and your machine! * `~` designates where you are (home directory) * Username: who is accessing the terminal right now! * `$` separates the information above and the input box that you can type in commands `bash` vs `zsh` They are mostly similar. `bash` has been default on Mac devices for a while, before they made `zsh` the default shell. Keep an eye on which shell your machine is using, but most common commands are the same. #### Navigating directories Remember that spaces have meaning in command line. ```bash= pwd # This is shortrened for "print working directory" ls # This prints "list" of files and directories within the working directories ls -l # Adding option to the command ``` The `-l` is an option! Be sure that you have a space between the command, `ls`, and `-l`. This specific option tells Shell that you want the list to be printed in the "long" format, containing more information. To figure out what options are available for the commands, you can look up the "manual" of that command by running `man <command>`. For example, ```bash= man ls ``` Sometimes, the command might not have a `man` from `man` function. Most likely, you can run `<command> -h` to open the "help" page which functions similar to a `man`. These "help" and "manual" pages might open a "new window" that lists a lot of information about that command. But you will notice that the original terminal has disappeared. To go back to the original terminal, type `q` and `Enter`. What is happening here is that the `man` and `-h` command/option opens up a program called `less`. We'll get into it later. Think of it as a "file viewer" within Terminal. You can view the files using `less`, but you cannot edit the content of it. **Tab completion** While you are typing commands or file names, you can utilize `tab completion` to complete the word for you. It will try to match what you are typing to what is available in the file system. And it will complete up to a point where it is unambiguous which file you are looking for. ```bash= history # this will show commands that you recently ran in terminal ``` #### File system Home vs root * Think of your file system as a tree. You can create directories within a directory, in effect, making the branches for that directory. But if you go upstream, when does the directory end? There must be a directory that contains ALL directories within your system. That directory is called `root`. You generally don't want to touch this directory directly as it contains a lot of operational files. * Where you will be most of the time is your `home` directory, denoted by`~`. `home` is where you should keep all your relevant files and folders -- even `Desktop` is probably within your `home`. Think about a system that is shared across multiple users, each user is going to have their own `home` directory. So you don't want to get in the way of one another. Changing things in `root` therefore will affect other users' `home` as well, so rule of thumb is to not touch anything beyond `home` if you don't know what you are doing. * `root` is represented as `\` and `home` is represented as `~` in Shell. ```bash= mkdir UPGG_Camping_Folder # This "makes directory" within your current directory (Desktop) cd UPGG_Camping_Folder # This takes us into the directory we just made cd .. # This takes us back into the parent directory (Desktop) ls .. # This lists the files and directories within parent directory ls ~ # You can use shorthand for `home` directory with `~` ``` **Absolute vs. Relative paths** Absolute paths * You can type this address from any location within your computer, and it will take you to the same place * Absolute paths start with a `/` * Useful for scripting and for going straight to one location! Relative path * Relative to a specific location, but from your current dir by default * No `/` at the start * If you type a file or directory that is relative to the current directory, you can omit current directory. E.g. `UPGG_Camping_Folder` is within your current directory. The Shell will understand this. * You can also use relative path from a specific location, for example, from your `home` - that looks like this `~/Desktop/<folder_name>`. **Examining Files** ```bash= cat # Prints the contents of the file to the terminal (stdout) head # Prints the first n lines of the file to the terminal (stdout) tail # Prints the last n lines fo the file to the terminal (stdout) less # Opens the file in read-only view without printing it out ``` **FASTA/FASTQ format** These are formats for "genome sequences" data. These have specific format that will help with consistency and optimization and are designed to be easy to read and space and memory efficient. These are technically a `.txt` file, but a specialized format. If a tool requires you to input a `fastq` file, you want to input a `fastq` file. But these are not just the suffix of the filename. Just renaming a file to `filename.fastq` does **NOT** make it a `fastq` file. FASTA/FASTQ files are designed to give you as much information about sequencing reads as possible while preserving memory. **FASTA** First line after the `>` = Read name and information about the read Second line = The actual DNA sequence in bases **FASTQ** First line after the `@` = Read name and information about the read Second line = The actual DNA sequence in bases Third line = Strand information (+ or -) and possibly other information Fourth line = PHRED quality score, each letter corresponds to a nucleotide and has a score value **BAM/SAM format** These are sequence alignment formats. Think of it as processed `fasta`/`fastq`.They include information like where these reads are on the genome (chromosome, location etc.) `bam` - "binary alignment map" and `sam` - "sequence alignment map". The difference is that `bam` files aren't human-readable but are optimized for storage. These formats would include metadata such as read names, alignment quality, and optional tags. CIGAR is the format for alignment quality. **GFF/BED format** "General feature format" and "browser extensible data" `GFF` is a complex format for detailed genomic feature annotation e.g., genes, exons etc. `BED` is a simpler format for specific genomic regions and intervals with chromosome id and start/end positions, etc. Remember, you can always Google the most updated documentation for each of these file formats! ### Why Computational Pipelines are Important It's always a good idea to plan out your experiments as much as possible! The quality of the data that comes from the lab experiments determines how your analysis will go. As genetics and genomics students, we need to use many different software and analysis tools to get from the experimental data to our pretty results. Visualization helps! Sit down and draw out your pipeline if you need to, getting a birds eye view will help you plan your time and focus. Many experimental pipelines have similar steps. Make sure you understand which steps are already well-standardized and which steps may need to be fine-tuned to your dataset. Making a broad experimental workflow will help you communicate how you analyze your data in publications and presentations. Fellow researchers need to understand how you analyzed your data in order to give helpful critiques. Don't reinvent the wheel! We have amazing collaborators in the CBB program and beyond who spend all their time developing methods and software, so use the resources that are available! Reading methods sections of relevant papers and doing some Googling will help you learn which tools are available and what you can use them for. Choose packages wisely! If possible, use software that is currently supported by an active developer or community. The more steps you have in your workflow, the more possibility for errors. Try to keep file formats compatible and have checkpoints to review your work! We don't do research alone, so do your best to make sure your data is error-free and well-documented. ### Writing Scripts and Downloading Data **Text Editors on the command line** Text editors are extremely useful to write and edit your code. Two common editors for the command line are Vim and Nano. **Nano** - simple and more user-friendly, but less customizable **Vim** - steep learning curve, but much more customizable A good example of a text file is a README file. README files are designed for documentation of a project or folder. They are extremely useful for keeping track of what a project goals were and what the files and scripts do. Many README files are .txt files. Remember, the extension is important to tell the computer what type of file this should be! ```bash= nano README.txt ``` Headers are a good thing to add to your README files. Include relevant data like the author, project title, date file was created, etc. To peek into our file without opening it again to edit it, we can use the `cat` or `head` ```bash= cat README.txt ``` You can check the first few lines of the file using `head`, and specify the number of lines using the `-n` flag ```bash= head -n 1 README.txt ``` **Shell Scripts** To combine command line commands into a file we can run all at once, lets create a shell script! Shell scripts have the .sh file extensions. In this case, we will put the command `echo` in our script, which will print a string to the terminal output. ```bash= nano my_script.sh echo "Hello UPGGers!" ``` To run our script, we can type `bash` followed by our script name. ```bash= bash my_script.sh ``` You can do many things with a shell script, including navigation instructions to access files in different folders. *Note: Ideally, try to use absolute paths in shell scripts! You can use relative file paths, but things can get messy if you need to move the script or files into different folders* There are lots of other important commands, like `chmod` to lock and unlock data, and `wget` or `curl` (on macs) to access links and data from online. **File Permissions** When you type `ls -l`, you get a list of files with their permissions. These permissions ae shown below: ![image](https://hackmd.io/_uploads/rkA64K7i0.png) The `r` permission is for opening and reading/viewing files, and is usually open to more people. The `w` permission is for writing or changing the file. You don't want write permission on especially for raw data files. You don't want to *accidentally* modify these files. The `x` permission is for executing the file, mostly applicable to filetypes such as `.sh` files. Lets add execute permissions for everyone to our newly created shell script using the `chmod` command and the plus sign to **add** permission to execute. ```bash= chmod +x my_script.sh ``` **Creating, Moving, Copying, and Deleting** In addition to the interface you are used to using, we can move, delete, copy, and rename files and folders on the command line! `mkdir` is "make directory", and is the command to create a new directory. ```bash= cd UPGG_Camping_Folder mkdir Camping_Gear_List ``` *Note: Its a good idea to avoid using spaces when working on the command line. Most people replace spaces with underscores or dashes* `mv` is "move", and can move a file or directory to a new location. `mv` can also rename files and folders! Lets rename our new file to be Camping locations ```bash= mv Camping_Gear_List Camping_Locations ``` If we don't need a file anymore, we can delete it with the `rm` or "remove" command. Lets delete our README file. You will notice that you cannot remove directories using just `rm`. This is because directories often contain things, so bash will try and keep you from deleting useful things. To delete a directory, we need to use the `-r` flag for "recursion", which will delete **everything** in the directory. BE CAREFUL! This can easily delete many files, and they will not come back! ```bash= rm README.txt rm directory_name # Note that this doesn't work with directory rm -r directory_name # You need `-r` for "recursion" which will delete everything in the folder! ``` **Saving Scripts and Version Control** As you work on your research, you will have many different versions of your scripts for different purposes or created by different people. Many labs have different ways to keep track of all these different versions, but whichever one you use, make sure everything is consistent! Duke has a contract with GitLab, so may labs may use it. It is slightly different from GitHub, but with similar functionalities. **Data Organization** Consistent are extremely helpful, because they allow us to use **wildcards**. In bash, you can replace parts of a file name with a star `*` to access any files with that name structure. **Opening Zipped Files** ## Day 1 - ☀️ Afternoon ☀️ ### Introduction to R and Rmd #### Instructor: Krista Pipho ### R Background R began as a proprietary language called S, but it is now a free and open source programming language! This means anyone can create and edit R tools, known as packages. **IDE** An IDE is an Integrated Development Environment, which provides a more intuitive way to interact with code. The IDE for R is called **RStudio**. RStudio has a lot of functionalities, but you don't need to know how to use all of them yet. The first window that will usually open is the R terminal window. Just like bash, you can write and submit commands here. Like many programming languages, you can do simple arithmetic with common operators `+` for addition `-` for subtraction `/` for division `*` for multiplication ```R= 10 + 5 10 - 5 10 / 2 10 * 5 ``` In order to access previous commands you submitted, you can press the up arrow key on your keyboard and scroll back up through your command history. Assignment You can assign a value to a variable using `<-` sign. ```R= x <- 10*5 print(x) # 50 ``` You can also compute with variables after you have assigned them a value ```R= x * x ``` If you want to know things are stored in your variable, you can use the `typeof()` function ```R= typeof(x) ``` Since R is foremost a statistical computing language, there are a lot of functionality for computing on various data formats. One type of data format is the **vector**. To create one, you run `c()` with elements that you need inside. Each elements are separated by commans (`,`). ```R= v <- c(1, 2, 3) # v is a vector of integer ``` You can create a **matrix** by combining rows of vectors. ```R= m <- rbind(v, v, v) ``` Additional resource: https://bio723-class.github.io/Bio723-book/getting-started-with-r.html If you want to know how to do something, Google probably has the answer! There are many base R functionalities as well as user-created packages to perform the functions you need. ### R Markdown Writing basic commands and doing arithmetic is fine, but it can be useful to create a shareable Go to **File** on the upper left Then click **New File** The fifth option down is **R Markdown** This will bring up a window where you can set options for your document, including the title, author, date, and output options. Today we will be using the default HTML output. When submitting this window, this should open a new tab in your RStudio environment. This is different from the terminal (now below the new window) because the new notebook tab is a more permanent way to type and edit your code. The first step you will usually want to do is to **Save** the file. Click the save icon at the upper ribbon and select the directory that you want to save the file to. Once you have saved your file in a directory, the lower right window may fill with the current **working directory**, where all the files in the directory should appear. The window should be populated with a default template for information. Most of this can be deleted to replace with your own data. Everything after the "##RMarkdown" is their example code that can be deleted. The first few lines are called the **YAML** header, and contains information about the document itself that should not be deleted! The second block of grey highlighted text is a **code chunk**. This one is a setup chunk of R code, as signified by the `{r setup}` in the first header, and will help the document knit correctly. This chunk won't appear in your final output since the option `include = FALSE` is set to False. To see how it will look in the output, click the **Knit** option on the upper ribbon. This will show an approximation of the "knitted" or "rendered" final output. To find helpful cheatsheets, click the **Help** tab in the lower right box, and click **Cheatsheets**. You can download these cheatsheets to your computer. Notes: * To make a newline, either two spaces or a backslash will work * When inserting images, make sure the images are downloaded to your computer or are linked to a image source online * Remember absolute vs. relative paths when linking to an image elsewhere on your machine! * You can place the image in the same directory as the RMarkdown and it can be inserted into the file * When making bulleted lists, make sure to include spaces after the dashes and have each entry on a new line ### Installing Packages Packages are essential for using R to the fullest extent. There are several ways to install packages, but one of the easiest is to use the RStudio framework. Go to the **Packages** tab on the lower right window and click the grey **Install** button on the upper right of that window. This should bring up a new window where you can search for packages by name and install them. Lets search for and install the **tidyverse** package. This may take a while, tidyverse includes many different smaller packages! **Loading and naming chunks** To quickly insert a new R code chunk, you can press `Ctrl + Alt + I` on Windows or `Cmd + Option + I` on Mac. You can edit the header `{r}` to name the chunk and add flags. This can be useful when loading in packages, since many of them can send a warning message upon loading. We will name this chunk "packages" and load our packages here. By specifying `include = TRUE` and `warning = FALSE`, we are telling R to include this chunk in the final knitted output but to not print any warnings that would normally be sent when the chunk is run. Naming codeblocks is optional, but is important for knowing which code block is responsible for the document not knitting! ```R= {r packages, include = TRUE, warning = FALSE} library(tidyverse) ``` **Inserting Citations** In order to insert references, we need to edit the YAML header at the top of our document. Under the information that is already there such as `title`, add two more lines: `bibliography: {your filename here}` and `link-citations: TRUE`. ```R= --- title: "Markdown Testing" author: "Kayla Wilhoit" output: html_document bibliography: references.bib link-citations: TRUE --- ``` This will give us linked citations within the body of the output, and generate a bibliography at the end of our document. To get the references, we need a BibTex formatted citation in a file that we can insert. You can obtain the BibTex file from most citation manager software. **For Zotero Users** Zotero is integrated with R! ~~(because Zotero is the best citation manager)~~ To insert the citation directly from your Zotero library, you will need to go into the **Visual Editor**. On the upper left of your notebook window, you will notice two buttons saying "Source" and "Visual". You will likely have been previously working in "Source", so click the "Visual" button. You may need to accept a window, but then a working preview of your code will open. With the cursor where you want to insert your citation, click the **Insert** option on the same ribbon where the Source/Visual options were. Click on the "@ Citation" option, and a new window should open. Click on "Zotero" on the menu on the right, and you can search and select items from your Zotero library! Once you have selected all the items you want, click insert and the options will be automatically inserted. **Reproducibility** It is a good idea to share the details of your current setup at the end of your document. To do this, use the `sessionInfo()` function. ```R= sessionInfo() ``` This will automatically print information such as the version of R used and the versions of all the packages installed at the time of document creation. Additional html for exercise can be found [here](https://github.com/rpornmon/2024-08-21-upgg-bootcamp/blob/gh-pages/files/UPGG-Bootcamp.html). In order to change the location of the reference list, you can add the following line to 'force' the list to a specific location: ```R= <div id='refs'></div> ``` ## Day 2 - ☕️ Morning ☕️ ### Data Exploration and Cleanup #### Instructors: RP Pornmongkolsuk and Natalie Dzikowski Learning objectives: - Download and inspect data using R - Learn how to use Tidyverse functions - Explore data in a scientifically motivated way - Organize and manipulate data in preparation for summary and visualization ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ### Data Exploration and Clean-Up The first half of this morning, we'll go through the package called `Tidyverse` which you all should have installed as part of the lesson yesterday. Red sticky if you don't have `Tidyverse` installed. ### Data structure ### Refresher on Data Types We will do a quick refresher on data types in R which Krista has touched on a little bit yesterday. These are basically categories of values that computers can read and interpret in different ways. If you know other programming languages, these categories tend to overlap, though might have different names. #### 1. Numeric Basically, numbers that we know very well. You can perform mathematical operations on them. The types of numeric values you are going to encounter the most are: 1a. `Double` - represent real numbers (whole numbers and decimals). It is the default value for whole numbers. 1b. `Integer` - represent whole numbers only. Whole numbers are defaulted to `double`. You have to tell R explicitly that a value is `integer`. You are unlikely to come across a case where `double` and `integer` will make significant changes to your code/performance, but it's good to be aware of the different types of numeric values. Another key point is that, arithmetic operation will automatically coerce `integer` to `double` if needs be. ```R= x <- as.integer(10) typeof(x) # Will be integer typeof(x + 10.0) # Will now be coerced to double ``` ### Basic Operations ```R= x + y x * y x / y # Division x ** y # Exponent x^y # Also Exponent x %% y # Modulo ``` **Valid variable names** There are some names you ***can't*** use as variable names, and they are names that you ***shouldn't*** use. 1. Start the name with letters ```R= a.1 <- 5 # Will work 1.a <- 5 # Will not work _a <- 5 # Will not work ``` 2. Don't use weird symbols, use `.` or `_`. Examples of style guides: - [Google's](https://google.github.io/styleguide/Rguide.xml) - [Jean Fan's](http://jef.works/R-style-guide/) - [Tidyverse's](http://style.tidyverse.org/) ```R= myHeightInJanuary <- 7 # camelCase = first letter lowercase rest capitals my_head_size <- 10 # Snake_case = underscores between words, all lowercase ``` 3. **DO NOT** use [reserved words](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) 4. Avoid naming variables the same as function names #### 2. Logical Sometimes these values are called `boolean` in other languages. It is common enough in programming that you should know about! The only possible logical values are `TRUE` and `FALSE` -- case sensitive!!! ```{r} !TRUE # Not TRUE TRUE & FALSE # TRUE and FALSE TRUE | FALSE # TRUE or FALSE TRUE | TRUE xor(TRUE, FALSE) # either x or y, but not both xor(TRUE, TRUE) ``` ```R= x > y x < y x >= x <= y x == y # Need double equal sign to check equality x != y # Negation - is x NOT equal to y !(x > y) # Negation again - is x NOT greater than y ``` > **Problem with numeric data in logical equivalence** Without going into the weeds, an irrational number like `sqrt(10)` has infinite decimal points. So, it is impossible for R to represent all the digits as it would take an infinite amount of storage. R has a limited degree of `precision` that it can represent an irrational value, which makes the term `(sqrt(10)^2)` slightly off from 10. ```R= 10 == (sqrt(10^2)) # This should equal 10, but it will return FALSE ``` To circumvent this limitation, you can test for "near equality" in R using a function `all.equal()`. **Logical operator** #### 3. Character strings Characters are created by using the `"` (double quote) or `'` (single quote) enclosing them. `nchar()` gives you the number of characters in the variable. Simple strings operations - joining strings using `paste()` and `paste0()` and splitting strings using `strsplit()`. * `paste()` automatically includes a **separator** (default being a space) * `paste0()` doesn't include a separator ```R= paste("Hello", "World") # returns "Hello World" paste("Hello", "World", sep = "_") # returns "Hello_World" paste("Hello", "World", sep = "") # returns "HelloWorld" paste0("Hello", "World") # returns "HelloWorld" strsplit("Hello World", split = " ") # returns "Hello" "World" ``` **Order of evaluation** Be careful if you changt type or value of a variable. Don't short-circuit your code. Options: 1. save as new variable or 2. always run code sequentially. ### Vectors A "vector" is a list of values that, importantly, are of the same type. This is the most common data type in R. To index from vectors and many other structures, you can use a colon `:` **Important!!** R is a 1-based indexing language! This is different than languages such as python, which are 0-based. This means that to access the first item in a vector in R, you want to use the number 1. ```{r} v1 <- c() # Empty vector v2 <- c(1) # Vector of length 1 and of type numeric v3 <- c("Hello", "World") # Vector of length 2 and of type character v4 <- c(1, "hi", TRUE, 1+2) # R is smart enough to coerce the type of all element into one that can be applied to all of them -- in this case, character, designated by " marks. print(v4) ``` ```{r} # prints the type of elements contained within vector typeof(v4) # Prints the length of the vector length(v4) ``` Concatenate vectors ```{r} v5 <- c(v3, v4) print(v5) ``` Vector arithmetic ```{r} x <- c(1, 2, 3, 4) x * 2 ``` Vector recycling (can skip this) ```{r} y <- c(5, 6, 7) x + y ``` Common statistical functions ```{r} sum(x) min(x) max(x) mean(x) median(x) sd(x) ``` ```{r} summary(x) ``` **Vector Indexing** ```{r} x x[1] # obtain the first element x[-1] # obtain all but first element x[-2] # all but second element x[3:4] # 3rd-4th elements x[4:1] # reverse order rev(x) x[1:length(x)-1] # exclude last element, flexible for any vector x[c(TRUE,FALSE,TRUE,TRUE)] # exclude second element ``` Use indexing to manipulate content of the vector. ```{r} y <- c(5, 6, 7) y y[2] <- 1999 y ``` Get indexes using `which()` ```{r} x >= 2 which(x >= 2) ``` ##### Exercise Use what you learned about vector to return a vector containing elements within `a` that is **greater** than 30. ```{r} a <- runif(20, 1, 100) ``` ### List List is a more flexible data type than vector - it can contain a mixture elements of various data types. We'll breeze through this really quickly. Creating a list ```{r} my_list <- list("Hello", 1, TRUE) # Mixed types of element print(my_list) ``` Size of list ```{r} length(my_list) ``` Indexing a list using `[[]]` ```{r} my_list[[1]] ``` Appending a list, list can even contain vector ```{r} my_list[[5]] <- 1:10 my_list ``` Indexing ```{r} my_list[[5]][7] ``` A list within a list ```{r} my_list[[4]] <- list("World") my_list ``` ```{r} my_list[[4]][[1]] ``` Combining list ```{r} c(list(1, 2), list(x=3, y=4)) ``` ### Data Frame This is arguably the most useful feature of R in data analyses. There's a reason we use spreadsheets, they portray multi-dimentional data and its relational information between data points across columns -- i.e., putting variables from the same observation next to each other. Think of data frame as like a stack of vectors (each column). To be able to make sense of the data, we want to add column names to designate what the values mean. Creating a data frame ```{r} df <- data.frame(1:5, 6:10) df ``` ```{r} names(df) names(df) <- c("first.column", "second.column") names(df) # Note how codes are evaluated line-by-line sequentially? ``` Or you can add names when you create data frame. ```{r} df2 <- data.frame(height = 100:105, weight = 120:125) df2 ``` You can think of a data frame as a specialized type of `list` (with constraints), where each vector is an element of the list. BUT the number of rows **must** be of the same length. ```{r, error=TRUE} df3 <- data.frame(height = 100:105, weight = 120:130) ``` Data frame is a **specialized** list! It has many similarities with lists. ```{r} typeof(df) class(df) ``` Hence, can use list-related functions. ```{r} length(df) df[[2]] ``` Properties of data frame ```{r} dim(df) ncol(df) nrow(df) ``` Indexing, extracting information from data frame ```{r} df[[1]] # Same as list ``` ```{r} df$first.column # Most common way to do it ``` R data frames are row-based, meaning first index refers to the row number, then column number ```{r} df ``` ```{r} df[1,2] # row 1, column 2 ``` Or, you can call by column name, and index like a normal vector. ```{r} df$second.column[1] ``` Adding columns ```{r} df$third <- 21:25 fourth <- runif(5, 10, 100) df$fourth <- fourth df ``` R has some built-in data frames that you can play with. ```{r} mtcars ``` ```{r} iris ``` ### Tidyverse (\~30 min) What is [Tidyverse](https://www.tidyverse.org/)? Tidyverse is a collection of R packages that had been designed and developed for data science. R is not a new language, and Tidyverse's philosophy is about reducing redundancy and improving coding style to be cleaner and more intuitive. First, even though we have all installed the package Tidyverse, the package has yet been loaded into your environment. You won't have access to Tidyverse's functions unless you run `library(tidyverse)` Tidyverse is a literal gold mine, and there's no way we're going to cover everything in just a few hours! We're going to go through a few functions that are particularly useful and most commonly used. ### Tibble A `tibble` is Tidyverse's attempt to improve on `data frame`. In general, they are about equivalent in terms of function. For the sake of this lesson, we'll use `tibble` because any Tidyverse's `readr` function would output a `tibble` by default. Don't worry about the differences - they are, for the most parts, interchangeable. You can convert a preexisting data frame into a tibble. ```{r} my_tib <- as_tibble(mtcars) # Converting a data frame to tibble my_tib # looks very similar to data frame ``` Another way to create a tibble. Key components are column names and values. Each column can only contain one data type. ```{r} my_tib2 <- tibble(name = c("Josh", "Peter"), income = c(100000000, 20)) my_tib2 ``` Conventionally, for data frames and tibbles, each row contains observations, each column contains variables, and each cell contains values. ```{r} my_tib ``` Subsetting, same as data frame ```{r} tb <- tibble( x = runif(5), y = rnorm(5)) # Extract by name 1 tb$x # Extract by name 2 tb[["x"]] # Extract by column index 1 tb[[1]] ``` ```{r} # If have time, example of differences between data frame vs tibble mtcars[,1] as_tibble(mtcars)[,1] ``` ```{r} class(tb) ``` ### Dplyr - Data Manipulation Tidyverse is a meta-package, meaning it includes multiple packages within it. One of them is `dplyr` which contains functions involving common data manipulation tasks. Note that these functions generally also work on data frames, not just tibble. #### Select Select columns you want from a tibble. > Note how you can refer to the column name directly without having to use "" or put it adjacent to > variable name (i.e. table1\$year)? That's the power of Tidyverse! ```{r} select(table1, year) ``` Select multiple columns ```{r} select(table1, country, year) ``` Use `-` to exclude columns. ```{r} select(table1, -year) ``` #### Filter Compare to base R, tidyverse is trying to reduce redundancy of repeating the name of the variable. In base R, if you want to look for observations from year 2000: ```{r} table1[table1$year == 2000,] ``` See how `table1` has to be called multiple times? That's redundancy!! Using `filter` reduces redundancy in your code. It filters for observations that match the logical operation you parse in. ```{r} filter(table1, year == 2000) # Using dplyr is much cleaner ``` Nesting different functions become wordy and unintuitive to write and read. Tidyverse tries to mitigate this problem by introducing its own grammar structure (which is now introduced in base R as well) that helps organizing the code corresponding to how your brain would normally work. Sequentiality is important. ```{r} select(filter(table1, year == 2000), country) # Nesting filter within select is unintuitive because filter is interpreted first, then select ``` Instead, you can use `pipe`, this is now even cleaner than before, you only have to type the name `table1` once. The symbol for `pipe` is `%>%`. The way `pipe` works is that, it will use the output of the function before `pipe` as the **first argument** of the function after `pipe`. Therefore, most `dplyr` functions put the option for a tibble as first argument, so you can keep sending the tibble down the `pipe` chain. `Pipe` is not limited to just Tidyverse functions. ```{r} # Now, the sequence of steps becomes more intuitive; filter and then select! table1 %>% filter(year == 2000) %>% select(country) ``` #### Mutate Mutate is a very powerful function. It manipulates columns and can add columns as well. You can directly pull information from different column within `mutate()`. ```{r} table1 %>% mutate(rate = cases / population * 10000) # This doesn't happen in-place. Don't forget to assign it to the same variable or new variable. ``` ```{r} table1 ``` ```{r} new_table1 <- table1 %>% mutate(rate = cases / population * 10000) new_table1 ``` Note on naming columns, you generally don't want to include spaces. Sometimes you have to fix the column names of data files you download from others, especially if they were made in Excel which generally doesn't have problem dealing with white spaces between words. ```{r, error=TRUE} my_tib$miles per gallon my_tib %>% filter(miles per gallon > 10) ``` ```{r} my_tib$'miles per gallon' <- 1:32 my_tib$'miles per gallon' ``` ```{r} my_tib %>% filter('miles per gallon' > 10) # space has meaning in programming!!! my_tib$miles_per_gallon <- 1:32 # much better ``` #### Quick lesson on conditionals You can guide R to perform tasks that depend on a certain condition using `if` statement. ```{r} if (10 > 5) { print("10 is greater than 5") } ``` You can create a dichotomy of tasks for R using `if` and `else` statement. ```{r} plant_speed <- 1000 car_speed <- 40 if (plane_speed > car_speed) { print("Planes are faster than cars") } else { print("Planes are NOT faster than cars") } ``` There is a nifty function called `if_else()` that you can use within `mutate()` to create a new column! ```{r} table1 %>% mutate(size = if_else(population > 1e8, "big", "small")) ``` ### Arrange Arrange helps sorting the rows of a tibble based on a column. ```{r} table1 %>% mutate(rate = cases / population * 10000, .after = year) %>% arrange(desc(rate)) # sort rate in descending order ``` ### Summarize() Summarize give you statistics for whatever you want! ```{r} table1 %>% summarize(max = max(cases), mean = mean(cases)) ``` ### Group_by() ```{r} table1 %>% group_by(year) %>% count() ``` `summarize()` is particularly useful in combination with `group_by()` It will provide a summary statistic in terms of the groups you have defined. Note: You have to do the group_by step before the summary step! ```{r} table1 %>% group_by(year) %>% summarize(max = max(cases), mean = mean(cases)) ``` ### Tidyr ### Reshape/Pivot data Yes, data might not be tidy when you obtain it from other labs or download from a database. Sometimes, however, data **is** tidy, but it is organized in such a way to facilitate other analyses. You might want to reshape how data looks like to fit your specific analysis. This is where reshaping a tibble comes in - you can spread one variable across multiple columns, or scatter one observation across multiple rows. You can use functions from `tidyr` to deal with these cases! `pivot_longer()` and `pivot_wider()`. 1999 and 2000 are recording of the same variable (cases) but from different years. It is intuitive to place them in different columns next to each other if we want to compare by eyes. But in plotting, you might want to merge them into one column called "cases". ### Pivot Wider On the opposite end, you might want to spread cases into two columns, one for each year. ### Readr `readr` contains functions involved in reading different types of data and import into R environment. Base R has different read functions as well! For example, `read.table()` is a very versatile read function for tabular data. We are going to try to read `dummy.csv` file using `https://github.com/rpornmon/2024upggBootcamp-dataExploreInR/raw/main/raw_data/dummy.csv`. `readr` introduces similar read functions. They generally perform faster and have more consistent naming scheme. Feel free to get used to using `readr` functions to read files, just make sure you have the packaged loaded before you do so. We'll use this dummy data to practice tools that we learned today! ### Exercise (\~20 min) 1. Load `dummy.csv` file from `"https://github.com/rpornmon/2024upggBootcamp-dataExploreInR/raw/main/raw_data/dummy.csv`, if you haven't done so before exercise. Assign to variable called `dummy` **Hint:** check out `readr` cheatsheet for functions for reading files. 2. Rename the column names that have spaces, make them more readable. Assign to a new variable. **Hint:** check out `dplyr` cheatsheet for functions for renaming columns. 3. Summarize number of manga each author has published from this tibble. Sort the tibble based on the number of manga in descending order. **Hint:** check out `dplyr` cheatsheet for related functions. 4. Change type of column `last published` into `double`. Assign to a new variable. **Hint:** look up `parse_double()` from `readr`, might not be on cheatsheet. 5. Create a new column containing boolean values denoting whether or not the manga is still ongoing -- call it `ongoing`. Assign to a new variable. **Hint:** check out `dplyr` cheatsheet for related functions. 6. Pivot tibble longer, combining `first published` and `last published` into one column called `year` and another column called `first_or_last`. **Hint:** check out `tidyr` cheatsheet for reshaping functions. ---------------------------------------------------------------------------------------------------- ### BREAK ---------------------------------------------------------------------------------------------------- ### Blank et al., 2017 Transcriptomics Dataset As a UPGG student, you might end up dealing with some type of "-omics" dataset (epigenomics, transcriptomics, proteomics, maybe multiomics!). Today, we'll be exploring and cleaning up a transcriptomics dataset from the following paper: [*Translational control of lipogenic enzymes in the cell cycle of synchronous, growing yeast cells*](https://www.embopress.org/doi/full/10.15252/embj.201695050) ### What was the goal of the study? The authors were searching for proteins that are under periodic translational control over the course of the cell cycle in yeast, using the size of the cell as a marker for cell cycle stage. - Are there any proteins whose levels change depending on the stage of the cell cycle (a.k.a. the size of the cell)? - Is the change in protein level due to transcriptional control (at the mRNA level), or translational control (at the protein level)? ### Which datasets will we be looking at? Dataset 1: **A dataframe** of mRNA levels of over 6000 transcripts analyzed in THIS study. Specifically, the authors took the normalized read counts of each mRNA transcript, at each different cell size, found the mean read count for each gene across all of the cell sizes, and expressed the mRNA levels of each gene as a ratio of the level at each cell size over the mean. These ratios were then log2-transformed. Dataset 2: **A vector** of 144 transcripts whose levels were found to fluctuate over the course of the cell cycle, both in this study, and in *Spellman et. al, 1998*. These are the genes under "periodic transcriptional control". In addition to these, we will need help from 2 gene annotation files to understand the data. ### Who cares? mRNA sequencing has been fundamental to genomics research. It's very possible you might generate data like this in your own research. You may be analying changing mRNA levels in the cell due to a normal cellular process, or due to some genetic, epigenetic, or chemical perturbation you did in an experiment. What if you never do mRNA sequencing? Still, at some point, you may have to deal with data formatted in a really similar way (protein levels, fluorescence on DNA microarrays, etc.). ### Load data Let's load the relevant datasets that we downloaded yesterday. Are they in your /Downloads? If yes, no worries. To unzip the .gz file, use the `gunzip` command in the terminal. ```bash= gunzip GSE81932_Dataset01.txt.gz ``` ### Exercise: Using the UNIX commands you learned yesterday, let's practice moving our downloaded data to the ./raw_data directory. Unzip any zipped files. Now let's read into the files: ```{r load data} library(tidyverse) mRNA_file <- "../raw_data/GSE81932_Dataset01.txt" mRNA_data <- read_lines(mRNA_file) periodically_expressed_genes_file <- "../raw_data/GSE81932_Dataset02.txt" periodically_expressed_genes <- read_lines(periodically_expressed_genes_file) ribi_annotation_file <- "../raw_data/ribosome_biogenesis_annotations.txt" scer_names_estimates_file <- "../raw_data/doi_10_5061_dryad_d644f__v20160422/scer-mrna-protein-absolute-estimate.txt" ``` ### Explore the data Try "printing" the values contained in the variable `periodically_expressed_genes`. ```{r} periodically_expressed_genes print(periodically_expressed_genes) ``` `periodically_expressed_genes` is a "vector" data type. It is similar to an "array" or a "list" in some other programming languages, BUT R has its own version of "list." Essentially, R has both "vector" and "list" data types, and they function differently. "Vectors" are what you would think of when you think of Python's "array" or "list" ```{r} str(periodically_expressed_genes) ``` `periodically_expressed_genes` is really clean data. What about `mRNA_data`? ```{r} mRNA_data ``` This one has a header. Luckily, if we turn the file into a tibble, that header will be recognized immediately. `read_tsv()` is for reading in tab-separated values ```{r} mRNA <- read_tsv(mRNA_file) ``` Remember, the output of many tidyverse functions will automatically be a tibble! Applying read_tsv to a data file can accomplish this. Now let's peek inside the ribosome biogenesis annotations: The option `n_max = 10` will let us just look at the first 10 lines of the file, so we don't clog up our output ```{r} read_lines(ribi_annotation_file, n_max=10) ``` Here, we can see that comments are marked with a "!" ### Exercise: How can we make a tibble that skips these? *Solution:* ```{r} ribi_annotation <- read_tsv(ribi_annotation_file, comment = "!") ribi_annotation ``` What about for the Dryad annotations? ```{r} read_lines(scer_names_estimates_file, n_max = 10) ``` Skip comments: ```{r} scer_names_estimates <- read_tsv(scer_names_estimates_file, comment = "#") scer_names_estimates ``` ### What can we learn about the data? One of the first things you might be curious about is how many transcripts were analyzed in total, and how many genes ended up making it on the "periodically expressed" list. How can we get these values? Here's a really easy, built-in way to do so: `length` is designed for vectors and will count the number of lines in the file ```{r} length(mRNA_data) ``` There are 6714 lines, so without the header, there are 6713 transcripts. ```{r} length(periodically_expressed_genes) ``` And 144 genes were periodically expressed. But we're using tidyverse, so we don't even need `length()`. Just peek at the tibble, the information is right there! ```{r} mRNA ``` We just got a decent overview of the data, but can we always be sure that each line corresponds to a unique gene? What if there are duplicates? ### Exercise: Check whether there are duplicate entries in `mRNA` and in `periodically_expressed_genes`. *Hint*: The keywords "distinct" and "unique" will be helpful - try searching those words in the help window on the lower right! *Solution:* ```{r} n_distinct(mRNA$ORF) n_distinct(periodically_expressed_genes) ``` ### Let's explore our ribi annotation file Which parts of this dataframe are we interested in? ```{r} ribi_annotation ``` The datasets in this study identified transcripts by their gene IDs, specific only to *S. cerevisiae* - these can be found in the column 'Systematic Name/Complex Accession'. The 'Gene/Complex' column provides a more common name for that gene. Additionally, we have the column "Gene Ontology Term" for each gene, associating it with a biological process. The file we downloaded was called `ribosome_biogenesis_annotations.txt`, so we'd probably assume that these are only genes involved in ribosome biogenesis. Still, let's double check that to be sure! ### Exercise: Inspect `ribi_annotation` and check whether the genes are, for sure, only involved in ribosomal biogenesis. *Solution:* ```{r} n_distinct(ribi_annotation$Qualifier) distinct(ribi_annotation, Qualifier) ``` ```{r} n_distinct(ribi_annotation$`Gene Ontology Term`) distinct(ribi_annotation, `Gene Ontology Term`) ``` So, clearly, we only have genes involved in ribosome biogenesis. The most useful part of this annotation file will be to convert the yeast gene IDs into more common names for downtstream analyses. ### Discussion: Why do you think we provided 1 argument for the `n_distinct()` function, but 2 arguments for `distinct()` ? ### Cleaning up the annotation file ### Exercise: Take only the columns we want from the dataframe: the gene names and the systematic IDs. *Solution:* ```{r} ribi_annotation_names <- select(ribi_annotation, Gene = "Gene/Complex", SystematicName = "Systematic Name/Complex Accession") ribi_annotation_names ``` Next, we should check if there are any duplicate gene entries, or if each line is unique. ### Exercise: Check the number of unique gene name/systematic name combinations. Are there duplicates? If yes, create a new dataframe with duplicates removed, and double-check that this was done correctly. *Solution:* ```{r} n_distinct(ribi_annotation_names) ``` In order to only get the distinct values, we can use the `distinct()` function. This will print the **values** that are distinct, which is different from `n_distinct` which only prints the **number** of distinct values. ```{r} ribi_genes <- distinct(ribi_annotation_names) ribi_genes n_distinct(ribi_genes) ``` Looks like our whole dataframe is made up of distinct entries now! ### Which ribosome biogenesis genes are periodically expressed during the cell cycle? We want to filter `ribi_genes` to only include the ones that appear in the periodically_expressed_genes, and make this a new dataframe. Remember, `select` is used to get columns of interest, and `filter` is used to get rows of interest. ```{r} ribi_genes_periodic <- filter(ribi_genes, SystematicName %in% periodically_expressed_genes) ribi_genes_periodic ``` We can now see that 34/144 periodically expressed genes are involved in ribosome biogenesis (or that 34/187 ribosome biogenesis genes are periodically expressed during the cell cycle). **Note:** There are often many ways to arrive at the same solution, even for the same function! In this case, would you want to do option 2 from above? Why or why not? ### Did our favorite gene make the cut? We can also check if our favorite gene is on this list! It would be easy enough to do manually, but let's pretend our dataframe is still really long... ```{r} "NUG1" %in% ribi_genes_periodic$Gene ``` It's there! Let's print the line to see what its Systematic Name is. ```{r} filter(ribi_genes_periodic, Gene == "NUG1") ``` ### Exercise: Check whether the gene "RPS6B" is a periodically expressed gene, and try printing its line to see its systematic name. *Solution:* ```{r} "RPS6B" %in% ribi_genes_periodic$Gene ``` ```{r} filter(ribi_genes_periodic, Gene == "RPS6B") ``` Oops! That one's not there. ### What about genes with other functions? Up until now, we were looking only at ribosome biogenesis genes, since that was a major GO term that came up in the data. What about the genes involved in other biological processes? Let's read the Dryad file! The columns `"orf"` and `"gene"` will be useful to us for now. Let's create a new dataframe with only these columns. ```{r} scer_gene_names <- select(scer_names_estimates, Gene = gene, SystematicName = orf) scer_gene_names ``` ### Challenge (Skipped): 1. Get the names of genes from the Dryad annotations that are periodically expressed. 2. Extract only the common gene names from the output of 1. 3. Is the gene NOP56 on this list? ```{r} filter(scer_gene_names, SystematicName %in% periodically_expressed_genes) filter(scer_gene_names, SystematicName %in% periodically_expressed_genes) %>% select(Gene) filter(scer_gene_names, SystematicName %in% periodically_expressed_genes, Gene == "NOP56") filter(scer_gene_names, Gene == "NOP56") "YLR197W" %in% periodically_expressed_genes ``` ### Exploring more of the Dryad data This dataset actually has a lot more interesting information, like mRNA and protein levels for each gene. To explore these columns, we can use the `arrange()` function. This will allow us to sort the dataframe according to a variable we're interested in. ```{r} arrange(scer_names_estimates, gene) ``` The default is ascending order. ```{r} arrange(scer_names_estimates, gene) %>% tail() ``` What would this do? ```{r} arrange(scer_names_estimates, desc(gene)) ``` What about this? ```{r} arrange(scer_names_estimates, desc(gene), mrna) ``` Wait, can't we just use `sort()` instead or `arrange()`? ```{r} arrange(scer_names_estimates, desc(gene), mrna) %>% tail() ``` With `arrange()`, NA values automatically get put the end, regardless of whether you're arranging in ascending or descending order. This makes removal or data exploration more convenient. ### Let's put the two biggest datasets together What if our favorite gene is periodically expressed, but it was only identified in the 2017 study, but not the 1998 study? (This would mean it was never included in `periodically_expressed_genes`). Well, we have access to all of the mRNA data (\>6000 transcripts), and a comprehensive annotation set from Dryad. Let's put these two together to see what's happening to ALL of the genes, so then we can check what's happening to our favorite genes. There are a bunch of ways to join datasets together on common keys. We can try out a bunch of these `join` methods and see what we get. A "key" that both datasets have in common is the "systematic name". However, this list of gene IDs is labeled "SystematicName" in the Dryad annotations, but labeled "ORF" in Dataset01 (mRNA). Let's change the name of one of the columns to match: ```{r} names(mRNA)[1] <- "SystematicName" mRNA ``` Let's perform a `left_join()` on the two datasets, following this format: `joined_df <- left_join(x, y, by = "key")` This will take all rows of dataframe 'x', & keep all of the columns in 'x', while merging dataframe 'y' on the desired key and appending columns from 'y', with values that correctly correspond to each key. That might sound confusing at first, so let's just see what happens: We need to provide a key to merge on, in this case "SystematicName". We will use the `by =` option within the `left_join` function. ```{r} mRNA_named <- left_join(mRNA, scer_gene_names, by = "SystematicName") mRNA_named ``` This will let us search (via their common names) whether our favorite genes are indeed periodically expressed. For example: ```{r} filter(mRNA_named, Gene %in% c("ACT1", "NOP16", "NOP56")) ``` There are definitely some fluctuations in mRNA. Interesting! Maybe they just didn't make the cut according to the parameters/thresholds used by the authors; or, maybe they just weren't identified in both the 2017 and 1998 papers. Following the authors' methodology, your data is now set up for you to find the culprit (on your own time, if you're interested). ### Saving your work You've done all this work - be sure to save it now! ### Reshaping tibbles Sometimes, we might want to reshape a tibble to make it better formatted for plotting/visualization later on. Dataset01 can be reshaped into a longer, and less wide, format: ```{r} mRNA_named pivot_longer(mRNA_named,cols = ends_with("fL"), names_to = "Vol_fL", values_to="log2_ratio") ``` What do you notice changed about Dataset01? We might run into issues if all of the cell volumes are formatted as a "(integer) fL". ### Exercise: Look through your cheat sheets. What function(s) can we pipe to remove the " fL" from every value in the "Vol_fL" column? *Solution:* The function `str_remove` from the **stringr** package can be very helpful for this. We also need to tell R that this new column is a number! We use the `parse_double()` function to make sure we have nice clean numbers. ```{r} pivot_longer(mRNA_named,cols = ends_with("fL"), names_to = "Vol_fL", values_to="log2_ratio") %>% mutate(Vol_fL = parse_double(str_remove(Vol_fL, " fL"))) ``` This looks easier to work with now. Let's save it to a new tibble: ```{r} mRNA_data_long <- pivot_longer(mRNA_named,cols = ends_with("fL"), names_to = "Vol_fL", values_to="log2_ratio") %>% mutate(Vol_fL = parse_double(str_remove(Vol_fL, " fL"))) ``` Now, we can easily pick 3 genes we're most interested in, and plot their fluctuating mRNA levels over the course of the cell cycle! Let's save that information into a small dataframe, and then Kayla and Gabe will guide you through plotting. ```{r} mRNA_data_3genes_working <- filter(mRNA_data_long, Gene %in% c("ACT1","NOP16","NOP56")) ``` ```{r} write_csv(mRNA_data_3genes_working, "../processed_files/mRNA_data_3genes_working.csv") ``` ### Questions? ## Day 2 - ☀️ Afternoon ☀️ ### Data Visualization #### Instructors: Kayla Wilhoit & Gabriel Kennedy ### Starting Out Let's use the penguins dataset to practice ggplot syntax! This is a commonly used example dataset that contains morphological information about Antarctic penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica. To use the dataset, we need to install the package `palmerpenguins` ```{r, message = FALSE, warning = FALSE} install.packages("palmerpenguins") library(palmerpenguins) ``` Since this is a prepackaged dataset, lets assign the data to our own named data frame so we can see it in our environment. ```{r} data(package = 'palmerpenguins') penguin_data <- penguins head(penguin_data) ``` We have several columns to work with here: The penguin species, island of collection, bill measurements, flipper length, body mass, sex, and date. You can read more about the variables by typing `?penguins`. You will notice that some rows have NA values - this won't affect our practice plotting other than giving us some warnings, but this is why data cleanup is so important before every project!\ \ Lets just quickly remove the rows with NA values. If you would like, try applying what you learned this morning to clean up this dataset on your own time. For the simple plots we are making today, the current long formatted data structure will work fine. If you are making more complicated and intensive plots in We will be using ggplot to make these figures. ggplot is included in the package tidyverse, which we should already have installed. ```{r, message = FALSE, warning = FALSE} library(tidyverse) ``` ```{r} tidy_data <- drop_na(penguin_data) ``` ### Layer 1 - Data Remember - ggplot works in layers! We start with calling ggplot as a function with parenthesis. The first layer is the data - this gives us an empty canvas to build onto. In this case, we specify our penguins dataset within the parenthesis using `data = penguins` *Note: You can omit the `data =` part and just call a data frame like `ggplot(penguin_data)`, but it is always good to clearly specify what your code is doing!* ```{r} ggplot(data = tidy_data) ``` The data loaded correctly, and now we have a blank canvas to work with. ### Layer 2 - Aesthetics = Mapping In ggplot, 'aesthetics' refers to the mapping of columns of our dataset onto aesthetic attributes of our plot. In this case, the attributes we want to specify are the x and y coordinates of the plot. Use the `aes()` function to map the x coordinate to the `species` column in our penguins dataset. ```{r} ggplot(data = tidy_data, mapping = aes(x = species)) ``` Now we can see that our species have appeared on the x-axis of our canvas! However, there still isn't any data shown. We need to specify a **geometry** that gives our data a shape to display ### Layer 3 - Geometries Like an oil painting, we are building in layers on top of what we have already done. In ggplot, we use the + operator after the `ggplot()` call to tell R that the following layer is part of the plot. Lets start with a basic **bar plot** with height corresponding to the count of the data values, which in ggplot is called using the function `geom_bar()` We use the plus sign to add a new line that is automatically indented for us, and call the `geom_bar()` function. We don't need to add anything within the parentheses yet. ```{r} ggplot(data = tidy_data, mapping = aes(x = species)) + geom_bar() ``` You will notice that even though we didn't specify a y-axis variable, the `geom_bar` function automatically used the count of the values we specified in the x-axis call (number of occurrences within each species). *In the weeds: This is because `geom_bar` is performing a statistical transformation on our data frame (counting the number of values). If you want to manually specify a value for the y-axis, you should use the `geom_col` function, which leaves the data as-is.* #### Two variables Lets specify a variable to plot on the y-axis, in this case flipper length in millimeters: You can also try using the other columns (`bill_depth_mm, bill_length_mm, body_mass_g`) for the following plots, and see what else you can learn! ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) ``` This allows us to use a bivariate visualization method like a **scatterplot** In ggplot syntax, a scatterplot is created by calling the function `geom_point()` ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_point() ``` ### Adding more - boxplots This plot is already informative! We can qualitatively guess that some species might tend to have longer flippers than the others. In order to help quantify this and make the plot more attractive looking, lets add on a box and whisker plot **on top** of our points already plotted. That's the power of working in layers - we can add new elements without messing up our already plotted data! ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_point() + geom_boxplot() ``` ### Making things look nicer - transparency Woo! We have nice boxplots that clearly show where the means of our clusters are different. However, this plot still looks kind of ugly. Lets add some optional variables to make it nicer to look at and clearly display our data points behind the boxplots. We can do this by first adjusting the opacity of the boxplots. Opacity in programming languages is often specified using the value `alpha` . Lets adjust the opacity of our plot to be half opacity by adding `alpha = 0.5` to our main statement, inside of the aesthetic statement: ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm, alpha = 0.5)) + geom_point() + geom_boxplot() ``` We can now see the points behind our boxplots! But something's going on - the opacity of both the points and the boxplots were affected, and we have this annoying 'alpha' thing in the legend. This is because we specified `alpha` in the `aes()` statement within the first line (the first layer!) - which affects the entire plot. If we want to only adjust one element of our plot at a time, we can **set** alpha to a specific value within one geom. Lets remove our previous addition and add a new `alpha` value to our `geom_boxplot`, without calling `aes():` *In the weeds: Why the legend? Remember - aesthetics = mapping! Adding alpha within the aesthetic statement will work, but ggplot will assume the alpha is something we got from the data, and therefore adds it to the legend. By setting alpha to a specific value outside of aes(), we tell ggplot that alpha doesn't have anything to do with the data.* ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_point() + geom_boxplot(alpha = 0.5) ``` Woohoo! Now our points are nice and dark behind our translucent boxplots. However, they are all smushed up against each other in a straight line, and it can be hard to see how the distribution of the points works. We can add some random horizontal spacing and noise to this by adding what is called **jitter**. *Note: This is just random noise, and does not represent properties of the original data! Use jitter judiciously.* Since it doesn't make sense to add jitter to a boxplot, lets just add it to our `geom_point` function. ggplot includes two ways to do this: setting the `position` value like we set the `alpha` value by adding `position = "jitter"`, or replacing the `geom_point` function with a `geom_jitter` function. ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_point(position = "jitter") + geom_boxplot(alpha = 0.5) ``` Lets also try replacing geom_point with geom_jitter: ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_jitter() + geom_boxplot(alpha = 0.5) ``` You can see that this gives essentially the same result, although the random noise means the exact position of the points will be different. *Note: The spread of the jittered points can look pretty chaotic! If you want to make a very pretty publication-worthy graph while avoiding overplotting, consider installing the package `ggbeeswarm` to make a nicely shaped spread of points.* ### Three variables - Color We have a lot of variables in our dataset, some of which may be correlated with each other. Lets see if there is any association between a penguins flipper length and its sex by coloring our points! Color is an aesthetic! Lets add a new `aes()` function with a `color` variable to our `geom_point` layer, and specify the `sex` column in our dataset: ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_point(aes(color = sex), position = "jitter") + geom_boxplot(alpha = 0.5) ``` Wow! It looks qualitatively like male penguins might have slightly longer flippers than female penguins. To quantitate this, lets split our boxplots up by sex as well: ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_point(aes(color = sex), position = "jitter") + geom_boxplot(aes(color = sex), alpha = 0.5) ``` Very cool - it looks like our groups are separated nicely! We will get into more details on color and color palettes later, but for now lets keep making this plot better. ### Grouping and dodging If you want to separate out the points so that they go under their boxplots again instead of jittering them, we can use a `position_dodge` in combination with `group` `position_dodge` adjusts the horizontal position of the data by using a grouping variable. In this case, we will set our grouping variable to also be sex, and add our `position_dodge` statement to the geom with a width of 0.75. *Note: position_dodge - like jitter and alpha - is not an aesthetic and goes outside of the aes parenthesis but within the geom parenthesis.* *In the weeds: You will notice that a position_dodge with a width of 0.75 aligns suspiciously perfectly with our boxplots. This is because boxplots are actually already using position_dodge by default when you add another variable like color!* ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_dodge(width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) ``` We have lost our jitter, but no fear - we can add it back! To add it back while retaining the grouping, we can simultaneously dodge and jitter by using `position_jitterdodge` and changing `width` to `dodge.width` ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) ``` This plot is looking pretty great! ### Titles and labels Lets add some necessary information to this plot to take it closer to being publication-worthy. There are multiple ways to add labels and titles to a plot, but lets use a simple one - the `labs()` function. This function controls the labels for data, axes, and the plot itself. Each variable within labs can be passed a string specifying our title, but you can also automate titles using R variables! Remember to add a plus sign when adding a new layer! ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) + labs(title = "Relationship between flipper length and sex by species") ``` We can also edit the axis labels to change the automatic names from the data into something nicer to look at: ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) + labs(title = "Relationship between flipper length and sex by species", x = "Species name", y = "Flipper length (mm)") ``` ### Themes If you want to get rid of that default grey background, you need to change the **theme**. Themes in ggplot govern all the **elements** of the plot - the text font and size, the background color, axis ticks, gridlines - every little detail. They can get very complex and you can create your own custom themes, but a good alternative to the default is `theme_bw()`, which changes the background to white and adds a nice dark border around the plot. *Note: You can also edit the legend by adding a legend() function, which can again be very detailed!* ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) + labs(title = "Relationship between flipper length and sex by species", x = "Species name", y = "Flipper length (mm)") + theme_bw() ``` Much better! ### Color palettes ggplot comes with a pretty nice set of default colors, but they can be a little repetitive. In addition, many of the default colors are unfortunately not colorblind-friendly. There are many, many ways to color your plots, but the simplest option is to manually set your colors using the `scale_color_manual` function. This function allows you to map colors directly to your data, which works well when you only have a few variables. Since we have two options for our variable we are coloring by, we can give it two color values. R also knows the names of most basic colors, so you don't even need to know any fancy hex or RGB codes! ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) + labs(title = "Relationship between flipper length and sex by species", x = "Species name", y = "Flipper length (mm)") + scale_color_manual(values = c("pink","purple")) + theme_bw() ``` In addition to manually setting colors, we can use pre-made color palettes. You can always copy the hex codes into scale_color_manual, but there are many palettes built into R packages! A great place to start is <https://r-graph-gallery.com/color-palette-finder>, where you can preview a bunch of popular palettes from the package **paletteer** Pick one and import the package below! ```{r} library("paletteer") ``` A personal favorite of mine for discrete data points is Darjeeling1 from the wesanderson package. This package has palettes based on multiple Wes Anderson movies, so I will import that for this example. ```{r} my_colors <- paletteer::paletteer_d("wesanderson::Darjeeling1") ``` We can access the hex codes directly in the vector we just created, or we could use a scale_color\_ function. Scales are complicated, but in this case we are using a paletteer-specific scales function for discrete (not continuous) colors from the wesanderson package, and specifying the reverse direction. ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) + labs(title = "Relationship between flipper length and sex by species", x = "Species name", y = "Flipper length (mm)") + scale_color_paletteer_d("wesanderson::Darjeeling1", direction = -1) + theme_bw() ``` Amazing! We now have a very nice plot, suitable for most presentations. *Note: Darjeeling1 and other palettes are pretty and nice for adding color to discrete, labeled data, but still don't have perfect variability between colors for colorblind-friendliness or greyscale. If your color is important and directly related to the data such as in a heatmap, options such as the **viridis** package have carefully selected palettes that are very useful for continuous datasets.* ### Checking significance From all of our plots, it looks like the Gentoo penguins have longer flippers than the Adelie or Chinstrap penguins. Is this statistically significant? We can do a quick check by using a new package called **ggpubr.** This package has lots of small tools to help make your plots publication-ready, but the most commonly used is a function to quickly compare and plot the means of groups in a plot. We can do this by adding another layer to our plot that this library supplies, `stat_compare_means()`. You can just call this by itself if you have a simple plot, but lets add information to make sure we use a t-test, show the significance as asterisks, and compare all three of our groups to each other. ```{r} library("ggpubr") ``` ```{r} ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) + labs(title = "Relationship between flipper length and sex by species", x = "Species name", y = "Flipper length (mm)") + theme_bw() + stat_compare_means(method ="t.test",label = "p.signif", comparisons = list(c("Gentoo", "Chinstrap"), c("Adelie", "Chinstrap"), c("Adelie", "Gentoo"))) ``` Amazing! Since we have so many data points, all of our comparisons are very significant, and even the two similar species are different from each other. *Note: Be careful using these "pre-made" statistical tests! `stat_compare_means` is useful to do quick visualizations like this, but its always a good idea to independently perform statistical tests yourself outside of the ggplot/R framework to make sure your data is actually significant, and not a result of a coding typo or plotting quirk.* *Note 2: Another alternative to ggpubr is the package **ggsignif**, which works similarly but has a few different functionalities.* ### Faceting It looks like male penguins have longer flippers than female penguins. Does this hold true within each species? Lets check significance again, but this time we will split up the species so we can check male vs. female trends independently for each species. We can do this through **faceting.** Faceting is an incredibly powerful tool in ggplot! It essentially splits your plot into smaller subplots or "facets" based on some aspect of your data. In this case, we can split up our data based on the species column into three separate plots so we can more clearly show our statistical comparison. We will use the function `facet_wrap()`, which takes a set of variables to facet by. We are only specifying one variable, so we give it `vars(species)` . In order for this to work, we need to change our x-axis call in the first line of our ggplot to be "sex" instead of "species", since we are now faceting by species instead. ```{r} ggplot(data = tidy_data, mapping = aes(x = sex, y = flipper_length_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) + facet_wrap(vars(species)) + labs(title = "Relationship between flipper length and sex by species", x = "Species name", y = "Flipper length (mm)") + theme_bw() ``` Very nice! We now have three windows or facets for our data. Lets now add our `stat_compare_means` line from before, but comparing female vs male instead of species: ```{r} ggplot(data = tidy_data, mapping = aes(x = sex, y = flipper_length_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) + facet_wrap(~species) + labs(title = "Relationship between flipper length and sex by species", x = "Species name", y = "Flipper length (mm)") + theme_bw() + stat_compare_means(method ="t.test",label = "p.signif", comparisons = list(c("female", "male"))) ``` Great! We are now doing separate comparisons for each species, and they all happen to be significant! *Note: Why "wrap"? All the facet functions split your data into facets, but they display it differently. `facet_wrap` will wrap your data into more rows if needed to fit your screen, or stretch it out if you have plenty of space. If you want a specific layout, `facet_grid` lets you specify how many rows and columns you want your plots to fill.* *In the weeds: You may be familiar with a different way to specify facets using a tilde (\~). Facets can be given using a formula, and the most basic formula with only one variable would be `~species` . Facets can also be given a character vector, like `c("species")`. These different methods can be useful for more complex faceting arrangements, so look further into them if your data gets complex.* ### Patchwork - Putting plots together So far, we have only been looking at one morphological measurement from our dataset - flipper length. Do the patterns we have seen so far hold true for all the other measurements as well? Because of how modular and flexible ggplot is, its extremely easy to switch the data out between plots. All we need to do is change the `flipper_length_mm` variable to something else in the first line of our plot. We could just make four new code blocks and four new plots, but it might be nice to look at them all together in one plot. We can use the package **patchwork** to help do this! **patchwork** is a library that can quickly and easily help you display multiple plots and format the layout. It's another extremely powerful tool with lots of functions, but it also has a very simple way to start! ```{r} library("patchwork") ``` To use patchwork, we need to save multiple plots in memory instead of showing it right away. We can do this by assigning our whole ggplot statement to a variable, just like you would a data frame or vector. Lets plot every numerical column in our data and save them into variables, and lets remove the labels for now to make the result a little less crowded: ```{r} p_f <- ggplot(data = tidy_data, mapping = aes(x = species, y = flipper_length_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) + theme_bw() + stat_compare_means(method ="t.test",label = "p.signif", comparisons = list(c("Gentoo", "Chinstrap"), c("Adelie", "Chinstrap"), c("Adelie", "Gentoo"))) p_b1 <- ggplot(data = tidy_data, mapping = aes(x = species, y = bill_length_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) + theme_bw() + stat_compare_means(method ="t.test",label = "p.signif", comparisons = list(c("Gentoo", "Chinstrap"), c("Adelie", "Chinstrap"), c("Adelie", "Gentoo"))) p_b2 <- ggplot(data = tidy_data, mapping = aes(x = species, y = bill_depth_mm)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) + theme_bw() + stat_compare_means(method ="t.test",label = "p.signif", comparisons = list(c("Gentoo", "Chinstrap"), c("Adelie", "Chinstrap"), c("Adelie", "Gentoo"))) p_m <- ggplot(data = tidy_data, mapping = aes(x = species, y = body_mass_g)) + geom_jitter(aes(group = sex, color = sex), position = position_jitterdodge(dodge.width = 0.75)) + geom_boxplot(aes(color = sex), alpha = 0.5) + theme_bw() + stat_compare_means(method ="t.test",label = "p.signif", comparisons = list(c("Gentoo", "Chinstrap"), c("Adelie", "Chinstrap"), c("Adelie", "Gentoo"))) ``` You'll notice the plots didn't display after the code block. No worries, the plots were created, but are stored in a variable now. You should see the named variables in your environment, probably as "lists". To make sure the plot is there, we can just call it in a new code block and the plot will display: ```{r} p_f ``` Now to have fun with patchwork - you can simply add one plot to another and patchwork will automatically put them together! ```{r} p_f + p_m ``` You can do some fancy formatting with just the plot variables to get some fun layouts in a very intuitive way: ```{r} (p_b1 | p_b2 | p_f) / p_m ``` If the plot looks sad and squished, click the little "show in new window" button at the upper right of the plot to expand it! ### Saving plots to a file There are several ways to save a plot to a file, but a simple way is to save the plot to a variable as before and then use the `ggsave()` function. ```{r} p <- (p_b1 | p_b2 | p_f) / p_m ``` ```{r} ggsave(p, file = "penguins_data_plot.png") ``` Note that the plot will save in your current working directory, and the dimensions may not be ideal! You can specify the dimensions within the ggsave statement: ```{r} ggsave(p, file = "penguins_data_plot_wide.png", width = 16, height = 8) ``` Yay! We now have our plot in a file and ready to be slapped into a presentation! ### Cheating :) Congrats! You just learned a lot of very specific, highly detailed plotting techniques! Now, what happens in a few months at the end of your first rotation, after you have forgotten everything from this bootcamp, when you suddenly need to plot your data for your final presentation? Don't worry, even people who use R every day are constantly forgetting details and specific functions, and there are several tools that can help you! It's not cheating, its time management :) Two great options for quickly looking at your data or giving you a starting plot template are **ggraptR** and **esquisse.** Both of these tools provide an interactive interface where you can move your data around and create/tweak basic graphs, and then provide you the exact code to create that graph in ggplot! ggraptR may not be working with the most recent versions of R, so lets try esquisse: ```{r} library("esquisse") ``` We pass our data frame to the function `esquisser` and a window should open with an interface. ```{r} esquisser(tidy_data) ``` To re-create the bare bones of our plot. drag the "species" variable to the X box, "flipper_length_mm" to the Y box, and "sex" to the color box. Ignore any error popups, and select a jitter plot from the upper left button. Then click on the "geom #2" tab, and select a boxplot from the upper left button. You can also manually edit the labels using the "Labels and Title" tab along the bottom. Finally, click the "Code" tab on the lower right, and you can copy the basic code to display that plot! I hope this short tutorial has been helpful to show you some of the things you can accomplish with ggplot in R. Happy plotting! ## Dataset Plotting Import our data from this morning: ```{r} mRNA_data <- read.csv("mRNA_data_3genes.csv") ``` #### Plot NOP16 data ```{r} ggplot(data = mRNA_data, aes(x = Vol_fL, y = log2_ratio, color = Gene)) + geom_point() ``` #### Line plot ```{r} ggplot(data = mRNA_data, aes(x = Vol_fL, y = log2_ratio, color = Gene)) + geom_line() ``` #### Three gene plot ### Heatmaps! To make a heatmap with ggplot, we use the `geom_tile()` function. This is a 2d structure like a barplot, so the coloring variable is `fill` instead of `color`. ```{r} ggplot(data = mRNA_data, aes(x = Vol_fL, y = Gene, fill = log2_ratio)) + geom_tile() ``` #### Continuous color scales ```{r} ggplot(data = mRNA_data, aes(x = Vol_fL, y = Gene, fill = log2_ratio)) + geom_tile() + scale_fill_gradient2() ``` #### Changing the theme ```{r} ggplot(data = mRNA_data, aes(x = Vol_fL, y = Gene, fill = log2_ratio)) + geom_tile() + scale_fill_gradient2() + theme_bw() ``` #### Making it interactive! The package **plotly** provides a great way to make your plots interactive. For our heatmap, this means you can mouse over the cells in the plot to display a tooltip with the exact values, providing both visual and quantitative information in a neat package! Lets install one last package and try it out: ```{r} # install.packages("plotly") library(plotly) ``` To use the tooltip, we need to specify a column with the text to display. We can include a lot of information in the tooltip, but in this case lets display the name of the gene, the vol_fl, and the log2 ratio number. We can do this by making a new column using the `mutate()` function from tidyverse! ```{r} tooltip_data <- mRNA_data %>% mutate(tooltip_text = paste0("Gene: ", SystematicName, "\n", "Vol (fL): ", Vol_fL, "\n", "Log2 ratio: ",round(log2_ratio,3))) ``` The formatting here is a little unusual in order to make it look pretty, but tooltips can be very simple. Lets try making our plot, this time specifying a new mapping inside the `aes()` statement for a `text` variable: ```{r} ggplot(data = tooltip_data, aes(x = Vol_fL, y = Gene, fill = log2_ratio, text = tooltip_text)) + geom_tile() + scale_fill_paletteer_c("grDevices::TealGrn", direction = 1) + theme_bw() ``` Notice we are using a **continuous** color palette this time and using **fill** instead of **color** by calling the `scale_fill_paletter_c` function and providing a sequential palette. *Note: To see a dataframe of the palette and package names that can be called with the continuous or discrete scale function, type `palettes_c_names` or `palettes_d_names`* It probably doesn't look any different, but now we can add the `ggplotly()` package wrapper after we save the plot into a new plotting variable: ```{r} i_p <- ggplot(data = tooltip_data, aes(x = Vol_fL, y = Gene, fill = log2_ratio, text = tooltip_text)) + geom_tile() + scale_fill_paletteer_c("grDevices::TealGrn", direction = 1) + theme_bw() ggplotly(i_p, tooltip="text") ``` Incredible! This interactivity should even work as a knit html document, lets try it out!

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.