owned this note
owned this note
Published
Linked with GitHub
# UC San Diego Library
# Software Carpentry Workshop - Intro to R
## April 10-11, 2019
### Biomedical Library Builiding
### 9:00a - 4:30pm
---
# This HackMD has been locked by the instructors and can no longer be edited.
# If you would like to export these notes, you can do that by selecting Export or Download options in the HackMD options menu.
---
### Instructors
- Reid Otsuji, Library (rotsuji@ucsd.edu)
- Stephanie Labou, Library (slabou@ucsd.edu)
- Rick McCosh, OB/GYN+REPRO SCI
- Arden Tran, Research IT (artran@ucsd.edu)
- Andre Paloczy, SIO
- Eva Sanchez Alvarez, Marine Bio
---
### This hackMD: https://bit.ly/2Z2gdIL
### Syllabus and Schedule: https://ucsdlib.github.io/2019-04-10-UCSD/
### UNIX Shell Set-up: https://swcarpentry.github.io/shell-novice/setup.html
---
## DAY 1 - Please sign in here
##### Name, Affiliation (faculty, staff, student, post-doc, etc), Department/Lab
---
Reid Otsuji, Research Data Curation Librarian, Library
Stephanie Labou, Data Science Librarian, Library
Dan LaSusa, staff IT, Library
Rick McCosh, OB/GYN+REPRO SCI
Arden Tran, Research IT
Eva Sanchez Alvarez, Marine Bio
Ashlie Pankonin, staff, Psychology/Barner Lab
Kyle Begovich, post-doc, Biology/Wilhelm Lab
Stefanie Makowski, post-doc, Medicine/Field Lab
Charles Seller, post-doc, Biology/Schroeder Lab
Sonja Lang, post-doc, Medicine/Schnabl Lab
Huikuan Chu, post-doc, Medicine/Schnabl Lab
Zhenping Wang,assistant project scientist, Dermatology/DiNardo lab
Lu Jiang, postdoc, Medicine/Schnabl Lab
Jose Bucheli, visiting grad student, Center for US-Mexican Studies
Stephanie Gamez, grad student, Biology/Akbari Lab
David Herold, faculty, Pathologyls
Nuria Pell, Medicine, Student-phD/Schnabl Lab
Ben Croker, faculty, Pediatrics
Yi Duan, post-doc, Medicine/Schnabl lab, yid003@ucsd.edu
Nicole Gergans, Staff, San Diego Water Board
Bei Gao, postdoc, Medicine/Schnabl lab
Rongrong Zhou,post-doc,Medicine/Schnabl LAB
Isabel Salas, post-doc, Salk institute, Allen Lab
Ranveer Jayani, Assistant Project Scientist, School of Medicine, UCSD
## Collaborative Notes:
---
## Day 1 - Unix shell
__Setup__
Go here for [setup of the Unix shell](http://swcarpentry.github.io/shell-novice/setup.html)
- On Mac: open your 'Terminal'
- On Windows: open 'Command Prompt' or Git Bash (avoid PowerShell)
*For Windows, you'll want to open the CLI (Command Line Interface - aka. 'Command Prompt') and then enter the command `bash` to enter the bash environment. All that means is we're telling the CLI to interpret our commands in the language of `bash`. On Mac, `bash` is native to the system, so already there when you open 'Terminal'*
Basic Definitions - "what do those acronyms mean?!"__
- **CLI**: Command Line Interface (aka Terminal, Console, or Command Prompt). The window that looks like a portal to the Matrix.
- **Bash**: Bash is a command processor language and Unix shell. It is one flavor of several CLI's. You'll get familiar with others like Command Prompt for Windows. If you're using Windows for this workshop, you're using Bash within Command Prompt.
- **GUI**: Graphical User Interface - typically referring to any time we use a mouse to interact with a program or environment. For example, the CLI is not a GUI and when we use the command `ls` to list our files and folders, we are seeing a stripped down version or base version of the functions that are used to list files and folders as opposed to using the usual `Finder` (Mac) or `File Explorer` (Windows) to look at your files and folders.
__Bash Commands__
- `pwd`: "print working directory" (tells me what directory I'm in)
- `mkdir`: make a new folder (example: mkdir new_folder)
- `cd <folder name>`: change directory (example: cd new_folder)
- `nano`: make a new file (example: nano new_file.txt)
- `CTRL + O`: use this inside the nano window to save your draft file. A prompt to enter a filename to save as will appear.
- `CTRL + X`: To exit the nano GUI, enter this command - by default, you'll also be prompted to choose to save or not.
- `cd ..`: Go back or "up" a level in the file directory.
- `cd`: `cd` by itself will take you back to your home directory, which will usually look like `<computer name>:~ <username>$`
- `rm`: Remove file (won't work on directories/folders)
- Warning: Removing is forever! There is no trash bin when you remove something using the command line.
- `rmdir`: Remove directory
- `rm -r /path/to/dir/*`: Remove everything to remove all sub-directories and files
- "Flags" or "arguments" - in programming and with command line languages like Bash or Command Prompt, most of the commands you'll use have lots of options to modify what you want that command to do. We call these flags for Bash and with most languages, we call them arguments. So for example, the command above `rm` is to remove, but if you wanted to add an argument or flag to remove an entire directory and subsets and files, you just add the flag `-r` along with the specified path to the directory you want to remove with the argument `/path/to/dir/*`. You'll end up with a command that looks like this: `rm -r /path/to/dir/*`
- `history`: gives you a full list of commands you've entered for your current session (unless you also save for later use with `history > history.txt`
- `history > <filename>.txt`: using the `history` command with the flag to save the output as a text file with your choice in naming.
- `cat <filename>.txt': use the `cat` command to quickly read through a file by having the CLI list the contents of the text file within the CLI.
- `clear`: the command wipes out the commands visible within the window of the CLI. On a Mac, just scroll up to see the history. On Windows, the CLI is actually cleared out. This is great tool to use for starting fresh visually.
- `mv`: move or rename files. Example `mv history.txt quotes.txt` just copies the history.txt file and renames the copy as quotes.txt. `mv history.txt ~/Desktop` moves the history.txt file to the Desktop.
- `wc`: word count (example: `wc *.txt` returns the number of lines, words, and characters in all files with a .txt extension)
- add `-l` to return just the number of lines (`wc -l *.txt`)
- `|` (called the "pipe") is used to chain together commands
- for example, `wc - l *.txt | sort -n | head -n 1` would (1) find the number of lines in all .txt files, then (2) sort the output from smallest number of lines to largest number of lines, then (3) return the first row (the `head` command)
-
__
__CLI Pro-Tips__
- want to **re-use the same command** you just typed? Perhaps one that was 5 lines up? Use `⬆` or `⬇` arrow keys and then `Enter` when you get to the command you want, or modify it as needed before hitting `Enter`
- **Auto-complete!** Use `TAB` to auto-complete commands and files/folders in the CLI. Example, say you're trying to list a file path to ~/Desktop/thesis/, you could simply start typing `~/De` and then `TAB` and it will complete to `~/Desktop/` and then continue by typing `t` and then `TAB` to complete `thesis/` giving you `~/Desktop/thesis/` If the thing you are trying to autocomplete has similar named files, it may take a couple more characters before the auto-complete finds the right file name. Use `ls` to determine how deep you need to go with typing before you can use auto-complete.
---
## R Day 1
Gapminder Data download:
https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv
Windows: right click -> save as
Mac: control + click -> save
### Intro to R & RStudio:
RStudio is an IDE
Integrated development environment
what is RStudio:
https://www.rstudio.com/products/rstudio/
### RStudio IDE Cheat sheet:
https://www.rstudio.com/resources/cheatsheets/#ide
### TED Talk about gapminder data set
https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen
**old versionrstudio installers:**
https://support.rstudio.com/hc/en-us/articles/206569407-Older-Versions-of-RStudio?mobile_site=true
Make sure you are using RStudio and not the R programming console.
__Working with variables__
R uses variables names
Use `<-` to assign values to variables, such as `x <- 5` (now x is equal to value 5)
Best practices for naming variables:
- Don't use spaces
- instead, use `_` (my_new_variable) or camel case (MyNewVariable)
- Don't start with a number
- Don't start with a period `.`
Variables are created (and overwritten) in the order they are run. Try running the following to see this in section:
`mass <- 47.5`
`age <- 122`
`mass <- mass * 2.0`
`age <- age - 20`
You can remove a variable using `rm()` with the variable name in `()`, as in `rm(my_variable)`
You can see all the variables you currently have using `ls()`
__Getting help__
Use `?` to get help about fnctions in R. For example, `?min` will return (in the lower right hand window) the help page for the `min()` command.
__Arithmetic and comparisons in R__
Addition: +
Subtraction: -
Divide: /
Multiply: *
Greater than: >
Greater than or equal to: >=
Less than: <=
Less than or equal to: <=
Equals: ==
Not equals: != (the ! here is equivalent to saying "not" whatever comes next, so "not equal")
__Vectorization__
- creat a vector namually: use `c` (combine function)
- `c(1,3,5,7)``
- Creating a vector using seq() function
```
seq(1, 3, by=0.2) # specify step size
[1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
> seq(1, 5, length.out=4) # specify length of the vector
[1] 1.000000 2.333333 3.666667 5.000000
```
__Packages__
Packages in R are your friend! You can think of these as bundles of useful functions that other people have created and made available to share. Practically, this means that if you're thinking "hm, I wish I could do this thing in R..." someone probably made a package with functions to do that thing!
There are a __lot__ of packages available: https://cran.r-project.org/web/packages/available_packages_by_name.html
You will need to install a package in order to use it. The good news is that you only have to install it once. To install, you can use `install.packages()` with the name of the package, in quotes, in the `()`. For example, to install the plotting packages `ggplot2`, you would use `install.packages("ggplot2")`.
You can also use the point-and-click option in RStudio: Tools --> Install packages --> put the names of the packages.
Once you have the package installed (which you only have to do once) you'll still need to tell R that you want to access that package each time you start a new R session. You do this by using `library()` with the name of the package (without quotes). Example: `library(ggplot2)` means that now I can access functions within `ggplot2` in my current R session.
The standard convention is to run these `library()` commands at the start of each R script. You'll want to load __all__ your packages like this at the beginning of your session/script.
__
before working with data
set your working environment
menu in RStudio `session/set wroking directory/choose directory`
select the directory where you have saved the Gapminder data.
Loading .csv data in RStudio
read.csv("gapminder.csv")
## Gapminder data TED talk
looking at the structure of the data frame:
```
str(gapminder)
```
Different Data Types in R:
Logical
Interger
Numeric(Double)
Complex
Character
`Vectors` colleciton of data points, in order, all the same data types
`Factors` a variable of any of the above types can also be treated as a factor. Discrete group assignments.
assinging gapminder.csv data to a variable
```
gapminder <- read.csv("gapminder.csv")
```
```
gapminder[gapminder$country == "australia", c("year", "lifeexp")]
```
### Creating plots
load ggplot2 library
```
library(ggplot2)
```
```
ggplot(data = gapminder, aes(x=gdpPercap, y=lifeExp)) + geom_point()
```
aes = aesthetics
what happens if you remove geom_point?
```
ggplot(data = gapminder, aes(x=gdpPercap, y=lifeExp))
```
save plots:
```
ggsave("Lifeexp_by_GDPperCap.pdf")
```
saved files are saved in the working directory
adding a theme to a plot: add after geom
```
ggplot(data = gapminder, aes(x=gdpPercap, y=lifeExp)) + geom_point() + theme_classic()
```
using variables to build plot layers:
```
p <- ggplot(data = gapminder, aes(x=gdpPercap, y=lifeExp)) p <- p + geom_point()
P <- p + theme_classic()
P
```
plot modified to show lines
```
p <- ggplot(data = gapminder, aes(x=year, y=lifeExp, by= country))
p <- p + geom_line()
P <- p + theme_classic()
P
```
adding lines and points
```
p <- ggplot(data = gapminder, aes(x=year, y=lifeExp, by= country))
p <- p + geom_line()
P <- p + geom_point()
P <- p + theme_classic()
P
```
```
test <- subset(gapminder, continent == "Oceania")
```
```
p <- ggplot(data = gapminder)
p <- p + geom_line(aes(x=year, y=lifeExp, by=country), color = "red"))
P <- p + geom_line(data = subset(gapminder, continent == "oceania"), aes(x=year, y=lifeExp, by=country), color="black")
P <- p + theme_classic()
P
```
__Additional plotting resources__
R graph gallery: https://www.r-graph-gallery.com/
ggplot2 cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
General reference for ggplot2: https://ggplot2.tidyverse.org/reference/
---
## End of day exercise: Day 1 Feedback
1. Something positive about the workshop Day:
Learned fast!
Learned quick commands--useful. Agree
liked learning how to make plots
2. Something you would like to learn more about or general feedback for improvements Day 1:
Go deeper into construct our own plots, any kinds
Customize plots
How to control other computers with the consol/terminal
When assisting other with questions, would be nice to keep voices down a little. Hard to focus with many voices at once
might be a good idea to encourage people to look over the lesson ahead of time so we can focus on practical application in class rather than general concepts
---
## DAY 2 - Please sign in here
##### Name, Affiliation (faculty, staff, student, post-doc, etc), Department/Lab
---
Kyle Begovich, post-doc, Biology/Wilhelm Lab
Ben Croker, Faculty, Pediatricsz
Bei Gao, Postdoc, Medicine/Schnabl lab
zhenping wang, project scientist, Dermatology/DiNardo lab
Nicole Gergans, staff, San Diego Water Quality Control Board
Jose Bucheli, visiting grad student, Center for US-Mexican Studies
Abby Pennington, Metadata Services / Research Data Curation, Library
Stephanie Gamez, grad student, Biology/Akbari Lab
Stefanie Makowski, postdoc, Medicine/Field Lab
Charles Seller, post-doc, Biology/Schroeder lab
Ashlie Pankonin, staff, Psychology/Barner lab
Huikuan Chu, postdoc, Medicine/Schnabl lab
Ranveer Jayani, Assistant Project Scientist, School of Medicine, UCSD
Rongrong Zhou,post-doc,Medicine/Schnabl LAB
Yi Duan, post-doc, Medicine/Schnabl lab, yid003@ucsd.edu
Isabel Salas, post-doc, Salk institute, Alleb Lab
## Version control with git ##
The material we will be covering: https://swcarpentry.github.io/git-novice/
__Set up__
The first time you use git, you'll need to set up your name and email.
`git config --global user.name "Your Name"`
`git config --global user.email "myemail@domain.com"`
Once you're inside the folder in which you want to version control your files, you'll use `git init` to get everything started. This will turn your folder into a "repository" where git can store version of your files.
```bash
git init
ls -a # shows the hidden files. .git will be in the folder
```
**make sure you do not git init in nested folders**
(This causes problems with tracking the changes of files.)
```bash
touch # command will make an empty file
```
___If you get stuck in vim___ type `:q!` and this will force an exit
__Basic git commands__
Use `git status` to check the status of your git repository. This will tell you what files have changed (including any added or deleted files).
If there are no changes, this will return `On branch master
nothing to commit, working directory clean`
If there are files that have changed, you will see something along the lines of
`Untracked files:
(use "git add <file>..." to include in what will be committed)
draft.txt`
Use `git add filename.extension` (example: `git add draft.txt`) to tell git you want to track this file
Then, `git status` will return something along the lines of
`Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: mars.txt`
Git now knows that it’s supposed to keep track of `draft.txt`, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command: `git commit`
You'll include a commit message, which is a short blurb describing what you've done in this change.
`git commit -m "create draft.txt"`
Once you press enter, you will see something like:
`[master (root-commit) f22b25e] create draft.txt
1 file changed, 1 insertion(+)
create mode 100644 draft.txt`
To see a history of your commits, use `git log`. (When you get to the point of having a lot of commits, you can use `git log --oneline` to see a more succinct summary of the commits.)
___Order of operations___
Most of working with git is the two commands, `git add` and `git commit`
1) make a change in a file "myfile.txt"
2) `git add myfile.txt`
3) `git commit -m "created myfile.txt"`
![A visual reference:](https://swcarpentry.github.io/git-novice/fig/git-staging-area.svg)
Use `git status` at any point to see whether there are any untracked files or any changes that haven't been committed.
__Look at differences between versions__
Use `git add` to see how files have changed between commits. Use the commit alphanumeric number (at least first few characters) to see difference between current versio nand selected past version for specified file. For example:
` git diff f22b25e draft.txt`
will show the difference between the current version of `draft.txt` and the version at commit `f22b25e`.
Can use `git diff HEAD` as a shortcut to see changes between current version and last committed version:
`git diff HEAD draft.txt`
Similarly, can use `git diff HEAD~1` to see changes between current version and commit one prior to last commit:
`git add HEAD~1 draft.txt`
See all commits for a particular file (only commits where certain file has changed):
`git log --follow -- filename`
Use `git log -p filename` to see actual differences between files in addition to commit messages.
__Rolling back to previous versions__
Use `git checkout` to "roll back" a file to a previously committed version (aka restore a previous version):
` git checkout f22b25e draft.txt`
To put things back the way they were:
`git checkout HEAD draft.txt`
___Detached head___: If you checkout and forget to specify a file, your whole repository will be rolled back and you will get a warning to your console that says `You are in 'detached HEAD' state.` The “detached HEAD” is like “look, but don’t touch” here, so you shouldn’t make any changes in this state. After investigating your repo’s past state, reattach your HEAD with `git checkout master`.
__Ignoring things__
You can create a file called `.gitignore` and include any files or folders which you don't want to track.
`nano .gitignore` [then add files or folders]
So for instance if I wanted to ignore a file called `notes.txt` or a folder called `data/`:
`cat .gitignore`
`notes.txt data/*`
__GitHub__
Material here: https://swcarpentry.github.io/git-novice/07-github/index.html
__________
```bash
git status #use git status often!!
git add --all #add all files to track
```
git will display colors for adding:
red - changes have been made to the file
green - files have been added for tracking
```bash
git commit -m "[you must type a short message]"
```
__Remotes in GitHub__
To get started with using remotes, we'll need to have a GitHub account. Go to [GitHub](https://github.com/) and create an account.
[Go here for the repo](https://github.com/dankelley/oce)
---
# Day 2 R
Make sure you have the Gapminder data saved in your working directory.
```
# load packages
install.packages("dplyr")
library(dplyr)
library(ggplot2)
gapminder <- read.csv("gapminder.csv") #read in data to gapminder variable
```
```
# explore the data
head(gapminder)
str(gapminder)
summary(gapminder)
```
select from dplyr
select dataframe, column
select is a way to select only the columns you need.
```
gap_small <- select(gapminder, country, year, gdpPercap)
```
**filter()** command - filters by row condition
```
gap_small_oceania <- filter(gapminder, continent == "Oceania")
```
Exercise solution:
```
new_data <- select(gapminder, country, lifeExp, gpdPercap)
```
**%>%** (this is a pipe)
```
gap2 <- gapminder %>%
filter(continent == "Oceania") %>%
select(country, year, pop)
# remember select and filter order makes a difference!
```
**mutate() function**
```
gap_gdp <- gapminder %>%
mutate(gdp = population * gdpPercap)
```
```
gap_gdp <- gap_gdp %>%
mutate(rich = ifelse(gdp > 1868000000, "rich", "on the way"))
summary(gap_gdp)
summary(as.factor(gap_gdp$rich))
```
Exercise 2
Break now: III
Break at 2:30: IIII
copy pasta code for arden
```
#install.packages("dplyr")
# First things first, load packages
library(dplyr)
library(ggplot2)
# Read in gapminder data
gapminder <- read.csv("gapminder.csv")
# Explore the data
head(gapminder)
str(gapminder)
summary(gapminder)
# select() command - selects columns
# select(dataframe, columns)
gap_small <- select(gapminder, country, year, gdpPercap)
# filter() command - filters by row condition
gap_small_Oceania <- filter(gapminder, continent == "Oceania")
# Make a new dataframe called “new_data” that has only the columns country, life expectancy, and GDP per capita.
new_data <- select(gapminder, country, lifeExp, gdpPercap)
# this is the pipe %>%
gap2 <- gapminder %>%
filter(continent == "Oceania") %>%
select(country, year, pop)
# gap2 <- gapminder %>%
# select(country, year, pop) %>%
# filter(continent == "Oceania")
# mutate() creates new columns
gap_gdp <- gapminder %>%
mutate(gdp = pop * gdpPercap)
gap_gdp <- gap_gdp %>%
mutate(rich = ifelse(gdp > 18680000000, "rich", "on the way"))
summary(gap_gdp)
summary(as.factor(gap_gdp$rich))
# Create a new dataframe that has a new column with the ratio of life expectancy to GDP per capita. Keep only the country and ratio column.
# Hint: think about the order of operations!
new <- gapminder %>%
mutate(ratio = lifeExp/gdpPercap) %>%
select(country, ratio)
# group_by() is how we group variables
# summarize() is how we summarize
gap_gdp_summary <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
gap_gdp_summary
gap_summary2 <- gapminder %>%
group_by(continent, country) %>%
summarize(max_pop = max(pop),
mean_gdpPercap = mean(gdpPercap))
# Let's get back to plotting
ggplot(data = gapminder, aes(x = year, y = lifeExp, by = country,color = pop)) +
geom_line() +
facet_wrap(~continent) +
scale_color_gradient(low = "blue", high = "red") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Year") +
ylab("Life Expectancy") +
ggtitle("Fancy Plot")
ggsave("fancyplot.png")
#example with plots
for (i in 1:length(unique(gapminder$country))) {
country_i <- unique(gapminder$country)[i]
gapminder_i <- filter(gapminder, country == country_i)
ggplot(data = gapminder_i, aes(x = year, y = gdpPercap)) +
geom_point() +
ggtitle(country_i)
ggsave(paste(country_i, ".png"))
}
```
# install knitr package
```
install.packages("knitr")
library(knitr)
```
### different plots
plotly
## End of workshop exercise: Day 2 Feedback
1. Something positive about the workshop Day:
The instrucxtor are very knowledgable and helping. I learned a lot in the two-day workshop.
Good overview of the kinds of things we can do with our new skills, although putting it into practice on my own seems a little daunting.
great code for plots. ++ :D
2. Something you would like to learn more about or general feedback for improvements Day 1:
Its a great two-day wkrshop packed with lots of information. Hwoever, I feel that this workshop should be split in to multiple weeks (with 1-2 hrs a week). This will help us all (who know nothing abot all this) to grasp the commands and princiles in a beter way. And some home-work and self practice will help hone our skills.
I think it would be helpful to allow for more time to understand what we're doing because sometimes you feel like you are typing things without really knowing why we're doing it. Maybe by making us read over the lessons ahead of time.
# Take the post workshop survey: https://ucsdlib.github.io/2019-04-10-UCSD/