changed 6 years ago
Linked with GitHub

UC San Diego Library

Software Carpentry Workshop - Intro to R

April 10-11, 2019

Biomedical Library Builiding

9:00a - 4:30pm


This HackMD has been locked by the instructors and can no longer be edited.

If you would like to export these notes, you can do that by selecting Export or Download options in the HackMD options menu.


Instructors


This hackMD: https://bit.ly/2Z2gdIL

Syllabus and Schedule: https://ucsdlib.github.io/2019-04-10-UCSD/

UNIX Shell Set-up: https://swcarpentry.github.io/shell-novice/setup.html


DAY 1 - Please sign in here

Name, Affiliation (faculty, staff, student, post-doc, etc), Department/Lab

Reid Otsuji, Research Data Curation Librarian, Library
Stephanie Labou, Data Science Librarian, Library
Dan LaSusa, staff IT, Library
Rick McCosh, OB/GYN+REPRO SCI
Arden Tran, Research IT
Eva Sanchez Alvarez, Marine Bio
Ashlie Pankonin, staff, Psychology/Barner Lab
Kyle Begovich, post-doc, Biology/Wilhelm Lab
Stefanie Makowski, post-doc, Medicine/Field Lab
Charles Seller, post-doc, Biology/Schroeder Lab
Sonja Lang, post-doc, Medicine/Schnabl Lab
Huikuan Chu, post-doc, Medicine/Schnabl Lab
Zhenping Wang,assistant project scientist, Dermatology/DiNardo lab
Lu Jiang, postdoc, Medicine/Schnabl Lab
Jose Bucheli, visiting grad student, Center for US-Mexican Studies
Stephanie Gamez, grad student, Biology/Akbari Lab
David Herold, faculty, Pathologyls
Nuria Pell, Medicine, Student-phD/Schnabl Lab
Ben Croker, faculty, Pediatrics
Yi Duan, post-doc, Medicine/Schnabl lab, yid003@ucsd.edu
Nicole Gergans, Staff, San Diego Water Board
Bei Gao, postdoc, Medicine/Schnabl lab
Rongrong Zhou,post-doc,Medicine/Schnabl LAB
Isabel Salas, post-doc, Salk institute, Allen Lab
Ranveer Jayani, Assistant Project Scientist, School of Medicine, UCSD

Collaborative Notes:


Day 1 - Unix shell

Setup
Go here for setup of the Unix shell

  • On Mac: open your 'Terminal'
  • On Windows: open 'Command Prompt' or Git Bash (avoid PowerShell)

For Windows, you'll want to open the CLI (Command Line Interface - aka. 'Command Prompt') and then enter the command bash to enter the bash environment. All that means is we're telling the CLI to interpret our commands in the language of bash. On Mac, bash is native to the system, so already there when you open 'Terminal'

Basic Definitions - "what do those acronyms mean?!"__

  • CLI: Command Line Interface (aka Terminal, Console, or Command Prompt). The window that looks like a portal to the Matrix.
  • Bash: Bash is a command processor language and Unix shell. It is one flavor of several CLI's. You'll get familiar with others like Command Prompt for Windows. If you're using Windows for this workshop, you're using Bash within Command Prompt.
  • GUI: Graphical User Interface - typically referring to any time we use a mouse to interact with a program or environment. For example, the CLI is not a GUI and when we use the command ls to list our files and folders, we are seeing a stripped down version or base version of the functions that are used to list files and folders as opposed to using the usual Finder (Mac) or File Explorer (Windows) to look at your files and folders.

Bash Commands

  • pwd: "print working directory" (tells me what directory I'm in)

  • mkdir: make a new folder (example: mkdir new_folder)

  • cd <folder name>: change directory (example: cd new_folder)

  • nano: make a new file (example: nano new_file.txt)

  • CTRL + O: use this inside the nano window to save your draft file. A prompt to enter a filename to save as will appear.

  • CTRL + X: To exit the nano GUI, enter this command - by default, you'll also be prompted to choose to save or not.

  • cd ..: Go back or "up" a level in the file directory.

  • cd: cd by itself will take you back to your home directory, which will usually look like <computer name>:~ <username>$

  • rm: Remove file (won't work on directories/folders)

    • Warning: Removing is forever! There is no trash bin when you remove something using the command line.
  • rmdir: Remove directory

  • rm -r /path/to/dir/*: Remove everything to remove all sub-directories and files

  • "Flags" or "arguments" - in programming and with command line languages like Bash or Command Prompt, most of the commands you'll use have lots of options to modify what you want that command to do. We call these flags for Bash and with most languages, we call them arguments. So for example, the command above rm is to remove, but if you wanted to add an argument or flag to remove an entire directory and subsets and files, you just add the flag -r along with the specified path to the directory you want to remove with the argument /path/to/dir/*. You'll end up with a command that looks like this: rm -r /path/to/dir/*

  • history: gives you a full list of commands you've entered for your current session (unless you also save for later use with history > history.txt

  • history > <filename>.txt: using the history command with the flag to save the output as a text file with your choice in naming.

  • cat <filename>.txt': use the cat` command to quickly read through a file by having the CLI list the contents of the text file within the CLI.

  • clear: the command wipes out the commands visible within the window of the CLI. On a Mac, just scroll up to see the history. On Windows, the CLI is actually cleared out. This is great tool to use for starting fresh visually.

  • mv: move or rename files. Example mv history.txt quotes.txt just copies the history.txt file and renames the copy as quotes.txt. mv history.txt ~/Desktop moves the history.txt file to the Desktop.

  • wc: word count (example: wc *.txt returns the number of lines, words, and characters in all files with a .txt extension)

    • add -l to return just the number of lines (wc -l *.txt)
  • | (called the "pipe") is used to chain together commands

    • for example, wc - l *.txt | sort -n | head -n 1 would (1) find the number of lines in all .txt files, then (2) sort the output from smallest number of lines to largest number of lines, then (3) return the first row (the head command)

__

CLI Pro-Tips

  • want to re-use the same command you just typed? Perhaps one that was 5 lines up? Use or arrow keys and then Enter when you get to the command you want, or modify it as needed before hitting Enter
  • Auto-complete! Use TAB to auto-complete commands and files/folders in the CLI. Example, say you're trying to list a file path to ~/Desktop/thesis/, you could simply start typing ~/De and then TAB and it will complete to ~/Desktop/ and then continue by typing t and then TAB to complete thesis/ giving you ~/Desktop/thesis/ If the thing you are trying to autocomplete has similar named files, it may take a couple more characters before the auto-complete finds the right file name. Use ls to determine how deep you need to go with typing before you can use auto-complete.

R Day 1

Gapminder Data download:
https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv

Windows: right click -> save as
Mac: control + click -> save

Intro to R & RStudio:

RStudio is an IDE
Integrated development environment
what is RStudio:
https://www.rstudio.com/products/rstudio/

RStudio IDE Cheat sheet:

https://www.rstudio.com/resources/cheatsheets/#ide

TED Talk about gapminder data set

https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen

old versionrstudio installers:
https://support.rstudio.com/hc/en-us/articles/206569407-Older-Versions-of-RStudio?mobile_site=true

Make sure you are using RStudio and not the R programming console.

Working with variables

R uses variables names

Use <- to assign values to variables, such as x <- 5 (now x is equal to value 5)

Best practices for naming variables:

  • Don't use spaces
    • instead, use _ (my_new_variable) or camel case (MyNewVariable)
  • Don't start with a number
  • Don't start with a period .

Variables are created (and overwritten) in the order they are run. Try running the following to see this in section:
mass <- 47.5
age <- 122
mass <- mass * 2.0
age <- age - 20

You can remove a variable using rm() with the variable name in (), as in rm(my_variable)

You can see all the variables you currently have using ls()

Getting help

Use ? to get help about fnctions in R. For example, ?min will return (in the lower right hand window) the help page for the min() command.

Arithmetic and comparisons in R
Addition: +
Subtraction: -
Divide: /
Multiply: *
Greater than: >
Greater than or equal to: >=
Less than: <=
Less than or equal to: <=
Equals: ==
Not equals: != (the ! here is equivalent to saying "not" whatever comes next, so "not equal")

Vectorization

  • creat a vector namually: use c (combine function)

  • `c(1,3,5,7)``

  • Creating a vector using seq() function

seq(1, 3, by=0.2)          # specify step size
[1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
> seq(1, 5, length.out=4)    # specify length of the vector
[1] 1.000000 2.333333 3.666667 5.000000

Packages
Packages in R are your friend! You can think of these as bundles of useful functions that other people have created and made available to share. Practically, this means that if you're thinking "hm, I wish I could do this thing in R" someone probably made a package with functions to do that thing!

There are a lot of packages available: https://cran.r-project.org/web/packages/available_packages_by_name.html

You will need to install a package in order to use it. The good news is that you only have to install it once. To install, you can use install.packages() with the name of the package, in quotes, in the (). For example, to install the plotting packages ggplot2, you would use install.packages("ggplot2").

You can also use the point-and-click option in RStudio: Tools > Install packages > put the names of the packages.

Once you have the package installed (which you only have to do once) you'll still need to tell R that you want to access that package each time you start a new R session. You do this by using library() with the name of the package (without quotes). Example: library(ggplot2) means that now I can access functions within ggplot2 in my current R session.
The standard convention is to run these library() commands at the start of each R script. You'll want to load all your packages like this at the beginning of your session/script.

__

before working with data
set your working environment

menu in RStudio session/set wroking directory/choose directory

select the directory where you have saved the Gapminder data.

Loading .csv data in RStudio
read.csv("gapminder.csv")

Gapminder data TED talk

looking at the structure of the data frame:

str(gapminder)

Different Data Types in R:
Logical
Interger
Numeric(Double)
Complex
Character

Vectors colleciton of data points, in order, all the same data types

Factors a variable of any of the above types can also be treated as a factor. Discrete group assignments.

assinging gapminder.csv data to a variable

gapminder <- read.csv("gapminder.csv")
gapminder[gapminder$country == "australia", c("year", "lifeexp")]

Creating plots

load ggplot2 library

library(ggplot2)
ggplot(data = gapminder, aes(x=gdpPercap, y=lifeExp)) + geom_point()

aes = aesthetics

what happens if you remove geom_point?

ggplot(data = gapminder, aes(x=gdpPercap, y=lifeExp))

save plots:

ggsave("Lifeexp_by_GDPperCap.pdf")

saved files are saved in the working directory

adding a theme to a plot: add after geom

ggplot(data = gapminder, aes(x=gdpPercap, y=lifeExp)) + geom_point() + theme_classic()

using variables to build plot layers:

p <- ggplot(data = gapminder, aes(x=gdpPercap, y=lifeExp)) p <- p + geom_point()
P <- p + theme_classic()
P

plot modified to show lines

p <- ggplot(data = gapminder, aes(x=year, y=lifeExp, by= country)) 
p <- p + geom_line()
P <- p + theme_classic()
P

adding lines and points

p <- ggplot(data = gapminder, aes(x=year, y=lifeExp, by= country)) 
p <- p + geom_line()
P <- p + geom_point()
P <- p + theme_classic()
P
test <- subset(gapminder, continent == "Oceania")
p <- ggplot(data = gapminder)
p <- p + geom_line(aes(x=year, y=lifeExp, by=country), color = "red"))

P <- p + geom_line(data = subset(gapminder, continent == "oceania"), aes(x=year, y=lifeExp, by=country), color="black")
P <- p + theme_classic()
P

Additional plotting resources
R graph gallery: https://www.r-graph-gallery.com/
ggplot2 cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
General reference for ggplot2: https://ggplot2.tidyverse.org/reference/


End of day exercise: Day 1 Feedback

  1. Something positive about the workshop Day:

Learned fast!
Learned quick commandsuseful. Agree
liked learning how to make plots

  1. Something you would like to learn more about or general feedback for improvements Day 1:

Go deeper into construct our own plots, any kinds
Customize plots
How to control other computers with the consol/terminal
When assisting other with questions, would be nice to keep voices down a little. Hard to focus with many voices at once
might be a good idea to encourage people to look over the lesson ahead of time so we can focus on practical application in class rather than general concepts


DAY 2 - Please sign in here

Name, Affiliation (faculty, staff, student, post-doc, etc), Department/Lab

Kyle Begovich, post-doc, Biology/Wilhelm Lab
Ben Croker, Faculty, Pediatricsz
Bei Gao, Postdoc, Medicine/Schnabl lab
zhenping wang, project scientist, Dermatology/DiNardo lab
Nicole Gergans, staff, San Diego Water Quality Control Board
Jose Bucheli, visiting grad student, Center for US-Mexican Studies
Abby Pennington, Metadata Services / Research Data Curation, Library
Stephanie Gamez, grad student, Biology/Akbari Lab
Stefanie Makowski, postdoc, Medicine/Field Lab
Charles Seller, post-doc, Biology/Schroeder lab
Ashlie Pankonin, staff, Psychology/Barner lab
Huikuan Chu, postdoc, Medicine/Schnabl lab
Ranveer Jayani, Assistant Project Scientist, School of Medicine, UCSD
Rongrong Zhou,post-doc,Medicine/Schnabl LAB
Yi Duan, post-doc, Medicine/Schnabl lab, yid003@ucsd.edu
Isabel Salas, post-doc, Salk institute, Alleb Lab

Version control with git

The material we will be covering: https://swcarpentry.github.io/git-novice/

Set up
The first time you use git, you'll need to set up your name and email.
git config --global user.name "Your Name"
git config --global user.email "myemail@domain.com"

Once you're inside the folder in which you want to version control your files, you'll use git init to get everything started. This will turn your folder into a "repository" where git can store version of your files.

git init
ls -a # shows the hidden files.  .git will be in the folder

make sure you do not git init in nested folders
(This causes problems with tracking the changes of files.)

touch # command will make an empty file

If you get stuck in vim type :q! and this will force an exit

Basic git commands

Use git status to check the status of your git repository. This will tell you what files have changed (including any added or deleted files).

If there are no changes, this will return On branch master nothing to commit, working directory clean

If there are files that have changed, you will see something along the lines of
Untracked files: (use "git add <file>..." to include in what will be committed) draft.txt

Use git add filename.extension (example: git add draft.txt) to tell git you want to track this file

Then, git status will return something along the lines of
Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: mars.txt

Git now knows that it’s supposed to keep track of draft.txt, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command: git commit

You'll include a commit message, which is a short blurb describing what you've done in this change.

git commit -m "create draft.txt"

Once you press enter, you will see something like:
[master (root-commit) f22b25e] create draft.txt 1 file changed, 1 insertion(+) create mode 100644 draft.txt

To see a history of your commits, use git log. (When you get to the point of having a lot of commits, you can use git log --oneline to see a more succinct summary of the commits.)

Order of operations

Most of working with git is the two commands, git add and git commit

  1. make a change in a file "myfile.txt"
  2. git add myfile.txt
  3. git commit -m "created myfile.txt"

A visual reference:

Use git status at any point to see whether there are any untracked files or any changes that haven't been committed.

Look at differences between versions

Use git add to see how files have changed between commits. Use the commit alphanumeric number (at least first few characters) to see difference between current versio nand selected past version for specified file. For example:

git diff f22b25e draft.txt

will show the difference between the current version of draft.txt and the version at commit f22b25e.

Can use git diff HEAD as a shortcut to see changes between current version and last committed version:

git diff HEAD draft.txt

Similarly, can use git diff HEAD~1 to see changes between current version and commit one prior to last commit:

git add HEAD~1 draft.txt

See all commits for a particular file (only commits where certain file has changed):
git log --follow -- filename

Use git log -p filename to see actual differences between files in addition to commit messages.

Rolling back to previous versions
Use git checkout to "roll back" a file to a previously committed version (aka restore a previous version):

git checkout f22b25e draft.txt

To put things back the way they were:
git checkout HEAD draft.txt

Detached head: If you checkout and forget to specify a file, your whole repository will be rolled back and you will get a warning to your console that says You are in 'detached HEAD' state. The “detached HEAD” is like “look, but don’t touch” here, so you shouldn’t make any changes in this state. After investigating your repo’s past state, reattach your HEAD with git checkout master.

Ignoring things
You can create a file called .gitignore and include any files or folders which you don't want to track.

nano .gitignore [then add files or folders]

So for instance if I wanted to ignore a file called notes.txt or a folder called data/:

cat .gitignore
notes.txt data/*

GitHub

Material here: https://swcarpentry.github.io/git-novice/07-github/index.html


git status  #use git status often!!
git add --all #add all files to track 

git will display colors for adding:
red - changes have been made to the file
green - files have been added for tracking

git commit -m  "[you must type a short message]"

Remotes in GitHub
To get started with using remotes, we'll need to have a GitHub account. Go to GitHub and create an account.

Go here for the repo


Day 2 R

Make sure you have the Gapminder data saved in your working directory.

# load packages
install.packages("dplyr")
library(dplyr)
library(ggplot2)

gapminder <- read.csv("gapminder.csv") #read in data to gapminder variable 


# explore the data
head(gapminder)
str(gapminder)
summary(gapminder)

select from dplyr
select dataframe, column
select is a way to select only the columns you need.

gap_small <- select(gapminder, country, year, gdpPercap)


filter() command - filters by row condition

gap_small_oceania <-  filter(gapminder, continent == "Oceania")

Exercise solution:

new_data <- select(gapminder, country, lifeExp, gpdPercap)

%>% (this is a pipe)

gap2 <- gapminder %>%
    filter(continent == "Oceania") %>%
    select(country, year, pop)
# remember select and filter order makes a difference!

mutate() function

gap_gdp <- gapminder %>%
    mutate(gdp = population * gdpPercap)
gap_gdp <- gap_gdp %>%
    mutate(rich = ifelse(gdp > 1868000000, "rich", "on the way"))
    
    
summary(gap_gdp)

summary(as.factor(gap_gdp$rich))
    

Exercise 2

Break now: III

Break at 2:30: IIII

copy pasta code for arden

#install.packages("dplyr")

# First things first, load packages
library(dplyr)
library(ggplot2)

# Read in gapminder data
gapminder <- read.csv("gapminder.csv")

# Explore the data
head(gapminder)
str(gapminder)
summary(gapminder)

# select() command - selects columns
# select(dataframe, columns)
gap_small <- select(gapminder, country, year, gdpPercap)

# filter() command - filters by row condition
gap_small_Oceania <- filter(gapminder, continent == "Oceania")

# Make a new dataframe called “new_data” that has only the columns country, life expectancy, and GDP per capita.

new_data <- select(gapminder, country, lifeExp, gdpPercap)

# this is the pipe %>% 

gap2 <- gapminder %>% 
  filter(continent == "Oceania") %>% 
  select(country, year, pop)
  
# gap2 <- gapminder %>% 
#   select(country, year, pop) %>% 
#   filter(continent == "Oceania") 

# mutate() creates new columns

gap_gdp <- gapminder %>% 
  mutate(gdp = pop * gdpPercap)

gap_gdp <- gap_gdp %>% 
  mutate(rich = ifelse(gdp > 18680000000, "rich", "on the way"))

summary(gap_gdp)
summary(as.factor(gap_gdp$rich))

# Create a new dataframe that has a new column with the ratio of life expectancy to GDP per capita. Keep only the country and ratio column. 
# Hint: think about the order of operations!
  
new <- gapminder %>% 
  mutate(ratio = lifeExp/gdpPercap) %>% 
  select(country, ratio)

# group_by() is how we group variables
# summarize() is how we summarize

gap_gdp_summary <- gapminder %>% 
  group_by(continent) %>% 
  summarize(mean_gdpPercap = mean(gdpPercap))
  
gap_gdp_summary

gap_summary2 <- gapminder %>% 
  group_by(continent, country) %>% 
  summarize(max_pop = max(pop),
            mean_gdpPercap = mean(gdpPercap))
  
# Let's get back to plotting

ggplot(data = gapminder, aes(x = year, y = lifeExp, by = country,color = pop)) +
  geom_line() +
  facet_wrap(~continent) +
  scale_color_gradient(low = "blue", high = "red") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("Year") +
  ylab("Life Expectancy") +
  ggtitle("Fancy Plot")

ggsave("fancyplot.png")
  
  
  
#example with plots

for (i in 1:length(unique(gapminder$country))) {
  
  country_i <- unique(gapminder$country)[i]
  
  gapminder_i <- filter(gapminder, country == country_i)
  
  ggplot(data = gapminder_i, aes(x = year, y = gdpPercap)) +
    geom_point() +
    ggtitle(country_i)
  
  ggsave(paste(country_i, ".png"))
  
}
  
  



install knitr package

​install.packages("knitr")
​library(knitr)

different plots

plotly

End of workshop exercise: Day 2 Feedback

  1. Something positive about the workshop Day:
    The instrucxtor are very knowledgable and helping. I learned a lot in the two-day workshop.
    Good overview of the kinds of things we can do with our new skills, although putting it into practice on my own seems a little daunting.

great code for plots. ++ :D

  1. Something you would like to learn more about or general feedback for improvements Day 1:
    Its a great two-day wkrshop packed with lots of information. Hwoever, I feel that this workshop should be split in to multiple weeks (with 1-2 hrs a week). This will help us all (who know nothing abot all this) to grasp the commands and princiles in a beter way. And some home-work and self practice will help hone our skills.
    I think it would be helpful to allow for more time to understand what we're doing because sometimes you feel like you are typing things without really knowing why we're doing it. Maybe by making us read over the lessons ahead of time.

Take the post workshop survey: https://ucsdlib.github.io/2019-04-10-UCSD/

Select a repo