owned this note
owned this note
Published
Linked with GitHub
> * > []# UC San Diego School of Global Policy and Strategy Skills Course
# 3 weeks - Introduction to R
GPS Auditorium: 6:30 - 9:30pm
[TritonEd Course Link](https://tritoned.ucsd.edu/webapps/blackboard/execute/announcement?method=search&context=course_entry&course_id=_18891_1&handle=announcements_entry&mode=view)
[Syllabus and Schedule](https://ucsdlib.github.io/win2019-gps-r/)
---
## Instructors:
Rick Mccosh
Stephanie Labou
Reid Otsuji
TA: Caio Mansini
Helper: Arden Tran
## Collaborative notes:
#Just like in Stata, whenever you use "#" R understands it as a comment
#Control+Shit+N (Shortcut for the R script) - just like a `do file` in Stata
#Always remember to comment what you're working on - it will help you later
## RStudio Cheatsheets!
https://www.rstudio.com/resources/cheatsheets/ - excellent resource for popular RStudio Packages.
## Week 1: First Script
#### Command+Enter (Mac Users) / Shift+Enter (Windows Users) -> Execute the commands in your script
1+100
2 - 4
2 * 3
(2+3)*3
# This is a comment, it will not be executed
Notice that mathematical functions in R receives its input value within ()
If you're not sure about how to use a certain function, just type ? before the command you want to execute and the help menu will pop up on your screen
Hit the "Tab" button to make your coding faster or if you want to see all the available commands with that specific set of words
?sin()
sin(1)
Logical Comparisons - Like Stata, whenever you want to test whether two values are equal use '==', '!=', '>', '<', '>=', '<='
## Variables in R are assignd differently from other programming languages.
#### Usual programming language (Python, C, C++, etc)
var = 1
#### R
var <- 1
#### To check the content of the variable just type the name of the variable and hit enter
var
[1] 1
## Variable Names
- Can contain letters, numbers, underscores and periods
- Can not start with a number or underscore
- Can not have spaces, or dashews (or punctuation besides period)
- avoid starting with a period, acceptable, but makes a 'hidden variable'
<- or = ?
Either at times, but reliably stick with <- for variable assignment
### CHALLENGE 1
#### What will be the fvalue of each variable after each statement in the following program?
mass <- 50
age <- 22
mass <- mass * 2
age <- age - 20
#### Output:
mass = 100
age = 2
## Vectors
R can store vectors and often described as 'vectorized'
#### example:
x <- c(1,2,3,4,5)
x
#### Output:
[1] 1,2,3,4,5
#### Another example:
Every number between 1 and 5
1:5
#### Output:
[1] 1,2,3,4,5
#### And another:
ls() #list all variables
rm(x) #remove the variable
ls()
#### Output:
[1] Toyota
## Installing Packages
install.packages("knitr")
- Syntactically fine to use either 'package' or "package" for argument
### checking on your downloaded packages and availability
- library()
### CHALLENGE 2
#### Install the following packages:
- 'ggplot2'
- 'knitr'
- 'gridExtra'
## GAPMINDER data
GAPMINDER: Life expectancy, population, GDP per Capita, 1952 - 2007
- Big but not too big
- enough features, but noit too many
- complete dataset, formatted well
- TED talk:
- https://www.........(ill come back to this)
## Navigating to the data set
In R Studio, set working directory to the folder where dataset is storted - for this example, data per instruction should be stored in 'Data' folder and hopefully, on your desktop for ease of access
- Session > Set Working Directory > Choose Directory
Validate by running the following command in the console to make sure you're in the right folder where your data is stored:
- getwd()
### Reading the data set
Once you've set your working directory to your 'data' folder where your data set should be stored as either gapminder.csv or gapminderFiveYear.csv, you will now use the read method to read the file!
In the console run the following command:
read.csv("gapminder.csv")
# note that you can TAB to have R autocomplete the file name.
# This helps to avoid mistyping the file name to read!
You should see the first output of the data set within R studio showing you that the data file was found and read.
### Concept: what did you just do?
So, conceptually, you got R and R Studio, additional packages for it, and then downloaded the data set. We then directed R Studio to 'look' into the right folder or directory where we have the data stored. Then you assigned that data set to an object variable with the read.csv command and assignment to 'gapminder' and then we just checked to see if that worked. So, start up + point it to the right folder + assigned the data and then checked!
#### Common issues explored:
- Filenames and extension issues!
- Make sure you havent double-named your file! For example, make sure you did not name your file "gapminder.csv.csv"
- Set the right working directory!
- Sometimes it helps to restart your R studio session after settin your working directory by going to:
- Session > Restart R
## Exploring Datasets
str(gapminder)
#### Just like the 'desc' command in Stata, str() provides information about the dataset like the number of variables, observations, etc.
view(gapminder) #Shows the data in tabular view
## Data types
Strings -> Text
Integers -> Whole Numbers
Doubles -> Numbers with decimals
Logical -> TRUE or FALSE
Complex -> 3i
## To check the data type of a variable type
class(var)
class(y)
Y <- as.character(y)
class(Y)
as.integer(Y)
### CHALLENGE 3
#### Create a vector called Numbers consisting of every6 number between 10 and 20
What data type is it?
Numbers <- 10:20
Class(Numbers)
#### Create a vector called Name with two elements: your first name and last name
What data type is it?
```
Name<-c("first name","last name")
class(Name)
```
## data frames! And some useful commands for analyzing at a high-level your data set now in a dataframe!
str(gapminder)
gapminder$continent
# note the use of the $ symbol - we will use this heavily
unique(gapminder$continent)
length(gapminder$continent)
summary(gapminder$lifeExp)
mean(gapminder$lifeExp)
### You can do it in R, just believe in yourself and your google skills.
Mainly Google though.
#subsetting
head(gapminder, 10)
tail(gapminder, 1)
#Use that ? for help on commands!
?head()
# This command below will output based on the arguments entered
gapminder[1,1]
# a little more complex
gapminder[1:5,c(1,3,5)]
### The above command traces as follows: rows 1 - 5, columns 1, 3, and 5
# the order of the data points matter for data frames
gapminder[gapminder$country=="Australia" & gapminder$year== 1952, c("year", "lifeExp")]
### Soft Wrap (make that code wrap around and not bleed off the page)
Go to: Tools > Global Options > Code > Soft wrap R source files
# subsetting continued
subset(gapminder, gapminder$country=="Ausralia" & gapminder$year == 1952, c("year", "lifeExp"))
### What purpose does subsetting serve?
- "Keep if"
- aids in analysis for chunking into smaller groups
- why use subset over brackets?
- use cases vary and need will determine best use to execute (tl;dr = each command has its place for best use)
### CHALLENGE 4
#### Use your new subsetting skills to display the life expectancy and GDPperCapita for people in Paraguay in 2007
gapminder[gapminder$country=="Paraguay" & gapminder$year==2007, c("year", "lifeExp", "GDPperCapita")]
### Parentheses and what to watch out for!
- Make sure to avoid syntax errors and failure to execute your code by ensuring you close out every parenthetical! tl;dr = () always
### IF Statements
#### Data Dependent Choices
x <- 8
if(x>=10) {
print("x is greater than 10")
} else{
print("x is less than 10")
}
#### What happened?
We assigned a variable 'x' to the value 8 and then asked with an IF statement to logically evaluate whether the variable x is greater than or equal to the value 10. And if it is, for the function to tell us via the print function that "x is greater than 10" and if not, to tell us "x is less than 10" via the else statement and also using that print function for the output. Thanos did nothing wrong.
#### Another example for the IF statement
x <- 8
if(x<=10) {
print("x is less than or equal to 10")
} else if(x>5) {
print("x is greater than 5, but less than 10")
} else {
print("x is less than 5")
}
#### Explain like I'm 5
Just like the prior example, we have set up a logical expression to be evaluated by first assigning the variable x to the value 8. Then we state that if the value of the variable x is less than or equal to the value of 10, for R to tell us that via the print function. If not, we want it to also evaluate whether it is greater than 5, but less than 10. And again, tell us via print function. And then if that expression is not correct, to state the last logical expression as less than 5.
### R package: 'KNITR'
- file > new file > R MarkDown
- You get a prompt then enter title: "GPS-Intro-R" and author: "Not Rick"
- save to current working directory
- Click the "Knit" button right under the tabs within your R studio editor
- VOILA - what just happened?!?!
- KNITR helps us make documents easier to read by embedding them into a file format sprinkled with the wonders of the graphical formatting of MarkDown. If you've used jupyter notebooks or iPython notebooks, this is essentially that without the kerneling. Also, if you're reading this page and select the 'dual-pane' icon to your top left by the HackMD logo, you'll notice that the webpage shows a Markdown section on the left and a formatted page to the right. This is the transformation to make the text output in R more readable and user-friendly that you see with KNITR.
```
## R Markdown
# Title
## Title 2
### Title 3
#### Title 4
This is some example text we can make a **bold** word or an *italics*
- bulleted list
H~2~0 superscripting
```
The above in Markdown becomes what you see below!
## R Markdown
# Title
## Title 2
### Title 3
#### Title 4
This is some example text we can make a **bold** word or an *italics*
- bulleted list
H~2~0 superscripting
### Code chunking
```{r}
x <- 8
x
y <- x + 1
y
```
#### WHERE IS THE ` SYMBOL? TRY TOP-LEFT ON KEYBOARD WHERE THE TILDE ~ SYMBOL IS.
### inline code
`r 2+2`
there are `r length(1:10)`
The above statement in a knit file would be converted in your mark down page to represent the actual data represented with that piece of in-line code.
## Please sign in below:
##### Name (A ### )
---
Reid Otsuji (A12345678)
Hsinyao Amy Huang (A53265236)
Michael W. Andrews (A14231233)
Jaemin Seo (A10283495)
Hannah Leigh Ashby (A13026919)
Navdhrishty Singh (A12972272)
Man Luo (A53277866)
Yifan Dong (A53256391)
Isabelle Chen (A12692259)
Christopher Thompson (A53290544)
Diego Jimenez (A92016548)
Shihao Lin (A53255525)
Noah Gerber (A11844436)
Zixuan Dai (A53270250)
Xinyu Huang (A53279031)
Qiuyi Wang(A53264636)
Savas Tarhan (A53222479)
Clarins Cecilia (A53288595)
Mingpu Xiao (A53281212)
Lin Ou (A53290084)
Wenlin Zhao (A53271207)
Yang Xuan (A14466001)
Jiahang Zhang (A53278367)
Zilu Zeng (A53266413)
Yaw Dapaa (A15758044)
Yunzhou Luo (A53233781)
Alexandra Murphy (A10269306)
Payam Shahsavandi (13534813)
Camila Gomez Wills (A53265648)
Gustavo López (A53264234)
Hui Zhang(A53248115)
Daniel Horan (A11768191)
Jude Muhtaseb (A14721276)
gKelis Wong (A53233278)
Cesar Perez (A53265549)
Greg Householter (A53288838)
Yucheng Shen (A53290227)
Yaw Dapaa (A15758044)
Michael Andrews (A14231233)
Iuliia Wilson (A53258052)
Xinyu Zhang(A53287451)
Jicuo Dai (A53251206)
William Shumate (A14196513)
Sean Clark (A53292865)
Travis Welburn (A92430385)
Wendy Romero-Garcia (A53217156)
Miso Park (A53232558)
Aimee Barnes(A11923359)
Nick Rhodes (A11893860)
Sibo Su (A53255631)
Juan S. Herrera (A53289098)
Inderpal Pamma (A53282970)
Wenhan Zhu (A53277425)
Haotian Chen (A53257482)
Talor Gruenwald (A53285078)
Gala Ledezma (A12804158)
Raymond Kao (A12647786)
Jiawei Huang (A53279789)
Laura Vossler (A53267214)
Renee Johnson (A12699790)
Yi LIU (A53257075)
Paul Koenig (A12629448)
Brittany Ekejiuba (A53290428)
Anh Nguyen (A53245210)
imgesu cetin(A53290197)
Iuliia Wilson (A53258052)
Katherine Tian (A12828524)
Christopher Thompson (A53290544)
Wenhan Zhu (A53277425)
Malena Hernandez (A53289951)
Talor Gruenwald (A53285078)
Masatoshi SHIMOSUKA (A53266177)
Xuan Gu(A53291635)
# Week 2: Plotting with R
## The Sticky Note Pedagogy
You'll be given sticky notes in pink and yellow/green to use as a signalling method for class.
- Stuck and need help? Pink sticky note on laptop!
- Confirming that you're good to proceed? Yellow/green note on laptop!
## Plotting with R: Getting Started
- Make sure you have the data set from last week! Look for gapminder.csv
- Open R Studio and [set your working directory](https://hackmd.io/grZZYR22RC25PU4l2TibTw?both#Navigating-to-the-data-set) (remember, this is where you point your environment to "look" into the right folder - and in this case, where our data file "Gapminder.csv" is)
- hint: "Session > Set Working Directory"
## Plotting with R: Install library packages!
#GPS 2019 Week 2
installed.packages()
install.packages("gridExtra")
library(ggplot2)
library(gridExtra)
library(knitr)
gapminder <- read.csv("gapminder.csv")
Run the above code to make sure you install the library gridExtra and then to check if you have the library packages for ggplot2 and knitr. The last command is to assign the variable name "gapminder" to our data file "gapminder.csv" with the read.csv method.
If you run into issues with not having the package, or package not found, just download via the install.packages("packagename") command:
install.packages("ggplot2")
install.packages("gridExtra")
install.packages("knitr")
## ggplot: Syntax and arguments
ggplot2 - "The Grammar of Graphics" not "Good Game"
#Our first plot
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point()
#saving plots
ggsave("LifeExpectancyByGDPperCap.pdf")
dev.off
#jpeg()
#png()
- ![Example Output Plot Image](https://i.imgur.com/RUluMjD.png)
### Challenge 1
#### Create a scatter plot of Population over Year
- year on X-axis
- pop on Y-axis
- ![Example Output of Challenge 1](https://i.imgur.com/uVp0zn2.png)
- Solution:
ggplot(data = gapminder, aes(x = year, y = pop)) + geom_point()
More Arguments
ggplot(data = gapminder, aes(x = year, y = pop)) +
geom_point() + theme_classic()
ggplot(data = gapminder, aes(x = year, y = pop, by = country)) +
geom_line() + theme_classic()
Now we're going to do points and lines!
ggplot(data = gapminder, aes(x = year, y = pop, by = country)) +
geom_line(aes(color = continent)) + geom_point() + theme_classic()
- ![Example Output](https://i.imgur.com/UVO1XI8.png)
What about colors?
ggplot(data = gapminder, aes(x = year, y = pop, by = country)) +
geom_line(aes(color = continent)) + geom_point(color = "red") + theme_classic()
![Output example for color added](https://i.imgur.com/ND3EV4r.png)
And another plot!
Let's say we're interested in Oceania...
Now, let's subset the gapminder data to only show the continent Oceania
[Subsetting notes](https://hackmd.io/grZZYR22RC25PU4l2TibTw?both#data-frames-And-some-useful-commands-for-analyzing-at-a-high-level-your-data-set-now-in-a-dataframe)
ggplot(data = gapminder, aes(x = year, y = lifeExp, by = country)) + geom_line(color = "blue") + geom_line(data = subset(gapminder, continent == "Oceania"), aes(x = year, y = lifeExp, by = country), color = "red")
![Oceania Subsetting Example](https://i.imgur.com/Bb3KhYd.png)
### Challenge 2
#### Make a lijne plot of lifeExp over year, with the country of your choice highlighted (contrasting color on top of all other data points)
- Y-axis: lifeExp
- X-axis: year
```
P <- ggplot(data = gapminder, aes(x = year, y = lifeExp, by = country))
P <- P + geom_line(color = "blue")
P <- P + geom_line(data = subset(gapminder, continent=="Oceania"), aes(x = year, y = lifeExp, by = country), color = "red")
P <- P + labs(title = "Oceania on the rise", x = "Year", y = "Life Expectancy")
P
```
![Example output of Oceania on the rise](https://i.imgur.com/Y1wjPEF.png)
Some more arguments
```
P <- P + scale_x_continuous(breaks = seq(1952, 2007, 5), labels = seq(1952, 2007, 5))
P
```
![Sequencing Output example](https://i.imgur.com/HQTl1dP.png)
- Notice how we now have a sequence of years by margins of 5 across our X axis
#### Some more args (short for argument) to get that sweet, sweet COLORS
```
ggplot(data = gapminder, aes(x= year, y = lifeExp, by = country)) + geom_line(aes(x= year, y = lifeExp, by = country, color = continent))
```
![Witness me colors](https://i.imgur.com/OFF8Q5A.png)
- Notice how we now have colors for each country! Neat!
- Now let's revamp it because XP
```
P <- ggplot(data = gapminder)
P <- P + geom_line(aes(x = year, y = lifeExp, by = country, color = continent))
P <- P + scale_color_manual(values = c("red", "blue", "green", "yellow", "pink"))
P
```
![manual color output example](https://i.imgur.com/lTObsL8.png)
- notice how now the data is reflected with the manual color entries we made in the above code
## For Loops: Iterating over your R code
for(i in 1:10){
print(i)
}
output_vector <- c()
for(i in 1:5){
for(j in c('a', 'b', 'c', 'd', 'e')){
temp_output <- paste(i, j)
output_vector <- c(output_vector, temp_output)
}
}
output_vector
output - you should get something like this:
[1] "1 a" "1 b" "1 c" "1 d" "1 e" "2 a" "2 b" "2 c" "2 d" "2 e" "3 a" "3 b" "3 c" "3 d" "3 e" "4 a" "4 b" "4 c" "4 d" "4 e" "5 a"
[22] "5 b" "5 c" "5 d" "5 e"
For Loops are the most common type of iterative data analysis - really helps with iterating (going over and over) data without your own manual effort.
### A plot for each continent
plotC <- list()
for(i in unique(gapminder$continent)){
data1 <- subset(gapminder, continent == i)
plotC[[i]] <- ggplot(data = data1) + geom_line(aes_string(x = "year", y = "lifeExp", by = "country"), color = "red")
}
plotC
grid.arrange(grobs = plotC, nrow = 3)
#issues with grid.arrange not found? Try this
install.packages("gridExtra")
library(gridExtra)
#ok, now try again
grid.arrange(grobs = plotC, nrow = 3)
![Grid.Arrange output example](https://i.imgur.com/4At1Hwq.png)
- Notice how our previous output for ```plotc``` output 5 separate plots which you could navigate through like a gallery in R studio, and now, with the ```grid.arrange()``` function we can see all of those plots combined into one plot output.
### Labels... *how do*?
plotC <- list()
for(i in unique(gapminder$continent)){
data1 <- subset(gapminder, continent == i)
plotC[[i]] <- ggplot(data = data1) + geom_line(aes_string(x = "year", y = "lifeExp", by = "country"), color = "red")
plotC[[i]] <- plotc[[i]] + labs(title = i, x = "YEAR", y = "LIFE EXPECTANCY")
}
plotC
grid.arrange(grobs = plotC, nrow = 3)
![plotC with explicit labels](https://i.imgur.com/J65Mij1.png)
- Note the plot now has labels affixed
```
plotC_grob <- arrangeGrob(grobs = plotC, nrow = 3, top = "life expectancy per continent over time")
ggsave("multipanelContinent.pdf", plotC_grob)
```
Output should read something like: `Saving 9.17 x 8.11 in image`
- Where did it go? Into your working directory
- What's my current working directory? Use this code to find out: `getwd()`
- navigate there and you'll find the PDF file (hopefully)
- Having difficulty? Try CTRL+C and CTRL+V with the above code - but seriously, put a pink sticky note or raise your hand for help!
## 5 + 1 continents = 6 continents - our next challenge
- nesting `IF` statements (think Inception, but with `IF` statements)
`#COPY THIS; PROFESSORS HATE THIS`
```NAmerica <- c("Antigua and Barbuda", "Bahamas", "Barbados", "Belize", "Canada", "Costa Rica", "Cuba", "Dominica", "Dominican Republic", "El Salvador", "Grenada", "Guatemala", "Haiti", "Honduras", "Jamaica", "Mexico", "Nicaragua", "Panama", "Puerto Rico", "Saint Kitts and Nevis", "Saint Lucia", "Saint Vincent and the Grenadines", "Trinidad and Tobago", "United States")```
- sometimes CTRL+V/P code from markdown pages can imbue additional spaces or indents which will not compile. Just try manual entries for suspect code.
```
continent6 <- ifelse(gapminder$country %in% NAmerica, "NorthAmerica", ifelse(as.character(gapminder$continent) %in% "Americas", "SouthAmerica", ifelse(gapminder$continent %in% c("Africa", "Europe", "Oceania"), paste(gapminder$continent), " ")))
continent6
```
- % is basically equal sign ('=')
- [A little more about 'matching' in R](https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/match)
```
gapminder6 <- cbind(gapminder, continent6)
View(gapminder6)
```
- Hint: be aware of your casing and make sure you don't cite an object with the wrong case! i.e. continent6 is not the same as Continent6
- The function View() has a capital 'V' and view() will not resolve
```
plotC <- list()
for(i in unique(gapminder6$continent6)){
data1 <- subset(gapminder6, continent6 == i)
plotC[[i]] <- ggplot(data = data1) + geom_line(aes_string(x = "year", y = "lifeExp", by = "country"), color = "red")
}
plotC
grid.arrange(grobs = plotC, nrow = 3)
```
- Hint: if you have an issue with `grid.arrange` function not found, try this:
```
#this loads that library back into your environment
library(gridExtra)
grid.arrange(grobs = plotC, nrow = 3)
```
![continent6 gridarrange outpuyt example](https://i.imgur.com/lDiK82A.png)
- Note that now we have 6 continents displayed! Hooray! We did it!
- Oh wait, we have to save now....
```
write.table(gapminder6, file="gapminder-6Continents.csv", sep=",", quote=FALSE, row.names=FALSE)
```
- sorry, that was brutal
- IF YOU LIKED THESE NOTES, PLEASE SUPPORT MY GOFUNDME @ NEEDMOARRAMFORMYPC, just kidding, have a great week!
## WEEK 2 COMPLETE :-D
## Week 2 - Please sign in below:
##### Name (A ### )
Payam Shahsavandi(A13534813)
William Shumate (A14196513)
Cesar Perez (A53265549)
Kelis Wong (A53233278)
Gala Ledezma (A12804158)
Camila Gomez Wills (A53265648)
Katherine Tian (A12828524)
Lin Ou (A53290084)
Hui Zhang(A53248115)
Zixuan Dai (A53270250)
Diego Jimenez (A92016548)
Wenlin Zhao (A53271207)
Yifan Dong(A53256391)
Noah Gerber (A11844436)
Jiahang Zhang (A53278367)
Yang Xuan (A14466001)
Jaemin Seo (A10283495)
Mingpu Xiao(A53281212)
Laura Vossler (A53267214)
Yucheng Shen (A53290227)
Zilu Zeng (A53266413)
Anh Nguyen (A53245210)
Jude Muhtaseb (A14721276)
Daniel Horan (A11768191)
Clarins Cecilia (A53288595)
Nick Rhodes(a11893860)
Xinyu Huang (A53279031)
Xinyu Zhang(A53287451)
Savas Tarhan (A53222479)
imgesu cetin (A5329019)
Yaw Dapaa (A15758044)
Gustavo López (A53264234)
Isabelle Chen (A12692259)
Greg Householter (A53288838)
Juan Sebastian Herrera (A53289098)
Navdhrishty Singh (A12972272)
Michael W. Andrews (A14231233)
Inderpal Pamma (A53282970)
Sibo Su (A53255631)
Sean Clark (A53292865)
Shihao Lin (A53255525)
Paul Koenig (A12629448)
Hsinyao Huang (A53265236)
Man Luo (A53277866)
Renee Johnson (A12699790)
Yi Liu (A53257075)
Wenhan Zhu (A53277425)
Aimee Barnes (A11923359)
Wendy Romero-Garcia (A53217156)
Jicuo Dai(A53251206)
Yunzhou Luo(A53233781)
Malena Hernandez (A53289951)
Travis Welburn (A92430385)
Talor Gruenwald (A53285078)
Hannah Leigh Ashby (A13026919)
Masatoshi Shimosuka (A53266177)
Xuan Gu (A53291635)
# Week 3: Data Wrangling with R
### TritonEd issues and submission help:
- If you have issues submitting your assignment via TritonEd, please email it in to Reid at <rotsuji@ucsd.edu>
### What is "data wrangling"?
- Data isnt in the format we want, either for summary statistics, or models, or for making plots.
- Or maybe we want to make new columns that are derivatives of esiting columns - how would we do that?
- We've worked a bit with `subset()`, but the code gets wordy - is there ....(come back to this)
## The dplyr package
### Getting started
- go to your R studio, create a new file
- What is my current working directory? Try `getwd()` this will show you your current working directory
- Set your current working directory with `setwd("the path you want")`
- e.g. `setwd("~/OneDrive/Documents/GPS/Rskillz")`
- Download the right packages with `install.packages(c("dplyr", "tidyr", "ggplot2"))`
- this is a shortcut to avoid having to type out multiple install package lines.
```
gapminder <- read.csv("gapminder_datga.csv")
head(gapminder)
str(gapminder)
# first magic function: select()
# select() is used to select columns
year_country_pop <- select(gapminder, country, year, pop)
```
### Exercise 1
#### Make a new dataframe called "new_data" that has only the columns country, life expectancy, and GDP per capita
```
# hint: its like the same thing above. "Same but different" - James Franco
new_data <- select(gapminder, country, lifeExp, gdpPercap)
```
A new function to use: `filter()`
```
#filter() is used to subset rows
long_life <- filter(gapminder, lifeExp >= 60)
```
`subset()` and `filter()` are functions that should used in tandem and provide great functionality in wranglin your data or munging it before analysis
`# this is the magic pipe %>%`
It is actually called the 'Pipe' but its magicalness is to be respected per Stephanie
For example:
```
gapminder %>%
select(country, year, pop) %>%
filter(country == "Canada")
```
So we can take the filter function even further with the above code and produce something like:
```
gapminder %>%
select(country, year, pop) %>%
filter(country == "Canada" & year == 1992)
gapminder %>%
filter(lifeExp >= 50) %>%
select(country, year, pop)
```
Ok, that's super rad, but what about like...3 arguments in there? Like this (strictly just for filtering):
```
# from the R documentation - abridged for our data example
# Multiple criteria
filter(gapminder, country == "Canada" & year == 1992)
filter(gapminder, country == "Canada" | year == 1992)
# Multiple arguments are equivalent to and
filter(gapminder, country == "Canada", year == 1992, lifeExp >= 50)
```
### Exercise 2
#### Create a new dataframe, called "gap_Africa", that has life expectancy, country, and year, for only African countries. How many rows and columns does this dataframe have?
The answer right after this commercial break
```
# hint: use 'continent' not 'country'
gap_Africa <- gapminder %>%
filter(continent == "Africa") %>%
select(country, year, lifeExp)
str(gap_Africa)
#output should read:
'data.frame': 624 obs. of 3 variables:
$ country: Factor w/ 142 levels "Afghanistan",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp: num 43.1 45.7 48.3 51.4 54.5 ...
```
There are other ways to slice and dice the data to arrive at this conclusion. But remember that we should always strive for clean, efficient, concise code.
### Mutate function (like X-men, we're creating something new)
```
# now we move on to mutate() which creates NEW columns
gapminder_new <- gapminder %>%
mutate(gdp = gdpPercap * pop)
# We're going to calculate entire GDP
head(gapminder, n = 3)
head(gapminder_new, n = 3)
# we did create a new dataframe implicit within the call above.
```
For best practice, when you read in your data call in the original data set with something like 'raw' for example:
`gapminder_raw <- read.csv("gapminder_data.csv")`
And let that be your original data set for differentials or backup, and then create new dataframes via assignment operator
`New_Dataframe <- read.csv("gapminder_data.csv")`
`gapminder$gdp <- (gapminder$pop * gapminder$gdpPercap)`
the above code creates a new column within the dataframe gapminder. If we wanted to add that same calculation to say "gapminder_new" we could do the same with the following:
`gapminder_new$gdp <- (gapminder_new$pop * gapminder_new$gdpPercap)`
### Exercise 3
#### Create a new dataframe that has a new column with the ratio of life expectancy to GDP per capita. Keep only the country and ratio column. Hint: think about the order of operations.
```
gapminder_new <- gapminder %>%
mutate(new_col = "I am a new column")
gap_ratio <- gapminder %>%
mutate(life_gdppercap = lifeExp / gdpPercap) %>%
select(country, life_gdppercap)
gap_life <- gapminder %>%
mutate(longlife = ifelse(lifeExp >= 50, "longer life", "shorter life"))
head(gap_life)
gap_new2 <- gapminder %>%
mutate(GDP= gdpPercap* pop ) %>%
mutate(Big_Spender= ifelse ( GDP>10000000000, "Yes","No"))
# output should be:
country year pop continent lifeExp gdpPercap gdp gdp2 longlife
1 Afghanistan 1952 8425333 Asia 28.801 779.4453 6567086330 6567086330 shorter life
2 Afghanistan 1957 9240934 Asia 30.332 820.8530 7585448670 7585448670 shorter life
3 Afghanistan 1962 10267083 Asia 31.997 853.1007 8758855797 8758855797 shorter life
4 Afghanistan 1967 11537966 Asia 34.020 836.1971 9648014150 9648014150 shorter life
5 Afghanistan 1972 13079460 Asia 36.088 739.9811 9678553274 9678553274 shorter life
6 Afghanistan 1977 14880372 Asia 38.438 786.1134 11697659231 11697659231 shorter life
```
### Exercise 4
#### Let's try a challenge question!
#### Create a new dataframe called "gap_new2" that has **two new columns**:
1. one new columnb called "GDP" with total GDP (remember GDP = gdp per capita * population)
2. another new column called "big_spender" that has "yes" if GDP is greater than 10 billion (10,000,000,000) and "no" if the GDP is less than 10 billion
```
gap_new2 <- gapminder %>%
mutate(gdp = gdpPercap * pop) %>%
mutate(big_spender = ifelse(gdp > 10000000000, "yes", "no"))
# More succint, same result, shorter code:
gap_new2 <- gapminder %>%
mutate(gdp = gdpPercap * pop, big_spender = ifelse(gdp > 10000000000, "yes", "no"))
# you can put multiple args in the mutate() function, and that would help with keeping it concise, but sometimes at the cost of readibility and cleanliness
head(gap_new2)
#output should be:
country year pop continent lifeExp gdpPercap gdp gdp2 big_spender
1 Afghanistan 1952 8425333 Asia 28.801 779.4453 6567086330 6567086330 no
2 Afghanistan 1957 9240934 Asia 30.332 820.8530 7585448670 7585448670 no
3 Afghanistan 1962 10267083 Asia 31.997 853.1007 8758855797 8758855797 no
4 Afghanistan 1967 11537966 Asia 34.020 836.1971 9648014150 9648014150 no
5 Afghanistan 1972 13079460 Asia 36.088 739.9811 9678553274 9678553274 no
6 Afghanistan 1977 14880372 Asia 38.438 786.1134 11697659231 11697659231 yes
```
- `select()`: columns
- `filter()`: rows
- `mutate()`: create new column
### `gapminder %>% group_by(a)`
#### using group_by(a) to segment your dataframe into segments for more purposeful analysis and beyond
```
#let's take a look at group_by(a) and summarize()
gapminder %>%
group_by(continent) %>%
summarize(mean_gdppercap = mean(gdpPercap))
# output should look like this:
A tibble: 5 x 2
continent mean_gdppercap
<fct> <dbl>
1 Africa 2194.
2 Americas 7136.
3 Asia 7902.
4 Europe 14469.
5 Oceania 18622.
```
Let's continue and add some more args
```
gapminder %>%
group_by(continent) %>%
summarize(mean_gdppercap = mean(gdpPercap)) %>%
select(continent, mean_gdppercap) %>%
filter(continent == "Oceania")
```
### Exercise 5
#### Find maximum life expectancy for each country. What is the maximum life expectancy for Australia? (using summarize and filter)
```
gapminder %>%
group_by(country) %>%
summarize(max_life = max(lifeExp)) %>%
select(country, max_life) %>%
filter(country == "Australia")
# output should be:
# A tibble: 1 x 2
country max_life
<fct> <dbl>
1 Australia 81.2
```
What about multiple args for `group_by()`?
```
gapminder %>%
group_by(country, continent) %>%
summarize(max_life = max(lifeExp)) %>%
head()
# output:
# A tibble: 6 x 3
# Groups: country [6]
country continent max_life
<fct> <fct> <dbl>
1 Afghanistan Asia 43.8
2 Albania Europe 76.4
3 Algeria Africa 72.3
4 Angola Africa 42.7
5 Argentina Americas 75.3
6 Australia Oceania 81.2
```
What if we want to order our country by maximum life expectancy and for the top 5 countries?
```
gapminder %>%
group_by(country) %>%
summarize(max_life = max(lifeExp)) %>%
arrange(desc(max_life))
# output:
# A tibble: 142 x 2
country max_life
<fct> <dbl>
1 Japan 82.6
2 Hong Kong China 82.2
3 Iceland 81.8
4 Switzerland 81.7
5 Australia 81.2
6 Spain 80.9
7 Sweden 80.9
8 Israel 80.7
9 France 80.7
10 Canada 80.7
# … with 132 more rows
```
Now with filter for Japan and seleted lifeExp value
```
gapminder %>%
filter(country == "Japan" & lifeExp == 82.6)
gapminder %>%
group_by(country) %>%
mutate(max_life = max(lifeExp)) %>%
filter(lifeExp == max_life) %>%
arrange(desc(max_life))
# output:
# A tibble: 142 x 9
# Groups: country [142]
country year pop continent lifeExp gdpPercap gdp gdp2 max_life
<fct> <int> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Japan 2007 127467972 Asia 82.6 31656. 4.04e12 4.04e12 82.6
2 Hong Kong China 2007 6980412 Asia 82.2 39725. 2.77e11 2.77e11 82.2
3 Iceland 2007 301931 Europe 81.8 36181. 1.09e10 1.09e10 81.8
4 Switzerland 2007 7554661 Europe 81.7 37506. 2.83e11 2.83e11 81.7
5 Australia 2007 20434176 Oceania 81.2 34435. 7.04e11 7.04e11 81.2
6 Spain 2007 40448191 Europe 80.9 28821. 1.17e12 1.17e12 80.9
7 Sweden 2007 9031088 Europe 80.9 33860. 3.06e11 3.06e11 80.9
8 Israel 2007 6426679 Asia 80.7 25523. 1.64e11 1.64e11 80.7
9 France 2007 61083916 Europe 80.7 30470. 1.86e12 1.86e12 80.7
10 Canada 2007 33390141 Americas 80.7 36319. 1.21e12 1.21e12 80.7
# … with 132 more rows
```
- Does R support wildcards? Yes, but see RegEx or Regular Expressions. For more info, check out: https://regex101.com/ and https://regexr.com/
- How to search a substring by using grammars expressing: if string contains "XXX", then ... see: http://www.datasciencemadesimple.com/sub-gsub-function-in-r/
- You would use gsub to replace (for original documentation it falls under `grep` function listing here: https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/grep)
For pattern matching and string manipulation, see: https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/grep
hold up, ill answer that after some more copy
```
gapminder %>%
filter(continent == "Europe") %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(by = country, color = country))
```
Output:
![](https://i.imgur.com/FQiqkHo.png)
```
gapminder %>%
filter(country %in% c("Canada", "Mexico", "United States")) %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(by = country, color = country))
```
Output:
![](https://i.imgur.com/7zagNmX.png)
## Long vs wide data
- the long format is where:
- each col is a variable
- each row is an observation
- in the long format, you have 1 column fo the 'observed variable and the other columns are 'ID variables'
- for the 'wide' format:
- each row is often a 'site/subject/or patient'
### More about long vs wide data
- researchers often want to manipulate their data froim the wide to long format, or vice versa
- you may find data ....
### "Tidy" Data
- just means formatted in the way R wants
### "Wide" format
- arguably most annoying format for data
- rivals 'long' format for annoying style of formatting data
### `gap_wide`
Get the gap_wide data [here](https://tritoned.ucsd.edu/bbcswebdav/pid-1467433-dt-content-rid-18837619_1/xid-18837619_1?target=blank)
```
# make sure you have the dataset for gapminder_wide.csv downloaded
gap_wide <- read.csv("gapminder_wide.csv")
gap_long <- gap_wide %>%
gather(key = "type_year", value = "obvs_values",
c(-continent, -country)) #this translates as 'except continent and country'
head(gap_long)
# output:
continent country type_year obvs_values
1 Africa Algeria gdpPercap_1952 2449.0082
2 Africa Angola gdpPercap_1952 3520.6103
3 Africa Benin gdpPercap_1952 1062.7522
4 Africa Botswana gdpPercap_1952 851.2411
5 Africa Burkina Faso gdpPercap_1952 543.2552
6 Africa Burundi gdpPercap_1952 339.2965
gap_long <- gap_long %>%
separate(type_year, into = c("variable_type", "year"), sep = "_")
gap_long %>%
group_by(variable_type) %>%
summarize(max = max(year))
gap_wide2 <- gap_long %>%
spread(key = variable_type, value = obvs_values)
head(gap_wide2)
# output:
continent country year gdpPercap lifeExp pop
1 Africa Algeria 1952 2449.008 43.077 9279525
2 Africa Algeria 1957 3013.976 45.685 10270856
3 Africa Algeria 1962 2550.817 48.303 11000948
4 Africa Algeria 1967 3246.992 51.407 12760499
5 Africa Algeria 1972 4182.664 54.518 14760787
6 Africa Algeria 1977 4910.417 58.014 17152804
```
- if you're having function not found problems, I feel bad for you son, I got a lot of function not found errors but %>% isn't one of them
- make sure to load your library packages with each start of R studio:
- library(dplyr)
- library(ggplot2)
- library(tidyr)
- If you don't, you'll run into the common problem of functions not being found due to missing dependency and in this cae, the packages we initially installed and then called via library() function.
```
filter(gap_wide2, country == "Angola" & year == 1952)
gap_wide2 %>%
arrange(year, country) %>%
head(10)
gap_long2 <- gap_long %>%
separate(type_year, into = c("variable_type, "year....)) #complete here, screen cut off before entry
```
Thank you and goodnight
## Week 3 - Please sign in below:
##### Name (A ### )
Camila Gomez Wills (A53265648)
Katherine Tian (A12828524)
William Shumate (A14196513)
Noah Gerber (A11844436)
Christopher Thompson (A53290544)
Kelis Wong (A53233278)
Diego Jimenez (A92016548)
Lin Ou (A53290084)
Navdhrishty Singh (A12972272)
Michael W. Andrews (A14231233)
Zixuan Dai (A53270250)
Inderapl Pamma (A53282970)
Wenlin Zhao (A53271207)
Zilu Zeng (A53266413)
Hui Zhang(A53248115)
Yi LIU (A53257075)
Wenhan Zhu (A53277425)
Mingpu Xiao(A53281212)Yi
Jiahang Zhang(A53278367)
Yang Xuan (A14466001)
Clarins Cecilia (A53288595)
Xinyu Zhang(A53287451)
Cesar Perez (A53265549)
Payam Shahsavandi(A13534813)
Gala Ledezma (A12804158)
Daniel Horan (A11768191)
Shihao Lin (A53255525)
imgesu cetin (A53290197)
Jude Muhtaseb (A14721276)
Wendy Romero-Garcia (A53217156)
Savas Tarhan (A53222479)
Aimee Barnes (A11923359)
Yunzhou Luo (A53233781)
Jicuo Dai(A53251206)
Huang Xinyu (A53279031)
Yucheng Shen (A53290227)
Paul Koenig (A12629448)
Greg Householter (A53288838)
Travis Welburn (A92430385)
Anh Nguyen (A53245210)
Malena Hernandez (A53289951)
Man Luo (A53277866)
Laura Vossler (A53267214)
Sibo Su (A53255631)
Yifan Dong (A53256391)
Hannah Ashby (A13026919)
Isabelle Chen (A12692259)
Masatoshi SHIMOSUKA (A53266177)
Grace Yuan(A53274857)
Xuan Gu (A53291635)
Things I liked:
The fact that there were several teachers and assistants hovering and answering screen/coding questions was very helpful. Overall, I would recommend as an introduction to coding. The patience of the teachers was very much appreciated.
Things that can improve:
Homework assignments that have more support face to face to answer coding questions. I would prefer 5 weeks R and 5 weeks Python and no Sequel.
Things I liked
All of the support and TA's available to help us out to finish our assignmnents and answer questions about quizzes.
Things that can improve:
Quizzes questions to be a bit clearer, on the first quizzes some of the questions were confusing. Also don't make the homework questions that complicated for the students to figure out on their own by googling, for those of us that struggle, it is hard for us to grasp the basic concepts and having to seek out more help, makes things complicated to solve.