## Customization - Use the four window icon to change the layout - Click on panes to actually see all the options - Or use tools > global options ## Notes - If you are stuck on the + in the console, hit escape - To get a new file - File > New File - Paper + Icon - Pick either R script or R markdown ## Markdown - The YAML: yet another markup language - this is the section between the `---` - Don't delete the dashes! - Edit below the dashes. - Sections that are grey or just different colors - Code chunks - anything that looks like this: ``` {r} # code chunk ``` - Code goes into this section - Leave the one labeled SETUP alone - To get a new code box: - Click the little + and C green icon in the middle of the top editor window (source window) - Click on the R option ``` x <- 1 x = 1 ``` - To run code: - Control + enter to run the line you are on - Click the green arrow on the chunk - Highlight what you want > click run at the top menu, run selected lines - Knitting: - This option creates the output report - It forgets completely anything you've done while working - Order is super important! - Notes: - Type notes outside of chunks - Use # to make comments ``` mean(c(3,4,5), na.rm = T) # this is a comment ``` ## Values ``` 4 # numbers TRUE # logical FALSE # logical "characters" # characters NA #logical, missing data NaN #logical, not a number ``` ## Objects - `c()`: concatenate, combine - If you save something as an object, it *usually* doesn't print out - If something prints, it probably didn't save ``` # variable (small vectors) x <- 5 # vector one row of values pizza <- c("cheese", "pepperoni", "pineapple") pizza # dataframe is basically excel salad <- data.frame( dressing = c("italian", "ranch"), toppings = c("cheese", "croutons"), orders = c(2, 3) ) salad # factors are fancy vectors drinks <- factor( x = c(3,4,5,3,4,3,4), # the data levels = c(3,4,5), # what are the values possible labels = c("Coke", "Pop", "Soda") # what label do you want to give them ) drinks # lists are like grocery lists dinner <- list( pizza, salad, drinks ) dinner ``` ## Functions ``` mean(c(1,2,3,NA)) mean(x = c(1,2,3,NA), na.rm = TRUE) # x and na.rm are arguments # mean is the function ``` Tip! Use the down arrow on the last chunk to run everything above --> then the play button to run that chunk. ## Slicing - Better name for this is filtering, selecting, subseting, picking only certain things - You can use `:` to indicate start THROUGH stop - You must use `c()` combine to pick non-sequential numbers ``` # just want the first one drinks[1] # take the first 3 drinks[1:3] # take some random ones drinks[c(1,3,5)] ``` ## Logicals ``` 2 == 4 # are they equal!!! 2 != 4 # are they NOT equal 2 < 4 # less than # > greater than # >= greater than or equal to # <= less than or equal to ``` ``` # this gives you trues and falses drinks == "Coke" # by using slicing we can select only those drinks[ drinks == "Coke" ] drinks[ drinks != "Coke" ] ``` ## Libraries - Click on the bottom right window pane - the packages tab - Install `car`, `psych`, `dplyr` ``` library(car) library(dplyr) ``` # Training Notes `## makes new slides` - How many slides? 7-10 - How long to talk? try to keep it under 20 minutes - For the assignment markdown - Leave the libraries and loading data at the top - about 3 exercises # Descriptive Statistics ## Libraries ``` library(rio) library(psych) library(car) library(dplyr) ``` ## Data ``` # if files in the same folder as the Rmd, you just can type the name of the file DF <- import("03_descriptives_data.csv") head(DF) ``` NOT IN MARKDOWN but in console `View(DF)` ``` summary(DF) ``` - histograms - x is the possible values of continuous variables - y is the frequency of each value ``` # dataframe name dollar sign column name # look in this dataframe here's the column name hist(DF$accuracy) # tidy r DF %>% #control shift m to get the pipe pull(accuracy) %>% # pull selects one column and turns it into a vector hist() # now make a histogram ``` ``` describe(DF) ``` - Point estimates: single values that represent data - Variability estimates: values that represent the spread of the data - Having both is important - havine one value that represents everything can be misleading so also having a understanding of how different people can be is useful - `na.rm = T` allows you to exclude missing scores and then calculate your statistic - `mean(c(1,2,3,NA), na.rm = T)` ``` mean(DF$self_perceived_knowledge, na.rm = T) median(DF$self_perceived_knowledge, na.rm = T) table(DF$FINRA_score) ``` ``` mean(DF$self_perceived_knowledge, na.rm = T) median(DF$self_perceived_knowledge, na.rm = T) table(DF$FINRA_score) # mode is four # values in the first row # frequency counts in the second row DF %>% # this is the pipe (ctrl shift m - command shift m) summarize(mean_self = mean(self_perceived_knowledge, na.rm = T), med_self = median(self_perceived_knowledge, na.rm = T)) # name of new variable = what you want to calculate DF %>% group_by(FINRA_score) %>% # group by allows you to calculate things by group summarize(mode_self = n()) # n() function counts the number of rows ``` - The mean and the median are very close, the distribution or the data is probably "normal" - When they are far apart the data is skewed - IQR is good for skewed data - SD is good for "normal" data ``` quantile(DF$self_perceived_knowledge) DF %>% pull(self_perceived_knowledge) %>% # grabs only the vector # converts the data frame into vector quantile(.) # the dot means use whatever comes above from the pipe boxplot(DF$self_perceived_knowledge) Boxplot(DF$self_perceived_knowledge) Boxplot(DF$overclaiming_proportion) hist(DF$self_perceived_knowledge) # visualization mean(DF$self_perceived_knowledge) # the point estimate sd(DF$self_perceived_knowledge) # the variability DF %>% # this is the pipe (ctrl shift m - command shift m) summarize(mean_self = mean(self_perceived_knowledge, na.rm = T), med_self = median(self_perceived_knowledge, na.rm = T), sd_self = sd(self_perceived_knowledge, na.rm = T)) ``` # Probability - in the functions: - `d` density --> gives you probability back - `r` random --> allows you to randomly sample