Working through the Bonus Example: A PhD student's approach to new code

--- title: "Tutorial 2 Bonus: Working through the Bonus Example: A PhD student's approach to new code" author: "Danny Tobin" output: word_document: default html_document: default subtitle: ENV710 Applied Statistical Modeling for Environmental Management editor_options: chunk_output_type: inline --- # Tutorial 2 Bonus ## Load your libraries ```r library} library(wbstats) # a package that enables us to import data from the World Bank. library(moments) # allows us to calculate skewness and kurtosis library(dplyr) # a package that helps us wrangle/manage data library(tidyr) # a package that allows us to pivot the data ``` ## Load your data ```r wbdata wb_data <- # pull the country data down from the World Bank - five indicators wb_data( indicator = c("SP.DYN.LE00.IN", "NY.GDP.PCAP.CD", "SP.POP.TOTL", "SP.URB.TOTL.IN.ZS", "EG.ELC.ACCS.ZS"), country = "countries_only", start_date = 2020, end_date = 2020 ) ``` ## Bonus Example Note: I skip the steps of renaming variables but you could do this step and have a nicer table. We could calculate summary statistics on all of the variables at the same time using dplyr. Please check out the website [here]: https://www.statology.org/summary-statistics-in-r-dplyr/. ## Solution Code! ```r # here is the working code all.summ <- wb_data %>% select(!date)%>% #this is a dplyr function to select variables. I deselected date by using (!date) summarise(across(where(is.numeric), .fns = list(min = ~min(., na.rm = T), median = ~median(., na.rm = T), mean = ~mean(., na.rm = T), stdev = ~sd(., na.rm = T), max = ~max(., na.rm = T), q25 = ~quantile(., 0.25, na.rm = T), q75 = ~quantile(., 0.75, na.rm = T)))) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value')) ``` ### Solution with Rounded numbers ```r} # if you want to make it pretty, you can round all of the columns all.summ <- all.summ %>% mutate_if(is.numeric, ~round(., 2)) all.summ ``` ## Here is the Process to Get to the Solution Like all of you, I followed the link and then had to play around with different things before I got it to work (see above). Here is my process: ### Step 1: Find Code and Adapt it to Your Data ```r bonus, eval = F} # This is the link: https://www.statology.org/summary-statistics-in-r-dplyr # I directly copy-paste the code, only changing it to refer to my data and naming a new object to hold my summary tst.summ <- wb_data %>% summarise(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, stdev = sd, q25 = ~quantile(., 0.25), q75 = ~quantile(., 0.75), max = max))) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value')) ``` ### Step 2: See if you can solve your problem with a dirty solution My error message points to the issue of having missing values in my data. I check that this is the problem by dropping all missing data and then run it again ```r} # after getting an error I see if this code works without nas # I use the drop_na() command from dplyr to get a dataset without nas. This is a very dangerous command because it changes the dataset. It is better to use na.rm as an argument, but I do this just to diagnose the error. Essentially, this is a fast way for me to ask "are NAs the issue?" tst <- wb_data %>% drop_na() dropna.summ <- tst %>% summarise(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, stdev = sd, q25 = ~quantile(., 0.25), q75 = ~quantile(., 0.75), max = max))) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value')) # as an example I could have done these two commands in one line by chaining %>% commands dropna.summ <- wb_data %>% drop_na() %>% summarise(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, stdev = sd, q25 = ~quantile(., 0.25), q75 = ~quantile(., 0.75), max = max))) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value')) dropna.summ ``` As you can see this code works, but now I want to figure out how to make it work without actually dropping the NAs (i.e. with rm.na as an argument) ## Step 3) Try a simpler version of the code Without looking anything up, I tried to remove NAs like we did in class. I remembered from previous sessions that when piping from a dataframe you oftentimes have to write "." as the first argument to say "I want to do this function to the piped data". However, as you will see later, I didn't remember that you must add the tilda in front of the function in some cases so the below code does not work. Notice that I put "r, eval = F" as the header to show code chunks without running them in my markdown file. ```r, eval = F} # I start simple with just two of the arguments # I use the . to say "I want to run the function on the data (wb_data) that was previously piped in all.mins <- wb_data %>% summarise(across(where(is.numeric), .fns = list(min = min(., na.rm = T), mean = mean(., na.rm = T) ))) ``` ## Step 4) Google the problem I see another error so I googled "summarise all variables missing values". Sometimes you may also want to specify "R" or "in R" in your search. I see a similar problem and solution on stack exchange here: https://stackoverflow.com/questions/25759891/dplyr-summarise-each-with-na-rm Copying the structure from that code I first try to put na.rm after all of my summary stat functions ```r, eval = F} all.summ <- wb_data %>% summarise(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, stdev = sd, max = max, q25 = ~quantile(., 0.25), q75 = ~quantile(., 0.75)), na.rm = T)) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value')) ``` Again, I have an error so I separate the problem into smaller chunks. I see that min, median etc. have a different structure than quantile so I try the solution only with them. ## Step 5) Solve the problem in pieces ```r} # Solution just for min, median, mean, sd basic.summ <- wb_data %>% summarise(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, stdev = sd, max = max), na.rm = T)) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value')) basic.summ ``` I have solved one half of my problem. Now I try a different approach with quantiles. I see that quantiles already have an argument specified and so I add na.rm directly into the function ```r} # this works for quantiles - now how to combine them? quantile.summ <- wb_data %>% summarise(across(where(is.numeric), .fns = list(q25 = ~quantile(., 0.25, na.rm = T), q75 = ~quantile(., 0.75, na.rm = T)))) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value')) ``` This works too. Can I combine the two pieces of code? ```r, eval = F} # all together # still an error all.summ <- wb_data %>% summarise(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, stdev = sd, max = max, q25 = ~quantile(., 0.25, na.rm = T), q75 = ~quantile(., 0.75, na.rm = T)), na.rm = T)) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value')) ``` No I can't. At this stage, I realize that I could just make each of the functions look like a quantile and then it works. ## Final Solution ```r # I see that the quantile structure is different than the other variables and so I mimic that structure # this code works all.summ <- wb_data %>% summarise(across(where(is.numeric), .fns = list(min = ~min(., na.rm = T), median = ~median(., na.rm = T), mean = ~mean(., na.rm = T), stdev = ~sd(., na.rm = T), max = ~max(., na.rm = T), q25 = ~quantile(., 0.25, na.rm = T), q75 = ~quantile(., 0.75, na.rm = T)))) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value')) ``` While this may look daunting, this took me about 5 minutes to go from clicking the original link and final solution. That isn't to brag -- the first time I did this it would have taken me hours (if I didn't give up before getting to a solution) -- this is just to say that this process will get easier as you get comfortable with R. As you gain experience programming, you will begin to see patterns that will help you jump to solutions faster. You will also get better at reading error messages, figuring out the right search terms, and reading forums. These are acquired skills and at first it will take a LONG time. Please be patient with yourselves and realize that this can be a lot more fun if you are approaching the problem with curiosity rather than frustration. Also, make sure to revel in your godliness when you solve a problem -- the feeling of accomplishment when you solve a coding issue, however small, is one of the small tangible pleasures that got me into R programming. I hope you take satisfaction in a job well done too.