--- tags: ResBaz2021 --- # SQL databases in R ## Thursday, May 20th, 2021 1\:00-3\:00 [Back to Resbaz HackMD Directory](https://hackmd.io/@ResBaz21/directory) Descriptions Forthcoming ## Getting Started Information here: https://github.com/pachamaltese/uarizona-workshops --- ## Introductions Name, Affiliation, Title, Email, Social Media - Your Name, University of Arizona, Your title, youremail@email.arizona.edu, your social media - Julian Pistorius, University of Arizona - College of Agriculture and Life Sciences, Research Software Engineer, julianp@arizona.edu, [github@julianpistorius](https://github.com/julianpistorius/) - Kelsey Gonzalez - UArizona, @KelseyGonzalez - Rene Dario Herrera (they/them), University of Arizona Cancer Center, Data Analyst, renedherrera@email.arizona.edu, [twitter@reneherrera](https://twitter.com/reneherrera), [github@renedarioherrera](https://github.com/renedarioherrera), [linkedin.com@renedarioherrera/](https://www.linkedin.com/in/renedarioherrera/) - Iris Rodden, University of Arizona, PhD student SNRE, irisr@email.arizona.edu ## Notes > Julian Pistorius: I created a repository which you can launch in MyBinder which has Jupyter and RStudio alongside a built-in PostgreSQL server: https://github.com/julianpistorius/binder-postgres ![nycflights13 data diagram](https://d33wubrfki0l68.cloudfront.net/245292d1ea724f6c3fd8a92063dcd7bfb9758d02/5751b/diagrams/relational-nycflights.png) Install and use `nycflight13` package: ```r=3.6 install.packages("nycflights13") library(nycflights13) flights ``` Query table using `dplyr`: ```r=3.6 library(dplyr) flights %>% filter(month == 3) ``` Indexes: You need a unique way to identify observations ```r=3.6 # Show airlines airlines # Indexes unique_index <- list( airlines = list("carrier"), planes = list("tailnum") ) index <- list( airports = list("faa"), flights = list( c("year", "month", "day"), "carrier", "tailnum", "origin", "dest" ), weather = list(c("year", "month", "day"), "origin") ) ``` Install RPostgres package: ```r=3.6 install.packages("RPostgres") library(RPostgres) ``` Create environment variables for SQL access: ```R3.6 usethis::edit_r_environ() # and add to your environment to keep them restricted.. tutorial_db="postgres" tutorial_user="postgres" tutorial_pass="rezbaz2021" tutorial_host="143.198.138.225" ``` For Binder environment: ```R3.6 usethis::edit_r_environ() # and add to your environment to keep them restricted.. tutorial_db="postgres" tutorial_user="postgres" tutorial_pass="" tutorial_host="127.0.0.1" ``` then session->restart R on toolbar ```R3.6 fun_connect <- function() { dbConnect( Postgres(), dbname = Sys.getenv("tutorial_db"), user = Sys.getenv("tutorial_user"), password = Sys.getenv("tutorial_pass"), host = Sys.getenv("tutorial_host") ) } ``` And then explore tables, but this isn't the best option ```R3.6 # explore tables ---- schema <- "public" conn <- fun_connect() #this isn't the best # remote_tables <- dbGetQuery( # conn, # sprintf("SELECT table_name # FROM information_schema.tables # WHERE table_schema='%s'", schema) # ) #this is better DBI::dbListTables(conn) dbDisconnect(conn) ``` ```r3.6 # what tables do my server find remote_tables <- as.character(remote_tables$table_name) remote_tables # Which tables do I need to upload to my server local_tables <- utils::data(package = "nycflights13")$results[, "Item"] tables <- setdiff(local_tables, remote_tables) ``` Now to connect... ```r3.6 conn <- fun_connect() obs_per_month <- tbl(conn, "flights") %>% group_by(year, month) %>% count() %>% collect() more_complicated <- tbl(conn, "flights") %>% group_by(origin) %>% count() %>% inner_join(tbl(conn, "airports") %>% select(faa,name), by = c("origin" = "faa")) %>% collect() dbDisconnect(conn) ``` If you want to see the SQL command, ```r3.6 conn <- fun_connect() obs_per_month <- tbl(conn, "flights") %>% group_by(year, month) %>% count() %>% show_query() dbDisconnect(conn) ``` ## Questions and Answers In this section, you can post your questions and feel free to answer if you have it. Questions will be answered during or after the workshop. 1. Ask your question. - Here is an answer 2. I get a problem: > ``` > library(RPostgres) Error: package or namespace load failed for ‘RPostgres’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/srv/conda/envs/notebook/lib/R/library/Rcpp/libs/Rcpp.so': /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /srv/conda/envs/notebook/lib/R/library/Rcpp/libs/Rcpp.so) ``` 3. ## References Possible solution to problem in Binder/Ubuntu: <https://stackoverflow.com/questions/44773296/libstdc-so-6-version-glibcxx-3-4-20-not-found> --- :::info **Session Feedback :mega:** Use the link below to provide your feedback on the session: [**Session Feedback Form**](https://forms.gle/TrnJpr9qRBEKdnVVA) :::