2020-08-19 Timeline Function

2020-08-19 Timeline Function ===== These are the adjustments on the `reciprocal_fn()` for getting the timelines for a supplied list of users. :::info :bulb: **Stewart's section:**:arrow_lower_right: --- ::: Notes as I read: 1. What object type does `map_dbl` return? > [rdocumentation.org](https://www.rdocumentation.org/search?q=map_dbl&latest=) has this: > *_dbl() returns a double(length(.x)) I presume this means a numeric vector while just plain `map()` returns a list. 2. `case_when()` I was aware of it but your demo shows me how to solve a problem I had been dreading for a while. That is, if the screen_name of an account is someting cyrptic like `@amboInd_Kol` how can I reliably change it to appear as "Amb. Smith: India" on the plots? I think this would solve that problem. 3. do you recommend using library(janitor) as a standard? 4. `dput()` I am unclear of the advantage over writing to a csv 5. General r question. Calling out to a fn like reciprocal_fn is time consuming and expensive. At what point does sync vs. async processing come into play? These can be found in [this Github file](https://github.com/Qstreet/hackmd-notes/blob/master/rQuestions.md). *** :::info :bulb: **Martin's section:**:arrow_lower_right: --- ### Stewart's questions: 1. What object type does `map_dbl` return? *this returns a numeric (double) vector.* The standard `map()` function always returns a list, but the suffixes tell you what kind of object will be returned with each `map_` variant. 2. `case_when()` I was aware of it but your demo shows me how to solve a problem I had been dreading for a while. That is, if the `screen_name` of an account is someting cyrptic like `@amboInd_Kol`, how can I reliably change it to appear as "`Amb. Smith: India`" on the plots? I think this would solve that problem. *Yes, this is an ideal solution for something like that. The great thing about the `case_when()` syntax is that it's capable of handling any function that return a Boolean output. So for the case you're describing, we can use `case_when()` with `stringr::str_detect()`.* *Assume we want a categorical variable that tells us if a character in `starwars` is `human`, `droid`, or `other`.* ```r dplyr::starwars %>% dplyr::count(species, sort = TRUE) # returns: # A tibble: 38 x 2 species n <chr> <int> 1 Human 35 2 Droid 6 3 NA 4 4 Gungan 3 5 Kaminoan 2 6 Mirialan 2 7 Twi'lek 2 8 Wookiee 2 9 Zabrak 2 10 Aleena 1 # … with 28 more rows ``` *We can look for the string match in `starwars$species` and assign the desired value in `spec_cat`*: ```r dplyr::starwars %>% dplyr::mutate(spec_cat = case_when( stringr::str_detect(species, "Human") ~ "human", stringr::str_detect(species, "Droid") ~ "droid", TRUE ~ "other")) %>% count(spec_cat) # A tibble: 3 x 2 spec_cat n <chr> <int> 1 droid 6 2 human 35 3 other 46 ``` 3. do you recommend using `library(janitor)` as a standard? *Yes--if only for the function `janitor::clean_names()`. This saves a ton of time spent re-naming variables and standarizes how you'll refer to columns.* 4. `dput()` I am unclear of the advantage over writing to a csv *The `dput()` function is an 'old guard' way to get an object's structure printed to the screen. For example, consider the small `table1` from `tidyr`* ```r tidyr::table1 # A tibble: 6 x 4 country year cases population <chr> <int> <int> <int> 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583 ``` *If I want to export this to another script, I could write it out to a .csv, and then re-import it. Or I can use `dput()`* ```r dput(table1) # this prints to the console structure(list(country = c("Afghanistan", "Afghanistan", "Brazil", "Brazil", "China", "China"), year = c(1999L, 2000L, 1999L, 2000L, 1999L, 2000L), cases = c(745L, 2666L, 37737L, 80488L, 212258L, 213766L), population = c(19987071L, 20595360L, 172006362L, 174504898L, 1272915272L, 1280428583L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L)) ``` *I can copy this text and reassign this to whatever object I need, or if I want to create a reproducible example. Another example of this is the [`datapasta package`](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html).* ```r library(datapasta) datapasta::tribble_paste(table1) ``` *And this gets printed to the screen.* ```r tibble::tribble( ~country, ~year, ~cases, ~population, "Afghanistan", 1999L, 745L, 19987071L, "Afghanistan", 2000L, 2666L, 20595360L, "Brazil", 1999L, 37737L, 172006362L, "Brazil", 2000L, 80488L, 174504898L, "China", 1999L, 212258L, 1272915272L, "China", 2000L, 213766L, 1280428583L ) ``` 5. General r question. Calling out to a fn like `reciprocal_fn` is time consuming and expensive. At what point does sync vs. async processing come into play? This really depends on a lot of factors. Bundling common processes into a function is always a good practice. We're in a somewhat unique situation because we're getting data from an API with a rate limit, so its a little more complicated to figure out how to make sure we're able to call `reciprocal_fn` and not overload the API. ::: # Original error The error being produces comes from using the `reciprocal_fn()` custom function to create a column `recip` inside the `ccp_accounts` data.frame. ```r ccp_accounts %>% # THIS IS THE LINE IN QUESTION. ALL ELSE WORKS FINE mutate(Language = ifelse(lang == 'en',"English", ifelse(lang == 'es', "Spanish", ifelse(lang == 'fr', "French", ifelse(lang == "ar", "Arabic", NA)))), reciprocal200 = map_dbl(user_id, reciprocal_fn)) # Error: Problem with `mutate()` input `reciprocal200`. # x Problem with `filter()` input `..1`. # x object 'created_at' not found # ℹ Input `..1` is `created_at == min(created_at)`. # ℹ Input `reciprocal200` is `map_dbl(user_id, reciprocal_fn)`. ``` This error is telling us :::success :dart: **Create `language` variable:** :arrow_lower_right: --- I wrote some code for creating the language variable using `dplyr::case_when()` so it's a little cleaner: ```r ccp_accounts <- ccp_accounts %>% # here we have multiple ifelse() statements that can be replaced by a single # case_when() statementl mutate( language = case_when( lang == "en" ~ "English", lang == "es" ~ "Spanish", lang == "fr" ~ "French", lang == "ar" ~ "Arabic", TRUE ~ NA_character_)) # check our work ccp_accounts %>% dplyr::count(language, lang) %>% tidyr::pivot_wider(names_from = language, values_from = n) # A tibble: 18 x 6 lang Arabic English French Spanish `NA` <chr> <int> <int> <int> <int> <int> 1 ar 7 NA NA NA NA 2 en NA 87 NA NA NA 3 fr NA NA 9 NA NA 4 es NA NA NA 18 NA 5 de NA NA NA NA 1 6 hu NA NA NA NA 1 7 it NA NA NA NA 1 8 ja NA NA NA NA 5 9 ko NA NA NA NA 1 10 nl NA NA NA NA 1 11 pt NA NA NA NA 5 12 ru NA NA NA NA 1 13 sr NA NA NA NA 2 14 tr NA NA NA NA 2 15 und NA NA NA NA 5 16 ur NA NA NA NA 1 17 zh NA NA NA NA 6 18 NA NA NA NA NA 3 ``` ::: ## Load the packages ```r library(rtweet) library(tidyverse) library(janitor) ``` ## Define Twitter list and use `rtweet::lists_members` This is the predifined twitter list stored as a character vector ```r CHN_Diplomats <- "1259590304727994368" # lists_members of twitter api list --------------------------------------- # returns 40 column df, each row is one member of the twitter list chn_diplos_list_mems <- rtweet::lists_members(list_id = CHN_Diplomats) # chn_diplos_list_mems %>% str(object = ., max.level = 1) # tibble [156 × 40] (S3: tbl_df/tbl/data.frame) ``` It results in a `data.frame` with 156 rows and 40 columns ## Create data.frame from `rtweet::lookup_users` Now we get the `user` data from our `chn_diplos_list_mems` object. ```r ## this one line pulls the full list (now 156) of CHN diplos from twitter. ## no need to list out CHN diplo account names into a vector ccp_accounts <- rtweet::lookup_users(chn_diplos_list_mems$user_id) ccp_accounts %>% str(object = ., max.level = 1) # tibble [156 × 90] (S3: tbl_df/tbl/data.frame) ``` This results in a data.frame with 156 rows and 90 columns `tibble [156 × 90] (S3: tbl_df/tbl/data.frame)` ## Define `reciprocal_fn` function Below is a function for collecting timeline information (the last `200` tweets) on the users stored in element `x`. The goals of this function are: 1. Store the output for `rtweet::get_timeline()` in `df` ```r # I!I get most recent 200 tweets. measure time span. shorter span is larger df <- rtweet::get_timeline(x, n = 200) ``` 2. Get the date of each user's first tweet (from the last `200`): ```r # FIND EARLIEST AND LATEST DATE IN TIBBLE # begin date bd <- df %>% filter(created_at == min(created_at)) beginDate <- as.Date(bd$created_at) ``` 3. Get the date of each user's last tweet (from the last `200`) ```r # get range in days from earliest to latest days_range_200_twts <- as.double(difftime(lubridate::ymd(endDate), lubridate::ymd(beginDate), units = "days")) ``` 4. Calculate a metric that is 'lower if the user has a higher level of twitter activity' ```r # calculate and return reciprocal # so lower number means higher level of twitter activity # and higher number is lower twitter activity reciprocal <- (1 / days_range_200_twts) * 100 ``` 5. Finally, we return this reciprocal object with a `return()` statement ```r return(reciprocal) ``` ### `reciprocal_fn(x)` Here we define the actual function. ```r reciprocal_fn <- function(x){ ## I!I get most recent 200 tweets. measure time span. shorter span is larger df <- rtweet::get_timeline(x, n = 200) # FIND EARLIEST AND LATEST DATE IN TIBBLE # begin date bd <- df %>% filter(created_at == min(created_at)) beginDate <- as.Date(bd$created_at) # define end date ed <- df %>% filter(created_at == max(created_at)) endDate <- as.Date(ed$created_at) # get range in days from earliest to latest days_range_200_twts <- as.double(difftime(lubridate::ymd(endDate), lubridate::ymd(beginDate), units = "days")) # calculate and return reciprocal (so lower number means higher level of twitter activity) reciprocal <- (1 / days_range_200_twts) * 100 return(reciprocal) } ``` ## Testing `reciprocal_fn` Below we get a small sample from `ccp_accounts` to test this function, ```r ccp_accounts %>% slice(1:5) %>% select(user_id) %>% purrr::as_vector() %>% # datapasta::vector_paste() %>% dput() c(user_id1 = "1282941024852213761", user_id2 = "1275420074233548806", user_id3 = "1274614148320690178", user_id4 = "1262720390977269761", user_id5 = "1260881050005315586") ``` `dput()` is preferred here because it's named vector. We store these five users in `small_cpp_users` ```r small_cpp_users <- c(user_id1 = "1282941024852213761", user_id2 = "1275420074233548806", user_id3 = "1274614148320690178", user_id4 = "1262720390977269761", user_id5 = "1260881050005315586") ``` The three tests below tell us something about what the function is currently doing. In this first case when we send all five users in `small_cpp_users`, we are expecting to see five `reciprocal`s ```r reciprocal_fn(x = small_cpp_users) # [1] 1.086957 ``` But we get a single value. If we feed this function the first user as a single text vector, we get a different number. ```r reciprocal_fn(x = "1282941024852213761") # [1] 3.333333 ``` :::danger :bulb: Initial thoughts --- ::: My suspicion is we'll have to break the iteration into

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.