2020-08-19 Timeline Function
=====
These are the adjustments on the `reciprocal_fn()` for getting the timelines for a supplied list of users.
:::info
:bulb: **Stewart's section:**:arrow_lower_right:
---
:::
Notes as I read:
1. What object type does `map_dbl` return?
> [rdocumentation.org](https://www.rdocumentation.org/search?q=map_dbl&latest=) has this:
> *_dbl() returns a double(length(.x))
I presume this means a numeric vector while just plain `map()` returns a list.
2. `case_when()` I was aware of it but your demo shows me how to solve a problem I had been dreading for a while. That is, if the screen_name of an account is someting cyrptic like `@amboInd_Kol` how can I reliably change it to appear as "Amb. Smith: India" on the plots? I think this would solve that problem.
3. do you recommend using library(janitor) as a standard?
4. `dput()` I am unclear of the advantage over writing to a csv
5. General r question. Calling out to a fn like reciprocal_fn is time consuming and expensive. At what point does sync vs. async processing come into play?
These can be found in [this Github file](https://github.com/Qstreet/hackmd-notes/blob/master/rQuestions.md).
***
:::info
:bulb: **Martin's section:**:arrow_lower_right:
---
### Stewart's questions:
1. What object type does `map_dbl` return? *this returns a numeric (double) vector.* The standard `map()` function always returns a list, but the suffixes tell you what kind of object will be returned with each `map_` variant.
2. `case_when()` I was aware of it but your demo shows me how to solve a problem I had been dreading for a while. That is, if the `screen_name` of an account is someting cyrptic like `@amboInd_Kol`, how can I reliably change it to appear as "`Amb. Smith: India`" on the plots? I think this would solve that problem.
*Yes, this is an ideal solution for something like that. The great thing about the `case_when()` syntax is that it's capable of handling any function that return a Boolean output. So for the case you're describing, we can use `case_when()` with `stringr::str_detect()`.*
*Assume we want a categorical variable that tells us if a character in `starwars` is `human`, `droid`, or `other`.*
```r
dplyr::starwars %>% dplyr::count(species, sort = TRUE)
# returns:
# A tibble: 38 x 2
species n
<chr> <int>
1 Human 35
2 Droid 6
3 NA 4
4 Gungan 3
5 Kaminoan 2
6 Mirialan 2
7 Twi'lek 2
8 Wookiee 2
9 Zabrak 2
10 Aleena 1
# … with 28 more rows
```
*We can look for the string match in `starwars$species` and assign the desired value in `spec_cat`*:
```r
dplyr::starwars %>%
dplyr::mutate(spec_cat = case_when(
stringr::str_detect(species, "Human") ~ "human",
stringr::str_detect(species, "Droid") ~ "droid",
TRUE ~ "other")) %>% count(spec_cat)
# A tibble: 3 x 2
spec_cat n
<chr> <int>
1 droid 6
2 human 35
3 other 46
```
3. do you recommend using `library(janitor)` as a standard?
*Yes--if only for the function `janitor::clean_names()`. This saves a ton of time spent re-naming variables and standarizes how you'll refer to columns.*
4. `dput()` I am unclear of the advantage over writing to a csv
*The `dput()` function is an 'old guard' way to get an object's structure printed to the screen. For example, consider the small `table1` from `tidyr`*
```r
tidyr::table1
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
```
*If I want to export this to another script, I could write it out to a .csv, and then re-import it. Or I can use `dput()`*
```r
dput(table1)
# this prints to the console
structure(list(country = c("Afghanistan", "Afghanistan", "Brazil",
"Brazil", "China", "China"),
year = c(1999L, 2000L, 1999L, 2000L, 1999L, 2000L),
cases = c(745L, 2666L, 37737L, 80488L, 212258L, 213766L),
population = c(19987071L, 20595360L, 172006362L,
174504898L, 1272915272L, 1280428583L)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
```
*I can copy this text and reassign this to whatever object I need, or if I want to create a reproducible example. Another example of this is the [`datapasta package`](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html).*
```r
library(datapasta)
datapasta::tribble_paste(table1)
```
*And this gets printed to the screen.*
```r
tibble::tribble(
~country, ~year, ~cases, ~population,
"Afghanistan", 1999L, 745L, 19987071L,
"Afghanistan", 2000L, 2666L, 20595360L,
"Brazil", 1999L, 37737L, 172006362L,
"Brazil", 2000L, 80488L, 174504898L,
"China", 1999L, 212258L, 1272915272L,
"China", 2000L, 213766L, 1280428583L
)
```
5. General r question. Calling out to a fn like `reciprocal_fn` is time consuming and expensive. At what point does sync vs. async processing come into play?
This really depends on a lot of factors. Bundling common processes into a function is always a good practice. We're in a somewhat unique situation because we're getting data from an API with a rate limit, so its a little more complicated to figure out how to make sure we're able to call `reciprocal_fn` and not overload the API.
:::
# Original error
The error being produces comes from using the `reciprocal_fn()` custom function to create a column `recip` inside the `ccp_accounts` data.frame.
```r
ccp_accounts %>%
# THIS IS THE LINE IN QUESTION. ALL ELSE WORKS FINE
mutate(Language = ifelse(lang == 'en',"English",
ifelse(lang == 'es', "Spanish",
ifelse(lang == 'fr', "French",
ifelse(lang == "ar", "Arabic", NA)))),
reciprocal200 = map_dbl(user_id, reciprocal_fn))
# Error: Problem with `mutate()` input `reciprocal200`.
# x Problem with `filter()` input `..1`.
# x object 'created_at' not found
# ℹ Input `..1` is `created_at == min(created_at)`.
# ℹ Input `reciprocal200` is `map_dbl(user_id, reciprocal_fn)`.
```
This error is telling us
:::success
:dart: **Create `language` variable:** :arrow_lower_right:
---
I wrote some code for creating the language variable using `dplyr::case_when()` so it's a little cleaner:
```r
ccp_accounts <- ccp_accounts %>%
# here we have multiple ifelse() statements that can be replaced by a single
# case_when() statementl
mutate(
language = case_when(
lang == "en" ~ "English",
lang == "es" ~ "Spanish",
lang == "fr" ~ "French",
lang == "ar" ~ "Arabic",
TRUE ~ NA_character_))
# check our work
ccp_accounts %>%
dplyr::count(language, lang) %>%
tidyr::pivot_wider(names_from = language, values_from = n)
# A tibble: 18 x 6
lang Arabic English French Spanish `NA`
<chr> <int> <int> <int> <int> <int>
1 ar 7 NA NA NA NA
2 en NA 87 NA NA NA
3 fr NA NA 9 NA NA
4 es NA NA NA 18 NA
5 de NA NA NA NA 1
6 hu NA NA NA NA 1
7 it NA NA NA NA 1
8 ja NA NA NA NA 5
9 ko NA NA NA NA 1
10 nl NA NA NA NA 1
11 pt NA NA NA NA 5
12 ru NA NA NA NA 1
13 sr NA NA NA NA 2
14 tr NA NA NA NA 2
15 und NA NA NA NA 5
16 ur NA NA NA NA 1
17 zh NA NA NA NA 6
18 NA NA NA NA NA 3
```
:::
## Load the packages
```r
library(rtweet)
library(tidyverse)
library(janitor)
```
## Define Twitter list and use `rtweet::lists_members`
This is the predifined twitter list stored as a character vector
```r
CHN_Diplomats <- "1259590304727994368"
# lists_members of twitter api list ---------------------------------------
# returns 40 column df, each row is one member of the twitter list
chn_diplos_list_mems <-
rtweet::lists_members(list_id = CHN_Diplomats)
# chn_diplos_list_mems %>% str(object = ., max.level = 1)
# tibble [156 × 40] (S3: tbl_df/tbl/data.frame)
```
It results in a `data.frame` with 156 rows and 40 columns
## Create data.frame from `rtweet::lookup_users`
Now we get the `user` data from our `chn_diplos_list_mems` object.
```r
## this one line pulls the full list (now 156) of CHN diplos from twitter.
## no need to list out CHN diplo account names into a vector
ccp_accounts <- rtweet::lookup_users(chn_diplos_list_mems$user_id)
ccp_accounts %>% str(object = ., max.level = 1)
# tibble [156 × 90] (S3: tbl_df/tbl/data.frame)
```
This results in a data.frame with 156 rows and 90 columns
`tibble [156 × 90] (S3: tbl_df/tbl/data.frame)`
## Define `reciprocal_fn` function
Below is a function for collecting timeline information (the last `200` tweets) on the users stored in element `x`. The goals of this function are:
1. Store the output for `rtweet::get_timeline()` in `df`
```r
# I!I get most recent 200 tweets. measure time span. shorter span is larger
df <- rtweet::get_timeline(x, n = 200)
```
2. Get the date of each user's first tweet (from the last `200`):
```r
# FIND EARLIEST AND LATEST DATE IN TIBBLE
# begin date
bd <- df %>% filter(created_at == min(created_at))
beginDate <- as.Date(bd$created_at)
```
3. Get the date of each user's last tweet (from the last `200`)
```r
# get range in days from earliest to latest
days_range_200_twts <- as.double(difftime(lubridate::ymd(endDate),
lubridate::ymd(beginDate),
units = "days"))
```
4. Calculate a metric that is 'lower if the user has a higher level of twitter activity'
```r
# calculate and return reciprocal
# so lower number means higher level of twitter activity
# and higher number is lower twitter activity
reciprocal <- (1 / days_range_200_twts) * 100
```
5. Finally, we return this reciprocal object with a `return()` statement
```r
return(reciprocal)
```
### `reciprocal_fn(x)`
Here we define the actual function.
```r
reciprocal_fn <- function(x){
## I!I get most recent 200 tweets. measure time span. shorter span is larger
df <- rtweet::get_timeline(x, n = 200)
# FIND EARLIEST AND LATEST DATE IN TIBBLE
# begin date
bd <- df %>% filter(created_at == min(created_at))
beginDate <- as.Date(bd$created_at)
# define end date
ed <- df %>% filter(created_at == max(created_at))
endDate <- as.Date(ed$created_at)
# get range in days from earliest to latest
days_range_200_twts <- as.double(difftime(lubridate::ymd(endDate),
lubridate::ymd(beginDate),
units = "days"))
# calculate and return reciprocal (so lower number means higher level of twitter activity)
reciprocal <- (1 / days_range_200_twts) * 100
return(reciprocal)
}
```
## Testing `reciprocal_fn`
Below we get a small sample from `ccp_accounts` to test this function,
```r
ccp_accounts %>%
slice(1:5) %>%
select(user_id) %>%
purrr::as_vector() %>%
# datapasta::vector_paste() %>%
dput()
c(user_id1 = "1282941024852213761", user_id2 = "1275420074233548806",
user_id3 = "1274614148320690178", user_id4 = "1262720390977269761",
user_id5 = "1260881050005315586")
```
`dput()` is preferred here because it's named vector. We store these five users in `small_cpp_users`
```r
small_cpp_users <- c(user_id1 = "1282941024852213761",
user_id2 = "1275420074233548806",
user_id3 = "1274614148320690178",
user_id4 = "1262720390977269761",
user_id5 = "1260881050005315586")
```
The three tests below tell us something about what the function is currently doing. In this first case when we send all five users in `small_cpp_users`, we are expecting to see five `reciprocal`s
```r
reciprocal_fn(x = small_cpp_users)
# [1] 1.086957
```
But we get a single value. If we feed this function the first user as a single text vector, we get a different number.
```r
reciprocal_fn(x = "1282941024852213761")
# [1] 3.333333
```
:::danger
:bulb: Initial thoughts
---
:::
My suspicion is we'll have to break the iteration into