owned this note
owned this note
Published
Linked with GitHub
# Collaborative notebook
## R tidyverse for UiO Carpentry
11-04-2022
https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv
https://www.uio.no/english/services/it/network/wireless/help/uioguest.html
### Be a part
User groups in Oslo for R:
Oslo UseR! https://www.meetup.com/en-AU/Oslo-useR-Group/
R-Ladies Oslo https://www.meetup.com/en-AU/rladies-oslo/
### Helpers
Everyone should (in my opinion) be familiar with the RStudio Cheat sheets:
https://www.rstudio.com/resources/cheatsheets/
### Code example from the course
See all tidyselecters by write and run:
`?tidyr_tidy_select`
```
penguins %>%
mutate(
bill_ratio = bill_depth_mm / bill_length_mm,
bill_type = if_else(condition = bill_ratio < 0.5,
true = "elongated",
false = "stumped"))
```
All the lesson materials can be found at:
https://athanasiamo.github.io/r-tidyverse-for-working-with-data/01-project-introduction/
### Pivoting on pairs of columns
Sometimes, we get to work with data, e.g. aggregated survey results where the data is shaped as an arbitrary number of key-value columns. That is, there are several column pairs where one contains a descriptor and the other a value. One way of pivot them would be to write one `pivot_long` mutation for each pair - but that fast becomes unwieldy.
```
penguin_semi <- penguins %>%
mutate(id = row_number()) %>%
pivot_longer(starts_with("bill"),
names_to = "name_1",
values_to = "value_1") %>%
pivot_longer(c(flipper_length_mm, body_mass_g),
names_to = "name_2",
values_to = "value_2")
penguin_semi %>%
pivot_longer(
-c(1:5),
names_to = c(".value", "col"),
names_sep = "_"
)
```
The thing to notice here is the `.value` part in `names_to`. From the help for `pivot_longer`:
> If `names_to` is a character
#' containing the special `.value` sentinel, this value will be ignored,
#' and the name of the value column will be derived from part of the
#' existing column names.
`".value"` is a parameter call to the function in `pivot_longer` attempting to identify the columns you choose to pivot.
Consider a slightly different semi-wide dataframe, where the column names contains some identifying information about each penguin.
```
penguins_semi <- penguins %>%
mutate(id = row_number()) %>%
pivot_longer(starts_with("bill"),
names_to = "name_yellow",
values_to = "value_yellow") %>%
pivot_longer(starts_with("flipper"),
names_to = "name_brown",
values_to = "value_brown")
penguins_semi %>%
pivot_longer(cols = c(starts_with("name"), starts_with("value")),
names_to = c(".value", "colour"),
names_sep = "_")
```
Here, instead of deselecting the columns to be pivoted by ```-c(1:5)``` we use a search pattern. Pivot on the columns starting with 'name' and 'value'.
The data frame we end up with is the same shape as in the last example, but it is perhaps more clear to see that `.value` points back to the actual names of the columns found by our search patterns. So, `.value` is not a string in the sense we would expect it to, but a reference to the column names.
For a nice explanation and great RegEx examples (finding columns based on pattern matching), see here: https://stackoverflow.com/questions/61386200/how-does-the-names-to-value-convention-work-for-multiple-observations-per-row (and beware of the rabbit holes).
### Messy data - not penguins
For transforming "messy" or other types of column naming schemes into snake-case,
use the `clean_names` function from he {janitor} package.
https://garthtarr.github.io/meatR/janitor.html
Question: Often I find a lot of the 'work' for intro courses is done in the csv. So courses can be easy to follow because the csv youre working with is already 'tidy' but often your real csvs arent so I was curious if there was any information/resources about the csv prep process? (so how to format your csv)
> Mo: This is very true. The penguins data we used is already very tidy, so its easy to work tidily in it.
> Preparing data for work really depends on the data and their state, so giving general advice can some times be hard.
> In the R for Data Science book, there is a chapter of particular interest for this topic:
> https://r4ds.had.co.nz/tidy-data.html which should provide some resources in how to tidy data, and what principles you might be looking out for in getting your data into a tidy format.
### Romeo networking - Language Technology and Data Analysis Laboratory (LADAL)
Someone asked about a textual analysis tutorial at https://slcladal.github.io/net.html This tutorial was not complete and impossible to follow. However, there is a link that brings you to https://colab.research.google.com/drive/1mSzppeBA6Ai3zCmNKfkqgCd_QouRe8BM?usp=sharing&pli=1 - which DOES contain the examples that was a prerequisite for doing the code that was actually quoted.
I would still say it is ... a messy tutorial, but you will probably learn what you have to. Sorry, I did not get your name, but I hope this gets to the right person.