# Tidy Tuesdays Week 3: College Tuition, Diversity, and Pay
<!--[https://www.economist.com/united-states/2016/04/21/delayed-gratification](https://i.imgur.com/6IQUvbF.png)-->
To start please visit: https://github.com/BEES-Tidy-Tuesdays/home
You will find a link to this collaborative document called "Week 2 notes".
:::info
**Please Read** :mega:
This is a collaborative markdown document: feel free to add, change, and improve it. We will upload the final document to github after this Tidy Tuesday session and use parts of it as a template for future sessions.
If something is unclear or doesn't make sense, fix it, or make a comment.
:::
### A. Preparation (~5-10 minutes)
#### A1. Make sure R and R studio are installed and running
#### A2. Download and start exploring the data
Please access the data [here](https://www.tuitiontracker.org/data/download/all-schools.csv). Download and extract to a specified folder. We encourage you to start working by creating a new Rproject, and use best practices for file management.
<!--OR
To clone the data set from github using git in RStudio:
1. Select "New Project"
2. Select "Version control"
3. Select "Git"
4. Paste "https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-03-10" as the URL, and select where you want to clone the files to on your computer
N.B. You need git installed on your computer, you can download it here:
- https://gitforwindows.org/ (Windows)
- https://git-scm.com/download/mac (Mac) -->
OR
Just use this code below to download the data:
#### OPTION 1
```
# Get the Data
library(tidyverse)
tuition_cost <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_cost.csv')
tuition_income <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_income.csv')
salary_potential <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/salary_potential.csv')
historical_tuition <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/historical_tuition.csv')
diversity_school <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/diversity_school.csv')
```
#### OPTION 2
```
# read in with tidytuesdayR package (https://github.com/thebioengineer/tidytuesdayR)
# PLEASE NOTE TO USE 2020 DATA YOU NEED TO USE tidytuesdayR version ? from GitHub
# Either ISO-8601 date or year/week works!
# Install via devtools::install_github("thebioengineer/tidytuesdayR")
tuesdata <- tidytuesdayR::tt_load('2020-03-10')
tuesdata <- tidytuesdayR::tt_load(2020, week = 11)
tuition_cost <- tuesdata$tuition_cost
```
#### OPTION 3
```
# Raw data from tuitiontracker.org
tuition_tracker <- readr::read_csv(https://www.tuitiontracker.org/data/download/all-schools.csv)
```
### A2. What are the files? What are the variables?
#### `tuition_cost.csv`
|variable |class |description |
|:--------------------|:---------|:-----------|
|name |character |School name |
|state |character | State name |
|state_code |character | State Abbreviation |
|type |character | Type: Public, private, for-profit|
|degree_length |character | 4 year or 2 year degree |
|room_and_board |double | Room and board in USD |
|in_state_tuition |double | Tuition for in-state residents in USD |
|in_state_total |double | Total cost for in-state residents in USD (sum of room & board + in state tuition) |
|out_of_state_tuition |double | Tuition for out-of-state residents in USD|
|out_of_state_total |double | Total cost for in-state residents in USD (sum of room & board + out of state tuition) |
#### `tuition_income.csv`
|variable |class |description |
|:-----------|:---------|:-----------|
|name |character | School name |
|state |character | State Name |
|total_price |double | Total price in USD |
|year |double | year |
|campus |character | On or off-campus |
|net_cost |double | Net-cost - average actually paid after scholarship/award |
|income_lvl |character | Income bracket |
#### `salary_potential.csv`
|variable |class |description |
|:-------------------------|:---------|:-----------|
|rank |double | Potential salary rank within state |
|name |character | Name of school |
|state_name |character | state name |
|early_career_pay |double | Estimated early career pay in USD |
|mid_career_pay |double | Estimated mid career pay in USD |
|make_world_better_percent |double | Percent of alumni who think they are making the world a better place |
|stem_percent |double | Percent of student body in STEM |
#### `historical_tuition.csv`
|variable |class |description |
|:------------|:---------|:-----------|
|type |character | Type of school (All, Public, Private) |
|year |character | Academic year |
|tuition_type |character | Tuition Type All Constant (dollar inflation adjusted), 4 year degree constant, 2 year constant, Current to year, 4 year current, 2 year current |
|tuition_cost |double | Tuition cost in USD |
#### `diversity_school.csv`
|variable |class |description |
|:----------------|:---------|:-----------|
|name |character | School name |
|total_enrollment |double | Total enrollment of students |
|state |character | State name |
|category |character | Group/Racial/Gender category |
|enrollment |double | enrollment by category |
### A3. Pre-processing
Tip: Cool way to quickly summarise your data
```
#install.packages("summarytools")
summarytools::dfSummary(diversity_school) %>%
view
#install.packages("skimr")
skimr::skim(diversity_school)
```
### B. Think of questions we can ask from the data (5 minutes)
Talk to the people nearest you and brainstorm some questions we can ask with this dataset. What could this data tell us? What are some interesting questions we could ask? How do you plan to visualise it?
#### Question ideas
What are the most diverse universities among the top 100 largest universities?
**Example**:
:::spoiler
https://twitter.com/thomas_mock/status/1237098932775215104
:::
:::info
**Type in the questions below**:
- Q1 Do higher tution costs lead to a higher salary potential?
- Q2 Is there a link between income level and diversity?
- Approach:
- Calculate a diversity index (HHI or Shannon-Wiener)
- join `tuition_income.csv` and `diversity_school.csv`
- create a scatterplot: diversity against early career pay
- Q3 What's the best "value" university?
- Highest income per tuition fee
- Q4 Is there an association between tuition costs and enrolment numbers?
:::
### C. Start coding! Share your code here! (30 minutes)
Share your code and your pretty figures here.
Stuck? Share what you have (even code that doesn't work) and ask for help.
#### Data Pre-processing
```
```
#### Q1 Do higher tution costs lead to a higher salary potential?
##### Answer
Looks like it
```
cost_salary <- left_join(tuition_cost,
salary_potential,
by = c("name"))
ggplot(cost_salary) +
geom_point(aes(in_state_tuition, early_career_pay), colour = "blue") +
geom_point(aes(in_state_tuition, mid_career_pay), colour = "red")+
geom_smooth(method = "glm",
aes(in_state_tuition, early_career_pay), colour = "blue") +
geom_smooth(method = "glm",
aes(in_state_tuition, mid_career_pay), colour = "red")
```

Tighter trend when we use out_of_state rather than in_state, probably because it filters out cheaper, public universities
```
ggplot(cost_salary) +
geom_point(aes(out_of_state_tuition, early_career_pay), colour = "blue") +
geom_point(aes(out_of_state_tuition, mid_career_pay), colour = "red")+
geom_smooth(method = "glm",
aes(out_of_state_tuition, early_career_pay), colour = "blue") +
geom_smooth(method = "glm",
aes(out_of_state_tuition, mid_career_pay), colour = "red")
```

#### Q2
##### Answer
```
# Convert diversity_school to long format
diversity_long <- diversity_school %>%
pivot_wider(names_from = category, values_from = enrollment) %>%
select(-c(Women, "Total Minority"))
# Calculate diversity index
#install.packages("vegan")
library(vegan)
# Calculate diversity index and join the table with tuition_income
diversity_income_lvl<- plyr::ddply(diversity_long,~name,function(x) {
data.frame(SHANNON=diversity(x[-c(1,2,3)], index="shannon"))
}) %>%
left_join(tuition_income, by="name") %>%
mutate(income_lvl=as_factor(income_lvl))
# Visualise: scatterplot
diversity_income_lvl %>% drop_na() %>%
ggplot() +
geom_quasirandom(aes(x = SHANNON, y=early_career_pay), alpha=0.2) +
geom_quantile(aes(x = SHANNON, y = early_career_pay), quantiles=c(0.01, 0.5, 0.99))+
theme_classic()+
ylab("Early Career Pay (USD/year)")+
xlab("Diversity Index (Shannon)")
```

```
#p = proportion of each type
Shannon = -(p1*log(p1)+
p2*log(p2)+
etc...)
```
#### Q3 What’s the best “value” university?
##### Answer
Based on "out of state fees" or "in state": Brigham Young University-Idaho
```
cost_salary <- cost_salary %>%
mutate(value.out = early_career_pay/out_of_state_tuition,
value.in = early_career_pay/in_state_tuition)
```
#### Q4 Is there an association between tuition costs and enrolment numbers?
##### Answer
First I plotted one against the other
```
ggplot(cost_diversity, aes(in_state_tuition, total_enrollment)) +
geom_point()
```

It seemed bunched up on the y axis, so I transformed it to a log10 scale.
```
ggplot(cost_diversity, aes(in_state_tuition, total_enrollment)) +
geom_point()+
scale_y_log10()
```

Hmm, there seems to be a trend there. But there's also a big cluster on the left that doesn't seem to be part of that trend.
❓What do you think the cluster is?
:::spoiler Answer
```
ggplot(cost_diversity, aes(in_state_tuition, total_enrollment)) +
geom_point(aes(colour = type))+
scale_y_log10()
```

:::
### D. Wrap up (10 minutes)
#### Cool R things I learnt this week
:::success
:tada:
:::
#### Make something cool? Post it on twitter! #TidyTuesday
Or see what other people around the world did with the data:
https://twitter.com/hashtag/TidyTuesday?src=hashtag_click&f=live
#### What could be improved from the next meeting?
---
### E Help! - ask questions about today's Tidy Tuesday here
::: info
If you help someone in person, please put it here as well, so others can learn. :+1:
:::
#### Q. How do I change the background?
A. It depends on what you want to change it to. In ggplot, the best way to change the background is to use `theme()`.
#### Q. How can I calculate a diversity index (Shannon-Wiener)?
::: spoiler Answer:
There are a number of packages that can help you calculate this (e.g. vegan).
```
library(vegan)
diversity(data[-1], index="shannon")
Site1 Site2 Site3 Site4 Site5
0.4851 1.2399 1.0905 0.5723 1.2129
Site6 Site7 Site8 Site9 Site10
1.0404 0.9613 0.8522 0.8162 0.6274
#OR
library(plyr)
ddply(data,~Sites,function(x) {
+ data.frame(SHANNON=diversity(x[-1], index="shannon"))
+ })
Sites SHANNON
1 Site1 0.4851
2 Site2 1.2399
3 Site3 1.0905
4 Site4 0.5723
5 Site5 1.2129
6 Site6 1.0404
7 Site7 0.9613
8 Site8 0.8522
9 Site9 0.8162
10 Site10 0.6274
# source: https://www.flutterbys.com.au/stats/tut/tut13.2.html
```
:::
### F General R help - Stuck on something? Need advice on you current project? Can you help answer someone else's question?
(N.B. this document is public, so don't include sensitive or private information)