Tidy Tuesdays Week 3: College Tuition, Diversity, and Pay

To start please visit: https://github.com/BEES-Tidy-Tuesdays/home

You will find a link to this collaborative document called "Week 2 notes".

Please Read

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

This is a collaborative markdown document: feel free to add, change, and improve it. We will upload the final document to github after this Tidy Tuesday session and use parts of it as a template for future sessions.

If something is unclear or doesn't make sense, fix it, or make a comment.

A. Preparation (~5-10 minutes)

A1. Make sure R and R studio are installed and running

A2. Download and start exploring the data

Please access the data here. Download and extract to a specified folder. We encourage you to start working by creating a new Rproject, and use best practices for file management.

Just use this code below to download the data:

OPTION 1

# Get the Data
library(tidyverse)
tuition_cost <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_cost.csv')

tuition_income <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_income.csv') 

salary_potential <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/salary_potential.csv')

historical_tuition <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/historical_tuition.csv')

diversity_school <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/diversity_school.csv')

OPTION 2

# read in with tidytuesdayR package (https://github.com/thebioengineer/tidytuesdayR)
# PLEASE NOTE TO USE 2020 DATA YOU NEED TO USE tidytuesdayR version ? from GitHub

# Either ISO-8601 date or year/week works!
# Install via devtools::install_github("thebioengineer/tidytuesdayR")

tuesdata <- tidytuesdayR::tt_load('2020-03-10')
tuesdata <- tidytuesdayR::tt_load(2020, week = 11)

tuition_cost <- tuesdata$tuition_cost

OPTION 3

# Raw data from tuitiontracker.org
tuition_tracker <- readr::read_csv(https://www.tuitiontracker.org/data/download/all-schools.csv)

A2. What are the files? What are the variables?

`tuition_cost.csv`

variable	class	description
name	character	School name
state	character	State name
state_code	character	State Abbreviation
type	character	Type: Public, private, for-profit
degree_length	character	4 year or 2 year degree
room_and_board	double	Room and board in USD
in_state_tuition	double	Tuition for in-state residents in USD
in_state_total	double	Total cost for in-state residents in USD (sum of room & board + in state tuition)
out_of_state_tuition	double	Tuition for out-of-state residents in USD
out_of_state_total	double	Total cost for in-state residents in USD (sum of room & board + out of state tuition)

`tuition_income.csv`

variable	class	description
name	character	School name
state	character	State Name
total_price	double	Total price in USD
year	double	year
campus	character	On or off-campus
net_cost	double	Net-cost - average actually paid after scholarship/award
income_lvl	character	Income bracket

`salary_potential.csv`

variable	class	description
rank	double	Potential salary rank within state
name	character	Name of school
state_name	character	state name
early_career_pay	double	Estimated early career pay in USD
mid_career_pay	double	Estimated mid career pay in USD
make_world_better_percent	double	Percent of alumni who think they are making the world a better place
stem_percent	double	Percent of student body in STEM

`historical_tuition.csv`

variable	class	description
type	character	Type of school (All, Public, Private)
year	character	Academic year
tuition_type	character	Tuition Type All Constant (dollar inflation adjusted), 4 year degree constant, 2 year constant, Current to year, 4 year current, 2 year current
tuition_cost	double	Tuition cost in USD

`diversity_school.csv`

variable	class	description
name	character	School name
total_enrollment	double	Total enrollment of students
state	character	State name
category	character	Group/Racial/Gender category
enrollment	double	enrollment by category

A3. Pre-processing

Tip: Cool way to quickly summarise your data

#install.packages("summarytools")
summarytools::dfSummary(diversity_school) %>%
  view

#install.packages("skimr")
skimr::skim(diversity_school)

B. Think of questions we can ask from the data (5 minutes)

Talk to the people nearest you and brainstorm some questions we can ask with this dataset. What could this data tell us? What are some interesting questions we could ask? How do you plan to visualise it?

Question ideas

What are the most diverse universities among the top 100 largest universities?
Example:

https://twitter.com/thomas_mock/status/1237098932775215104

Type in the questions below:

Q1 Do higher tution costs lead to a higher salary potential?
Q2 Is there a link between income level and diversity?
- Approach:
  - Calculate a diversity index (HHI or Shannon-Wiener)
  - join tuition_income.csv and diversity_school.csv
  - create a scatterplot: diversity against early career pay
Q3 What's the best "value" university?
- Highest income per tuition fee
Q4 Is there an association between tuition costs and enrolment numbers?

Share your code and your pretty figures here.

Stuck? Share what you have (even code that doesn't work) and ask for help.

Data Pre-processing

Q1 Do higher tution costs lead to a higher salary potential?

Answer

Looks like it

cost_salary <- left_join(tuition_cost, 
                         salary_potential, 
                         by = c("name"))
                         
ggplot(cost_salary) +
  geom_point(aes(in_state_tuition, early_career_pay), colour = "blue") +
  geom_point(aes(in_state_tuition, mid_career_pay), colour = "red")+
  geom_smooth(method = "glm",
              aes(in_state_tuition, early_career_pay), colour = "blue") +
  geom_smooth(method = "glm",
              aes(in_state_tuition, mid_career_pay), colour = "red")

Tighter trend when we use out_of_state rather than in_state, probably because it filters out cheaper, public universities

ggplot(cost_salary) +
  geom_point(aes(out_of_state_tuition, early_career_pay), colour = "blue") +
  geom_point(aes(out_of_state_tuition, mid_career_pay), colour = "red")+
  geom_smooth(method = "glm",
              aes(out_of_state_tuition, early_career_pay), colour = "blue") +
  geom_smooth(method = "glm",
              aes(out_of_state_tuition, mid_career_pay), colour = "red")

Q2

Answer

# Convert diversity_school to long format
diversity_long <- diversity_school %>% 
  pivot_wider(names_from = category, values_from = enrollment) %>%
  select(-c(Women, "Total Minority")) 
  
# Calculate diversity index
#install.packages("vegan")
library(vegan)

# Calculate diversity index and join the table with tuition_income
diversity_income_lvl<- plyr::ddply(diversity_long,~name,function(x) {
           data.frame(SHANNON=diversity(x[-c(1,2,3)], index="shannon"))
 }) %>% 
  left_join(tuition_income, by="name") %>% 
  mutate(income_lvl=as_factor(income_lvl))
  
 # Visualise: scatterplot
diversity_income_lvl %>% drop_na() %>% 
  ggplot() +
  geom_quasirandom(aes(x = SHANNON, y=early_career_pay), alpha=0.2) + 
  geom_quantile(aes(x = SHANNON, y = early_career_pay), quantiles=c(0.01, 0.5, 0.99))+
  theme_classic()+
  ylab("Early Career Pay (USD/year)")+ 
  xlab("Diversity Index (Shannon)")

#p = proportion of each type
Shannon = -(p1*log(p1)+
            p2*log(p2)+
            etc...)

Q3 What’s the best “value” university?

Answer

Based on "out of state fees" or "in state": Brigham Young University-Idaho

cost_salary <- cost_salary %>%
  mutate(value.out = early_career_pay/out_of_state_tuition,
         value.in = early_career_pay/in_state_tuition)

Q4 Is there an association between tuition costs and enrolment numbers?

Answer

First I plotted one against the other

ggplot(cost_diversity, aes(in_state_tuition, total_enrollment)) +
         geom_point()

It seemed bunched up on the y axis, so I transformed it to a log10 scale.

ggplot(cost_diversity, aes(in_state_tuition, total_enrollment)) +
         geom_point()+
  scale_y_log10()

Hmm, there seems to be a trend there. But there's also a big cluster on the left that doesn't seem to be part of that trend.

❓What do you think the cluster is?

Answer

ggplot(cost_diversity, aes(in_state_tuition, total_enrollment)) +
         geom_point(aes(colour = type))+
  scale_y_log10()

D. Wrap up (10 minutes)

Cool R things I learnt this week

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Make something cool? Post it on twitter! #TidyTuesday

Or see what other people around the world did with the data:
https://twitter.com/hashtag/TidyTuesday?src=hashtag_click&f=live

What could be improved from the next meeting?

E Help! - ask questions about today's Tidy Tuesday here

If you help someone in person, please put it here as well, so others can learn.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Q. How do I change the background?

A. It depends on what you want to change it to. In ggplot, the best way to change the background is to use theme().

Q. How can I calculate a diversity index (Shannon-Wiener)?

Answer:

There are a number of packages that can help you calculate this (e.g. vegan).

library(vegan)
diversity(data[-1], index="shannon")

 Site1  Site2  Site3  Site4  Site5 
0.4851 1.2399 1.0905 0.5723 1.2129 
 Site6  Site7  Site8  Site9 Site10 
1.0404 0.9613 0.8522 0.8162 0.6274 

#OR
library(plyr)
ddply(data,~Sites,function(x) {
+         data.frame(SHANNON=diversity(x[-1], index="shannon"))
+ })

    Sites SHANNON
1   Site1  0.4851
2   Site2  1.2399
3   Site3  1.0905
4   Site4  0.5723
5   Site5  1.2129
6   Site6  1.0404
7   Site7  0.9613
8   Site8  0.8522
9   Site9  0.8162
10 Site10  0.6274

# source: https://www.flutterbys.com.au/stats/tut/tut13.2.html

F General R help - Stuck on something? Need advice on you current project? Can you help answer someone else's question?

(N.B. this document is public, so don't include sensitive or private information)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Tidy Tuesdays Week 3: College Tuition, Diversity, and Pay

A. Preparation (~5-10 minutes)

A1. Make sure R and R studio are installed and running

A2. Download and start exploring the data

OPTION 1

OPTION 2

OPTION 3

A2. What are the files? What are the variables?

tuition_cost.csv

tuition_income.csv

salary_potential.csv

historical_tuition.csv

diversity_school.csv

A3. Pre-processing

B. Think of questions we can ask from the data (5 minutes)

Question ideas

C. Start coding! Share your code here! (30 minutes)

Data Pre-processing

Q1 Do higher tution costs lead to a higher salary potential?

Answer

Q2

Answer

Q3 What’s the best “value” university?

Answer

Q4 Is there an association between tuition costs and enrolment numbers?

Answer

D. Wrap up (10 minutes)

Cool R things I learnt this week

Make something cool? Post it on twitter! #TidyTuesday

What could be improved from the next meeting?

E Help! - ask questions about today's Tidy Tuesday here

Q. How do I change the background?

Q. How can I calculate a diversity index (Shannon-Wiener)?

F General R help - Stuck on something? Need advice on you current project? Can you help answer someone else's question?

`tuition_cost.csv`

`tuition_income.csv`

`salary_potential.csv`

`historical_tuition.csv`

`diversity_school.csv`