changed 5 years ago
Linked with GitHub

Tidy Tuesdays Week 2: National Hockey League (NHL) Goals

To start please visit: https://github.com/BEES-Tidy-Tuesdays/home

You will find a link to this collaborative document called "Week 2 notes".

Please Read

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

This is a collaborative markdown document: feel free to add, change, and improve it. We will upload the final document to github after this Tidy Tuesday session and use parts of it as a template for future sessions.

If something is unclear or doesn't make sense, fix it, or make a comment.

A. Preparation (~5-10 minutes)

A1. Make sure R and R studio are installed and running

A2. Download and start exploring the data

Please access the data here. Download and extract to a specified folder. We encourage you to start working by creating a new Rproject, and use best practices for file management.

OR

To clone the data set from github using git in RStudio:

  1. Select "New Project"
  2. Select "Version control"
  3. Select "Git"
  4. Paste "https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-03-03" as the URL, and select where you want to clone the files to on your computer

N.B. You need git installed on your computer, you can download it here:

OR

Just use this code to download the data

# Get the Data

game_goals <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-03/game_goals.csv')

top_250 <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-03/top_250.csv')

season_goals <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-03/season_goals.csv')

# Or read in with tidytuesdayR package (https://github.com/thebioengineer/tidytuesdayR)
# PLEASE NOTE TO USE 2020 DATA YOU NEED TO USE tidytuesdayR version ? from GitHub

# Either ISO-8601 date or year/week works!

# Install via devtools::install_github("thebioengineer/tidytuesdayR")

tuesdata <- tidytuesdayR::tt_load('2020-03-03')
tuesdata <- tidytuesdayR::tt_load(2020, week = 10)


game_goals <- tuesdata$game_goals

A2. What are the files? What are the variables?

Files

The data is broken down into a few files:

File Description
top_250.csv Top 250 NHL Career Leaders and Records for Goals
game_goals.csv Records of Goals for each player and each game
season_goals.csv Records of Goals for each player and each season

Variables/Attributes

top_250.csv

Please note this is the top 250 goal scorers as found here.

variable class description
raw_rank double Rank of goals (blank if duplicate)
player character Player Name
years character Years active (start - end)
total_goals double Total goals scored in the NHL
url_number double Number for URL
raw_link character Raw player ID
link character Link to player details on hockeyreference.com
active character Status: If still playing = Active, if retired = retired
yr_start double First year in the NHL
game_goals.csv

Goals for each player and each game (only for players who started at or after 1979-80 season). This is due to limited game-level data prior to 1980.

variable class description
player character Player name
season double Season year
rank double Rank equivalent to game_num for most
date double Date of game (ISO format)
game_num double Game number within each season
age character Age in year-days
team character NHL team
at character At: blank if at home, @ if at the opponent arena
opp character Opponent
location character Location = location of game (home or away)
outcome character Outcome = Won, Loss, Tie
goals double Goals Scored by player
assists double Assists - helped with goal for other player
points double Points - Sum of goals + assists
plus_minus double Plus Minus - Team points minus opponents points scored while on ice
penalty_min double Penalty minutes - minutes spent in penalty box
goals_even double Goals scored while even-strength
goals_powerplay double Goals scored on powerplay
goals_short double Goals scored while short-handed
goals_gamewinner double Goals that were gamewinner
assists_even double Assists while even strength
assists_powerplay double Assists on powerplay
assists_short double Assists on shorthanded
shots double Shots
shot_percent double Shot percent (goals/shots)
season_goals.csv
variable class description
rank double Overall goals ranking (1 - 250)
position character Position = player position (C = center, RW = Right Wing, LW = left Wing)
hand character Dominant hand (left or right)
player character Player name
years character Season years (year-yr)
total_goals double Total goals scored in career
status character Status = retired or active
yr_start double year started in NHL
season character Specific season for the player
age double Age during season
team character Team during season
league character League during season
season_games double Games played in the season
goals double Goals scored in the season
assists double Assists in the season
points double Points in the season
plus_minus double Plus Minus in the season - Team points minus opponents points scored while on ice
penalty_min double Penalty Minutes in the season
goals_even double Goals scored while even strength in a season
goals_power_play double Goals scored on powerplay in a season
goals_short_handed double Goals short handed in a season
goals_game_winner double Goals that were game winner in a season
headshot character Player headshot (URL to image of their head)

A3. Pre-processing

  1. Are the data 'tidy'?

    Hint: What is tidy data?

  2. Do we need to filter out records with missing data?

Tip: Cool way to quickly summarise your data

install.packages("summarytools")
library(summarytools)

#creates a html output summarising each to columns
view(dfSummary(game_goals))

B. Think of questions we can ask from the data (5 minutes)

Talk to the people nearest you and brainstorm some questions we can ask with this dataset. What could this data tell us? What are some interesting questions we could ask? How do you plan to visualise it?

Question ideas

Example:

Will AlexOvechkin beat Gretzky by age 40?

Type in the questions below:

  • Q1 Which variables are highly correlated? Start with game_goals.csv (Kat)
  • Q2 Which season has more goals? (Dony)
  • Q3 Are the number of goals for each player increasing over time?

C. Start coding! Share your code here! (30 minutes)

Share your code, and your pretty figures here.

Stuck? Share what you have and ask for help.

Preprocessing code

Read data


Can we map where the missing data are?

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

naniar::vis_miss(game_goals)
naniar::vis_miss(season_goals)
naniar::vis_miss(top_250)
  • Why are there many NAs in the top250$raw_rank column?

Q1

Answer

Q2 Which Season has more goals?

Answer
game_goals %>%
  group_by(season) %>%
  summarise(total_goals = sum(goals)) %>%
ggplot(aes(x = season, y = total_goals))+
  geom_point()+ #produce points
  geom_smooth()+
  theme_classic()
  
season_totals <- game_goals %>%
  group_by(season) %>%
  summarise(total_goals = sum(goals))

#Highest was in 2012 with 874 goals
max(season_totals$total_goals)

Q3 Are the number of goals for each player increasing over time?

Answer: No
game_goals_2 <- game_goals %>%
  group_by(season,player) %>%
  mutate(total_goals_player = sum(goals)) %>% 
  group_by(season) %>% 
  mutate(total_goals_season = sum(goals)) %>% 
  mutate(sum_player = n_distinct(player)) %>% 
  mutate(total_goals_standardised = total_goals_player/sum_player) %>% 
  distinct(season, player,.keep_all = T)


# The number of goals per player has been declining
ggplot(data = game_goals_2,
       aes(x = season, 
           y = total_goals_standardised)) +
  geom_jitter(color = "pink", alpha = 0.3) +
  geom_smooth(stat = "smooth", color="red") +
  theme_classic()

Potential explanations:

  • the game gets more competitive
  • there are more teamwork in the recent years
  • there are less records in the early years

Q4 How does the number of teams change across seasons?

Answer
game_goals %>%
  group_by(season) %>%
  summarise(total_teams = n_distinct(team)) %>%
  ggplot(aes(x = season, y = total_teams))+
  geom_point()+
  geom_smooth()+
  theme_classic()

D. Wrap up (10 minutes)

Cool R things I learnt this week

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

What could be improved from the next meeting?


E Help! - ask questions about today's Tidy Tuesday here

If you help someone in person, please put it here as well, so others can learn.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Q. How do I change the background?

A. It depends on what you want to change it to. In ggplot, the best way to change the background is to use theme().

F General R help - Stuck on something? Need advice on you current project? Can you help answer someone else's question?

(N.B. this document is public, so don't include sensitive or private information)

Select a repo