# Tidy Tuesdays Week 2: National Hockey League (NHL) Goals
To start please visit: https://github.com/BEES-Tidy-Tuesdays/home
You will find a link to this collaborative document called "Week 2 notes".
:::info
**Please Read** :mega:
This is a collaborative markdown document: feel free to add, change, and improve it. We will upload the final document to github after this Tidy Tuesday session and use parts of it as a template for future sessions.
If something is unclear or doesn't make sense, fix it, or make a comment.
:::
### A. Preparation (~5-10 minutes)
#### A1. Make sure R and R studio are installed and running
#### A2. Download and start exploring the data
Please access the data [here](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-03-03/readme.md). Download and extract to a specified folder. We encourage you to start working by creating a new Rproject, and use best practices for file management.
OR
To clone the data set from github using git in RStudio:
1. Select "New Project"
2. Select "Version control"
3. Select "Git"
4. Paste "https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-03-03" as the URL, and select where you want to clone the files to on your computer
N.B. You need git installed on your computer, you can download it here:
- https://gitforwindows.org/ (Windows)
- https://git-scm.com/download/mac (Mac)
OR
Just use this code to download the data
```
# Get the Data
game_goals <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-03/game_goals.csv')
top_250 <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-03/top_250.csv')
season_goals <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-03/season_goals.csv')
# Or read in with tidytuesdayR package (https://github.com/thebioengineer/tidytuesdayR)
# PLEASE NOTE TO USE 2020 DATA YOU NEED TO USE tidytuesdayR version ? from GitHub
# Either ISO-8601 date or year/week works!
# Install via devtools::install_github("thebioengineer/tidytuesdayR")
tuesdata <- tidytuesdayR::tt_load('2020-03-03')
tuesdata <- tidytuesdayR::tt_load(2020, week = 10)
game_goals <- tuesdata$game_goals
```
#### A2. What are the files? What are the variables?
#### Files
The data is broken down into a few files:
| File | Description |
| ------------- | ------------- |
| top_250.csv | Top 250 NHL Career Leaders and Records for Goals |
| game_goals.csv | Records of Goals for each player and each game |
| season_goals.csv | Records of Goals for each player and each season |
#### Variables/Attributes
##### `top_250.csv`
Please note this is the top 250 goal scorers as found [here](https://www.hockey-reference.com/leaders/goals_career.html).
|variable |class |description |
|:-----------|:---------|:-----------|
|raw_rank |double | Rank of goals (blank if duplicate) |
|player |character | Player Name |
|years |character | Years active (start - end) |
|total_goals |double | Total goals scored in the NHL |
|url_number |double | Number for URL |
|raw_link |character | Raw player ID |
|link |character | Link to player details on hockeyreference.com |
|active |character | Status: If still playing = Active, if retired = retired|
|yr_start |double |First year in the NHL |
##### `game_goals.csv`
Goals for each player and each game (only for players who started at or after 1979-80 season). This is due to limited game-level data prior to 1980.
|variable |class |description |
|:-----------------|:---------|:-----------|
|player |character | Player name |
|season |double | Season year |
|rank |double | Rank equivalent to game_num for most |
|date |double | Date of game (ISO format) |
|game_num |double | Game number within each season|
|age |character | Age in year-days|
|team |character | NHL team |
|at |character | At: blank if at home, @ if at the opponent arena |
|opp |character | Opponent |
|location |character | Location = location of game (home or away) |
|outcome |character | Outcome = Won, Loss, Tie |
|goals |double | Goals Scored by player|
|assists |double | Assists - helped with goal for other player |
|points |double | Points - Sum of goals + assists |
|plus_minus |double | Plus Minus - Team points minus opponents points scored while on ice|
|penalty_min |double | Penalty minutes - minutes spent in penalty box |
|goals_even |double | Goals scored while even-strength |
|goals_powerplay |double | Goals scored on powerplay |
|goals_short |double | Goals scored while short-handed|
|goals_gamewinner |double | Goals that were gamewinner|
|assists_even |double | Assists while even strength|
|assists_powerplay |double | Assists on powerplay|
|assists_short |double | Assists on shorthanded|
|shots |double | Shots|
|shot_percent |double | Shot percent (goals/shots)|
##### `season_goals.csv`
|variable |class |description |
|:------------------|:---------|:-----------|
|rank |double |Overall goals ranking (1 - 250)|
|position |character | Position = player position (C = center, RW = Right Wing, LW = left Wing)|
|hand |character |Dominant hand (left or right) |
|player |character | Player name|
|years |character | Season years (year-yr)|
|total_goals |double | Total goals scored in career |
|status |character |Status = retired or active|
|yr_start |double | year started in NHL|
|season |character | Specific season for the player|
|age |double |Age during season|
|team |character | Team during season |
|league |character |League during season|
|season_games |double |Games played in the season|
|goals |double |Goals scored in the season|
|assists |double |Assists in the season|
|points |double |Points in the season|
|plus_minus |double | Plus Minus in the season - Team points minus opponents points scored while on ice|
|penalty_min |double |Penalty Minutes in the season |
|goals_even |double |Goals scored while even strength in a season|
|goals_power_play |double |Goals scored on powerplay in a season|
|goals_short_handed |double |Goals short handed in a season|
|goals_game_winner |double |Goals that were game winner in a season|
|headshot |character | Player headshot (URL to image of their head) |
#### A3. Pre-processing
1. Are the data 'tidy'?
Hint: [What is tidy data?](https://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure)
2. Do we need to filter out records with missing data?
Tip: Cool way to quickly summarise your data
```
install.packages("summarytools")
library(summarytools)
#creates a html output summarising each to columns
view(dfSummary(game_goals))
```
### B. Think of questions we can ask from the data (5 minutes)
Talk to the people nearest you and brainstorm some questions we can ask with this dataset. What could this data tell us? What are some interesting questions we could ask? How do you plan to visualise it?
#### Question ideas
**Example**:
:::spoiler
[Will AlexOvechkin beat Gretzky by age 40? ](https://twitter.com/lauriejhopkins/status/1234551933966352385)
:::
:::info
**Type in the questions below**:
- Q1 Which variables are highly correlated? Start with `game_goals.csv` (Kat)
- Q2 Which season has more goals? (Dony)
- Q3 Are the number of goals for each player increasing over time?
:::
### C. Start coding! Share your code here! (30 minutes)
Share your code, and your pretty figures here.
Stuck? Share what you have and ask for help.
#### Preprocessing code
#### Read data
```
```
:::warning
**Can we map where the missing data are?** :zap:
```
naniar::vis_miss(game_goals)
naniar::vis_miss(season_goals)
naniar::vis_miss(top_250)
```
- Why are there many NAs in the `top250$raw_rank` column?
:::
#### Q1
##### Answer
```
```
#### Q2 Which Season has more goals?
##### Answer
```
game_goals %>%
group_by(season) %>%
summarise(total_goals = sum(goals)) %>%
ggplot(aes(x = season, y = total_goals))+
geom_point()+ #produce points
geom_smooth()+
theme_classic()
season_totals <- game_goals %>%
group_by(season) %>%
summarise(total_goals = sum(goals))
#Highest was in 2012 with 874 goals
max(season_totals$total_goals)
```

#### Q3 Are the number of goals for each player increasing over time?
##### Answer: No
```
game_goals_2 <- game_goals %>%
group_by(season,player) %>%
mutate(total_goals_player = sum(goals)) %>%
group_by(season) %>%
mutate(total_goals_season = sum(goals)) %>%
mutate(sum_player = n_distinct(player)) %>%
mutate(total_goals_standardised = total_goals_player/sum_player) %>%
distinct(season, player,.keep_all = T)
# The number of goals per player has been declining
ggplot(data = game_goals_2,
aes(x = season,
y = total_goals_standardised)) +
geom_jitter(color = "pink", alpha = 0.3) +
geom_smooth(stat = "smooth", color="red") +
theme_classic()
```

Potential explanations:
- the game gets more competitive
- there are more teamwork in the recent years
- there are less records in the early years
#### Q4 How does the number of teams change across seasons?
##### Answer
```
game_goals %>%
group_by(season) %>%
summarise(total_teams = n_distinct(team)) %>%
ggplot(aes(x = season, y = total_teams))+
geom_point()+
geom_smooth()+
theme_classic()
```

### D. Wrap up (10 minutes)
#### Cool R things I learnt this week
:::success
:tada:
:::
#### What could be improved from the next meeting?
---
### E Help! - ask questions about today's Tidy Tuesday here
::: info
If you help someone in person, please put it here as well, so others can learn. :+1:
:::
#### Q. How do I change the background?
A. It depends on what you want to change it to. In ggplot, the best way to change the background is to use `theme()`.
### F General R help - Stuck on something? Need advice on you current project? Can you help answer someone else's question?
(N.B. this document is public, so don't include sensitive or private information)