Proposal for 313
## Dataset
```{r superbowl, message = FALSE}
superbowl <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-02/youtube.csv')
glimpse(superbowl)
```
## Description
This dataset is created by FiveThirtyEight, originally collected through superbowl-ads.com. There are 247 different commercials with `r ncol(superbowl)` different variables, including the year, brand, and view count. However, they only take from `r length(unique(superbowl$brand))` different brands for their dataset. Their purpose for creating this dataset is to identify defining characteristics of superbowl ads from popular brands like Toyota and Bud Light. Some questions they asked included; Was it funny? Was it patriotic? Did it include animals? Did it use sex to sell this product? Afterwards, they explored how these categories cluster with each other, and found some unique combinations such as ads that included both sex appeal and animals.
Some of the variables included in this dataset are integers, including the `view_count`, `like_count`, `dislike_count`. Other variables are characters, including `brand`, `title`, and `description`. There are also logical variables, which shows either `TRUE` or `FALSE` for some categories, including whether or not the ad is funny, patriotic, includes animals, and includes sex.These logical variables were determined by the FiveThirtyEight team as they watched all of the advertisements.
## Why this dataset?
We chose this dataset due to a number of factors. First, all the members of our group have an interest in the Super Bowl. All of us watch the game every year. In our personal experiences, there have been certain memorable ads, and we are aware of the popularity surrounding ads during the Super Bowl. As a group, we thought it would be interesting to look at what makes ads popular and if there is statistical reasoning behind why we remember specific ads and not others.
The dataset itself also seems as if it fits the parameters discussed it he project description. There are a wide array of numerical and categorical variables. The dataset also allows for us to ask two distinct questions, one related to the popularity of ads depending on what categories they include (or do not include) in the ads, and the other looking at analysis of the variables over time. There are many different angles that we can analyze the dataset from, and potentially interesting visualizations that can be made.
## Questions
### What factors contribute to the most viewed ads and has the relationship between those factors and the views changed over time?
For our first question, we want to get a general sense of which of the characteristics of the ad contributes to really high view counts. Specifically, we want to investigate how the variables `animals`, `celebrity`, `use_sex` affect `view_count`. For the second part of the question, we will primarily investigate how the trends and relationship we explored in the first question has changed over time.
### What is the relationship between popularity of a video and how well it is interacted with?
This question deals with whether popularity is connected with rating, as well as overall interaction with a video. In this question we will look at how the variables `view_count`, `like_count`, `dislike_count`, `comment_count`, and `favorite_count` are related. For the first part of this question we can look at the nubmer of views and proportion of likes to dislikes. In the latter part of question two, we can use the number of comments and favorites to show the degree of interaction with videos.
## Analysis plan
"A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any)."
### Question 1 Plan
For the first plot, we would like to create 3 different bar graphs for each of the different logical variables: `animals`, `celebrity`, and `use_sex`. Each graph will have 2 bars (one for true, one for false) and we will plot average view count on the y-axis. We then will plot the like to dislike ratio as a green:red fill on the bars. This means that we will have to caculate average view count for each specific category and boolean value (3x2 = 6 different bars/calculations). We will then identify trends based off the newly-created plots.
For the second plot, we would like to create a line plot with `year` on the x-axis and `view_count` on the y-axis. We may need to mutate year in some way so it fits cleanly on the x-axis. We then want to have 6 different lines with points on plotted as well (we can layer geom_point and geom_line). These 6 lines will have 3 different colors for `animals`, `celebrity`, and `use_sex`. We then can fill these lines for when these conditions are true and dash them when they are false (3X2 = 6 total, unique lines). We will then identify trends based off the newly-created plots.
### Question 2 Plan
For the first part of this question, our first plot will have view count on the x-axis and like:dislike ratio on the y-axis. We will try many different plots and color schemes to best show this. Histograms, lineplots, point plots, and filling points/bars/lines with different colors are all considerations. We may need to create a new variable that has the ratio of like:disklikes for each given video.
For the second part of our question, our biggest challenge will be figuring out how to graph comments and favorites. Combining the variables by adding them wouldn't be a good idea because we want to see if there's a relationship among the trend of comments and farvorites. We will expirement with having view counts on the x-axis and then the best way to graph both comments and favorites on the y-axis. We will expirement with using different colors and plots (line plot, point plot, side-by-side histrograms) to best show this relationship.