---
title: "Final project"
author: "Your Name"
date: "current version: `r Sys.Date()`"
output:
html_document:
highlight: kate
code_folding: show
theme: cosmo
df_print: paged
toc: true
toc_depth: 6
toc_float:
collapsed: true
smooth_scroll: false
---
```{r setup, include=FALSE}
library(knitr)
library(tidyverse)
library(janitor)
library(rmarkdown)
# base options ----
base::options(
tibble.print_max = 25,
tibble.width = 78,
scipen = 100000000,
max.print = 999999
)
# knitr chunk options ----
knitr::opts_chunk$set(
echo = TRUE, # show/hide all code
# results = "hide", # hide/show results
tidy = FALSE, # cleaner code printing
comment = "#", # better console printing
eval = TRUE, # turn this to FALSE stop code chunks from running
message = TRUE, # show messages
fig.width = 7, # figure width
fig.height = 5, # figure height
warning = FALSE, # show warnings
size = "small", # size of the text
fig.path = "img/" # location of figures
)
# knitr knit settings ----
knitr::opts_knit$set(
width = 78
)
```
# Instructions
Set the code chunks to `eval=TRUE` and replace the dataset/variables with the
data from your EDA project. All of the text (with the exception of the code
chunks and section headers) should be deleted and replaced with your own words.
# Motivation
Some inspiration:
>
> "..as much as EDA is a set of tools, it’s also a mindset. And that mindset
> is about your relationship with the data. You want to understand the
> data—gain intuition, understand the shape of it, and try to connect your
> understanding of the process that generated the data to the data itself.
> EDA happens between you and the data and isn’t about proving anything to
> anyone else yet." - from Doing Data Science, by Rachel Schutt & Cathy O’Neil
>
Who constructed this data set, when, and why? Do some research on the movies
dataset--it's from the ggplot2 package, so check out the reference here:
https://ggplot2.tidyverse.org/reference/index.html
Someone put together all this information. What was the original purpose
behind the dataset's construction?
Data is more than just a bunch of numbers and text. What activity, instance
or phenomenon do these data represent (people, places, products, etc.)?
# Import
Import your data below:
```{r import-movies_raw, message=FALSE, warning=FALSE}
# example
movies_raw <- readr::read_csv(file = "data/imdb-movies.csv")
# standardize names
movies <- janitor::clean_names(movies_raw)
```
# Inspect
Use the `skimr::skim()` function to print the summary statistics for the
dataset. Use this output to help you understand what you're seeing in the
data visualizations.
```{r skim, eval=TRUE}
# replace the code below with the data from your eda project
skimr::skim(diamonds)
```
Use the `skimr` output above to look through each of the columns in your data
set, and be sure you understand what they are.
Which columns are numerical or categorical? Are their date variables? If so,
what are the minimum and maximum dates? Information likes this gives us context
to the data.
What units were the quantities measured in?
Are their columns that represent record numbers, IDs, or descriptions (instead
of data to compute with)?
# Single Variable Graphs
Create the label for your single variable graph
```{r labs_hist, eval=TRUE}
labs_hist <- labs(title = "Histogram of [ ]",
x = "[Variable Name with units]")
```
Create the single variable graph (add the labels)
```{r geom_histogram, eval=FALSE}
# replace the code below with the data from your eda project
ggplot2::diamonds %>%
ggplot(aes(x = depth)) +
geom_histogram() +
labs_hist
```
Create the label for your single variable graph
```{r labs_freq, eval=TRUE}
labs_freq <- labs(title = "Frequency polygon of [ ]",
x = "[Variable Name with units]")
```
Create another single variable graph (add the appropriate labels)
```{r geom_freqpoly, eval=TRUE}
# replace the code below with the data from your eda project
ggplot2::diamonds %>%
ggplot(aes(x = carat)) +
geom_freqpoly() +
labs_freq
```
## Expectations
Depending on your familiarity with the topic, you should have some expectations
about what the dataset should contain.Check out the 'typical values' section
from R4DS:
https://bit.ly/r4ds-typical-values
Of the single variable graphs you created, did you see any unusual or unexpected findings?
# Bivariate or Multivariate Graphs
Use two columns (variables) from your dataset to build a bivariate or multivariate graph.
```{r labs_box, eval=TRUE}
labs_box <- labs(
title = "[Variable X] by [Variable Y]",
subtitle = "source: [link to data]",
fill = "[Variable Y]",
x = "[Variable X]",
y = "[Variable Y]")
```
Build the graph following the labels you've created above:
```{r geom_boxplot, eval=TRUE}
# replace the code below with the data from your eda project
ggplot2::diamonds %>%
ggplot() +
geom_boxplot(aes(x = cut,
y = carat,
fill = cut),
alpha = 1/5,
show.legend = FALSE) +
labs_box
```
Below is the code for a 2nd bivariate or multivariate graph. It's not required to have more than one, but it usually helps provide more information to write about:
```{r labs_freq_facet, eval=TRUE}
labs_freq_facet <- labs(
title = "[Variable X] by [Variable Y]",
subtitle = "source: [link to data]",
fill = "[Variable Y]",
x = "[Variable X]",
y = "[Variable Y]")
```
```{r facet_wrap, eval=TRUE}
# replace the code below with the data from your eda project
ggplot2::diamonds %>%
ggplot(aes(x = carat,
y = price)) +
geom_point(aes(color = cut),
show.legend = FALSE) +
facet_wrap(~ cut) +
labs_freq_facet
```
## Interpretation
What charts/graphs are did you use?
Describe the graphs you used in your EDA project. Since we've been using
`ggplot2`, you can use the documentation or the text to help you:
https://ggplot2.tidyverse.org/reference/index.html
Do these graphs represent similarities or differences? Change? Growth? Use this
chapter from R for Data Science (R4DS) as a guide:
https://bit.ly/r4ds-eda
Write about some of the relationships you observed in your visualizations
If you used a box-plot, histogram, or scatter-plot, check out this image that
displays the relationship between the three:
https://bit.ly/r4ds-boxplot
Include the data dictionary with this file when you turn it in.