--- title: "Final project" author: "Your Name" date: "current version: `r Sys.Date()`" output: html_document: highlight: kate code_folding: show theme: cosmo df_print: paged toc: true toc_depth: 6 toc_float: collapsed: true smooth_scroll: false --- ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(janitor) library(rmarkdown) # base options ---- base::options( tibble.print_max = 25, tibble.width = 78, scipen = 100000000, max.print = 999999 ) # knitr chunk options ---- knitr::opts_chunk$set( echo = TRUE, # show/hide all code # results = "hide", # hide/show results tidy = FALSE, # cleaner code printing comment = "#", # better console printing eval = TRUE, # turn this to FALSE stop code chunks from running message = TRUE, # show messages fig.width = 7, # figure width fig.height = 5, # figure height warning = FALSE, # show warnings size = "small", # size of the text fig.path = "img/" # location of figures ) # knitr knit settings ---- knitr::opts_knit$set( width = 78 ) ``` # Instructions Set the code chunks to `eval=TRUE` and replace the dataset/variables with the data from your EDA project. All of the text (with the exception of the code chunks and section headers) should be deleted and replaced with your own words. # Motivation Some inspiration: > > "..as much as EDA is a set of tools, it’s also a mindset. And that mindset > is about your relationship with the data. You want to understand the > data—gain intuition, understand the shape of it, and try to connect your > understanding of the process that generated the data to the data itself. > EDA happens between you and the data and isn’t about proving anything to > anyone else yet." - from Doing Data Science, by Rachel Schutt & Cathy O’Neil > Who constructed this data set, when, and why? Do some research on the movies dataset--it's from the ggplot2 package, so check out the reference here: https://ggplot2.tidyverse.org/reference/index.html Someone put together all this information. What was the original purpose behind the dataset's construction? Data is more than just a bunch of numbers and text. What activity, instance or phenomenon do these data represent (people, places, products, etc.)? # Import Import your data below: ```{r import-movies_raw, message=FALSE, warning=FALSE} # example movies_raw <- readr::read_csv(file = "data/imdb-movies.csv") # standardize names movies <- janitor::clean_names(movies_raw) ``` # Inspect Use the `skimr::skim()` function to print the summary statistics for the dataset. Use this output to help you understand what you're seeing in the data visualizations. ```{r skim, eval=TRUE} # replace the code below with the data from your eda project skimr::skim(diamonds) ``` Use the `skimr` output above to look through each of the columns in your data set, and be sure you understand what they are. Which columns are numerical or categorical? Are their date variables? If so, what are the minimum and maximum dates? Information likes this gives us context to the data. What units were the quantities measured in? Are their columns that represent record numbers, IDs, or descriptions (instead of data to compute with)? # Single Variable Graphs Create the label for your single variable graph ```{r labs_hist, eval=TRUE} labs_hist <- labs(title = "Histogram of [ ]", x = "[Variable Name with units]") ``` Create the single variable graph (add the labels) ```{r geom_histogram, eval=FALSE} # replace the code below with the data from your eda project ggplot2::diamonds %>% ggplot(aes(x = depth)) + geom_histogram() + labs_hist ``` Create the label for your single variable graph ```{r labs_freq, eval=TRUE} labs_freq <- labs(title = "Frequency polygon of [ ]", x = "[Variable Name with units]") ``` Create another single variable graph (add the appropriate labels) ```{r geom_freqpoly, eval=TRUE} # replace the code below with the data from your eda project ggplot2::diamonds %>% ggplot(aes(x = carat)) + geom_freqpoly() + labs_freq ``` ## Expectations Depending on your familiarity with the topic, you should have some expectations about what the dataset should contain.Check out the 'typical values' section from R4DS: https://bit.ly/r4ds-typical-values Of the single variable graphs you created, did you see any unusual or unexpected findings? # Bivariate or Multivariate Graphs Use two columns (variables) from your dataset to build a bivariate or multivariate graph. ```{r labs_box, eval=TRUE} labs_box <- labs( title = "[Variable X] by [Variable Y]", subtitle = "source: [link to data]", fill = "[Variable Y]", x = "[Variable X]", y = "[Variable Y]") ``` Build the graph following the labels you've created above: ```{r geom_boxplot, eval=TRUE} # replace the code below with the data from your eda project ggplot2::diamonds %>% ggplot() + geom_boxplot(aes(x = cut, y = carat, fill = cut), alpha = 1/5, show.legend = FALSE) + labs_box ``` Below is the code for a 2nd bivariate or multivariate graph. It's not required to have more than one, but it usually helps provide more information to write about: ```{r labs_freq_facet, eval=TRUE} labs_freq_facet <- labs( title = "[Variable X] by [Variable Y]", subtitle = "source: [link to data]", fill = "[Variable Y]", x = "[Variable X]", y = "[Variable Y]") ``` ```{r facet_wrap, eval=TRUE} # replace the code below with the data from your eda project ggplot2::diamonds %>% ggplot(aes(x = carat, y = price)) + geom_point(aes(color = cut), show.legend = FALSE) + facet_wrap(~ cut) + labs_freq_facet ``` ## Interpretation What charts/graphs are did you use? Describe the graphs you used in your EDA project. Since we've been using `ggplot2`, you can use the documentation or the text to help you: https://ggplot2.tidyverse.org/reference/index.html Do these graphs represent similarities or differences? Change? Growth? Use this chapter from R for Data Science (R4DS) as a guide: https://bit.ly/r4ds-eda Write about some of the relationships you observed in your visualizations If you used a box-plot, histogram, or scatter-plot, check out this image that displays the relationship between the three: https://bit.ly/r4ds-boxplot Include the data dictionary with this file when you turn it in.