--- title: Mini Project label: 'homework' layout: 'post' geometry: margin=2cm tags: project --- # CS100 Mini-Project ### Fun with EDA! ##### Due: October 18, 2022, at 10 pm ### Instructions This is a pair programming assignment. Please refresh your memory about pair programming [here](https://cs.brown.edu/courses/csci0170/content/docs/pair-programming.pdf). As the name suggests, pair programming requires that you work with a partner. If you have trouble finding a partner, please post on EdStem that you are looking for a match. If you cannot find one that way, please contact the TAs for help. Handin instructions: Each submission should include both your code, as an R markdown (`.Rmd`) file---suppressing code, or not, as appropriate---as well as the resulting PDF, after running `Knit PDF` on the R markdown file. Partners should submit their mini-projects as a group, using Gradescope’s group submission feature. After uploading your files and pressing submit on Gradescope, press either the “Group Members” or “Add Group Member” button to add your partner to the submission. [Gradescope group submission tutorial](https://www.youtube.com/watch?v=rue7p_kATLA&t=40s) Be sure to follow the CS100 course collaboration policy as you work on this and all CS100 assignments. ### Overview The goal of this mini project is for you to complete an exploratory data analysis (EDA) by applying the concepts and tools you’ve learned during the first half of the semester. We will soon be progressing to statistics and machine learning, but the first step in working with data should always be EDA, as it can be a window offering powerful insights into your data. ### Datasets For your final project, you will have the opportunity to work with a dataset of your choice. For this mini-project, we have provided you with a choice of three datasets. If none of them appeal, you are welcome to work with a dataset of your choice, even on this mini-project. ##### Option 1: College Majors The [college majors dataset]( https://cs.brown.edu/courses/cs100/homeworks/data/miniproject/college_majors.csv ) contains information about U.S. college students, their majors, and their future employment. The variables are: <!---is compiled based on U.S. census information ([American Community Survey Public Use Microdata Series](https://www.census.gov/programs-surveys/acs/microdata.html) 2010-2012), and ---> Header | Description ------------------|---------------- `Rank` | Rank by median earnings `Major_code` | Major code, FO1DP in ACS PUMS `Major` | Major description `Major_category` | Category of major from Carnevale et al `Total` | Total number of people with major `Sample_size` | Sample size (unweighted) of full-time, year-round ONLY (used for earnings) `Men` | Male graduates `Women` | Female graduates `ShareWomen` | Women as share of total `Employed` | Number employed (ESR == 1 or 2) `Full_time` | Employed 35 hours or more `Part_time` | Employed less than 35 hours `Full_time_year_round` | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35) `Unemployed` | Number unemployed (ESR == 3) `Unemployment_rate` | Unemployed / (Unemployed + Employed) `Median` | Median earnings of full-time, year-round workers `P25th` | 25th percentile of earnings `P75th` | 75th percentile of earnings `College_jobs` | Number with job requiring a college degree `Non_college_jobs` | Number with job not requiring a college degree `Low_wage_jobs` | Number in low-wage service jobs <!-- - **Rank**: Rank by median earnings - **Major_code**: Major code, FO1DP in ACS PUMS - **Major**: Major description - **Major_category**: Category of major from Carnevale et al - **Total**: Total number of people with major - **Sample_size**: Sample size (unweighted) of full-time, year-round ONLY (used for earnings) - **Men**: Male graduates - **Women**: Female graduates - **ShareWomen**: Women as share of total - **Employed**: Number employed (ESR == 1 or 2) - **Full_time**: Employed 35 hours or more - **Part_time**: Employed less than 35 hours - **Full_time_year_round**: Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35) - **Unemployed**: Number unemployed (ESR == 3) - **Unemployment_rate**: Unemployed / (Unemployed + Employed) - **Median**: Median earnings of full-time, year-round workers - **P25th**: 25th percentile of earnings - **P75th**: 75th percentile of earnings - **College_jobs**: Number with job requiring a college degree - **Non_college_jobs**: Number with a job not requiring a college degree - **Low_wage_jobs**: Number in low-wage service jobs--> These data were compiled by [538](https://fivethirtyeight.com/). More recent data may be available [here](https://github.com/fivethirtyeight/data/blob/master/college-majors/all-ages.csv). ##### Option 2: Hate Crimes The [hate crimes dataset](https://cs.brown.edu/courses/cs100/homeworks/data/miniproject/hate_crimes.csv), from [FiveThreeEight](https://github.com/fivethirtyeight/data/tree/master/hate-crimes), is compiled per state, and describes the prevalence of hate crimes in that state, along with other descriptive statistics pertaining to the racial makeup, economic status, education levels, etc. of the state’s population. The variables are: Header | Definition ---|--------- `state` | State name `median_household_income` | Median household income, 2016 `share_unemployed_seasonal` | Share of the population that is unemployed (seasonally adjusted), Sept. 2016 `share_population_in_metro_areas` | Share of the population that lives in metropolitan areas, 2015 `share_population_with_high_school_degree` | Share of adults 25 and older with a high-school degree, 2009 `share_non_citizen` | Share of the population that are not U.S. citizens, 2015 `share_white_poverty` | Share of white residents who are living in poverty, 2015 `gini_index` | Gini Index, 2015 `share_non_white` | Share of the population that is not white, 2015 `share_voters_voted_trump` | Share of 2016 U.S. presidential voters who voted for Donald Trump `hate_crimes_per_100k_splc` | Hate crimes per 100,000 population, Southern Poverty Law Center, Nov. 9-18, 2016 `avg_hatecrimes_per_100k_fbi` | Average annual hate crimes per 100,000 population, FBI, 2010-2015 <!-- - **state**: state name - **median_household_income**: median household income - **share_unemployed_seasonal**: share of the population that is unemployed, seasonally adjusted - **share_population_in_metro_areas**: share of the population that lives in metropolitan areas - **share_population_with_high_school_degree**: share of adults 25 and older with a high-school degree - **share_non_citizen**: share of the population that are not U.S. citizens - **share_white_poverty**: share of white residents who are living in poverty - **gini_index**: Gini index - **share_non_white**: share of the population that is not white - **share_voters_voted_trump**: Share of 2016 U.S. presidential voters who voted for Donald Trump - **hate_crimes_per_100k_splc**: Hate crimes per 100,000 population, Southern Poverty Law Center, Nov. 9-18, 2016 - **avg_hatecrimes_per_100k_fbi**: Average annual hate crimes per 100,000 population, FBI, 2010-2015 --> ##### Option 3: Bike Sharing The [bike sharing dataset](https://cs.brown.edu/courses/cs100/homeworks/data/miniproject/trip_data.csv ) lists the details pertaining to individual bike rides in the New York City Citibike system, such as the length of the trip, the time of day, from where to where, etc. The variables are: <div style="width:200px">header</div> | Definition -----------------------------------|----------------------------- `tripduration` | length of trip (seconds) `starttime` | start time and date `stoptime` | end time and date `start station id` | id of the start station `start station name` | name of the start station `start station latitude` | latitude of the start station `start station longitude` | longitude of the start station `end station id` | if of the end station `end station name` | name of the end station `end station latitude` | latitude of the end station `end station longitude` | longitude of the end station `bikeid` | the id of the bike used `usertype` | type of rider, (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member) `birth year` | year of birth of rider `gender` | gender of rider, (0=unknown; 1=male; 2=female) <!-- - **tripduration**: length of trip (seconds) - **starttime**: start time and date - **stoptime**: end time and date - **start station id**: id of the start station - **start station name**: name of the start station - **start station latitude**: latitude of the start station - **start station longitude**: longitude of the start station - **end station id**: if of the end station - **end station name**: name of the end station - **end station latitude**: latitude of the end station - **end station longitude**: longitude of the end station - **bikeid**: the id of the bike used - **usertype**: type of rider, (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member) - **birth year**: year of birth of rider - **gender**: gender of rider, (Zero=unknown; 1=male; 2=female) --> These data were already processed to remove trips taken by staff as they service and inspect the system, trips that are taken to/from “test” stations, and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it's secure). These data are compiled by [Citibike](https://ride.citibikenyc.com/system-data). More recent data may be available [here](https://s3.amazonaws.com/tripdata/index.html). ### Exploratory Data Analysis Your goal in this mini-project is to explore one of the aforementioned three datasets. In so doing, you can use all the R tools you have at your disposal so far to try to uncover an interesting story in the data and to create informative visualizations. You should then write a short summary of your discoveries in R Markdown, suppressing code for, but displaying relevant visualizations. Your aim in this write up is to convince us that your conclusions are valid. This mini-project is a pair programming project because collaboration will give you a chance to vet your ideas. Your partner can find holes in your thinking, and likewise, you, in theirs. We also encourage you to visit the course staff during their office hours, so you can run your ideas by them as well for further (constructive) criticism. ### Rubric Each group (of 2 students) is expected to propose at least three hypotheses based on the data set of their choosing, which they then support or refute using descriptive statistics and visualizations.