--- title: Project 1, Fall 2020 tags: Projects-F20, Project 1 --- # Project 1 ## Due Date Information **Out:** Tuesday, October 6th **Design check sign-up / individual note-taking portion deadline:** Saturday, October 9th, 11:59pm EST (note that late days cannot be used on this assignment) :::warning Note on late day use (updated 10/10/2020) * Late days cannot be used on the pre-design check submission, but for project 1 *only*, we'll be accepting them after Saturday 11:59pm for half credit, up until the end of design checks. ::: **Design check dates:** Saturday, October 10th-Wednesday, October 14th (see TA-specific appointment calendar) **Design check group submission deadline:** two hours before your design check **In:** Tuesday, October 20th 9pm EST <center> <img src="https://img.theculturetrip.com/768x432/wp-content/uploads/2019/06/img_4310-copy.jpg" width="340"> </center> ## Important Notes: * Please be familiar with the projects logistics, detailed [here.](https://hackmd.io/@cs111/projectlogistics) * Please read this entire handout carefully. * Make sure to copy the code we give exactly for loading the spreadsheets. * Do NOT use lists in this project, except for ```select-columns```. This function takes in a table and a list of strings (indicating column names) and outputs a table with those only the columns of the input table whos names were in the list. This can be used to make cleaner, more specific tables in part 3. *It is by no means an expectation or a requirement to do this -- this entire project can be completed without lists.* ## Summary Imagine that you are in NYC, and want to hail a taxi cab. How might things such as weather, day of the week, or time of the day influence how many other people are also trying to hail a taxi cab? In this project, we'll be using modified versions of real-life data to analyze taxi driving patterns in NYC based on certain factors: weather, day of the week, and time of day. We'll also be thinking about the social implications of working with large sets of real-world data! NYC publishes a lot of open data (see [**this link**](https://data.cityofnewyork.us)). You found records of every taxi ride taken in city cabs during 2015 and 2016, and want to analyze the data, with a particular look at how taxi usage varies by time of day, day of the week, and weather conditions. Visit [**this webpage**](https://data.cityofnewyork.us/Transportation/2016-Yellow-Taxi-Trip-Data/k67s-dv2t) on the 2016 taxi data, to get familiar with the columns that NYC provides in these datasets. Please note that this link might take a while to load. For better or worse, this dataset is HUGE -- it has 131 million rows and consumes more than 17GB of space (so don't download it!). The raw dataset is too big to open easily in Excel or Pyret, so you're going to work with a summarized version that we have already computed from the raw data. :::info Project Learning Goals - Become familiar with manipulating large datasets & joining separate datasets based on common attributes - Expand your knowledge on built-in table functions in Pyret to manipulate data - Practice testing functions over tables - Leverage plots and charts to identify and present conclusions about large datasets ::: ## Files to Submit on Gradescope **Individual pre-design check:** - `project-1-design-check-individual.pdf` -> one submission *per person* under Project 1 Pre-Design Check (Individual) **Design Check:** - `project-1-design-check.pdf` -> one submission per group under Project 1 Design Check **Final Handin:** - `transit-analysis.arr` -> one submission per (sub)group under Project 1 - `transit-report.pdf` -> one submission per (sub)group under Project 1 - `project-1-reflection.pdf` -> one submission *per person* under Project 1 Reflection ## The Project For this project, these will be the three main **Analysis Questions** about the 2016 taxi data: 1. To what extent does bad weather affect how many rides people take? There are many ways to interpret bad weather and you can analyze this question through different lenses, such as rain, snow, and temperature. 2. Do the number of rides and total fares follow similar patterns for each day of the week across the year? In other words, is there a reasonably consistent pattern across all Mondays of a year? What about across Saturdays? And so on. 3. Are some days of the week more likely than others to have high numbers of rides? **In addition**, you need to provide a way to produce a table that summarizes statistics about the numbers of rides at different times of day under different weather conditions. Specifically, given a table and a function to use to summarize the values for a particular weather condition and time of day, you will write a function `summary-table` to produce a (Pyret) table of the following form, where each cell contains some statistic about the number of rides in the given time period on a day with the given weather: ``` | | Rain | Snow | Clear | | ---------- | ------ | ------ | ------- | | Morning | num | ... | ... | | Afternoon | ... | | | | Evening | ... | | | | Night | ... | | | ``` where ```num``` might be the sum of all rides on rainy-day mornings, or the daily average on rainy-day mornings, etc. ***Note:*** The data set divides time into quarters (0-6, 6-12, 12-18, and 18-24). For the purpose of this assignment, any of these quarters can represent any of the time frames (i.e. Morning, Afternoon, Evening, and Night can refer to any of the time frames). There is no direct mapping that we require, but your decision in representing these time frames should be intuitive and the ordering must remain consistent (i.e. Afternoon follows Morning, Evening follows Afternoon etc.). As described in the [project logistics handout](https://hackmd.io/@cs111/projectlogistics) (also linked above), the project will be completed in three stages: pre-design, design, and analysis. More detailed expectations for each phase are described in separate sections below. ***Note:** We believe the hardest part of this assignment lies in figuring out what analyses you will do and in creating the tables you need for those analyses. Once you have created the tables, the remaining code should be similar to what you have written for homework and lab. Plan enough time to think out your table and analysis designs.* ### Accessing the Data The following code will load the [summarized 2016 taxi data](https://docs.google.com/spreadsheets/d/1ZbiTAuBpy55akMtA-gWjRBBW0Jo6EP0h_mQWmLMyfkc/edit#gid=1887967044) into Pyret: ``` include tables include shared-gdrive("cs111-2020.arr", "1imMXJxpNWFCUaawtzIJzPhbDuaLHtuDX") include gdrive-sheets include image import math as M import statistics as S import data-source as DS include shared-gdrive("taxi-project-support-2020.arr", "1RF7AvfRpZ6a4asxQzHeC_2a91gNrZtgF") taxi-ssid = "1ZbiTAuBpy55akMtA-gWjRBBW0Jo6EP0h_mQWmLMyfkc" taxi-sheet = load-spreadsheet(taxi-ssid) # load spreadsheet taxi-data-sheet = taxi-sheet.sheet-by-name("data", true) # get data sheet taxi-data-long = load-table: day, weekday, timeframe, num-rides, avg-dist, total-fare source: taxi-data-sheet end ``` **Note:** that the source code file imported above, `taxi-project-support.arr`, contains functions that might be useful for this project that are not in the standard CS0111 Pyret Documentation. Details on this can be found below under ["Helpful Functions"](#Additional-helper-functions). For weather data, we have extracted data from La Guardia airport in New York City in 2016 (from [NCDC](https://en.wikipedia.org/wiki/National_Climatic_Data_Center)) and left it in a [Google Sheet](https://drive.google.com/file/d/1TZ6jslrkKnYvJZZ_N8-KniREYJ4-KrZl/view?usp=sharing). You can access it with the following code: ``` weather-ssid = "1uiWXHjKAeZ7aUjiL6V_IFN5j9uLRHv_b1ji_Nc3IZm4" wdata-sheet = load-spreadsheet(weather-ssid) weather-data = load-table: date, weekday, awnd, prcp, snow, tavg, tmax, tmin source: wdata-sheet.sheet-by-name("final2", true) end ``` ## Deadline 1: Pre-design Check Please [read this overview of project logistics](https://hackmd.io/LmKqwGlDSDSRHlIaXSVUPA), including an example of what we are expecting for your pre-design check. The note taking portion/submission should not take any more than 30 minute. As described in the deadline information at the top of this handout, the deadline for both the signup and submission is **this Friday at 11:59pm.** For your final reflection, you'll be reading a short excerpt from Chapter 8 of Cathy O'Neil's *Weapons of Math Destruction*. **Note that while nothing is due yet, you might want to keep these readings in mind as you work through this project!** Here is [the excerpt you are required to read](https://drive.google.com/file/d/1wlt6ZGs-4EnwlYqGUn0yx0EyyYeEZ_0V/view?usp=sharing), the [full chapter](https://drive.google.com/file/d/1qADN9VO4ZFq7QXyb5oDvs0Lse6202Ipq/view?usp=sharing) if you find yourself intrigued, and a [link to access the full e-book](http://josiah.brown.edu/record=b8466367~S7), available through the Brown library. ## Deadline 2: Design Check **Setup and Handin Info** The following portion of the project will be completed as a group. Answer the following questions with your group in a PDF file, `project-1-design-check.pdf`, and submit it on Gradescope under **Project 1 Design Check** at least 2 hours before your design check. Add all of your group members to the submission by clicking on the "Group Members" button at the bottom of the page after submitting. This document will not be primarily graded on correctness, but is a chance to show the staff that you have engaged with the questions on your own. You can create a PDF by writing in your favorite word processor (Word, Google Docs, etc) then saving or exporting to PDF. Ask the TAs if you need help with this. Please put the emails of all group members at the top of the file. **Questions** 1. Look at the [summarized 2016 taxi data](https://docs.google.com/spreadsheets/d/1ZbiTAuBpy55akMtA-gWjRBBW0Jo6EP0h_mQWmLMyfkc/edit#gid=1887967044). Compare the summarized data with the sample of the original table shown on [the NYC website](https://data.cityofnewyork.us/Transportation/2016-Yellow-Taxi-Trip-Data/k67s-dv2t). What operations or steps could produce the summarized data from the original? Write a bulleted list of steps (in English, not code) that explain how to produce the summarized form from the original. Make sure you have some ideas of what functions from the Pyret tables documentation you might use. (The point of this question is to show you that you know almost everything you'd need to do this conversion yourself, had the source data not been so huge -- within a couple of weeks you will know how to do all of these steps yourself.) 2. For each of the three analysis questions listed above at the beginning, describe how you plan to do the analysis. You should try to answer these questions: * What charts, plots and statistics do you plan to generate to answer the analysis questions? Why? What are the types and the axes of these charts, plots and statistics? * What table(s) will you need to generate those charts, plots and statistics? * If the table(s) you need have different columns or rows than those that we gave you, provide a sample of the table that you need. * For each of the new tables that you identified, describe how you plan to create these tables from the ones that we've given you. This can include the overall summary table produced by the summary table function. Make sure to list all Pyret operators and functions you might use, (with input/output types and description of what they do, but without the actual code). If you don't know how to create any table, discuss it with the TA at your design check, or feel free to discuss with TAs at hours beforehand. **Important note**: You can use any of the Pyret table, chart and plot operations as you see fitting - some that you could use (but you are not limited to, or required to use these) are: `sort-by`, `filter-by`, `stdev`, `mean`, `sum`, `scatter-plot`, `freq-bar-chart`, `histogram`. You can read more about these in [Tables Documentation](https://hackmd.io/@cs111/table). :::info ***Sample Answer:*** If you were asked to analyze whether municipalities with a population (in 2000) larger than 30,000 have an increase or decrease in population, your answer to this might be: "I'd start with a table of municipalities that have a population in 2000 of over 30,000, and then make a scatterplot of the population of those cities in 2000 and 2010. I'd add a linear regression line, then check whether there was a pattern in changes between the two population values. I'd obtain a table of municipalities with a population of greater than 30,000 in 2000 by using the `filter-by` function." ::: 3. For the `summary-table` function, you will be filling in the body of the following function (you do not have to implement it for the design check, but you do eventually have to implement it): ``` fun summary-table(t :: Table, f :: (Table, String -> Number)) -> Table: doc: ```Produces a table that uses the given function f to summarize rides for each of rain/snow/clear weather during morning/ afternoon/evening/night timeframes based on the data in the table, t.``` ... end # the type of f is function that takes Table and String and returns a Number # the String should correspond to the name of the column f will operate on ``` Generate a general idea of how you want to implement this function for the design check. For example, this might be called `summary-table(mytable, sum)` or `summary-table(mytable, mean)` to summarize the total or average numbers of rides within the dates represented in `mytable`. You are welcome to create any other helper function to work with `summary-table` that you see fit for your analysis. * Provide an example of how this function `summary-table` will be used. Your answer should include an example of the input table, an input function that takes in a Table and String and returns a Number, and an output Table. 4. Given these two tables: **Table 1:** | date | prcp | | ---------- | ---- | | 2020/10/14 | 1.0 | | 2020/10/15 | 1.1 | | 2020/10/16 | 0.0 | **Table 2:** | date | number_of_rides | | ---------- | --------------- | | 2020/10/15 | 28591 | | 2020/10/14 | 2355 | | 2020/10/17 | 14513 | | 2020/10/16 | 4810 | Write a bulleted list of steps (in English) to combine these two tables to one that looks like the table below. If a step corresponds to a specific Pyret tables function, make sure to name the function, even if you're not completely sure how it will be used!. | date | prcp | number_of_rides | | ---------- | ---- | --------------- | | 2020/10/14 | 1.0 | 2355 | | 2020/10/15 | 1.1 | 28591 | | 2020/10/16 | 0.0 | 4810 | ### Requirements Here's an overview of the design check requirements. * Submit your work on Gradescope (Project 1 Design Check) before the Design Check. Feel free to have your design check work with you during the design check. * We expect that all group members have participated in designing the project. The TA will aim to have all group members participate in your discussion about the work you've submitted. Splitting the work such that each of you does only 1 of the analysis questions is likely to backfire, as you might have inconsistent tables or insufficient understanding of work done by your partner. * Be on time to your design check. If someone has something come up, contact the TA and try to reschedule. <!-- ## Optional Second Check-in During your design check, each (sub)group will also have the opportunity to schedule an optional personal check-in with your project TA where you can ask them any questions you have at that point, or work on a bug you might be having. These are 20-30 minute meetings with your project TA from Sept 25 - Oct 1. Meeting earlier in the time frame above allow you to get higher-level help, while later might allow you to get more focused help. Not all members of the (sub)group have to be present for the second check-in, but it is recommended to have everyone be present if possible so everyone will be on the same page. **Note:** If you schedule a personal check-in but wish to cancel, do so at least 12 hours before the time of the check-in. We want to respect both your time and the staff's time! Failure to do so may result in point deductions on your final project grade. --> ## Deadline 2: Analysis and Report ### Analysis For the analysis, you will be submitting a Pyret file named `transit-analysis.arr` that contains the function `summary-table`, the tests for the function, and all the functions used to generate the report (charts, plots, and statistics). **Note:** 1. Create at least two different example tables in your tests for the `summary-table` function. 2. Make sure to test all helper functions that you create unless they return images. 3. If you copy a table or plot into your analysis, you must tell us what it is called in your code so we can reproduce your results. :::info ***Sample Answer:** Continuing with comparing municipalities as an example, we'd expect to see something in your Pyret file like the following:* ``` # ------ Analysis for comparing municipalities' populations --- # fun more-than-thirty-thousand(r :: Row) -> Boolean: ... end qualifying-munis = filter-by(municipalities, more-than-thirty-thousand) munis-ex1-ex2-scatter = lr-plot(c-students, "population-2000", "population-2010") ``` *Then, your report may look like this:* ![](https://i.imgur.com/2ld32PX.png) ::: ### Guidelines on the Analysis In order to do these analyses, you will need to get day-of-the-week information into the tables and combine data from the two tables based on common dates. **Combining data across tables** Both tables store data by dates, which means you should be able to combine information to create a single table. However, these two tables have different date formats (this was intentional on our part). *Handle aligning the date formats in Pyret, not in Google Sheets*. One of our goals for this project is making sure you know how to use coding to manipulate tables for combining data. Load both tables into Pyret, then figure out how to combine the information. [Pyret String documentation](https://www.pyret.org/docs/latest/strings.html) might be your friend! <!-- :::warning ***Note**: As we saw in the lecture on errors in data tables, small errors and typos can lurk in datasets. While you might be tempted to just combine columns from the tables by relying on them having the same dates in the same order, this would not be a safe option unless you also had code to check this assumption about the dates. For now, your approach should look up each date from one table in the other. We will revisit to how to write this check in lecture once we finish teaching you what we need to do that.* ::: --> Below, you'll find a code snippet which you can and paste into Pyret to see what your joined taxi and weather data *could* look like. (Note that it does not have to look like this, but this might be helpful for reference.) ``` #this table is what your output join table should look like # you may also use it in testing taxi-join-ssid = "1izj3IJ3wt7W4-uV-uD8Ph3y8REMETR4ipJtfkmNegsA" taxi-join-sheet = load-spreadsheet(taxi-join-ssid) # load spreadsheet taxi-join-table = load-table: day, weekday, timeframe, num-rides, avg-dist, total-fare, rain, snow source: taxi-join-sheet.sheet-by-name("Sheet1", true) sanitize num-rides using DS.strict-num-sanitizer sanitize avg-dist using DS.strict-num-sanitizer sanitize total-fare using DS.strict-num-sanitizer sanitize rain using DS.strict-num-sanitizer sanitize snow using DS.strict-num-sanitizer end ``` Your goal is to make a table that looks like `taxi-join-table` by combining the weather and taxi data sets using Pyret table functions, instead of just preloading a spreadsheet that we made (which is what the code snippet above it currently doing). Some students might find it more intuitive to try writing `summary-table` before worrying about how to join the two tables. If so, feel free to use this `taxi-join-table` to write `summary-table` initially, and then replace our code with your code that creates the joined table. **Hint:** If you feel your code is getting to complicated to test, add helper functions! You will almostly certainly have computations that get done multiple times with different data for this problem. Create and test a helper or two to keep the problem manageable. You don't need helpers for everything, though -- it is fine for you to have nested `build-column` expressions in your solution, for example. ### Report For the report, you will be submitting a file named `transit-report.pdf`. Include in this file the copies of your charts and the written part of your analysis. Your report should address the three analysis questions outlined at the beginning of this assignment. You should make a report of your findings in a Word or Google Document, which you can then conver to a PDF for submission. Pyret makes it easy to make this kind of report. When you make a plot, there is an option in the top left hand side of the window to save the chart as a `.png` file which you can then copy into your document. Additionally, whenever you output a table in the interactions window, Pyret gives you the option to copy the table. If you copy the table into some spreadsheet, it will be formatted as a table that you can then copy into Word or Google Docs. Your report should contain any relevant plots and tables, any conclusions you have made, and your reflection on the project (see next section). We are not looking for fancy or specific formatting, but you should put some effort into making sure the report reads well (use section headings, full sentences, spell-check it, etc). There's no specified length -- just say what you need to say to present your analyses to answer the questions. :::info *An example of what part of your report might look like:* ![](https://i.imgur.com/2ld32PX.png) ::: At the end of your report, also answer the following questions. Do this ++after you have finished the coding portion of the project!++ 1. Describe one key insight that each partner gained about programming or data analysis from working on this project and one mistake or misconception that each partner had to work though. 2. Based on the data and analysis techniques you had, how confident are you in the quality of your results? What other information or skills could have improved the accuracy and precision of your analysis? 3. State one or two followup questions that you have about programming or data analysis after working on this project. ### Reflection Finally, you'll each be submitting a reflection individually. **Task:** First, read [a short excerpt from Chapter 8](https://drive.google.com/file/d/1wlt6ZGs-4EnwlYqGUn0yx0EyyYeEZ_0V/view?usp=sharing) of Cathy O'Neil's *Weapons of Math Destruction*. * *Note:* Because this is part of a book chapter, O’Neil references some ideas from earlier in the book. The most important of these is a “WMD”, a Weapon of Math Destruction, which is her name for an exploitative algorithm or model. *If you find yourself intrigued by this excerpt, here is the [full chapter](https://drive.google.com/file/d/1qADN9VO4ZFq7QXyb5oDvs0Lse6202Ipq/view?usp=sharing) and a [link to access the full e-book](http://josiah.brown.edu/record=b8466367~S7), available through the Brown library.* **Task:** In your reflection, answer the following questions. Be sure to label each question with its number. 1. Imagine you are an employee at New York’s Taxi and Limousine Commission, and that you’re being asked to coordinate with cab companies to allocate taxies based on the taxi and weather data only. * To you, what seems like a fair way to allocate taxis? Consider factors such as location, length of ride, and frequency of pickups. * Now suppose your boss asks you to focus on allocating taxis in a way that will generate the most revenue (i.e. considering the highest fare and tip amounts). Name one benefit and one harm to allocating taxis in this way. *Hint: What happens if you need to get somewhere and you’re in an area where pickups don’t generate high tips?* 2. Suppose you had access to a dataset which gives latitude/longitude coordinates for each address in New York. * Upon joining this dataset with the existing taxi data, what more might you be able to find out? Give **one possible relationship or trend** you might be able to analyze, and **two possible ethical or privacy-related issues** you could run into by joining these datasets. Be sure to consider O'Neil's comments on zipcode data in your answer! * In light of these possible inferences, describe one way in which the ability to join tables affects ethical data collection or data privacy. Explain your reasoning. *Hint: consider the O'Neil reading, plus articles you’ve read on computational inference, for ideas!* ### Handin Information :::warning update 10/17 : typo where "project-1-reflection.pdf" said "project-1-reflection.arr". This has been changed to match the handin info at the top of the handout. ::: For your final handin, submit `transit-analysis.arr` and `transit-report.pdf` on Gradescope under Project 1. Also submit `project-1-reflection.pdf` under Project 1 Reflection. Nothing is required to print in the interactions window when we run `transit-analysis.arr`, but your analysis answers should include comments indicating which variable names or expressions yield the data on which you based your answers. **Note:** For each (sub)group's handin, make sure you only have *one* handin and that all (sub)group members are added to the submission on Gradescope. ### Grading You will be graded on Functionality, Testing, Design/Style, and Reflection for this assignment. Key metrics for each of these categories are described below. **Functionality:** * Does your code accurately produce the data you needed for your analyses? * Are you able to use code to perform the table transformations required for your analyses? * Is your `summary-table` function working? * Have you joined the two tables together (i.e., is `summary-table` working without using our preloaded example table)? **Testing:** * Have you tested your functions well, particularly those that do computations more interesting than extracting cells and comparing them to other values? * Have you shown that you understand how to set up smaller tables for testing functions before using them on large datasets? **Design/Style:** * Have you chosen suitable charts and statistics for your analysis? * Have you identified appropriate table formats for your analysis tasks? * Have you created helper functions as appropriate to enable reuse of computations? * Have you chosen appropriate functions and operations to perform your computations? * Have you used docstrings and comments to effectively explain your code to others? * Have you named intermediate computations appropriately to improve readability of your code? This includes both what you named and whether the names are sufficiently descriptive to convey useful information about your computation. * Have you followed the other guidelines of the style guide? (line length, naming convention, etc.) **Reflection:** * Have you answered all parts of the reflection questions? * Have you read the assigned excerpt and answered the Socially Responsible Computing questions? * For tips on answering SRC questions and information on how we grade, you can always reference the [STA response guide](https://hackmd.io/@cs111/sta-response-guide)! <a name="additional"></a> ## Note on late day use :::warning last updated 10/10/2020 ::: If you wish you use late day(s) on the final handin, all members of a subgroup MUST have enough late days to do so. In other words, if a subgroup wants to use 2 late days on the final handin, but a member of the subgroup only has 1 left, the whole subgroup can only use 1 late day. Please consider this when forming subgroups. ## Additional helper functions `taxi-project-support.arr` contains a function that might be helpful in manipulating your data. This is not in the original CS0111 Pyret Tables Documentation, but feel free to use it if you'd like: * `long-to-wide`: Long to wide can be used to bring shared data across rows into columns. For example, ```long-to-wide(taxi-data-long, "day", "timeframe")``` will produce a table that deletes the "timeframe" column (which repeats the same for time quarters for each day) and brings them into columns, resulting in only one row of data per day. (Feel free to test it out yourself to visualize what is does!) ` ## Campuswire and Feedback - [Campuswire](https://campuswire.com/c/G2F5490D5/) and specifically the [FAQs + Clarification post](https://campuswire.com/c/G8DE0A2C4/feed/1118) can be your friend for this Project! - Have feedback for the class or for this project? Submit your feedback [here](https://docs.google.com/forms/d/1v4hr5G4hZC8V74nJBHTFJEMhmqQ5Gf6f44ucuQzJEYI/edit).