Out: Tuesday, October 6th
Design check sign-up / individual note-taking portion deadline: Saturday, October 9th, 11:59pm EST (note that late days cannot be used on this assignment)
Note on late day use (updated 10/10/2020)
Design check dates: Saturday, October 10th-Wednesday, October 14th (see TA-specific appointment calendar)
Design check group submission deadline: two hours before your design check
In: Tuesday, October 20th 9pm EST
select-columns
. This function takes in a table and a list of strings (indicating column names) and outputs a table with those only the columns of the input table whos names were in the list. This can be used to make cleaner, more specific tables in part 3. It is by no means an expectation or a requirement to do this – this entire project can be completed without lists.Imagine that you are in NYC, and want to hail a taxi cab. How might things such as weather, day of the week, or time of the day influence how many other people are also trying to hail a taxi cab? In this project, we'll be using modified versions of real-life data to analyze taxi driving patterns in NYC based on certain factors: weather, day of the week, and time of day. We'll also be thinking about the social implications of working with large sets of real-world data!
NYC publishes a lot of open data (see this link). You found records of every taxi ride taken in city cabs during 2015 and 2016, and want to analyze the data, with a particular look at how taxi usage varies by time of day, day of the week, and weather conditions.
Visit this webpage on the 2016 taxi data, to get familiar with the columns that NYC provides in these datasets. Please note that this link might take a while to load.
For better or worse, this dataset is HUGE – it has 131 million rows and consumes more than 17GB of space (so don't download it!). The raw dataset is too big to open easily in Excel or Pyret, so you're going to work with a summarized version that we have already computed from the raw data.
Project Learning Goals
Individual pre-design check:
project-1-design-check-individual.pdf
-> one submission per person under Project 1 Pre-Design Check (Individual)Design Check:
project-1-design-check.pdf
-> one submission per group under Project 1 Design CheckFinal Handin:
transit-analysis.arr
-> one submission per (sub)group under Project 1transit-report.pdf
-> one submission per (sub)group under Project 1project-1-reflection.pdf
-> one submission per person under Project 1 ReflectionFor this project, these will be the three main Analysis Questions about the 2016 taxi data:
To what extent does bad weather affect how many rides people take? There are many ways to interpret bad weather and you can analyze this question through different lenses, such as rain, snow, and temperature.
Do the number of rides and total fares follow similar patterns for each day of the week across the year? In other words, is there a reasonably consistent pattern across all Mondays of a year? What about across Saturdays? And so on.
Are some days of the week more likely than others to have high numbers of rides?
In addition, you need to provide a way to produce a table that summarizes statistics about the numbers of rides at different times of day under different weather conditions. Specifically, given a table and a function to use to summarize the values for a particular weather condition and time of day, you will write a function summary-table
to produce a (Pyret) table of the following form, where each cell contains some statistic about the number of rides in the given time period on a day with the given weather:
where num
might be the sum of all rides on rainy-day mornings, or the daily average on rainy-day mornings, etc.
Note: The data set divides time into quarters (0-6, 6-12, 12-18, and 18-24). For the purpose of this assignment, any of these quarters can represent any of the time frames (i.e. Morning, Afternoon, Evening, and Night can refer to any of the time frames). There is no direct mapping that we require, but your decision in representing these time frames should be intuitive and the ordering must remain consistent (i.e. Afternoon follows Morning, Evening follows Afternoon etc.).
As described in the project logistics handout (also linked above), the project will be completed in three stages: pre-design, design, and analysis. More detailed expectations for each phase are described in separate sections below.
Note: We believe the hardest part of this assignment lies in figuring out what analyses you will do and in creating the tables you need for those analyses. Once you have created the tables, the remaining code should be similar to what you have written for homework and lab. Plan enough time to think out your table and analysis designs.
The following code will load the summarized 2016 taxi data into Pyret:
Note: that the source code file imported above, taxi-project-support.arr
, contains functions that might be useful for this project that are not in the standard CS0111 Pyret Documentation. Details on this can be found below under "Helpful Functions".
For weather data, we have extracted data from La Guardia airport in New York City in 2016 (from NCDC) and left it in a Google Sheet. You can access it with the following code:
Please read this overview of project logistics, including an example of what we are expecting for your pre-design check. The note taking portion/submission should not take any more than 30 minute.
As described in the deadline information at the top of this handout, the deadline for both the signup and submission is this Friday at 11:59pm.
For your final reflection, you'll be reading a short excerpt from Chapter 8 of Cathy O'Neil's Weapons of Math Destruction. Note that while nothing is due yet, you might want to keep these readings in mind as you work through this project! Here is the excerpt you are required to read, the full chapter if you find yourself intrigued, and a link to access the full e-book, available through the Brown library.
Setup and Handin Info
The following portion of the project will be completed as a group.
Answer the following questions with your group in a PDF file, project-1-design-check.pdf
, and submit it on Gradescope under Project 1 Design Check at least 2 hours before your design check. Add all of your group members to the submission by clicking on the "Group Members" button at the bottom of the page after submitting.
This document will not be primarily graded on correctness, but is a chance to show the staff that you have engaged with the questions on your own.
You can create a PDF by writing in your favorite word processor (Word, Google Docs, etc) then saving or exporting to PDF. Ask the TAs if you need help with this. Please put the emails of all group members at the top of the file.
Questions
Look at the summarized 2016 taxi data. Compare the summarized data with the sample of the original table shown on the NYC website. What operations or steps could produce the summarized data from the original? Write a bulleted list of steps (in English, not code) that explain how to produce the summarized form from the original. Make sure you have some ideas of what functions from the Pyret tables documentation you might use.
(The point of this question is to show you that you know almost everything you'd need to do this conversion yourself, had the source data not been so huge – within a couple of weeks you will know how to do all of these steps yourself.)
For each of the three analysis questions listed above at the beginning, describe how you plan to do the analysis. You should try to answer these questions:
Important note: You can use any of the Pyret table, chart and plot operations as you see fitting - some that you could use (but you are not limited to, or required to use these) are: sort-by
, filter-by
, stdev
, mean
, sum
, scatter-plot
, freq-bar-chart
, histogram
. You can read more about these in Tables Documentation.
Sample Answer:
If you were asked to analyze whether municipalities with a population (in 2000) larger than 30,000 have an increase or decrease in population, your answer to this might be: "I'd start with a table of municipalities that have a population in 2000 of over 30,000, and then make a scatterplot of the population of those cities in 2000 and 2010. I'd add a linear regression line, then check whether there was a pattern in changes between the two population values.
I'd obtain a table of municipalities with a population of greater than 30,000 in 2000 by using the filter-by
function."
For the summary-table
function, you will be filling in the body of the following function (you do not have to implement it for the design check, but you do eventually have to implement it):
Generate a general idea of how you want to implement this function for the design check. For example, this might be called summary-table(mytable, sum)
or summary-table(mytable, mean)
to summarize the total or average numbers of rides within the dates represented in mytable
. You are welcome to create any other helper function to work with summary-table
that you see fit for your analysis.
summary-table
will be used. Your answer should include an example of the input table, an input function that takes in a Table and String and returns a Number, and an output Table.Given these two tables:
Table 1:
date | prcp |
---|---|
2020/10/14 | 1.0 |
2020/10/15 | 1.1 |
2020/10/16 | 0.0 |
Table 2:
date | number_of_rides |
---|---|
2020/10/15 | 28591 |
2020/10/14 | 2355 |
2020/10/17 | 14513 |
2020/10/16 | 4810 |
Write a bulleted list of steps (in English) to combine these two tables to one that looks like the table below. If a step corresponds to a specific Pyret tables function, make sure to name the function, even if you're not completely sure how it will be used!.
date | prcp | number_of_rides |
---|---|---|
2020/10/14 | 1.0 | 2355 |
2020/10/15 | 1.1 | 28591 |
2020/10/16 | 0.0 | 4810 |
Here's an overview of the design check requirements.
For the analysis, you will be submitting a Pyret file named transit-analysis.arr
that contains the function summary-table
, the tests for the function, and all the functions used to generate the report (charts, plots, and statistics).
Note:
summary-table
function.Sample Answer: Continuing with comparing municipalities as an example, we'd expect to see something in your Pyret file like the following:
Then, your report may look like this:
In order to do these analyses, you will need to get day-of-the-week information into the tables and combine data from the two tables based on common dates.
Combining data across tables
Both tables store data by dates, which means you should be able to combine information to create a single table. However, these two tables have different date formats (this was intentional on our part). Handle aligning the date formats in Pyret, not in Google Sheets. One of our goals for this project is making sure you know how to use coding to manipulate tables for combining data. Load both tables into Pyret, then figure out how to combine the information. Pyret String documentation might be your friend!
Below, you'll find a code snippet which you can and paste into Pyret to see what your joined taxi and weather data could look like. (Note that it does not have to look like this, but this might be helpful for reference.)
Your goal is to make a table that looks like taxi-join-table
by combining the weather and taxi data sets using Pyret table functions, instead of just preloading a spreadsheet that we made (which is what the code snippet above it currently doing).
Some students might find it more intuitive to try writing summary-table
before worrying about how to join the two tables. If so, feel free to use this taxi-join-table
to write summary-table
initially, and then replace our code with your code that creates the joined table.
Hint: If you feel your code is getting to complicated to test, add helper functions! You will almostly certainly have computations that get done multiple times with different data for this problem. Create and test a helper or two to keep the problem manageable. You don't need helpers for everything, though – it is fine for you to have nested build-column
expressions in your solution, for example.
For the report, you will be submitting a file named transit-report.pdf
. Include in this file the copies of your charts and the written part of your analysis. Your report should address the three analysis questions outlined at the beginning of this assignment.
You should make a report of your findings in a Word or Google Document, which you can then conver to a PDF for submission. Pyret makes it easy to make this kind of report. When you make a plot, there is an option in the top left hand side of the window to save the chart as a .png
file which you can then copy into your document.
Additionally, whenever you output a table in the interactions window, Pyret gives you the option to copy the table. If you copy the table into some spreadsheet, it will be formatted as a table that you can then copy into Word or Google Docs.
Your report should contain any relevant plots and tables, any conclusions you have made, and your reflection on the project (see next section). We are not looking for fancy or specific formatting, but you should put some effort into making sure the report reads well (use section headings, full sentences, spell-check it, etc). There's no specified length – just say what you need to say to present your analyses to answer the questions.
An example of what part of your report might look like:
At the end of your report, also answer the following questions. Do this after you have finished the coding portion of the project!
Finally, you'll each be submitting a reflection individually.
Task: First, read a short excerpt from Chapter 8 of Cathy O'Neil's Weapons of Math Destruction.
If you find yourself intrigued by this excerpt, here is the full chapter and a link to access the full e-book, available through the Brown library.
Task: In your reflection, answer the following questions. Be sure to label each question with its number.
update 10/17 : typo where "project-1-reflection.pdf" said "project-1-reflection.arr". This has been changed to match the handin info at the top of the handout.
For your final handin, submit transit-analysis.arr
and transit-report.pdf
on Gradescope under Project 1. Also submit project-1-reflection.pdf
under Project 1 Reflection. Nothing is required to print in the interactions window when we run transit-analysis.arr
, but your analysis answers should include comments indicating which variable names or expressions yield the data on which you based your answers.
Note: For each (sub)group's handin, make sure you only have one handin and that all (sub)group members are added to the submission on Gradescope.
You will be graded on Functionality, Testing, Design/Style, and Reflection for this assignment. Key metrics for each of these categories are described below.
Functionality:
summary-table
function working?summary-table
working without using our preloaded example table)?Testing:
Design/Style:
Reflection:
last updated 10/10/2020
If you wish you use late day(s) on the final handin, all members of a subgroup MUST have enough late days to do so. In other words, if a subgroup wants to use 2 late days on the final handin, but a member of the subgroup only has 1 left, the whole subgroup can only use 1 late day. Please consider this when forming subgroups.
taxi-project-support.arr
contains a function that might be helpful in manipulating your data. This is not in the original CS0111 Pyret Tables Documentation, but feel free to use it if you'd like:
long-to-wide
: Long to wide can be used to bring shared data across rows into columns. For example,long-to-wide(taxi-data-long, "day", "timeframe")
will produce a table that deletes the "timeframe" column (which repeats the same for time quarters for each day) and brings them into columns, resulting in only one row of data per day. (Feel free to test it out yourself to visualize what is does!)