Out: February 22nd
Design Checks: February 28th, March 1st
In: March 7th, 11:59PM EST!
It's time to try your data science skills on real datasets! For this project, we will be looking at density of grocery stores in different counties in the USA, producing a report that answers several questions that we've provided for you.
The project occurs in two stages. During the first week, you'll work on the design of your tables and functions, reviewing your work with a TA during a Design Check. In the second week, you'll implement your design, presenting your results in a written report that includes charts and plots from the tables that you created. The report and the code you used to process the data get turned in for the Final Handin.
This is a pair project. You and a partner should complete all project work together. You can find your own partner or we can match you with someone. Note that you have to work with different partners on the first two projects in the course (the last project is a solo project).
We suggest skimming the whole handout to get an idea of what is expected, then reading the design check instructions, and then opening up the data set to explore the questions in more detail. The overarching goal of this project is to answer the analysis questions and to write and test the summary-generator
function (described below), and the rest of this handout walks you through those goals. We expect that the specific analysis questions and summary-generator
function description will take a few read-throughs to thoroughly understand. Do not worry if you and your partner do not immediately arrive at the list of tasks you need to do in order to complete the project – one of the skills you are practicing is how to break a large analysis down into smaller steps. The design check with your TA will be one opportunity to let you know if you are on the right track.
The main dataset (county-store-count-table
) indicates how many grocery stores and convenience stores are in counties across the USA. A second table (county-population-table
) captures the populations of counties. A third (state-abbv-table
) matches the two-letter abbreviations with the full names of each state.
You will use these three tables to determine answers to the following questions, where "combined stores" here refers to the sum of grocery and convenience stores. "Per capita" means "per person" (e.g. the computation for "combined stores per capita" would be the total number of combined stores in a state divided by the population of that state).
Which states have the highest variability (measured using standard deviation) of stores across counties?
Do states with the largest populations also have the most combined stores?
Are counties with the largest populations in the states with the largest number of stores per capita?
Which 5 states have the largest ratio of convenience stores to combined stores?
You will produce a mix of code and charts to present your findings. You will also provide a function (summary-generator
) that can be used to generate summary data about a specific aspect of your dataset. The summary-generator
function will allow the user to customize which statistic (such as average, sum, median) gets used to generate the table data.
These high-level descriptions highlight the skills that you'll practice in this project:
We have done every one of these steps across lecture, homeworks 3 and 4, and labs 3 and 4. You have a lot to start from.
The stencil code (expand following spoiler) will load all of the tables and set up the libraries that you need.
Copy and paste the following code to load the dataset into Pyret:
Imagine that there are many county-level store datasets in the US, and that you need to create summaries of them with different types of statistics. For example, in one case, you may get the dataset from 1970 and need to find the total stores count over the counties in each state. Or in another case, you may get the dataset from 2021 and need to find the mean (average) stores count over the counties in each state. This all requires building a function that is flexible in terms of which data it presents and what kind of statistics it computes.
Your summary-generator
will take in a table that contains the following columns:
For example, the input table might look like
The summary-generator
will also take in a summary function. Take a look at the functions (sum
, mean
, etc) on the summary function documentation. Each takes in a Table
and String
representing a column name, and applies the relevant math over that column of the table to produce a single Number
(for example, mean
produces the mean of all of the values in the column).
The goal of your summary-generator
is to figure out how to use the summary function and the input table to produce an output table with only these columns: "state", "abbv", "population", and "store-summary":
Each row is a state. The population
column contains the total population of that state. The "state", "abbv", and "population" columns will be the same for a given input table, no matter what summary function you give to your summary-generator
. The store-summary
column summarizes some statistic about the total number of stores (grocery and convenience) across counties in that state, , based on the summary function input. The statistic might be the total, average, median, etc stores across counties in that state.
For instance, if the mean
function were passed into your summary-generator
function, the store-summary
column should contain the average value of total stores across all counties in the state for that row. If the sum
function were passed into your summary-generator
function, the store-summary
column should contain the sum total of stores across all counties in the state for that row.
The person who calls your summary-generator
function will indicate which summary method to use by passing another function as input.
For the summary-generator
function, use the following header:
This might be called as summary-generator(mytable, sum)
or summary-generator(mytable, mean)
.
Note: sum
and mean
here are built-in functions (that you do not write), as described above. Passing a function as an argument is like what you have done when using transform-column
.
Your summary-generator
function should not reference any tables from outside the function except the provided state-abbv-table
. While producing your output table, you should use state-abbv-table
as a starting point (to build columns for the output table and to extract data from the input table t
). Also, your output table should not contain any columns other than those shown in the example above: "state", "abbv", "poulation" and "store-summary"
Note: You do not need to test summary-generator
. However, please run summary-generator
twice outside of the function with two different summary functions. Make sure the output makes sense! This will look something like this:
summary-generator
.Table
and a String
. For each state, what does the input table to the summary function look like in order to get the desired output? It may help to draw out an example table for a specific state. Then, think about how you to create those Table
s out of the input table to summary-generator
.summary-generator
to answer some of those questions? What summary functions would you use? Understanding this question will go a long way in helping you understand the goal of the summary-generator
function and the entire assignment.The design check is a 30-minute one-on-one meeting between your team and a TA to review your project plans and to give you feedback well before the final deadline. Many students make changes to their designs following the check: doing so is common and will not cost you points.
Task – Understand your data: use the stencil code to load the data set into Pyret. Look at the structure and contents of each provided table. In one place, create a reference sheet that you can refer to for the rest of the project and bring to the design check. We suggest putting the following on the reference sheet:
Making a reference sheet like this will save you time as you consider the questions, make a design check plan, and start coding!
You should plan to bring your reference sheet to any office hours that you attend for this project.
Task – Data-cleaning plan: Look at your datasets and identify the cleaning, normalization, and other pre-processing steps that will need to happen to prepare your dataset for use. Make a to-do list of the cleanup steps you will need to perform. (Hint: look for similar data in different tables that has different formats.)
We’ve asked you to clean up and pre-process data based on the formatting. Another way programmers clean data is by eliminating outliers. In the report.pdf
file, answer the following questions.
You considered the impact of throwing out outliers when cleaning datasets. Now, apply this to the topic of food deserts. If you haven't yet, read this article about food deserts and answer the rest of the questions based on what you learned.
Task – Analysis plan: For each of the analysis questions listed above, describe how you plan to do the analysis. You should try to answer these questions:
summary-generator
, name the summary function you will use. Otherwise, write out the tasks that you will need.Task – summary-generator
function example: Write a check
block with two examples for your summary-generator
function. This means that you'll have to create example input tables – use your reference sheet from the first task as a starting point!
Task – partner agreement: Have in writing an agreement for how you and your partner will work on the implementation (see the "working with your partner" section at the end of the handout). You can have this in email, but you will need to show something written to your Design-check TA.
By 11:59pm the day before your design check starts, submit your work for the design check as a PDF file named project-1-design-check.pdf
to "Project 1 Design Check" on Gradescope. Please add your project partner to your submission on Gradescope as well. You can create a PDF by writing in your favorite word processor (Word, Google Docs, etc) then saving or exporting to PDF. Ask the TAs if you need help with this. Please put both you and your partner's login information at the top of the file.
Your design check grade will be based on whether you had viable ideas for each of the questions and were able to explain them adequately to the TA (for example, we expect you to be able to describe why you picked a particular plot or table format). Your answers do not have to be perfect, but they do need to illustrate that you have thought about the questions and what will be required to answer them (functions, graphs, tables, etc.). The TA will give feedback for you to consider in your final implementation of the project.
Your design check grade will be worth roughly a third of your overall project grade. Failure to account for key design feedback in your final solution may result in a deduction on your analysis stage grade.
Note: We believe the hardest part of this assignment lies in figuring out what analyses you will do and in creating the tables you need for those analyses. Once you have created the tables, the remaining code should be similar to what you have written for homework and lab. Take the Design Check seriously. Plan enough time to think out your table and analysis designs.
The deliverables for this stage include:
analysis.arr
that contains the function summary-generator
, the tests for the function, and all the functions used to generate the report (charts, plots, and statistics).report.pdf
. Include in this file the copies of your charts and the written part of your analysis. Your report should address each of the analysis questions outlined for the dataset. Your report should also contain responses to the Reflection questions described below.Note: Please connect the code in your analysis
file and the results in your report
with specific comments and labels in each. For example:
Sample Linking: See the comment in the code file:
Then, your report might look like this:
In order to do these analyses, you will need to combine data from the multiple tables in the dataset. For each dataset/problem option, the tables use slightly different formats of the information used to link data across the tables (such as different date formats). You should handle aligning the datasets in Pyret code, not by editing the Google Sheets prior to loading them into Pyret. Making sure you know how to use coding to manage tables for combining data is one of our goals for this project. Pyret String documentation might be your friend!
Hint: If you feel your code is getting too complicated to test, add helper functions! You will almostly certainly have computations that get done multiple times with different data for this problem. Create and test a helper or two to keep the problem manageable. You don't need helpers for everything, though – for example, it is fine for you to have nested build-column
expressions in your solution. Don't hesitate to reach out to us if you want to review your ideas for breaking down this problem.
This is where your summary sheet from the first design step task will come in handy! Feel free to add the helper function descriptions to that sheet, and visually show how the helper functions fit together and make use of all of the different tables.
Your report should contain any relevant plots (and tables, if you find them helpful as well), any conclusions you have made, and your reflection on the project (see next section). We are not looking for fancy or specific formatting, but you should put some effort into making sure the report reads well (use section headings, full sentences, spell-check it, etc). There's no specified length – just say what you need to say to present your analyses.
Note: Pyret makes it easy to extract image files of plots to put into your report. When you make a plot, there is an option in the top left hand side of the window to save the chart as a .png
file which you can then copy into your document. Additionally, whenever you output a table in the interactions window, Pyret gives you the option to copy the table. If you copy the table into some spreadsheet, it will be formatted as a table that you can then copy into Word or Google Docs.
Have a section in your report document with answers to each of the following questions after you have finished the coding portion of the project:
For your final handin, submit one code file named analysis.arr
containing all of your code for producing plots and tables for this project. Also submit report.pdf
, which contains a summary of the plots, tables, and conclusions for your answers to the analysis questions. Your project reflection also should be in the report file. Nothing is required to print in the interactions window when we run your analysis file, but your analysis answers in report.pdf
should include comments indicating which variable names or expressions in analysis.arr
yield the data for your answers.
You will be graded on Functionality, Design, and Testing for this assignment.
Functionality – Key metrics:
summary-generator
function working?Testing – Key metrics:
Design – Key metrics:
You can pass the project even if you either (a) skip the summary-generator
function or (b) have to manipulate some of the tables by hand rather than through code. A project that does not meet either of these baseline requirements will fail the functionality portion.
A high score on functionality will require that you wrote appropriate code to perform each analysis and wrote a working summary-generator
function. The difference between high and mid-range scores will lie in whether you chose and used appropriate functions to produce your tables and analyses.
For design, the difference between high and mid-range scores will lie in whether your computations that create additional tables are clear and well-structured, rather than appearing as you made some messy choices just to get things to work.
We expect that both partners are involved in the work of this project. Specifically, this means:
How you arrange your work is up to the two of you. As part of your design check, you will indicate how you plan to do the implementation work.
Be respectful of each other's time. If you agree to meet to work on the project, show up as scheduled. If you agreed to get certain work started prior to a meeting, come with that work started. This is basic professionalism.
What if a partner stops responding? Get in touch with your design check TA and the HTAs if your partner becomes unresponsive, whether that means they are not doing their share or they are doing the work alone and leaving you out of it. Neither is acceptable.
You will only get credit for a project that you actively participated in. At the end of the project, we will ask everyone to complete a form indicating how they and their partner split up the work. If you left your partner to do all of the implementation work, you will not get credit for that portion of the project.
Brown University CSCI 0111 (Spring 2023)
Do you have feedback? Fill out this form.