Out: October 5th
In: October 18th, 11:59PM EST
It's time to put all of your data science skills together, while also working on a real dataset. For this project, we will be looking at density of grocery stores in different counties in the USA, producing a report that answers several questions that we've provided for you.
The project occurs in two stages. During the first week, you'll work on the design of your tables and a plan of your functions, reviewing your work with a TA during a Design Check. In the second week, you'll implement your design, presenting your results in a written report that includes charts and plots from the tables that you created. The report and the code you used to process the data get turned in for the Final Handin.
This is a pair project. You and a partner should complete all project work together.
The main dataset (county-store-count-table
) indicates how many grocery stores and convenience stores are in counties across the USA. A second table (county-population-table
) captures the populations of counties. A third (state-abbv-table
) matches the two-letter abbreviations with the full names of each state.
You will use these three tables to determine answers to the following questions, where "combined stores" here refers to the sum of grocery and convenience stores.
You will produce a mix of code and charts to present your findings.
Part of what you will turn in for this is a report with the results of your analysis and some table (that you need to compute) to summarize trends in the data. These summary tables will all have the following shape:
The specific measurement in the rightmost column will differ from table to table: sometimes, we will want the total population across all counties, sometimes, we will want average populations across counties, and so on.
These high-level descriptions highlight the skills that you'll practice in this project:
We have done every one of these steps across lecture, hwks 3 and 4, and labs 3 and 4. You have a lot to start from.
The stencil code (expand following spoiler) will load all of the tables and set up the libraries that you need.
Copy and paste the following code to load the dataset into Pyret:
The rest of this handout describes what you need to do for each of the design and implementation phases.
A design check is a 30-minute one-on-one meeting between your team and a TA to review your project plans and to give you feedback before you get too far into writing code. Most students make changes to their design after the check (that's the point). Your partner notification email will tell you how to sign up for a slot.
Task: Identify needed data: For each of the four analysis tasks and producing the summary table, write down which columns of each table you expect to need for that task.
Task: Identify needed table cleanup: Make a to-do list of the clean-up tasks you expect to have to do to prepare the data in the individual tables for use in the tasks. (Hint: look for similar data in different tables that has different formats.)
Task – Plan your analysis approach for each of the four analysis tasks: You may use Snap or write it out on paper (as you choose). The point is for you to be able to review this with your TA to make sure you understand the problems. We want to see which table operators you think you might need, and names of helper functions that you might need to support those operations. As part of this, indicate what kind of plot/chart, if any, you will prepare and what variables are on the axes.
Consider the following (small) version of a county-store-count-table
.
Task: Draw out what the resulting summary table should look like if we are summing the total population across all counties within a state. The population data would come from the county-population-table
in the stencil (the small table above is just telling you which states and counties get included in your sample summary table).
Task: Plan out a program that would compute the per-capita-summary
value for the row for Rhode Island. Don't think about the whole table. Just plan how to compute the value that goes in the cell for Rhode Island. When doing so, assume you were working with a larger table of the same format, not one with the two fixed rows shown above.
Task: Have in writing an agreement for how you and your partner will work on the implementation (see the "working with your partner" section at the end of the handout). You can have this in email, but you will need to show something written to your Design-check TA.
By 11:59pm the day before your design check starts, submit your work for the design check as a PDF file named project-1-design-check.pdf
to "Project 1 Design Check" on Gradescope. Please add your project partner to your submission on Gradescope as well. You can create a PDF by writing in your favorite word processor (Word, Google Docs, etc) then saving or exporting to PDF. Ask the TAs if you need help with this. Please put both you and your partner's login information at the top of the file.
Your design check grade will be based on whether you had viable ideas for each of the questions and were able to explain them adequately to the TA (for example, we expect you to be able to describe why you picked a particular plot or table format). Your answers do not have to be perfect, but they do need to illustrate that you have thought about the questions and what will be required to answer them (functions, graphs, tables, etc.). The TA will give feedback for you to consider in your final implementation of the project.
Your design check grade will be worth roughly a third of your overall project grade. Failure to account for key design feedback in your final solution may result in a deduction on your analysis stage grade.
For this phase, you will implement your code and create the report with your analyses and the summary tables. The deliverables for this stage include:
analysis.arr
that contains all the functions used to generate the report (charts, plots, and statistics). This includes the function summary-generator
(detailed below) and corresponding tests.report.pdf
. Include in this file the copies of your charts and the written part of your analysis. Your report should address each of the analysis questions outlined for your chosen dataset. Your report should also contain responses to the Reflection questions described below.You are welcome to use any combination of table and list operators that you wish in the implementation phase.
The overview showed the following example of a summary table:
NOTE: the inputs of summary-generator have been updated (10/13) to include two tables, not just one. Hopefully this will clarify how to get tables to generate the two right columns of the summary table.
For this part of the project, you will write a function named summary-generator
to produce a table with this structure. This function will (eventually) take three inputs: a table with the columns of county-store-count-table
, a table with the columns of county-population-table
and a function to use for summarizing data from the given population-table
in the last column, as shown below:
We expect this will be the most challenging task on the project, but it should go more smoothly if you do this step by step (don't start with the function header!):
mean
population instead of the total? Try to make that edit and see if it works.You are only turning in your final solution here, not the steps. But the steps will help you get this working bit by bit.
Notes:
summary-generator
function should not reference any tables from outside the function except the provided state-abbv-table
and county-population-table
.Testing: You do not need to test summary-generator
itself, though you should test significant helpers that you use along the way to build it. However, please run summary-generator
twice outside of the function with two different summary functions. Make sure the output makes sense.
Your report should contain any relevant plots (and tables, if you find them helpful as well), any conclusions you have made, and your reflection on the project (see next section). We are not looking for fancy or specific formatting, but you should put some effort into making sure the report reads well (use section headings, full sentences, spell-check it, etc). There's no specified length – just say what you need to say to present your analyses.
Note: Pyret makes it easy to extract image files of plots to put into your report. When you make a plot, there is an option in the top left hand side of the window to save the chart as a .png
file which you can then copy into your document. Additionally, whenever you output a table in the interactions window, Pyret gives you the option to copy the table. If you copy the table into some spreadsheet, it will be formatted as a table that you can then copy into Word or Google Docs.
Note: Please connect the code in your analysis
file and the results in your report
with specific comments and labels in each. For example:
Sample Linking: See the comment in the code file:
Then, your report might look like this:
Have a section in your report document with answers to each of the following questions after you have finished the coding portion of the project:
For your final handin, submit one code file named analysis.arr
containing all of your code for producing plots and tables for this project. Also submit report.pdf
, which contains a summary of the plots, tables, and conclusions for your answers to the analysis questions. Your project reflection also should be in the report file. Nothing is required to print in the interactions window when we run your analysis file, but your analysis answers in report.pdf
should include comments indicating which variable names or expressions in analysis.arr
yield the data for your answers.
You will be graded on Functionality, Design, and Testing for this assignment.
Functionality – Key metrics:
summary-generator
function working?Testing – Key metrics:
Design – Key metrics:
A high score on functionality will require that you wrote appropriate code to perform each analysis and wrote a working summary-generator
function. The difference between high and mid-range scores will lie in whether you chose and used appropriate functions to produce your tables and analyses.
For design, the difference between high and mid-range scores will lie in whether your computations that create additional tables are clear and well-structured, rather than appearing as you made some messy choices just to get things to work.
Minimal requirements for passing: You can pass the project even if you either (a) skip the summary-generator
function or (b) have to manipulate some of the tables by hand rather than through code. A project that does not meet either of these baseline requirements will fail the functionality portion.
We expect that both partners are involved in the work of this project. Specifically, this means:
How you arrange your work is up to the two of you. As part of your design check, you will indicate how you plan to do the implementation work.
Be respectful of each other's time. If you agree to meet to work on the project, show up as scheduled. If you agreed to get certain work started prior to a meeting, come with that work started. This is basic professionalism.
What if a partner stops responding? Get in touch with your design check TA and the HTAs if your partner becomes unresponsive, whether that means they are not doing their share or they are doing the work alone and leaving you out of it. Neither is acceptable.
You will only get credit for a project that you actively participated in. At the end of the project, we will ask everyone to complete a form indicating how they and their partner split up the work. If you left your partner to do all of the implementation work, you will not get credit for that portion of the project.