--- title: Project 1 Fall-2022 tags: Projects-F22, Project 1 --- # Project 1: Grocery Store Analysis ## Due date information **Out:** October 5th **In:** October 18th, 11:59PM EST ![](https://i.imgur.com/Edc8h2Y.png) ## Summary It's time to put all of your data science skills together, while also working on a real dataset. For this project, we will be looking at density of grocery stores in different counties in the USA, producing a report that answers several questions that we've provided for you. The project occurs in two stages. During the first week, you'll work on the design of your tables and a plan of your functions, reviewing your work with a TA during a [**Design Check**](#Deadline-1-The-Design-Stage). In the second week, you'll implement your design, presenting your results in a written report that includes charts and plots from the tables that you created. The report and the code you used to process the data get turned in for the [**Final Handin**](#Final-Handin). This is a pair project. You and a partner should complete all project work together. ## Resources - [CSCI0111 Table Documentation](https://hackmd.io/@cs111/table) - [Pyret String documentation](https://www.pyret.org/docs/latest/strings.html) ## Project Overview ### The Data The main dataset (`county-store-count-table`) indicates how many grocery stores and convenience stores are in counties across the USA. A second table (`county-population-table`) captures the populations of counties. A third (`state-abbv-table`) matches the two-letter abbreviations with the full names of each state. ### Analysis Tasks You will use these three tables to determine answers to the following questions, where "combined stores" here refers to the sum of grocery and convenience stores. - Which state has the largest number of combined stores per capita? - Do states with the largest populations also have the most combined stores? - Is there a correlation between county populations and the number of combined stores per capita in the states to which the counties belong? - Which 5 states have the largest ratio of convenience stores to combined stores? You will produce a mix of code and charts to present your findings. ### Writing a Report with Generated Tables Part of what you will turn in for this is a report with the results of your analysis and some table (that you need to compute) to summarize trends in the data. These summary tables will all have the following shape: ``` | state | abbv | num-stores | per-capita-summary | | ------------- | ----- | ---------- | ------------------ | | Rhode Island | RI | 145,000 | 0.001 | | Colorado | CO | ... | ... | | Maryland | MD | ... | ... | | South Dakota | SD | ... | ... | | ... ``` The specific measurement in the rightmost column will differ from table to table: sometimes, we will want the total population across all counties, sometimes, we will want average populations across counties, and so on. ### What Skills Does this Project Practice? These high-level descriptions highlight the skills that you'll practice in this project: - preparing tables for use by cleaning up or detecting messy data - combining data across tables to answer a question - using plots and charts to display data - creating functions to reuse common computations (for building the different summary tables) - creating helper functions to keep computations manageable We have done every one of these steps across lecture, hwks 3 and 4, and labs 3 and 4. You have a lot to start from. The stencil code (expand following spoiler) will load all of the tables and set up the libraries that you need. :::spoiler Stencil Copy and paste the following code to load the dataset into Pyret: ``` include tables include gdrive-sheets include shared-gdrive("dcic-2021", "1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep") import math as M import statistics as S import data-source as DS google-id = "17OCB7nDBepuvxHrDKB4qMPcI0_UHbTzNwMP_2s0WkXw" county-population-unsanitized-table = load-spreadsheet(google-id) county-store-count-unsanitized-table = load-spreadsheet(google-id) state-abbv-unsanitized-table = load-spreadsheet(google-id) county-population-table = load-table: county :: String, state :: String, population-estimate-2016 :: Number source: county-population-unsanitized-table.sheet-by-name("county-population", true) sanitize county using DS.string-sanitizer sanitize state using DS.string-sanitizer sanitize population-estimate-2016 using DS.strict-num-sanitizer end county-store-count-table = load-table: state :: String, county :: String, num-grocery-stores :: Number, num-convenience-stores :: Number source: county-store-count-unsanitized-table.sheet-by-name("county-store-count", true) sanitize state using DS.string-sanitizer sanitize county using DS.string-sanitizer sanitize num-grocery-stores using DS.strict-num-sanitizer sanitize num-convenience-stores using DS.strict-num-sanitizer end state-abbv-table = load-table: state :: String, abbv :: String source: state-abbv-unsanitized-table.sheet-by-name("state-abbv", true) sanitize state using DS.string-sanitizer sanitize abbv using DS.string-sanitizer end ``` ::: <br> The rest of this handout describes what you need to do for each of the design and implementation phases. ## Phase 1: The Design Check A design check is a 30-minute one-on-one meeting between your team and a TA to review your project plans and to give you feedback before you get too far into writing code. Most students make changes to their design after the check (that's the point). Your partner notification email will tell you how to sign up for a slot. **Task: Identify needed data**: For each of the four analysis tasks and producing the summary table, write down which columns of each table you expect to need for that task. **Task: Identify needed table cleanup**: Make a to-do list of the clean-up tasks you expect to have to do to prepare the data in the individual tables for use in the tasks. (*Hint: look for similar data in different tables that has different formats.*) **Task -- Plan your analysis approach for each of the four analysis tasks:** You may use Snap or write it out on paper (as you choose). The point is for you to be able to review this with your TA to make sure you understand the problems. We want to see which table operators you think you might need, and names of helper functions that you might need to support those operations. As part of this, indicate what kind of plot/chart, if any, you will prepare and what variables are on the axes. #### Making sure you understand the summary table Consider the following (small) version of a `county-store-count-table`. ``` | state | county | num-grocery-stores | num-convenience-stores | | ----- | ------- | ------------------ | ---------------------- | | PA | York | 64 | 132 | | RI | Bristol | 7 | 13 | | RI | Kent | 21 | 58 | | ME | Waldo | 12 | 29 | | ME | York | 43 | 111 | ``` **Task:** Draw out what the resulting summary table should look like if we are summing the total population across all counties within a state. The population data would come from the `county-population-table` in the stencil (the small table above is just telling you which states and counties get included in your sample summary table). **Task:** Plan out a program that would compute the `per-capita-summary` value for the row for Rhode Island. *Don't think about the whole table. Just plan how to compute the value that goes in the cell for Rhode Island.* When doing so, assume you were working with a larger table of the same format, not one with the two fixed rows shown above. **Task:** Have in writing an agreement for how you and your partner will work on the implementation (see the "working with your partner" section at the end of the handout). You can have this in email, but you will need to show something written to your Design-check TA. ### Design Check Handin ++By 11:59pm the day before++ your design check starts, submit your work for the design check as a PDF file named `project-1-design-check.pdf` to "Project 1 Design Check" on Gradescope. Please add your project partner to your submission on Gradescope as well. You can create a PDF by writing in your favorite word processor (Word, Google Docs, etc) then saving or exporting to PDF. Ask the TAs if you need help with this. Please put both you and your partner's login information at the top of the file. ### Design Check Logistics * Please bring your work for the design check either on laptop (files already open and ready to go) or as a printout. Use whichever format you will find it easier to take notes on. * We expect that both partners have equally participated in designing the project. The TA may ask either one of you to answer questions about the work you present. Splitting the work such that each of you does 1-2 of the analysis questions is likely to backfire, as you might have inconsistent tables or insufficient understanding of work done by your partner. * Be on time to your design check. If one partner is sick, contact the TA and try to reschedule rather than have only one person do the design check. ### Design Check Grading Your design check grade will be based on whether you had viable ideas for each of the questions and were able to explain them adequately to the TA (for example, we expect you to be able to describe why you picked a particular plot or table format). Your answers do not have to be perfect, but they do need to illustrate that you have thought about the questions and what will be required to answer them (functions, graphs, tables, etc.). The TA will give feedback for you to consider in your final implementation of the project. Your design check grade will be worth roughly a third of your overall project grade. Failure to account for key design feedback in your final solution may result in a deduction on your analysis stage grade. ## Phase 2: Implementation and Reporting For this phase, you will implement your code and create the report with your analyses and the summary tables. The deliverables for this stage include: 1. A Pyret file named `analysis.arr` that contains all the functions used to generate the report (charts, plots, and statistics). This includes the function `summary-generator` (detailed below) and corresponding tests. 2. A report file named `report.pdf`. Include in this file the copies of your charts and the written part of your analysis. Your report should address each of the analysis questions outlined for your chosen dataset. Your report should also contain responses to the Reflection questions described below. You are welcome to use any combination of table and list operators that you wish in the implementation phase. ### Implementing the Summary-Table Generator The overview showed the following example of a summary table: ``` | state | abbv | num-stores | per-capita-summary | | ------------- | ----- | ---------- | ------------------ | | Rhode Island | RI | 145,000 | 0.001 | | Colorado | CO | ... | ... | | Maryland | MD | ... | ... | | South Dakota | SD | ... | ... | | ... ``` **NOTE: the inputs of summary-generator have been updated (10/13) to include two tables, not just one. Hopefully this will clarify how to get tables to generate the two right columns of the summary table.** For this part of the project, you will write a function named `summary-generator` to produce a table with this structure. This function will (eventually) take **three** inputs: a table with the columns of `county-store-count-table`, a table with the columns of `county-population-table` and a function to use for summarizing data from the given `population-table` in the last column, as shown below: ``` fun summary-generator( t-stores :: Table, t-population :: Table, summary-func :: (Table, String -> Number)) -> Table: doc: ```Produces a table that uses the given function to summarize populations across counties. The outputted table should also have total number of grocery and convenience stores for every state.``` ... end ``` We expect this will be the most challenging task on the project, but it should go more smoothly if you do this step by step (don't start with the function header!): 1. Go back to the exercise from the design check that had you sketch out how to sum the population for Rhode Island. Get the code working to produce the value of the Rhode Island cell. 2. How would that code be different to compute the sum of the population in Maine? Turn your Rhode Island code into a function that takes the state to process as an input. 3. Use that function to produce the sum of the populations for every state. 4. For the full solution, we need to produce summary tables with different mathematical operations. Where would you have to change your current code in order to compute the `mean` population instead of the total? Try to make that edit and see if it works. 5. Draw on what we have learned about taking configuration points as inputs: how could you make your code be configurable regarding the mathematical operation to perform? You are only turning in your final solution here, not the steps. But the steps will help you get this working bit by bit. **Notes:** - Your `summary-generator` function **should not** reference any tables from outside the function except the provided `state-abbv-table` and `county-population-table`. - Your output table should not contain any columns other than those shown in the example above: "state", "abbv", "num-stores" and "per-capita-summary" **Testing:** You do not need to test `summary-generator` itself, though you should test significant helpers that you use along the way to build it. However, please run `summary-generator` twice outside of the function with two different summary functions. Make sure the output makes sense. ### Report Your report should contain any relevant plots (and tables, if you find them helpful as well), any conclusions you have made, and your reflection on the project (see next section). We are not looking for fancy or specific formatting, but you should put some effort into making sure the report reads well (use section headings, full sentences, spell-check it, etc). There's no specified length -- just say what you need to say to present your analyses. **Note:** Pyret makes it easy to extract image files of plots to put into your report. When you make a plot, there is an option in the top left hand side of the window to save the chart as a `.png` file which you can then copy into your document. Additionally, whenever you output a table in the interactions window, Pyret gives you the option to copy the table. If you copy the table into some spreadsheet, it will be formatted as a table that you can then copy into Word or Google Docs. **Note:** Please connect the code in your `analysis` file and the results in your `report` with specific comments and labels in each. For example: :::info ***Sample Linking:** See the comment in the code file:* ``` # Analysis for question on cities with population over 30K fun more-than-thirty-thousand(r :: Row) -> Boolean: ... end qualifying-munis = filter-by(municipalities, more-than-thirty-thousand) munis-ex1-ex2-scatter = lr-plot(qualifying-munis, "population-2000", "population-2010") ``` *Then, your report might look like this:* ![](https://i.imgur.com/2ld32PX.png) ::: ### Reflection Have a section in your report document with answers to each of the following questions ++after you have finished the coding portion of the project++: 1. Describe one key insight that each partner gained about programming or data analysis from working on this project and one mistake or misconception that each partner had to work though. 2. Based on the data and analysis techniques you used, how confident are you in the quality of your results? What other information or skills could have improved the accuracy and precision of your analysis? 3. State one or two followup questions that you have about programming or data analysis after working on this project. ### Final Handin For your final handin, submit one code file named `analysis.arr` containing all of your code for producing plots and tables for this project. Also submit `report.pdf`, which contains a summary of the plots, tables, and conclusions for your answers to the analysis questions. Your project reflection also should be in the report file. Nothing is required to print in the interactions window when we run your analysis file, but your analysis answers in `report.pdf` should include comments indicating which variable names or expressions in `analysis.arr` yield the data for your answers. ### Final Grading You will be graded on Functionality, Design, and Testing for this assignment. Functionality -- Key metrics: * Does your code accurately produce the data you needed for your analyses? * Are you able to use code to perform the table transformations required for your analyses? * Is your `summary-generator` function working? Testing -- Key metrics: * Have you tested your functions well, particularly those that do computations more interesting than extracting cells and comparing them to other values? * Have you shown that you understand how to set up smaller tables for testing functions before using them on large datasets? Design -- Key metrics: * Have you chosen suitable charts and statistics for your analysis? * Have you identified appropriate table formats for your analysis tasks? * Have you created helper functions as appropriate to enable reuse of computations? * Have you chosen appropriate functions and operations to perform your computations? * Have you used docstrings and comments to effectively explain your code to others? * Have you named intermediate computations appropriately to improve readability of your code? This includes both what you named and whether the names are sufficiently descriptive to convey useful information about your computation. * Have you followed the other guidelines of the style guide (line length, naming convention, type annotations, etc.) A high score on functionality will require that you wrote appropriate code to perform each analysis and wrote a working `summary-generator` function. The difference between high and mid-range scores will lie in whether you chose and used appropriate functions to produce your tables and analyses. For design, the difference between high and mid-range scores will lie in whether your computations that create additional tables are clear and well-structured, rather than appearing as you made some messy choices just to get things to work. **Minimal requirements for passing:** You can pass the project even if you either (a) skip the `summary-generator` function or (b) have to manipulate some of the tables by hand rather than through code. A project that does not meet either of these baseline requirements will fail the functionality portion. ## Working with Your Partner We expect that both partners are involved in the work of this project. Specifically, this means: - you do the design work together and present your ideas to your design-check TA together. - you cooperate on the implementation. There are multiple ways to do this: - write the code for both parts working mostly together - each work on individual functions separately, but while working in proximity (same room, online together, etc) - each write some parts and check in periodically to agree on the code that gets submitted How you arrange your work is up to the two of you. As part of your design check, you will indicate how you plan to do the implementation work. **Be respectful of each other's time**. If you agree to meet to work on the project, show up as scheduled. If you agreed to get certain work started prior to a meeting, come with that work started. This is basic professionalism. **What if a partner stops responding?** Get in touch with your design check TA and the HTAs if your partner becomes unresponsive, whether that means they are not doing their share or they are doing the work alone and leaving you out of it. Neither is acceptable. **You will only get credit for a project that you actively participated in**. At the end of the project, we will ask everyone to complete a form indicating how they and their partner split up the work. If you left your partner to do all of the implementation work, you will not get credit for that portion of the project. ---------------------- <!-- > Brown University CSCI 0111 (Fall 2022) > Do you have feedback? Fill out this form <iframe src="https://docs.google.com/forms/d/e/1FAIpQLSfargxTdgdp4RkoujUW9zhv5cFhkkDJwdE2PVXOocttoXLFXg/viewform?embedded=true" width="640" height="407" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe> -->