--- title: Project 1 Spring-2022 tags: Projects-S22, Project 1 --- # Project 1: Totally Tubular Tables! ## Due date information **Out:** February 28th **Project 1 Partner Form Deadline:** March 1, 11:59PM EST **In:** March 15th, 11:59PM EST ![](https://i.imgur.com/Edc8h2Y.png) ## Summary It's time to try your data science skills on real datasets! For this assignment, you will choose one of three datasets to work on. You'll then apply what we've learned so far to clean up the data, analyze the data, and write a report that presents the results of your analysis. There's no difference in difficulty across the datasets -- we're merely letting you choose which dataset/question you are most interested in exploring. The project occurs in two stages. During the first week, you'll work on the design of your tables and functions, reviewing your work with a TA during a **Design Check**. In the second week, you'll implement your design, presenting your results in a written report that includes charts and plots from the tables that you created. The report and the code you used to process the data get turned in for the **Final Handin**. This is a pair project. You and a partner should complete all project work together. You can find your own partner or we can match you with someone. Note that you have to work with different partners on each of the three projects in the course. ## Resources https://hackmd.io/@cs111/table ## Dataset options Your dataset/question options are: - **Climate Change:** Annual CO~2~ emissions data since 1960 by country, with a second table showing temperature changes per country since 1960. Your analysis will look at how temperature changes relate to total emissions in different regions of the world. - **Bikeshare:** Bikeshare data from New York City for the month of October 2020, combined with a table of zipcodes in which bike stations are located. Your analysis will look at how far people are traveling and whether that varies by part of the city. - **Grocery Stores:** Data on the number of grocery/convenience stores per county in the USA, with another table of county-level population data. Your analysis will look at whether some states offer better access to grocery stores than others. Whichever dataset you choose, you will be given a collection of questions to answer in your analysis. You will also provide a function (`summary-generator`) that can be used to generate summary data about a specific aspect of your dataset. The `summary-generator` function will allow the user to customize which statistic (such as average, sum, median) gets used to generate the table data. Detailed instructions for accessing each dataset, and the corresponding analysis and summary requirements, are in the following expansion options. The rest of the handout (after the expansion options) explains general requirements that apply regardless of which option you have chosen. :::info We suggest skimming the instructions for your chosen dataset, and then reading the design check instructions, and then reading the instructions for your chosen dataset in detail. The overarching goal of this project is to answer the analysis questions and to write and test the `summary-generator` function, and the rest of this handout walks you through those goals. We expect that the specific analysis questions and `summary-generator` function description will take a few read-throughs to thoroughly understand, and that you'll switch between the general instructions and the specific instructions for your dataset while making your design check plan. Do not worry if you and your partner do not immediately arrive at the list of tasks you need to do in order to complete the project -- one of the skills you are practicing is how to break a large analysis down into smaller steps. The design check with your TA will be one opportunity to let you know if you are on the right track. ::: ### CO~2~ Emissions :::spoiler Instructions **Overview:** This dataset covers country-level carbon emissions and climate change measurements since 1960. The main table in this dataset (`co2-emissions-table `) gives the total and per-capita emissions for every country and year in the dataset. You are also given a table of all years from 1960 until the end of the dataset, a table of all regions, and a table that has a row for each country with the region it is in and its cumulative warming in degrees since 1960. Take a minute to load the tables (using the stencil code below) into Pyret and take a look at the contents of each table before continuing reading this section. **Analysis:** If you choose this dataset, your analysis should answer the following questions about the CO~2~ Emissions data: **All analysis questions should be answered with both a visualization/chart and corresponding table** - Which region (third column in the `warming-deg-c-table`) had the highest *cumulative* CO~2~ emissions since 1960? - Did this region always have the highest emissions in each year since 1960? - Is there a relationship between cumulative emissions since 1960 and the increase in temperature since 1960 for countries? - Of the top 10 countries with the highest cumulative emissions since 1960, what proportion of these countries belong to each region? Note: Use the following function that's included in your stencil code instead of Pyret's builtin `string-to-number` function! Take a look at the `where` block to see how it is used. ``` fun string-to-num-project1(s :: String) -> Number: doc: "converts string to number value" n = string-to-number(s) cases (Option) n: | some(v) => v | none => raise("Non-number value passed in: " + s) end where: string-to-num-project1("3") is 3 string-to-num-project1("3.5") is 3.5 string-to-num-project1("-10") is -10 string-to-num-project1("hello") raises "Non-number value passed in: hello" end ``` **Summary Generator:** You will also create a `summary-generator` function that can be used to generate summaries of country-level emissions and warming datasets. Imagine that there are multiple country-level emissions and warming datasets starting in 1960, and that you need to create summaries of them with different types of statistics. For example, in one case, you may get the dataset from 1960 through 2020 and need to find the **total** CO~2~ emissions over all years in the time period, over all of the countries in each region. Or in another case, you may get the dataset with a different set of years or countries, and need to find the **average** CO~2~ emissions over all years in the time period, over all of the countries in each region. This all requires building a function that is flexible in terms of which data it presents and what kind of statistics it computes. Your `summary-generator` will take in a table that contains the following columns: - "year": the year this row's data is from - "country": the name of the country that this row's data is from - "total-co2-emission-metric-ton": the country's total emissions for the given year - "warming-deg-c-since-1960": the total warming of the country in celsius *over the entire timespan of the dataset, since 1960* (**this means that any two rows with the same "country" value but different "year" values will have the same value in this column!**) - "region": the region of the world (Africa, Asia, Europe, Oceania, N. America, S. America, Other) the country is in For example, the input table might look like ``` | year | country | total-co2-emission-metric-ton | warming-deg-c-since-1960 | region | | ----- | -------- | ----------------------------- | ------------------------ | ---------- | | 1960 | Bolivia | emi1 | warm1 | S. America | | 1961 | Bolivia | emi2 | warm1 | ... | | 1960 | Peru | emi3 | warm2 | ... | | 1961 | Peru | ... | ... | ... | | ... | ... | ... | ... | ... | ... ``` The `summary-generator` will also take in a [summary function](https://hackmd.io/@cs111/table#Summarizing-Columns). Take a look at the functions (`sum`, `mean`, etc) on the summary function documentation. Each takes in a `Table` and `String` representing a column name, and applies the relevant math over that column of the table to produce a single `Number` (for example, `mean` produces the mean of all of the values in the column). The goal of your `summary-generator` is to figure out how to use the summary function and the input table to produce an output table with only these columns: "region", "avg-warming", and "CO2-summary": ``` | region | avg-warming | CO2-summary | | ------------- | ----------- | ----------- | | Oceania | avg-warm | num1 | | Africa | ... | ... | | Asia | ... | ... | | S. America | ... | ... | ... ``` Each row represents the statistics for a single Region. The `avg-warming` column contains the average warming across all countries on that continent since 1960. **The "region" and "avg-warming" columns will be the same for a given input table, no matter what summary function you give to your `summary-generator`.** The `CO2-summary` column summarizes some statistic about the emissions per year across countries in the region since 1960, **based on the summary function input**. The CO~2~ statistic might be the total, average, median, etc emissions across countries in each region. For instance, if the `mean` function were passed into your `summary-generator` function, the `CO2-summary` column should contain the average CO~2~ emissions over all years of the data set, over all of the countries in that region. If the `sum` function were passed into your `summary-generator` function, the `CO2-summary` column should contain the total CO2 emissions over all years of the data set, over all of the countries in that region. For the `summary-generator` function, use the following header: ``` # the summary-func takes a smaller table and a column name (the String input) fun summary-generator(t :: Table, summary-func :: (Table , String -> Number)) -> Table: doc: ```Produces a table that uses the given function to summarize CO2 emissions for every region (Oceania/Asia/Europe/Africa/ SouthAmerica/NorthAmerica/Other). The outputted table should also have the average warming in every region.``` ... end ``` This might be called as `summary-generator(mytable, sum)` or `summary-generator(mytable, mean)` to summarize the total or average CO~2~ emissions since 1960 across countries in each region as represented in `mytable`. **Note**: `sum` and `mean` here are built-in functions (that you do not write), as described above. Passing a function as an argument is like what you have done when using or table operations like `build-column`, `filter-with`, and `transform-column`. Your `summary-generator` function **should not** reference any tables from outside the function except the provided `regions-table`. While producing your output table, you should use `regions-table` as a starting point (to build columns for the output table and to extract data from the input table `t`). Also, your output table should **not** contain any columns other than those shown in the example above: "region", "avg-warming", and "CO2-summary". **Note:** You do not need to test `summary-generator`. However, please run `summary-generator` twice outside of the function with two different summary functions. Make sure the output makes sense! This will look something like this: ``` fun summary-generator(...): # your code end summary-generator(your-input-table, func1) summary-generator(your-input-table, func2) ``` **Hints**: 1. You will have to construct an example input table with the column names "year," "country," "total-co2-emission-metric-ton," "warming-deg-c-since-1960," and "region" yourself. Plan out how you will do this in the design check! It will also help to manually create smaller input tables that you can use while developing the `summary-generator`. 2. The format of the output suggests that you will have to call the summary function once for every region to generate the specific summary value for that region's row. The summary function takes in a `Table` and a `String`. For each region, what does the input table to the summary function look like in order to get the desired output? It may help to draw out an example table for a specific region. Then, think about how you to create those `Table`s out of the input table to `summary-generator`. 3. Go back to the analysis questions. Can you use tables created by your `summary-generator` to answer some of those questions? What summary functions would you use? Understanding this question will go a long way in helping you understand the goal of the `summary-generator` function and the entire assignment. ::: :::spoiler Stencil Copy and paste the following code to load the datasets into Pyret. ``` include tables include gdrive-sheets include image include shared-gdrive("dcic-2021", "1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep") import math as M import statistics as S import data-source as DS google-id = "1igZMhJUpAiKg3U6775pcWFIRyDBNmTTLwlrXoUem29M" warming-deg-c-unsanitized-table = load-spreadsheet(google-id) co2-emissions-unsanitized-table = load-spreadsheet(google-id) years-1960-2014-unsanitized-table = load-spreadsheet(google-id) regions-unsanitized-table = load-spreadsheet(google-id) ################## Importing tables ################## warming-deg-c-table = load-table: country :: String, warming-deg-c-since-1960 :: String, region :: String source: warming-deg-c-unsanitized-table.sheet-by-name("warming-degc", true) sanitize country using DS.string-sanitizer sanitize warming-deg-c-since-1960 using DS.string-sanitizer sanitize region using DS.string-sanitizer end co2-emissions-table = load-table: year :: Number, country :: String, total-co2-emission-metric-ton :: Number, per-capita :: Number source: warming-deg-c-unsanitized-table.sheet-by-name( "fossil-fuel-co2-emissions-by-nation_csv", true) sanitize year using DS.strict-num-sanitizer sanitize country using DS.string-sanitizer sanitize total-co2-emission-metric-ton using DS.strict-num-sanitizer sanitize per-capita using DS.strict-num-sanitizer end years-1960-2014-table = load-table: year :: Number source: years-1960-2014-unsanitized-table.sheet-by-name("years", true) sanitize year using DS.strict-num-sanitizer end regions-table = load-table: region :: String source: regions-unsanitized-table.sheet-by-name("regions", true) sanitize region using DS.string-sanitizer end fun string-to-num-project1(s :: String) -> Number: doc: "converts string to number value" n = string-to-number(s) cases (Option) n: | some(v) => v | none => raise("Non-number value passed in: " + s) end where: string-to-num-project1("3") is 3 string-to-num-project1("3.5") is 3.5 string-to-num-project1("-10") is -10 string-to-num-project1("hello") raises "Non-number value passed in: hello" end ``` ::: ### Bikeshare Data :::spoiler Instructions **Overview:** This dataset covers customers' starting and stopping zip codes for a bikshare service. The data was collected over October 2020. The main table in this dataset (`october-2020-citibike-table`) shows information about the start/stop bikeshare stations customers have used and trip duration, as well as the age/gender of customers. Additionally, we include a table with the zipcodes of each bikeshare station, and a table of all zipcodes. Take a minute to load the tables (using the stencil code below) into Pyret and take a look at the contents of each table before continuing reading this section. **Analysis:** If you choose this dataset, your analysis should answer the following questions about the bikeshare data: **All analysis questions should be answered with both a visualization/chart and corresponding table** - Which 5 zip codes were used the most number of times as start stations? As end stations? - Find the top 5 starting zipcodes for customers, and the top 5 starting zipcodes for subscribers. Compare these sets of zip codes. Are there any similarities? If there are similar zipcodes between the sets, are there more customers or subscribers at those zipcodes? - Find the top 5 starting zip codes for the following age groups: 20 or under, 21-30, 31-40, and 41 or older. Are there any similarities between these sets of zip codes? - Is there a relationship between the length of rides and gender of rider? **Summary Generator:** You must now create a `summary-generator` function that can be used to generate summaries of bikeshare datasets. Imagine that there are many bikshare datasets, and that you need to create summaries of them with different types of statistics. For example, in one case, you may get the dataset from 2019 and need to find the **average** ride duration for each zipcode. Or in another case, you may get the dataset from 2022 and need to find the **median** ride duration for each zipcode. This all requires building a function that is flexible in terms of which data it presents and what kind of statistics it computes. Your `summary-generator` will take in a table with the following columns: - "start-zip": the starting zip code for a ride - "end-zip": the ending zip code for a ride - "age": the age of the rider - "duration": the duration of the ride For example, the input table might look like ``` | start-zip| end-zip | age | duration | | -------- | -------- | -------- | -------- | | zip1 | zip2 | age1 | dur1 | | zip3 | zip4 | age2 | dur2 | | zip5 | zip6 | age3 | dur3 | | ... | ... | ... | ... | ``` The `summary-generator` will also take in a [summary function](https://hackmd.io/@cs111/table#Summarizing-Columns). Take a look at the functions (`sum`, `mean`, etc) on the summary function documentation. Each takes in a `Table` and `String` representing a column name, and applies the relevant math over that column of the table to produce a single `Number` (for example, `mean` produces the mean of all of the values in the column). The goal of your `summary-generator` is to figure out how to use the summary function and the input table to produce an output table with only these columns: "zipcode", "average-age", and "duration-summary": ``` | zipcode | average-age | duration-summary | | ------- | ---------- | ---------------- | | 10020 | 24.7 | 853.7 | | 11101 | ... | ... | | 10451 | ... | ... | | 11237 | ... | ... | ... ``` Each row has data for a particular zipcode. The `average-age` column contains the average age of the riders for each ride that started or ended in that zipcode. **The `zipcode` and `average-age` columns will be the same for a given input table, no matter what summary function you give to your `summary-generator`.** The `duration-summary` column summarizes some statistic about the durations of the unique rides around that zipcode, **based on the summary function input**. The `duration-summary` statistic might be the total duration, the average duration, the median, and so on. For instance, if the `mean` function were passed into your `summary-generator` function, the `duration-summary` column should contain the average duration over all rides of the data set. If the `sum` function were passed into your `summary-generator` function, the `duration-summary` column should contain the total duration over all rides of the data set. For the `summary-generator` function, use the following header: ``` fun summary-generator(t :: Table, summary-func :: (Table , String -> Number))-> Table: doc: ```Produces a table that uses the given function to summarize duration of rides across zipcodes. The outputted table should also have average age of riders for each zipcdoe.``` ... end ``` This might be called as `summary-generator(mytable, sum)` or `summary-generator(mytable, mean)` to summarize the total or average duration of rides in each zipcode represented in `mytable`. **Note**: `sum` and `mean` here are built-in functions (that you do not write), as described above. Passing a function as an argument is like what you have done when using or table operations like `build-column`, `filter-with`, and `transform-column`. Your `summary-generator` function **should not** reference any tables from outside the function except the provided `sorted-zip-codes-table`. While producing your output table, you should build columns to `sorted-zip-codes-table`. Also, your output table should not contain any columns other than those shown in the example above: "zipcode", "average-age" and "duration-summary". **Note:** You do not need to test `summary-generator`. However, please run `summary-generator` twice outside of the function with two different summary functions. Make sure the output makes sense! This will look something like this: ``` fun summary-generator(...): # your code end summary-generator(your-input-table, func1) summary-generator(your-input-table, func2) ``` **Hints**: 1. You will have to construct an example input table with the column names "start-zip," "end-zip," "age," and "duration" yourself. Plan out how you will do this in the design check! It will also help to manually create smaller input tables that you can use while developing the `summary-generator`. 2. The format of the output suggests that you will have to call the summary function once for every zipcode to generate the specific summary value for that zipcode's row. The summary function takes in a `Table` and a `String`. For each zipcode, what does the input table to the summary function look like in order to get the desired output? It may help to draw out an example table for a specific zipcode. Then, think about how you to create those `Table`s out of the input table to `summary-generator`. 3. Go back to the analysis questions. Can you use tables created by your `summary-generator` to answer some of those questions? What summary functions would you use? Understanding this question will go a long way in helping you understand the goal of the `summary-generator` function and the entire assignment. ::: :::spoiler Stencil Copy and paste the following code to load the datasets into Pyret. ``` include tables include gdrive-sheets include shared-gdrive("dcic-2021", "1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep") import math as M import statistics as S import data-source as DS google-id = "1iSAp4AXNNcfdxm7cBCSPcy_KPINpgf0nPYGvxxIZXV4" october-2020-citibike-unsanitized-table = load-spreadsheet(google-id) stations-unsanitized-table = load-spreadsheet(google-id) sorted-zip-codes-unsanitized-table = load-spreadsheet(google-id) #| Note: for the gender column 0 represents unknown 1 represents male 2 represents female |# october-2020-citibike-table = load-table: trip-duration :: Number, start-time :: String, stop-time :: String, start-station-id :: Number, start-station-name :: String, end-station-id :: Number, end-station-name :: String, bike-id :: Number, user-type :: String, birth-year :: Number, gender :: Number source: october-2020-citibike-unsanitized-table.sheet-by-name("october-2020-citibike-sample", true) sanitize trip-duration using DS.strict-num-sanitizer sanitize start-time using DS.string-sanitizer sanitize stop-time using DS.string-sanitizer sanitize start-station-id using DS.strict-num-sanitizer sanitize start-station-name using DS.string-sanitizer sanitize end-station-id using DS.strict-num-sanitizer sanitize end-station-name using DS.string-sanitizer sanitize bike-id using DS.strict-num-sanitizer sanitize user-type using DS.string-sanitizer sanitize birth-year using DS.strict-num-sanitizer sanitize gender using DS.strict-num-sanitizer end stations-table = load-table: station-name :: String, zipcode :: Number source: stations-unsanitized-table.sheet-by-name("station-dataset", true) sanitize station-name using DS.string-sanitizer sanitize zipcode using DS.strict-num-sanitizer end sorted-zip-codes-table = load-table: zipcode :: Number source: sorted-zip-codes-unsanitized-table.sheet-by-name("zipcodes", true) sanitize zipcode using DS.strict-num-sanitizer end ``` ::: ### Grocery Stores :::spoiler Instructions **Overview:** This dataset covers the county-level quantities of grocery and convenience stores, as well as county populations. The main table (`county-store-count-table`) shows the numbers of grocery and convenience stores in counties. You are also given a table with the populations of counties, as well as a table of state abbreviations. Take a minute to load the tables (using the stencil code below) into Pyret and take a look at the contents of each table before continuing reading this section. **Analysis:** If you choose this dataset, your analysis should answer the following questions about the grocery and population data: **All analysis questions should be answered with both a visualization/chart and corresponding table** Note: "combined stores" here refers to the sum of grocery and convenience stores. - Which state has the largest number of combined stores per capita? - Do states with the largest populations also have the most combined stores? - Is there a correlation between county populations and the number of combined stores per capita in the states to which the counties belong? - Which 5 states have the largest ratio of convenience stores to combined stores? **Summary Generator:** You must now create a `summary-generator` function that can be be used to generate summaries of county-level store datasets. Imagine that there are many county-level store datasets in the US, and that you need to create summaries of them with different types of statistics. For example, in one case, you may get the dataset from 1970 and need to find the **maximum** stores-per-capita value for counties in each state. Or in another case, you may get the dataset from 2021 and need to find the **average** stores-per-capita for counties in each state. This all requires building a function that is flexible in terms of which data it presents and what kind of statistics it computes. Your `summary-generator` will take in a table that contains the following columns: - "county": the county's name - "state": the county's state, either as the state's full name or abbreviation - "num-stores-county": the total number of grocery and convenience stores in the county - "stores-per-capita-county": the total number of grocery and convenience stores in the county, divided by the county's population For example, the input table might look like ``` | county | state | num-stores-county | stores-per-capita-county | | ------------- | ----- | ----------------- | ------------------------ | | county 1 | RI | num1. | num2. | | county 2 | CO | ... | ... | | county 3 | MD | ... | ... | | county 4 | SD | ... | ... | ... ``` The `summary-generator` will also take in a [summary function](https://hackmd.io/@cs111/table#Summarizing-Columns). Take a look at the functions (`sum`, `mean`, etc) on the summary function documentation. Each takes in a `Table` and `String` representing a column name, and applies the relevant math over that column of the table to produce a single `Number` (for example, `mean` produces the mean of all of the values in the column). The goal of your `summary-generator` is to figure out how to use the summary function and the input table to produce an output table with only these columns: "state", "abbv", "num-stores", and "per-capita-summary": ``` | state | abbv | num-stores | per-capita-summary | | ------------- | ----- | ---------- | ------------------ | | Rhode Island | RI | 145,000 | 0.001 | | Colorado | CO | ... | ... | | Maryland | MD | ... | ... | | South Dakota | SD | ... | ... | ... ``` Each row is a State. The `num-stores` column contains the total number of grocery and convenience stores in that state. **The "state", "abbv" and "num-stores" columns will be the same for a given input table, no matter what summary function you give to your `summary-generator`.** The `per-capita-summary` column summarizes some statistic about the total number of stores (grocery and convenience) *per capita* across counties in that state, , **based on the summary function input**. The per-capita statistic might be the total, average, median, etc stores per capita across counties in that state. For instance, if the `mean` function were passed into your `summary-generator` function, the `per-capita-summary` column should contain the average value of *total stores per capita* across all counties in the state for that row. If the `sum` function were passed into your `summary-generator` function, the `per-capita-summary` column should contain the sum of the *total stores per capita* across all the counties in that state. The person who calls your `summary-generator` function will indicate which summary method to use by passing another function as input. For the `summary-generator` function, use the following header: ``` fun summary-generator(t :: Table, summary-func :: (Table , String -> Number))-> Table: doc: ```Produces a table that uses the given function to summarize stores per capita across counties. The outputted table should also have total number of grocery and convenience stores for every state.``` ... end ``` This might be called as `summary-generator(mytable, sum)` or `summary-generator(mytable, mean)` to summary the total or average CO~2~ emissions since 1960 across countries in each region as represented in `mytable`. **Note**: `sum` and `mean` here are built-in functions (that you do not write), as described above. Passing a function as an argument is like what you have done when using or table operations like `build-column`, `filter-with`, and `transform-column`. Your `summary-generator` function **should not** reference any tables from outside the function except the provided `state-abbv-table`. While producing your output table, you should use `state-abbv-table` as a starting point (to build columns for the output table and to extract data from the input table `t`). Also, your output table should not contain any columns other than those shown in the example above: "state", "abbv", "num-stores" and "per-capita-summary" **Note:** You do not need to test `summary-generator`. However, please run `summary-generator` twice outside of the function with two different summary functions. Make sure the output makes sense! This will look something like this: ``` fun summary-generator(...): # your code end summary-generator(your-input-table, func1) summary-generator(your-input-table, func2) ``` **Hints**: 1. You will have to construct an example input table with the column names "state", "county," "num-stores-county" and "stores-per-capita-county" yourself. Plan out how you will do this in the design check! It will also help to manually create smaller input tables that you can use while developing the `summary-generator`. 2. The format of the output suggests that you will have to call the summary function once for every state to generate the specific summary value for that state's row. The summary function takes in a `Table` and a `String`. For each state, what does the input table to the summary function look like in order to get the desired output? It may help to draw out an example table for a specific state. Then, think about how you to create those `Table`s out of the input table to `summary-generator`. 3. Go back to the analysis questions. Can you use tables created by your `summary-generator` to answer some of those questions? What summary functions would you use? Understanding this question will go a long way in helping you understand the goal of the `summary-generator` function and the entire assignment. ::: :::spoiler Stencil Copy and paste the following code to load the dataset into Pyret: ``` include tables include gdrive-sheets include shared-gdrive("dcic-2021", "1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep") import math as M import statistics as S import data-source as DS google-id = "17OCB7nDBepuvxHrDKB4qMPcI0_UHbTzNwMP_2s0WkXw" county-population-unsanitized-table = load-spreadsheet(google-id) county-store-count-unsanitized-table = load-spreadsheet(google-id) state-abbv-unsanitized-table = load-spreadsheet(google-id) county-population-table = load-table: county :: String, state :: String, population-estimate-2016 :: Number source: county-population-unsanitized-table.sheet-by-name("county-population", true) sanitize county using DS.string-sanitizer sanitize state using DS.string-sanitizer sanitize population-estimate-2016 using DS.strict-num-sanitizer end county-store-count-table = load-table: state :: String, county :: String, num-grocery-stores :: Number, num-convenience-stores :: Number source: county-store-count-unsanitized-table.sheet-by-name("county-store-count", true) sanitize state using DS.string-sanitizer sanitize county using DS.string-sanitizer sanitize num-grocery-stores using DS.strict-num-sanitizer sanitize num-convenience-stores using DS.strict-num-sanitizer end state-abbv-table = load-table: state :: String, abbv :: String source: state-abbv-unsanitized-table.sheet-by-name("state-abbv", true) sanitize state using DS.string-sanitizer sanitize abbv using DS.string-sanitizer end ``` ::: --- ## Deadline 1: The Design Stage The design check is a 30-minute one-on-one meeting between your team and a TA to review your project plans and to give you feedback well before the final deadline. Many students make changes to their designs following the check: doing so is common and will not cost you points. **Task -- Find a Partner and Signup for a slot** * Fill out the Project 1 Partner Form [here](https://forms.gle/sxSZTxYF5zUDeG8S8) before **Tuesday, March 1 at 11:59PM EST**. If you don't have a partner by then, fill out the form anyways! We will pair you up randomly with another student. Design checks are held Sunday through Tuesday (the 6th to the 8th) and are mandatory for all groups. Only one of you has to fill out the Project 1 Partner form; after one of you fills out the form, you will both receive an email by the following morning by your design check TA to schedule a meeting for your design check! If you couldn't find a partner, you will be notified of your random assignment before noon Wednesday **March 2nd**. <!-- :::spoiler Inviting your partner to the Google Calendar event {%youtube x05KHShhA-Q %} **Note:** You will not see a name (like Eli Berkowitz on the video), but you will receive a notification by 7PM on Friday who your Design Check TA is going to be! ::: --> **Task -- Understand your data:** use the stencil code to load the data set into Pyret. Look at the structure and contents of each provided table. In one place, create a reference sheet that you can refer to for the rest of the project and bring to the design check. We suggest putting the following on the reference sheet: 1. The *exact* names of each of the provided tables 2. The *exact* column names of each of the provided tables 3. One (or more!) example rows from each of the provided tables 4. Some representation of the relationships/common information between the tables. For example, if two different tables have columns with the same information, it may help to draw an arrow or use color-coded highlighting to show that on your document 5. Repeat steps 2 and 4 for the input and output tables of the `summary-generator` function for your dataset. **Making a reference sheet like this will save you time as you consider the questions, make a design check plan, and start coding!** You should plan to bring your reference sheet to any office hours that you attend for this project. **Task -- Data-cleaning plan:** Look at your datasets and identify the cleaning, normalization, and other pre-processing steps that will need to happen to prepare your datasets for use. Make a list of the steps that you'll need to perform, with a couple of notes about how you will do each step (which table operation, whether you need helper functions, etc). **Task -- Analysis plan:** For each of the analysis questions listed for your dataset, describe how you plan to do the analysis. You should try to answer these questions: * What charts, plots and statistics do you plan to generate to answer the analysis questions? Why? What are the types and the axes of these charts, plots and statistics? * What table(s) will you need to generate those charts, plots and statistics? Try to give exact column names -- the more specific you are, the better the feedback that the TA can give. * Look closely at the tables that we provide you. If the table(s) you need have different columns or rows than those that we gave you, provide a sample of the table that you need. * For each of the new tables that you identified, describe how you plan to create the table from the ones that we've given you (you will likely need to use combinations of `filter-with`, `build-column` and `transform-column`). Make sure to list all Pyret operators, functions (with input/output types and description of what they do, but without the actual code). If you don't know how to create any table, discuss it with the TA at your design check. :::info ***Sample Answer:*** Assume you had a dataset with information on changes in city populations over time. If you were asked to analyze whether cities with population (in 2000) larger than 30,000 have an increase or decrease in population, your answer to this design-check question might be: "I'd start with a table of cities that have a population in 2000 of over 30,000, and then make a scatterplot of the population of those cities in 2000 and 2010. I'd add a linear regression line, then check whether there was a pattern in changes between the two population values. I'd obtain a table of cities with a population of greater than 30,000 in 2000 by using the `filter-with` function." ::: **Task -- `summary-generator` function example:** Write a `check` block with two examples for your `summary-generator` function. Detailed instructions about this function are included in the Instructions section for your dataset. Also, write down some concrete ideas for how you will produce this function. This means that you'll have to create example input tables -- use your reference sheet from the first task as a starting point! ### Design Check Handin ++By 11:59pm the day before++ your design check starts, submit your work for the design check as a PDF file named `project-1-design-check.pdf` to "Project 1 Design Check" on Gradescope. Please add your project partner to your submission on Gradescope as well. You can create a PDF by writing in your favorite word processor (Word, Google Docs, etc) then saving or exporting to PDF. Ask the TAs if you need help with this. Please put both you and your partner's login information at the top of the file. ### Design Check Logistics * Please bring your work for the design check either on laptop (files already open and ready to go) or as a printout. Use whichever format you will find it easier to take notes on. * We expect that both partners have equally participated in designing the project. The TA may ask either one of you to answer questions about the work you present. Splitting the work such that each of you does 1-2 of the analysis questions is likely to backfire, as you might have inconsistent tables or insufficient understanding of work done by your partner. * Be on time to your design check. If one partner is sick, contact the TA and try to reschedule rather than have only one person do the design check. ### Design Check Grading Your design check grade will be based on whether you had viable ideas for each of the questions and were able to explain them adequately to the TA (for example, we expect you to be able to describe why you picked a particular plot or table format). Your answers do not have to be perfect, but they do need to illustrate that you have thought about the questions and what will be required to answer them (functions, graphs, tables, etc.). The TA will give feedback for you to consider in your final implementation of the project. Your design check grade will be worth roughly a third of your overall project grade. Failure to account for key design feedback in your final solution may result in a deduction on your analysis stage grade. ***Note:** We believe the hardest part of this assignment lies in figuring out what analyses you will do and in creating the tables you need for those analyses. Once you have created the tables, the remaining code should be similar to what you have written for homework and lab. Take the Design Check seriously. Plan enough time to think out your table and analysis designs.* ## Deadline 2: Perform Your Analysis The deliverables for this stage include: 1. A Pyret file named `analysis.arr` that contains the function `summary-generator`, the tests for the function, and all the functions used to generate the report (charts, plots, and statistics). 2. A report file named `report.pdf`. Include in this file the copies of your charts and the written part of your analysis. Your report should address each of the analysis questions outlined for your chosen dataset. Your report should also contain responses to the Reflection questions described below. **Note:** Please connect the code in your `analysis` file and the results in your `report` with specific comments and labels in each. For example: :::info ***Sample Linking:** See the comment in the code file:* ``` # Analysis for question on cities with population over 30K fun more-than-thirty-thousand(r :: Row) -> Boolean: ... end qualifying-munis = filter-by(municipalities, more-than-thirty-thousand) munis-ex1-ex2-scatter = lr-plot(qualifying-munis, "population-2000", "population-2010") ``` *Then, your report might look like this:* ![](https://i.imgur.com/2ld32PX.png) ::: #### Guidelines on the Analysis In order to do these analyses, you will need to combine data from the multiple tables in your chosen dataset. For each dataset/problem option, the tables use slightly different formats of the information used to link data across the tables (such as different date formats). *You should handle aligning the datasets in Pyret code, not by editing the Google Sheets prior to loading them into Pyet.* Making sure you know how to use coding to manage tables for combining data is one of our goals for this project. [Pyret String documentation](https://www.pyret.org/docs/latest/strings.html) might be your friend! **Hint:** If you feel your code is getting too complicated to test, add helper functions! You will almostly certainly have computations that get done multiple times with different data for this problem. Create and test a helper or two to keep the problem manageable. You don't need helpers for everything, though -- for example, it is fine for you to have nested `build-column` expressions in your solution. Don't hesitate to reach out to us if you want to review your ideas for breaking down this problem. This is where your summary sheet from the first design step task will come in handy! Feel free to add the helper function descriptions to that sheet, and visually show how the helper functions fit together and make use of all of the different tables. ### Report Your report should contain any relevant plots (and tables, if you find them helpful as well), any conclusions you have made, and your reflection on the project (see next section). We are not looking for fancy or specific formatting, but you should put some effort into making sure the report reads well (use section headings, full sentences, spell-check it, etc). There's no specified length -- just say what you need to say to present your analyses. **Note:** Pyret makes it easy to extract image files of plots to put into your report. When you make a plot, there is an option in the top left hand side of the window to save the chart as a `.png` file which you can then copy into your document. Additionally, whenever you output a table in the interactions window, Pyret gives you the option to copy the table. If you copy the table into some spreadsheet, it will be formatted as a table that you can then copy into Word or Google Docs. ### Reflection Have a section in your report document with answers to each of the following questions ++after you have finished the coding portion of the project++: 1. Describe one key insight that each partner gained about programming or data analysis from working on this project and one mistake or misconception that each partner had to work though. 2. Based on the data and analysis techniques you used, how confident are you in the quality of your results? What other information or skills could have improved the accuracy and precision of your analysis? 3. State one or two followup questions that you have about programming or data analysis after working on this project. ### Final Handin For your final handin, submit one code file named `analysis.arr` containing all of your code for producing plots and tables for this project. Also submit `report.pdf`, which contains a summary of the plots, tables, and conclusions for your answers to the analysis questions. Your project reflection also should be in the report file. Nothing is required to print in the interactions window when we run your analysis file, but your analysis answers in `report.pdf` should include comments indicating which variable names or expressions in `analysis.arr` yield the data for your answers. ### Final Grading You will be graded on Functionality, Design, and Testing for this assignment. Functionality -- Key metrics: * Does your code accurately produce the data you needed for your analyses? * Are you able to use code to perform the table transformations required for your analyses? * Is your `summary-generator` function working? Testing -- Key metrics: * Have you tested your functions well, particularly those that do computations more interesting than extracting cells and comparing them to other values? * Have you shown that you understand how to set up smaller tables for testing functions before using them on large datasets? Design -- Key metrics: * Have you chosen suitable charts and statistics for your analysis? * Have you identified appropriate table formats for your analysis tasks? * Have you created helper functions as appropriate to enable reuse of computations? * Have you chosen appropriate functions and operations to perform your computations? * Have you used docstrings and comments to effectively explain your code to others? * Have you named intermediate computations appropriately to improve readability of your code? This includes both what you named and whether the names are sufficiently descriptive to convey useful information about your computation. * Have you followed the other guidelines of the style guide (line length, naming convention, type annotations, etc.) You can pass the project even if you either (a) skip the `summary-generator` function or (b) have to manipulate some of the tables by hand rather than through code. A project that does not meet either of these baseline requirements will fail the functionality portion. A high score on functionality will require that you wrote appropriate code to perform each analysis and wrote a working `summary-generator` function. The difference between high and mid-range scores will lie in whether you chose and used appropriate functions to produce your tables and analyses. For design, the difference between high and mid-range scores will lie in whether your computations that create additional tables are clear and well-structured, rather than appearing as you made some messy choices just to get things to work. ---------------------- > Brown University CSCI 0111 (Spring 2022) > Do you have feedback? Fill out this form <iframe src="https://docs.google.com/forms/d/e/1FAIpQLSd1cjOjhS2vhnmC2ju5BuhO-vmUEU5b925EIcKfi0vBgBRjgg/viewform?embedded=true" width="640" height="375" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe>