Homework 4: Data Drama

Due: Tuesday, October 5, 2021 at 9:00PM ET.

Setup and Handin

Setup

  • Task: Create a file in Pyret called hw4-code.arr. You will write your solution in here.

  • Task: In the same way as in Homework 3, there's a file on Canvas called hw4-helper.arr. Copy and paste everything from this file into your new hw4-code.arr file. Do not look at hw4-helper.arr before you turn in Homework 3. We can tell when you look at something, and we'll be checking to see who peeks.

    How do I open hw4-helper.arr?

    Download this file from Canvas. You can open it in a plain-text editor (e.g. TextEdit, WordPad) to copy and paste into your hw4-code.arr file. If you want to see it in Pyret, navigate to code.pyret.org and hit "View in Google Drive". Once You're in the Google Drive folder, drag and drop the file in, right click on it > Open with > Pyret.

  • Task: Create a file in your preferred word processor (Google Docs, Word, TextEdit, WordPad, etc) called hw4-written.

  • You will be graded on the quality of your examples for each function.

    What does this mean?
    • High-quality examples cover the simple/basic cases for each parameter, values at/around any boundaries mentioned in the problem statement, common cases, and potentially subtle cases (e.g., strings with two uses of a character where a function is looking for one use).
    • You do not need to include examples for erroneous inputs (such as negative numbers if only positives make sense for the problem).
  • Do not put your name anywhere in the file.

Handin

  • Hit "Run", then paste this check block into the interactions window. All the tests should pass: if they don't, this means you have a typo in a function header.
    ​​check:
    ​​  t1 = table: id :: Number, age :: Number, town :: String, hobby :: String, has-car :: Boolean end
    ​​  t2 = table: part-name :: String, material :: String row: "", "" end
    ​​  t3 = table: material :: String, price :: Number row: "", 0 end
    ​​  t4 = table: name :: String, gender :: String, occupation :: String, num-visits :: Number, fav-trip :: String end
    ​​  exists = lam(f): true end
    ​​  customers-to-show-ad-to(CUSTOMERS, AD-INTERACTION, 0) is t1
    ​​  customers-in-kingman(t1) is t1
    ​​  exists(show-age-distrib) is true
    ​​  exists(show-car-proportions) is true
    ​​  part-cost("", t2, t3) is 0
    ​​  exists(clean-spork)
    ​​end
    
  • Download your code file and make sure is is called hw4-code.arr. Download or save your written file and make sure it is a PDF called hw4-written.pdf. If your written file is not a PDF, we might not be able to grade it. Hand in your work on Gradescope.

Remember your resources!

The Assignment

Part 1: Even more ads!

Learning Goals:

  • Practice writing functions that take tables as inputs
  • Practice testing tables
  • Work with library functions that create plots and charts of tabular data

Part 1A: Multiple ads

Remember to use the CS0111 Table documentation (linked in Setup and Handin), not the official documentation. If your code uses constructs named sieve or extend (without functions), for example, you are using the wrong version.

In the last homework, we handled a growing collection of data by using a Row for a customer's data. This week, we want to handle a larger number of customers themselves, by having a multi-row Table of customers as the input.

Since there will be many customers, we'll give each customer an ID number that we can use for reference. ID numbers are unique, but besides that, they can be any number.

The code in hw4-helper.arr contains some Tables and functions for various parts of the homework. Copy and paste the code from the helper file into your hw4-code.arr file. In this part we'll use CUSTOMERS and AD-INTERACTION. The second Table is new. You can see what kind of data it contains using the interactions window.

Note: Please don't edit CUSTOMERS; if you want other example Tables which are similar, please define a new Table.

First, the Oatman Ad Agency also wants to have a way to find customers in any given Table who live in "kingman".

Task: Write a function called customers-in-kingman, which takes in a Table with at least column names id :: Number and town :: String (like CUSTOMERS) and returns a Table with the same column names and only the Rows of customers whose town is "kingman" (case sensitive).

Now, we want to identify the customers to show our ad to.

Task: Write a function called customers-to-show-ad-to, which takes in two Tables and a Number. The first Table has the column names id :: Number, age :: Number, town :: String, hobby :: String, has-car :: Boolean (like CUSTOMERS). The second Table has the column names customer-id :: String, ad-id :: String, interacted :: Boolean (like AD-INTERACTION), and describes if each customer interacted with each ad. The Number is the ad id (either 1, 2, or 3 for the given AD-INTERACTION Table).

This function outputs a new Table containing the rows of customers in the CUSTOMERS-like table who interacted with the given ad-id (with the columns id :: Number, age :: Number, town :: String, hobby :: String, has-car :: Boolean).

In order to write examples for customers-to-show-ad-to, feel free to define new Tables explicitly. Your new tables need to have the same column names as CUSTOMERS and AD-INTERACTION.

Hint!

filter-with is going to be a very useful built-in function for this problem. As you can see in the documentation, filter-with takes in a Table and a function with the type (Row -> Boolean).

Also, you don't need lists to complete this problem (we haven't introduced them in class yet, so you aren't expected to know them!) Instead, remember that filter-with iterates over the rows in a table. If you want to iterate over, say, a second table at the same time, you may want to consider using a second filter-with inside of the original

You can define a function inside of another function like so:

fun outer-table-helper(r :: Row) -> Boolean:
​​  doc: "your docstring"
​​  
​​  fun inner-table-helper(r2 :: Row) -> Boolean:
​​    doc: "another docstring"
​​    # can access both r and r2 in here! 
​​  end
​​  
​​    # what should happen for every iteration 
​​    through the rows of the outer table?
​​end
​​
​​filter-with(my-table, outer-table-helper)

Stencil

Here is a stencil that you can use to create your customers-to-show-ad-to function. It gives you the general structure, and where to nest functions, but you still need to fill in the parts where you see TODO!

fun customers-to-show-ad-to(
    customer-like-table :: Table, 
    interaction-like-table :: Table, 
    ad-id :: Number) -> Table:
  doc: ```Takes in a Table of potential customers and a Table of which customer 
  interacted with which ad and returns the first Table, filtered by customers 
  who interacted with the ad indicated by ad-id```

  fun trim-interactions(r :: Row) -> Boolean:
    # TODO: Return true when the row's ad-id matches the ad-id we're 
    # considering and customers have interacted with ad
    # Your code here!
  end

  # TODO: Create a new table called current-ad-data containing only rows from
  # interaction-like-table where ad-id's match up and interacted is true
  # Your code here!

  fun matching-customers(customer-row :: Row) -> Boolean:
    fun interactions-matching-customer(interaction-row :: Row) -> Boolean:
      # TODO: Return true if customer-id in interaction-row matches id in customer-row
      # Your code here!
    end

    # TODO: Filter current-ad-data using interactions-matching-customer.  
    # If the resulting table has more than 0 rows, this should return true 
    # (i.e. the customer had at least 1 interaction with this ad)
    # Your code here!
  end

  # TODO: Filter the customer-like-table with matching-customers to get 
  # the customers that interacted with the given ad!
  # Your code here!  
end

Note: Pyret can read the Strings from the customer-id and ad-id columns in the AD-INTERACTION table and convert them to Numbers. If you choose to write custom ad-interaction tables to test with, your customer-id and ad-id columns can be of type Number.

Part 1B: Visualization

Let's help the Oatman Ad Agency visualize some of this customer data! Specifically, let's look at:

  1. The proportion of customers in CUSTOMERS with a car vs. without a car
  2. The distribution of customers in CUSTOMERS across the following age groups: 20 or under, 21-30, 31-40, and 41 or older

You will refer to the CS111 Table documentation to find the appropriate plotting functions.

Task: Develop a function show-car-proportions that takes a Table and produces an Image that conveys the proportion of customers that do/do not have cars.

Task: Develop a function show-age-distrib that takes in a Table and produces a single Image of an appropriate plot or chart to show the distribution described in item 2. This raises an interesting question about how to deal with the blank values in the age column. Options include:

  • making blank values a separate category
  • omitting them from the visualization
  • counting them as part of "20 and under"

Implement whichever (one!) of these you feel is the most reasonable approach to handling this data.

Task: In your hw4-written file, briefly discuss when you believe each of these three options would be appropriate. Make sure your written answers like this are clearly marked by section (e.g. label this answer "1B").

Hints!

Hint 1: Decide which kind of plot or chart you want to produce. You may need to play with some of the functions in the Table documentation to figure out what makes sense.

Hint 2: Plan out how you will get from the original data table to the inputs needed for the plot/chart function that you've chosen to use. Do you need a compute additional information? Limit to a certain set of rows? Do you need to do these in a particular order? List those out (in prose), then build expressions to perform each step.

Hint 3: Look at the count builtin function. What does the table produced by the count function look like? What are the names for its columns?


Note: You don't need to test functions that produce Images, but you do need to test any helpers which don't output Images. However, these functions should work on any Table with the column names used within the function, not just the CUSTOMERS table.

Task: The Agency is thinking about producing a scatterplot that maps customers' IDs against their ages. Do not actually produce this scatterplot, but consider: What useful information would such a scatterplot provide? Give your answer in a block comment of a sentence or two under your implementation of show-car-proportions.

Part 2: Ship Repair (Extraction)

Learning Goals:

  • Combine data across two tables to answer a question

Captain Cork realizes that his ship, the Starship Fisler, is in drastic need of repairs. He finds a list of parts that need fixing, but he finds that each part is priced by its material.

As part of the code you copied over from hw4-helper.arr, there is one Table called PART-INFO that includes the name and material for each item, and another Table called MATERIAL-PRICES that shows the price of each material.

How do I see what these Tables look like?

To see what these Tables look like (since they're not explicitly defined like CUSTOMERS, feel free to type the name of the Table in the definitions window and hit "Run", or hit "Run" and then type the name in the interactions window and hit return.


Cork wants to know if there is an easier way to figure out how much he has to spend for each item.

Task: Write a function called part-cost which takes in a String representing the name of a part, a Table with column names part-name :: String, material :: String (like PART-INFO), and a Table with column names material :: String, price :: Number (like MATERIAL-PRICES). It should return a Number representing the cost associated with the input part name. You can assume that the input name exists once and only once in the first Table.

Hints!

Hint 1: Work through a concrete example: pick a part name and work out a list of the computations you'll need to do (say, on each table) to compute the answer. Write expressions to do the computations just for your single concrete example. Then, step back and create functions to let you do the computation for any part.

Hint 2: Consider creating a helper function to help break up the computation

Part 3: Spork's Shuttle Service (Data Cleaning)

We will cover this part in lecture on Friday, 10/1

Learning Goals:

  • Think about how table designs and code can embody assumptions (possibly limited) about people
  • Practice cleaning up raw data for reliable processing

Mr. Spork has a wonderful business shuttling aliens across space. He wants to roll out a new a customer loyalty program, and his plan is to start a trial run to see how the strategy will impact business. After gathering initial data about his trial participants, Spork finds the data is already suuuuper messy, and he needs your help before he can begin the trial!

In the code you copied from hw4-helper.arr, there is a detailed Table called SPORK-DATA. It includes each participant's name, gender, occupation, number of visits in the past 90 days, and favorite constellation destination. Spork needs this data to launch a student/teacher discount and hold special sales.

Part 3A: Written

Before we try to clean Spork's data, we should think about what the valid contents of each field could be. Turns out, this can be pretty hard (many programmers and companies get this wrong in practice).

Task: Read this short article on falsehoods programmers believe about names.

Task: Read this article on gender storage in databases.

Task: Answer the following questions in your hw4-written file before moving on to the coding task below. Make sure your written answers like this are clearly marked by section and number (e.g. label these answer "3A"and label each question with its number).

  1. Take a look at SPORK-DATA. Based on the two articles you've just read, what are two assumptions this Table seems to reflect, based on its structure or the data you see? Provide an example from SPORK-DATA of each assumption you name, and explain why the assumption is harmful or why it does not always hold true.
    Hints!

    Hint: Because the articles are about names and gender, try paying special attention to those columns!

  2. In order to help Spork out, Alyssa has decided to write some code to clean up their customer data. Looking at the function clean-spork-names and its helpers, which you copied from hw4-helper.arr, identify 3 separate assumptions Alyssa is making about names, giving an example from SPORK-DATA for why each assumption is false.
    Hints!

    Hint 1: You can run the function in Pyret and see its effect on the Table!

    Hint 2: Pay special attention to the first ~30 rows of the Table!

Before proceeding to part 3B, make a short list of the issues that you see needing to be cleaned up before someone could work with the data. You aren't turning this in, but it is a good sanity check on your understanding of data cleaning before we tell you what to clean.

Part 3B: Designing and Coding

[Updated 10/1]: We spent Friday talking about planning table programs rather than cleaning them. As a result, we haven't covered everything that you might need for these three coding problems. We are therefore going to give you two options here: you can write the code as originally assigned (since technically you've seen what you need for these questions), or you can turn this more into practice with planning out programs (which we expect would be valuable for many of you). It's entirely up to you.

Option 1: The original coding option

Now it's your turn to try your hand at cleaning and preparing the data!

Task: Write function called clean-spork that takes a Table with the columns of SPORK-DATA and produces a table with at least the same column names, as well as one new column called "stud-teach". The cleaned table may have fewer rows than the original.

The cleaned table should have:

  1. normalized the names of constellations so that only the four strings"libra", "orion","aries", or "other" appear in the "fav-trip" column.
    For example, "laries" -> "aries"
  2. only numbers between 1 and 200 in the "num-visits" column
  3. a new "stud-teach" column that summarizes/bins the original "occupation" column. The new column should have only the values "student", "teacher", or "other". Teacher is a catch-all term for professors, teachers, and instructors.

Note: Try calling clean-spork on SPORK-DATA itself to see what it does, but since it's such a big Table it doesn't make sense to make this an example in the where block for clean-spork. Instead, like some other functions in this homework, make a test table with rows and columns for the specific cases that you want to test. Make the data varied, but your example Table doesn't need to be too long if you design your rows carefully.

Hints!

Hint 1: You can and should visualize the data for outliers, use functions like count, etc., to help you figure out ways in which the data is messy.

Hint 2: You might want to try writing more than one helper function for clean-spork!


Task: In your hw4-written file, write a few sentences to reflect on what you learned from this assignment. What have you learned about working with data? What have you learned about coding? We won't grade your answers, but we will look at them to see what's emerging for everyone.

Option 2: The design practice option

Consider the same clean-spork function that's described in the coding option, but don't write the code.

Task: Write down a concrete example of a smaller version of SPORK-DATA to use as a potential testing input for clean-spork. There's no fixed number of rows required. Choose a collection of rows that you think highlight some of the key issues that your function would have had to handle.

Task: Provide the corresponding output table that you think should result from calling clean-spork on your example input table.

You can write your tables for these two tasks either in Pyret, as drawings on paper, or in something like google sheets (and upload a screenshot or PDF of the sheets). If you upload a file (rather than use Pyret), call it clean-spork-tables.(png or pdf).

Task: Write out a program-planning to-do list for writing clean-spork. Each of your list items should include a brief prose description of a computation that's needed, the name of a relevant built-in operator from Pyret (if applicable), or the name and input/output types of functions that you would need to write to accomplish that item. By the time you are done, you should have a list of most of the functions that you would need for this problem, but without the bodies of those functions filled in.

Remember, this is not about writing code. It is about breaking down the clean-spork task into smaller concrete programming steps. One way to approach this is to think about what has to be done to produce each column value in the output table that is not in your input table.

Theme Song

Fly Me to the Moon by Frank Sinatra


Brown University CSCI 0111 (Fall 2020)
Do you have feedback? Fill out this form.