--- title: Homework 4 tags: Homeworks-F23, 2023 --- # Homework 4: Data Drama **Due:** Wednesday, February 21st, 11:59 PM ## Setup - Create a file in [Pyret](http://code.pyret.org) called `hw4-code.arr` where you will write your Homework 4 solution. - Create a file in your preferred word processor (Google Docs, Word, TextEdit, WordPad, etc) called `hw4-src.pdf`. This will hold your written work. **Do not include your name in your SRC submission!** - Do **not** put your name anywhere in the file. - Copy and paste the starter code in the next section into `hw4-code.arr` ## Stencil Code and Resources Copy and paste the code in the following dropdown into `hw4-code.arr`: :::spoiler Expand here for HW4 starter code ``` provide * provide-types * include shared-gdrive("dcic-2021", "1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep") include gdrive-sheets include data-source ALL-DATA = load-spreadsheet("1wWEI3nx64xkbpV9Ey06Ho_yvNQcWi8KpyQAtGlQmcfo") CURRENCY-CONVERSION = load-table: from-c :: String, to-c :: String, conv-rate :: Number source: ALL-DATA.sheet-by-name("currency", true) end # Uncomment for Part 2 #| ART = load-table: id :: Number, cost :: Number, currency :: String, kind :: String source: ALL-DATA.sheet-by-name("art", true) end |# # Table and helper function for Part 4 titanic-raw-loader = load-spreadsheet("1ZqZWMY_p8rvv44_z7MaKJxLUI82oaOSkClwW057lr3Q") TITANIC-RAW = load-table: survived :: Number, pclass :: Number, raw-name :: String, sex :: String, age :: Number, sib-sp :: Number, par-chil :: Number, fare :: Number source: titanic-raw-loader.sheet-by-name("titanic", true) end fun first-name(full-name :: String) -> String: doc: ```Extracts the first name from the given full name. The first name is defined as the part of the name that is between the first two spaces (removing parentheses if necessary)``` first-space = string-index-of(full-name, " ") after-first = string-substring(full-name, first-space + 1, string-length(full-name)) # the second space is the first space of the substring after-first second-space = string-index-of(after-first, " ") # the first name will be before the space in after-first between-spaces = string-substring(after-first, 0, second-space) # get rid of first parenthesis if it exists if string-char-at(between-spaces, 0) == "(": string-substring(between-spaces, 1, string-length(between-spaces)) else: between-spaces end where: first-name("Mr. Wensley Z") is "Wensley" first-name("Dr. Victor Frankenstein") is "Victor" first-name("Ms. (Alice Potts) A. Bird") is "Alice" end ``` ::: ### Remember your resources! - [Pyret documentation](https://www.pyret.org/docs/latest/) - [How to create a plan](https://hackmd.io/@cs111/planning) - [CS0111 Table documentation](https://hackmd.io/@cs111/table) (use this **instead** of the built-in `Table` documentation) - [Information on lambda expressions for Tables](https://hackmd.io/@cs111/lambdas-tables) - [TA hours](https://brown-csci0111.github.io/pages/calendar) - [Edstem](https://edstem.org/us/courses/54800) - [Code clarity guide](https://cs.brown.edu/courses/csci0111/fall2021/assets/docs/pyret-clarity-design-testing.html) ## Big Picture :::danger *Remember to use the CS0111 `Table` documentation (linked in Setup and Handin), not the official documentation.* ***If your code uses constructs named sieve or extend (without functions), for example, you are using the wrong version and will not get credit.*** ::: This assignment has you practice a multi-stage data-analysis task. We're going to write functions that draw on data from two tables to perform computations. We'll also do multiple versions of a function, where each version adds more features, so you can see an example of building up solutions a bit at a time. This assignment draws on all of the table lectures so far. For the planning tasks, we saw an example on Feb 14, and you're also practicing it in lab 3. While we have covered everything that you need as of Feb 14, the examples we do in class on Feb 16 will give more practice with the concepts in this assignment. :::warning **Important:** as you start working on more complicated programming problems, we want you to practice planning out your tasks *before* you start coding. Taking the time to plan your approach to the programming tasks can potentially save hours of debugging time. We have given you some guidance on planning [here](https://hackmd.io/@cs111/planning). ::: ## The Assignment As an international organization, Lily Pad LFC (limited frogability corporation) has to deal with the complexities of currency exchange rates. In order to do this, the leaders at Lily Pad LFC have to manipulate tables containing complicated data. This homework will get you ready to join Lily Pad LFC by using your table skills to uncover important information. Once you've practiced your skills, one of your new co-workers, Ricky Redeyed, also needs some help evaluating art sales and investigating the Titanic. ![frog with paperwork](https://hackmd.io/_uploads/rkDXkfwYa.jpg) ## Part 1: Prices Under Exchange Rates :::info Learning Goals: - Practice planning programs - Combine data across two tables to answer a question ::: Companies that do business internationally need to be able to quote prices in different currencies, following a table of exchange rates. This is the kind of situation where one would have two separate tables: one with the prices of items, and another with the exchange rates per country. For this problem, we'll be computing currency conversions for an online art store in which artists and clients may be in different countries. The store has two tables, that are both in your starter code. One is for artwork, called `ART`: it tracks the unique id, cost and base currency for each piece, as well as what kind of artwork it is. It has some errors in the data, which we will clean up in the next part. It is commented out -- **ignore it for now**. ``` table: id :: Number, cost :: Number, currency :: String, kind :: String ``` The other table is for currency conversion, called `CURRENCY-CONVERSION`. Each row holds the conversion rate (the multiplicative factor) to convert from the first currency (`from-c`) to the second (`to-c`). ``` table: from-c :: String, to-c :: String, conv-rate :: Number ``` For example, if a row of the `Table` looked like: ``` ["EUR" "TND" 3.32] ``` we would convert from Euros (EUR) to Tunisian Dinars (TND) by multiplying the price in Euros by 3.32. We will use these tables to develop several programs for selling art across currencies. :::spoiler How do I see what the `Tables` in this assignment look like? Unlike in Homework 3, these tables are not typed directly into Pyret, but rather pulled in from a Google Sheet. To see what these `Tables` look like, feel free to type the name of the `Table` in the definitions window and hit "Run", or hit "Run" first and then type the name in the interactions window and hit return. ::: <br> :::info For these tasks, you can use our `CURRENCY-CONVERSION` `Table` when writing tests. ::: **Task 1:** Write a function called `exchange-price-1`, which takes in as input a `Table` that has the same columns as `CURRENCY-CONVERSION`, a starting currency (a `String`), a desired currency (a `String`), and a price (a `Number`), and returns a `Number` of the price converted from the starting currency to the desired currency. For this exercise, you can assume that a row with the starting currency in the "from-c" column and the desired currency in the "to-c" column exists exactly once in the table. For now, do not worry about handling the case of converting a currency to itself. *Remember to make a plan for this task before starting to code it up!* :::spoiler How do the stated assumptions affect how you should write tests? For this and every homework problem that has text like "you can assume that a row appears exactly once in the table," your test cases should obey this assumption. That is, for this problem, do **not** include tests like `exchange-price-1(CURRENCY-CONVERSION, "JPY", "EUR")`, since this test violates the assumption (there are no rows with "JPY" in the "from-c" column and "EUR" in the "to-c" column in this `Table`). ::: <br> :::spoiler How do I write tests for decimal numbers? Remember that you can use mathematical expressions in tests, such as `0.25 is 1/4`. Pyret also has an [`is-roughly`](https://www.pyret.org/docs/latest/testing.html#%28part._testing_is-roughly%29), but it will be much easier to use the mathematical expression instead. ::: <br> **Task 2:** Sometimes, we cannot make the assumption that we make in Task 1. For this task, assume that input currency conversion table may not list the conversion from A to B, but it may list the conversion from B to A. Write a function called `exchange-price-2`, which has the same inputs and output as `exchange-price-1`, but uses the direct conversion (as in Task 1) or the inverse conversion if necessary. (You compute the inverse of a number *N* using *1 / N*, where / is division). *Remember to make a plan for this task before starting to code it up!* Details: - You should also account for the case where the starting currency and the desired currency are the same, even if the `Table` does not contain the relevant row. (*Hint: what would the conversion rate be in this case?*) - For cases where the starting currency and desired currency are different, you can assume that a row with one currency in the "to-c" and the other currency in the "from-c" column appears in the `Table`. If there are multiple matching rows, you can assume that they are consistent with each other (that is, if you have a `Row` with A in to-c and B in from-c, and a row with B in to-c and A in from-c, no matter if you choose to do the direct conversion or the inverse conversion, you will get approximately the same answer.) **Task 3:** Create a function called `plot-exchange-rates` which takes in a `Table` that has the same columns as `CURRENCY-CONVERSION` and a starting currency (a `String`) and produces a bar chart of conversion rates from the given currency to all the other available currencies, that is, currencies for which there exists either a direct or an inverse conversion (or both) from the given currency. *Remember to make a plan for this task before starting to code it up!* :::warning This is one of the more complex tasks in this assignment - you might think about how you will do it, and then move on to other tasks in the assignment and come back to it. If you get stuck, try one of the following strategies: * Revising your plan or talking through it in hours/on Ed * Closing the Pyret window, drawing out what you want the result to look like on paper, and thinking about the steps you are taking/ways you are interacting with your data * Getting a good night's sleep or working on something else (seriously, sometimes the brain just needs a rest before looking at the problem anew!) * Giving yourself a time limit to work on the problem and then turning in the homework regardless (it's worth a fraction of the points on one assignment, and sometimes there are more important things in life!) ::: :::spoiler Questions to ask yourself when making a plan - How do you make sure that you only include the available currencies in your answer? - Sometimes, the other currency will be in the "from-c" column, and other times, it will be in the "to-c" column. How can you make sure that the bar chart includes both of these cases? - How do you create a plot that uses the inverse conversion if the direct conversion is not available? ::: ## Part 2: Data cleaning This section deals with the `ART` `Table`. Most of the section is exploratory (you are meant to write short pieces of code and write down what you notice, but not turn in code), with one coding task that you turn in. :::info In the Feb 12 and 14 lectures, we learned that there are multiple approaches to cleaning data: using [sanitizers](https://dcic-world.org/2022-08-28/processing-tables.html#%28part._missing-data%29), manually cleaning up the source spreadsheet, filtering out rows that we recognize as malformed, and writing functions to clean up columns. One of your jobs in these tasks will be to determine which of these approaches makes the most sense for different kinds of messy data. ::: **Task 4:** Uncomment the block below the line that says `# Uncomment for Part 2` by removing the surrounding `#|` and `|#`, and try to run the code. You will encounter one error. Using what you learned in class, change the `load-table` block of code so that this error no longer happens (Note: column A corresponds to the "id" column). Even though the error text will hint towards editing the Google sheet itself, you can and should resolve this in code. In a one-sentence comment below this change, in your own words, explain what the error was and why it happened. Now, examine the `ART` `Table` in the Pyret window. For the data to be considered "clean," we would want: - Every id in the "id" column to be a unique number - Every price in the "cost" column to be between 100 and 99,999 (inclusive) - Every value in the "currency" column to be a valid global currency, represented as a three-letter capitalized string - Every cell in the "kind" column to contain one of three values: "visual," "sculpture," or "other." The next few tasks ask you to reflect on what it would mean to clean the existing data to conform to these expectations. As you complete tasks 5-7 remember that, aside from scrolling through the columns to examine data, you can also graph the data or even sort it. **Task 5:** One of the values in the "id" column appears twice, violating one of the requirements above. In a short comment that starts with "Part 2, Task 5:", name this value, and explain how to find it efficiently ("efficiently" here means faster than reading through all of the values in the column). **Task 6:** In a short comment that starts with "Part 2, Task 6:", explain how the values in the "cost" column are not clean, using at least 2 specific examples from the data. Which of the approaches listed in the blue box at the top of this part would be your first choice for cleaning up the data, and why? Which would be your second choice, and what is an upside and a downside of this second choice? **Task 7:** In a short comment that starts with "Part 2, Task 7:", explain how the values in the "currency" column are not clean, using 2-4 specific examples from the data. Which of the approaches listed in the blue box at the top of this part would be your first choice for cleaning up the data, and why? Which would be your second choice, and what is an upside and a downside of this second choice? **Task 8:** Examine the "kind" column. We consider "visual" art to be art that has "paint" or "photo" in the "kind" description, and "sculpture" to be art that has "marble", "ceramic," or "wood" in the "kind" description (all case-insensitive). Everything else would be considered "other." Write a function called `clean-kind` that takes in a `String` (which you can assume will look like any of the values in the pre-cleaning "kind" column) and produces either "visual", "sculpture", or "other", based on the criteria named in this task. To test this task, use specific values from the given data. :::info **Note:** this task does not ask you to use `transform-column` to clean the `ART` `Table`, but you can (and should) certainly try out what it looks like to use `clean-kind` as an input to `transform-column` for this purpose. ::: ## Part 3: Art sales :::info Learning Goals: - Practice writing functions that take tables as inputs - Practice writing helper functions for use with tables - Practice testing tables ::: <img src="https://hackmd.io/_uploads/SkzeHf_t6.png" alt="frog painting a picute of a frog" width="40%"> <img src="https://lh3.google.com/u/0/d/1_tojn_abSjumSS_Ybjp0U09ukvdbtyi3" alt="drawing of a frog relaxing on a beach" width="50%"> **Task 9:** Since the given `ART` `Table` has data errors *and* is pretty large (and therefore unwieldy to use in test cases), come up with two small example `Tables` with the same columns as `ART`. Make sure that the `Table`s you create obey the criteria for what it means for the data to be clean, as given in Part 2. For the currency column, you can limit yourself to the following currencies (so that they match the `CURRENCY-CONVERSION` `Table`): CAD, CHF, CNY, EUR, JPY, MXN, USD, ZAR. You may end up revising these tables as you work through and test the following tasks -- that's okay, but make the tables now as a starting point! For the next two tasks, you can use the provided `CURRENCY-CONVERSION` as your test input for the relevant table, and you should use the test tables you defined in Task 9. **Task 10:** Create a function called `get-art-price` that takes in an art ID (`Number`), a currency (`String`), a `Table` that has the same columns as `CURRENCY-CONVERSION` and a `Table` that has the same columns as `ART` and returns a `Number` for the price of the artwork with the given id, converted to the input currency. You can assume that exactly one row with the given id exist in the `ART`-like input table. If a direct or inverse conversion doesn't exist (as described in Task 2), this function should `raise("unable to convert price")`. *Remember to make a plan for this task before starting to code it up!* **Task 11:** You want to find the priciest art in the `Table`, but since the artworks are in different currencies, you will need to convert all of the artwork to the same currency to do this analysis. Create a function called `priciest-art` that takes in a target currency (`String`), a `Table` that has the same columns as `CURRENCY-CONVERSION`, and a `Table` that has the same columns as `ART`, and returns the artwork ID (`Number`) that costs the most once all artworks have been converted to the target currency. You can assume that all conversions to the given currency exist (either as direct or inverse conversions) in the input currency conversion table. *Remember to make a plan for this task before starting to code it up!* :::info **Note:** this means that you will have to examine `CURRENCY-CONVERSION` to find such currencies when you are testing your function in order to obey this assumption. ::: ## Part 4: A Titanic Dataset :::info Learning Goals: - Think about how table designs and code can embody assumptions (possibly limited) about people - Practice cleaning up raw data for reliable processing ::: There are a few different versions of databases of passengers who sailed on the Titanic. (There is some ambiguity in this term, because the Titanic made a few stops even on its maiden voyage.) One such database is in your starter code, called `TITANIC-RAW`. We're going to do an analysis of this dataset, using an assumption about names. We want to determine: - The seven most popular first names of male passengers, - The seven most popular first names of female passengers, Relative to the dataset, we define the relevant concepts as follows: - A man has sex field “male”, woman has it as “female” (we acknowledge that this binary excludes people. We return to this issue later in the exercise.) - A “first name” is the part of raw-name between the first and second spaces (excluding any parentheses that appear). For example, "Dr. Victor Frankenstein"'s first name would be "Victor," and "Ms. (Laura Jeanne) Reese Witherspoon"'s first name would be "Laura."" We give you a helper function, `first-name`, which takes in a full name as a `String` and outputs the first name according to the criterion above. **Task 12:** Using table functions and the `first-name` helper function, determine the seven most popular male and female names *according to our first name definition*. You do not have to print them out directly in the interactions window -- in fact, the easiest way to find these names will likely be to do some computations on the raw data and then find an appropriate function to summarize some data in the resulting tables (*hint*: where in the Table documentation is "summarizing" discussed?) Don't overcomplicate this -- for reference, our solution uses 5 lines of code. *Remember to make a plan for this task before starting to code it up!* In your `hw4-src.pdf` file: - Write the 7 most common male first names in descending order - Write the 7 most common female first names in descending order **Task 13:** Does anything surprise you about your findings? Expand `TITANIC-RAW` in your Pyret window and observe how names were recorded in the dataset. In your `hw4-src.pdf` file, describe any observations you have made in this exploration. Before you started this task, did you believe that the definition of first name was reasonable (if no, briefly state why not). Did this change after your exploration? **Task 14:** In the Titanic Dataset, we used data that was collected over a century ago using methods informed by prejudices of the time. However, biased data is still very much a modern issue. Read the two following articles on current problems with data collection and storage of names and gender: - [Falsehoods Programmers Believe About Names](https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/). - [Gender Storage in Databases](https://slate.com/technology/2019/10/gender-binary-nonbinary-code-databases-values.html) Think about your response to Task 13. How did gender bias impact the collection and storage of the Titanic dataset? From the two articles you just read, how can similar biases emerge in data collection and storage now? Could some biases be less detectable than others? Write your response in `hw4-src.pdf`. **Task 15:** Read the following quote from Amy J. Ko in [Critically Conscious Computing Chapter 7: “Encoding Information”]((https://criticallyconsciouscomputing.org/information)) (The same resource you read in homework 3! This chapter also offers a fascinating introduction to the history and limitations of racial and ethnic categories in data.) > While none of the powers or perils of data are specific to computers, computers do play a central role in amplifying the harms and benefits of data. Forms on a website control what data can and cannot be entered, shifting power from the individuals filling out the form to the website and the developers who maintain it. **Databases control how data is encoded and organized, shifting power from the people represented by data to the database designers who created and maintain the database.** As a programmer, you have lots of power. Whether you work with an old, biased datset or design and collect a new dataset, there are actionable steps you can take to prevent further harm. You will continue to explore the *how* throughout your CS education at Brown! ## Check Block (Autograder Compatibility) Hit "Run", then paste this check block into the interactions window. All the tests should pass: if they don't, this means you have a typo in a function header and the *autograder won't run on your code!* ``` check "functions exist and have correct inputs": CLEAN-ART = table: id, cost, currency, kind row: 1, 1, "USD", "other" end is-number(exchange-price-1(CURRENCY-CONVERSION, "USD", "EUR", 1)) is true is-number(exchange-price-2(CURRENCY-CONVERSION, "USD", "EUR", 1)) is true is-string(clean-kind("")) is true is-number(get-art-price(1, "USD", CURRENCY-CONVERSION, CLEAN-ART)) is true is-number(priciest-art("USD", CURRENCY-CONVERSION, CLEAN-ART)) is true end ``` If you see the block below appear in the interactions window after running it, then you are fine! Please submit to [Gradescope](https://www.gradescope.com/courses/718380). &NewLine; ![](https://i.imgur.com/vp6aMHK.png) If not, double-check your function names, input types, and input order. If you are stuck at any point, please feel free to come to hours or post on Ed! ## Handin - Download your code file and make sure it is called `hw4-code.arr`. - Your written answers should be clearly marked with corresponding task numbers. Download or save your written file and **make sure it is a PDF** called `hw4-src.pdf`. If your written file is not a PDF, we might not be able to grade it. - Hand in `hw4-code.arr` and `hw4-src.pdf' on Gradescope. Make sure you do not include your name or any other identification on your submissions! ## Theme Song [My Heart Will Go On](https://www.youtube.com/watch?v=F2RnxZnubCM&pp=ygUNdGl0YW5pYyBzb25ncw%3D%3D) by Celine Dion ------ > Brown University CSCI 0111 (Spring 2024) <iframe src="https://forms.gle/tVELrdxLYisxKvsb6" width="640" height="372" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe>