Project 1: Recommender

--- tags: Projects-Summer2021 --- Project 1: Recommender ================= ### Due Date: Friday, June 18th, Anywhere-On-Earth time Overview ======= In this project, you will design and build a program that uses machine learning to generate a *decision tree* from training data that can be used for classification tasks. This handout is long and will take a while to read, however, reading it carefully will be very beneficial when implementing the project. In this handout, we will: - outline the learning objectives of the project - give background on machine learning and decision trees - outline the implementation architecture of the project - describe the decision tree generation algorithm - discuss testing expectations - provide a way for you to test the accuracy of the decision trees generated by your algorithm - detail the design check expectations - describe the SRC component of the project - list final submission requirements - describe how the project will be graded ## Learning Objectives With every project in CS18, we'll highlight the learning objectives that are targeted in the project. By the end of the project, you will be able to: * Select and justify a specific data structure for use in a specific problem, citing alignment with problem needs or resource constraints * Use concrete examples, invariants, hand tracing, and debugging to explore or explain how a program works * Create a class hierarchy that makes proper use of interfaces (for variant types or contracts) * Comment code effectively with Javadocs * Develop a collection of examples and tests that exercise key behavioral requirements of a problem * Write programs that behave as expected based on a provided description or collection of test cases * Implement a basic machine learning algorithm to train a model to classify data * Identify potential societal impacts of a program on individual users or populations Stencil Code =========== ::: warning **Important**: Read [this guide](https://hackmd.io/ucedaAekROycRiN2j1LRaw) detailing how to use GitHub Classroom for projects in CS18 first! ::: You can set up your group and clone your group's project repository [**here**](https://classroom.github.com/g/kBhIEmEC). Background ========= ## Machine Learning We hear a lot about machine learning, but what really is it? Broadly, much of machine learning can be described as *function approximation*: using historical data to train a model that approximates an unknown underlying function. One simple machine learning algorithm that you might be familiar with from school is **linear regression**. ![](https://i.imgur.com/c4SNy1u.png) As seen in the above graph, linear regression uses the training data (the blue dots) to create a linear estimation (the red line) of the underyling function that describes the data. In this case, the training data is given in the form of pairs of inputs **x** and outputs **y(x)**. We can use linear regression to estimate **y(x) = mx + b**. ## Vegetable Preferences Consider a different problem: **you would like to train a model to tell you whether you are likely to enjoy a particular vegetable** based on your preferences for vegetables you have already tried. You've been tracking your vegetable preferences and have the following table: ``` name color lowCarb highFiber likeToEat ------------------------------------------------------ spinach green true true false kale green true true true peas green false true true carrot orange false false false lettuce green true false true ``` Each row of the table corresponds to a vegetable you have tried and each column corresponds to an attribute of vegetables. `likeToEat` is the attribute that our model will be predicting. Thinking back to the linear regression example, the function we were estimating was single-variable. We can think of your vegetable preferences as a multivariable function that takes in as parameters the other attributes of a vegetable `color`, `lowCarb`, and `highFiber` and outputs `likeToEat`. That means we are approximating a function likeToEat = function(color, lowCarb, highFiber) Unlike the linear regression example, `likeToEat` can only be one of two possible values, `true` or `false`, so this becomes a *classification* problem: we want to be able to label any vegetable as either `likeToEat = true` or `likeToEat = false`. ## Data The performance of any machine learning system is highly dependent on the data available on which to train the model. Google Director of Research and Brown alum Peter Norvig once said the following of Google's prowess: > We don't have better algorithms, we just have more data. In machine learning, *training data* is the data from which we build our model. In the linear regression example, the training data were the blue dots, and in the vegetable preferences example the training data were the data you collected on your historical vegetable preferences. The more good data points we gather and use in the training process, the more accurate our model will be. We define the *accuracy* of a classification model to be the percentage of novel *testing data* the model correctly classified. In the vegetable preferences example, if we compiled a test dataset of 100 vegetables that were not present in the training data, and fed them into our model, for how many of those 100 would the model correctly identify whether or not you like to eat the vegetable. It is important to note that in order to properly measure the accuracy of a model, the model must be tested on a novel dataset with no overlap with the training dataset. This is the distinction between *training* and *testing* data. ## Decision Trees We've talked abstractly about "training a model", but what does that actually look like? In this project, **you will be building a program to generate a decision tree from training data to perform classification tasks** like the aformentioned vegetable preferences example. The *model* is the decision tree, and the *training* is the process of building the tree from training data. Decision trees are a popular form of machine learning partly because of how intuitive and visualizable they are. Decision trees provide an ordered way of examining each attribute of a datum, leading ultimately to the decision/classification for that datum. Below is a diagram of a decision tree that could be generated from the vegetable preferences training data. ![](https://i.imgur.com/j8lrAjE.jpg) The **nodes** of the tree are labeled with attributes of the data (except for the name, which we are skipping in this example). The **edges** of the tree are labeled with different values of the attribute of the node from which they stem. Looking at the diagram, the root node considers the `color` attribute, for which there are two values in the dataset (green and orange). For the orange vegetables, `likeToEat` is always `false`. But `likeToEat` differs among the green vegetables. The tree next looks at whether a vegetable is `highFiber`. For green, low fiber vegetables, `likeToEat` is always `false` but varies for high fiber vegetables. The tree thus looks at the `lowCarb` attribute. If `lowCarb` is `true`, you are just as likely to like or dislike the vegetable, so we can arbitrarily pick to classify `likeToEat` as either `true` or `false`. If you wanted to use this tree to predict whether you would like a new vegetable, you would use the attributes of the new vegetable to traverse the tree from the root until you get to a decision leaf. At each node, you follow the edge whose label corresponds to the value of the attribute about the new vegetable. For example, if you wondered whether you would like acorn squash (which is orange), this tree would predict that you would not like it. If you wondered about green beans, which are low in fiber, this tree would predict that you would like it. Here are things to note about the decision tree: * each non-leaf node is labeled with an attribute from the dataset * the same attribute arises only once along any path from the root to a decision * a node for an attribute has outgoing edges corresponding to the possible values of that attribute from within the table * the decisions are the predicted value for one attribute (here, ``likeToEat``), based on the values of the other attributes. All decisions are for the same attribute. In this project, you will write a program that generates a decision tree from a training dataset and uses that tree to make predictions. You will exercise various object-oriented practices to make your code robust and will evaluate the performance and social impacts of our implementation. :::info **Note:** We use the words "decision", "prediction", and "classification" somewhat loosely throughout this handout and the code to mean the same thing: the tree's predicted value of the target attribute for a test datum. Similarly, we use "build", "generate", and "train" to all mean the process of building the decision tree from a training dataset. ::: Implementation Architecture ========= Your implementation will consist of three main parts: * Classes and interfaces for datasets * Classes and interfaces for decision trees * Classes and interfaces to generate and use a decision tree given a dataset We give you the interfaces for this project in the `src` package. Your job is to **design and implement the corresponding classes**. Additionally, you will write test cases for your code and answer written reflections. :::info **Your implementation may only use built-in datatypes that we have already covered in CS18** (LinkedList, ArrayList, arrays, and any classes you define for yourself). In particular, those with prior Java experience may know other data structures (such as HashMaps) that we have not yet covered in CS18. To keep the workload even for everyone, and to maintain consistency in considering runtimes and space usage, you ++**cannot**++ use these other data structures on this project. ::: Datasets ------- For purposes of this project, we will assume that each dataset corresponds to a table. The columns of the table are the attributes, while the rows are the individual data objects with values for each attribute. The dataset classes represent the training data you'll receive to generate the tree. Therefore, you can assume each attribute should have a fixed number of possible values (such as booleans or a set of specific numbers or strings) and not something like an infinite set of integers. #### `IAttributeDatum` To capture the notion of a single data object with attributes (corresponding to a row in the table) we provide the interface `IAttributeDatum`. For your design check, you will create an implementation of this interface to capture the vegetable preferences example (more details in the design check section of the handout). ```java public interface IAttributeDatum { // lookup the value associated with the named attribute public Object getValueOf(String attributeName); } ``` We use `Object` as the return type here because an attribute can be of any type. For example, `"color"` would be a `String` and `"highFiber"` would be a `Boolean`. #### `IAttributeDataset` To capture the notion of a collection of `IAttributeDatum` (a table is a collection of rows) we provide the interface `IAttributeDataset`. Note that the interface has the type parameter `<T extends IAttributeDatum>`. This means that `T` must be a class that implements the `IAttributeDatum` interface (this is confusing Java syntax, but "extends" in this context means either "extends" or "implements"). This is because your dataset class must be able to handle any concrete implementation of `IAttributeDatum`. ```java public interface IAttributeDataset<T extends IAttributeDatum> { // gets all the attributes in the dataset public List<String> getAttributes(); // gets the data objects (rows) of the dataset public List<T> getDataObjects(); // gets the number of data/rows in the dataset public int size(); // does every row/datum have the same value for the given attribute/column public Boolean allSameValue(String ofAttribute); // gets the value of ofAttribute, which is assumed to be common // across all rows public Object getSharedValue(String ofAttribute); // gets the most common value of ofAttribute in the dataset public Object mostCommonValue(String ofAttribute); // partitions data into subsets such that each subset has same value // of onAttribute public List<IAttributeDataset<T>> partition(String onAttribute); } ``` You will implement `IAttributeDataset` by filling out the required methods in the `DataTable` class in the `sol` package. Your `DataTable` implementation should store both the list of attributes (`String`) and the list of data objects (`IAttributeDatum`) for the dataset. Decision Trees --------------- Decision trees differ from binary trees in two ways: 1. nodes can have any number of children 2. edges are labeled by the value (of the attribute from which they stem) they represent You must decide how to implement each of these features and must create classes in `sol` for the tree structure. The only constraint is that classes representing nodes (and leaves!) of your tree must implement the interface that we provide in `ITreeNode.java`: ```java public interface ITreeNode { // traverses decision tree to get prediction for datum public Object lookupDecision(IAttributeDatum datum); // prints the tree given a current indentation (leadspace) public void printNode(String leadspace); } ``` The ``lookupDecision`` method also returns `Object` because the type of decisions can vary across datasets and individual attributes to be predicted. `leadspace` in `printNode` refers to the indentation level of the node. So if you want to print a root node, you might input `leadspace` as `""` (empty), or if you want to print its child, you might input `leadspace` as `" "`(a couple of spaces to show that it is a child). Each non-leaf node will have an associated attribute from the dataset. The edges will be labeled with the possible values for that attribute. Decision Tree Generator ------------ The generator supports training a decision tree for a dataset and making predictions with the generated tree. The `ITreeGenerator` interface provides methods required for classes that generate the decision tree. ```java public interface ITreeGenerator { // builds a decision tree to predict the value of a named attribute public ITreeNode buildClassifier(String targetAttr); // uses the trained decision tree to get the target attribute predicted // value for a datum public Object lookupRecommendation(IAttributeDatum datum); // prints the entire decision tree public void printTree(); } ``` You will implement `ITreeGenerator` by filling out the methods in the provided `TreeGenerator` class in the `sol` package. As with the `DataTable` class, your `TreeGenerator` class is parameterized by the generic type of the data. Algorithm for Building the Decision Tree ============ A decision tree gets generated for a dataset and an attribute in the dataset that we want to predict. The values of the attribute to predict will be at the leaves of the decision tree. ## Pseudocode The following pseudocode outlines the algorithm to generate a decision tree to predict `targetAttribute` for a `subset` of a dataset: ``` if all rows in subset have same value for targetAttribute: return decision node (leaf) with common value else if no more unused attributes: return decision node (leaf) with most common value of targetAttribute across current subset else: choose a previously unused attribute create new node for chosen attribute store most common value of targetAttribute across current subset partition subset into new subsets for each value of chosen attribute for each new subset: recursively generate decision tree for subset create an edge from new node to recursive subtree, labeled by corresponding value of chosen attribute return new node ``` :::info When implementing your tree generator, you should start with the full dataset as the initial subset. ::: **Let's talk about the step** `choose a previously unused attribute`**:** First, `previously unused` means that the attribute does not appear on the path from the root node to the current spot in the tree. But what does `choose` mean? There are several different options: - choose the attributes in the order they appear in the attribute list - choose an unused attribute randomly - choose an unused atttribute that will minimize the size of the produced tree While in practice, the third option is optimal, to simplify things for this project, you should choose an unused attribute to split on randomly. **Note that this means your tree will be different every time you generate it.** Also note that you must ensure that there are unused attributes remaining before choosing one to avoid an error. ## Random Generating random numbers in Java is super easy! You can use the `Random` library as such: ```java import java.util.Random; ... Random random = new Random(); int upperBound = 10; // generates a random number between 0 (inclusive) and 10 (exclusive) // note that the upper bound must be greater than 0! int randomNum = random.nextInt(upperBound); ``` Think about what bound you would want to use when selecting a random element of a list... Making Predictions ================== As discussed earlier, we want to use the decision tree we trained to predict the value of an a target attribute for novel test data objects. The ``lookupRecommendation`` method in the generator will take in a test datum, use the datum's attribute values to traverse the decision tree until reaching a leaf, and then return the decision/predicted value at the leaf. "Traversing the tree" will most likely involve the recursive use of the `ITreeNode` method `lookupDecision`. **What if the new data uses an attribute value that is not already in the decision tree?** For example, perhaps you are predicting whether you will like a yellow vegetable (like corn), when the training data had no yellow vegetables. :::warning You **cannot** assume that you have all of the attributes values up front. You only know about those values that are represented in the initial training dataset. ::: You should handle that case in the following way: When traversing the decision tree for a datum, if the datum's value of an attribute is not present the tree, you should immediately return the most common value for the target attribute across the subset of training data used to create the current node. :::info It is important to note that the most common value of the target attribute over the entire dataset is different than the most common value of the target attribute over a subset. It's your job to figure out how to store that information, so that it is accessible when traversing the tree. ::: Testing ======= In code you've written so far, the process of testing has referred to calling the methods you've written and comparing the actual result against the expected result. There are a couple reasons why we don't test everything like that. For larger projects where functionality depends on every component working correctly, it can be hard to locate the source of a bug if we're only testing the final result. Additionally, since this the decision tree is shaped differently every time, we don't have an expected result to compare our generated tree to. That's why for this project we need two types of tests: unit tests and system tests. ## Unit Testing When we talk about "unit testing", we're talking about testing the components individually (i.e. testing that the methods in each class work as expected). This is similar to tests you've written before. To unit test Recommender, you can use the `Tester` and`checkExpects` as normal. For methods with expected outputs for each input, you will be expected to write unit tests. In this project, that means that you should write tests for any public method that you write in classes that implement `IAttributeDatum`, `IAttributeDataset`, or `ITreeNode`. You should write these tests in`RecommenderTestSuite.java` in the `sol` package. **These are the only tests you're required to write.** Note that since the tree you generate in `TreeGenerator` differs every time you execute, you are not expected to write unit tests for methods in this class. ## System Testing When we talk about "system testing", we're talking about testing how the different components work together. Here, instead of comparing the output against expected outputs, we will compute the accuracy of your recommender given training and testing data. Using system testing can help give you a better sense of whether your tree is generating properly and producing reasonable results. You should run system tests before you turn in, but these tests are separate from your `RecommenderTestSuite`. There is nothing to turn in for system testing, but this is one way we are going to test your tree, so it would be wise to try it! We have provided you with several large datasets and tests that will train and test your decision tree to measure its accuracy. This can be helpful to gauge whether or not your decision tree generation is working. The datasets can be found in the `data` folder of your project. You will notice that the data files are `csv` files. The first row is the header and contains all the attribute names, and every other row is equivalent to a single `IAttributeDatum` object, containing the values of each attribute. The testing code we provide parses these `csv` files into `IAttributeDataset` objects and uses them to train and test your decision tree generation. For each dataset, we have provided a training and testing dataset. As mentioned earlier in the handout, it is important that the tree is tested on novel data rather than the same data it was trained with. There are three datasets: * **Villains** This dataset contains as attributes a bunch of restauraunts on College Hill and whether or not the person/datum/row likes that restauraunt as well as a boolean attribute `isVillain` for whether or not the person is a villain. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a person is a villain based on their restauraunt preferences. The class that implements `IAttributeDatum` for this dataset is `RestaurantPreferences`. * **Mushrooms**: This dataset contains as attributes a bunch of attributes of mushrooms, one of which being `isPoisonous`. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a mushrooom is poisonous based on its other attributes. The class that implements `IAttributeDatum` for this dataset is `Mushroom`. * **Candidates** The SRC section of the handout describes this dataset more in-depth, but this dataset contains attributes of candidates applying for a job, one of which being whether or not they were hired. The tests train your decision to predict whether or not a candidate was hired based on their resume attributes. The class that implements `IAttributeDatum` for this dataset is `Candidate`. The tests that train and test your decision tree are located in the `TestAccuracy` class in the `src` package. You should not need to modify the code (unless you want to increase the number of iterations to make the measurement more precise). Basically for each dataset it trains your decision tree on training data and then for each datum in the testing dataset, it compares the generated decision trees prediction for the datum's value of target attribute with the datum's actual value to measure accuracy. Because your decision trees have a random component, their performance can be variable, so we do this many times (currently `NUM_ITERATIONS = 100`) and average the results to get the average accuracy of the trees generated by your algorithm. **Your accuracy measured on testing data for each of the datasets should be above 70%**. If the target attribute is a boolean, a decision tree that randomly selects one of the two possible values as it's prediction would have an accuracy of around 50%, so if your decision tree is low, you are definitely doing something wrong. There is also a test that tests the decision trees on the same training data used to generate the trees. Your accuracy should be ~100% for this test, and if it is not, there may be something wrong with your code. :::warning **Note:** These test datasets are not good substitutes for making your own, smaller test dataset for debugging. Because of how large the datasets are and how many attributes they possess, debugging tree generation using them will be difficult. Instead, we recommend that you use the `Vegetable` class that you made for the design check to make a vegetable preferences dataset modeled off the example given earlier because you can tell by hand what the tree should look like. ::: Design Check ============ The purpose of design checks in CS18 is to help you and your group solidify your conceptual understanding of the project, give you time to come up with a plan for implementing the project, and then give a space for your group to discuss that plan with a TA. While we do require you to complete some tasks before the design check, and you do receive a grade for the design check, design checks are *not* meant to be adversarial: they are for your benefit only! ## Dates/Sign-up Design checks will be held from **June 8th to June 11th**. Once you find a partner for Recommender (or if you would like to be assigned a partner), you should fill out the [Partner sign-up form](https://docs.google.com/forms/d/1O60yOWU7A-09jU7foYkMYrenNu6zLnvhiAaupKF0vFY/edit). On Monday June 7, we will send out design check groups and assign each group a TA and link a Google Appointments calendar for your group to sign up for a design check later in the week. ## Tasks At the design check, we want to see the following: 1. Two hand-drawn examples of decision trees for the vegetable preferences example dataset that have different sizes by virtue of considering attributes in different orders. Both examples should be predicting the same attribute. 2. Outlines of classes (names, fields, types on fields -- no methods) that will implement the decision tree and dataset interfaces. 3. An complete class `Vegetable` that implements `IAttributeDatum` and captures the attributes of vegetables shown in the vegetable preferences example given earlier. Remember that one datum should be equivalent to one row in the table. 4. Your plan for handling attribute values that arise during prediction but weren't in the original dataset: are you building this into the nodes and edges of the tree, into the method that predicts the outcome, or something else? 5. Parts 1 and 2 of **Design Check SRC** (described below). Please submit this as a PDF to the Recommender Design Check: SRC assignment on Gradescope. :::info You should upload the hand-drawn decision trees, the class outlines, and your `Vegetable.java` file to the Gradescope assignment titled **Recommender Design Check**. You should submit the SRC component of your design check to the Gradescope assignment titled **Recommender Design Check: SRC**. You should submit both of these components before your scheduled design check begins. ::: ## Summary of the Collaboration Policy The following table summarizes the collaboration policy for projects. | | Code Together | Conceptual questions | Debugging help | | -------- | -------- | -------- | --------- | | Same group | yes | yes | yes | | Classmates outside of group | no | yes | yes | | Course Staff (TAs) | ---- | yes | yes | | People outside Summer 2021 CS0180 | no | no | no | <!Design checks will be held Wednesday the 19th through Friday the 21st. We will send out a spreadsheet for you and your partner to select a slot on Sunday, please find a time before 3 PM on Tuesday the 18th. Before your design check, please upload your `Vegetable` class to Gradescope.!> ## SRC Portion **These questions are also available in a [Google doc](https://docs.google.com/document/d/1FgJr8bz6RcofA4JlnAwqoRnEfrcUMXrhEU2NLnHQBYQ/edit?usp=sharing). You can make a copy to use as a template if you'd like.** ### Background The recommender system you will create can be used to make predictions in a variety of real-world contexts. However, applying your system to real-world problems could introduce unanticipated and harmful social consequences. Algorithmic bias, the systematic creation of unfair outcomes for one group of users, is one way that technological systems can perpetuate injustice. Before deploying any predictive tool, it is important to test for algorithmic bias in its predictions over time. In this section, you will test and discuss some of the potential social consequences of your recommender system in an algorithmic hiring context. At many companies, algorithmic systems help hiring managers flag candidates for recruitment or screen large volumes of resumes. However, there are increasing examples of these tools unintentionally reproducing and amplifying bias in the hiring process. (Note: A lot of the examples here and in our datasets are for cis female and cis male, even if not explicitly specified. It is important to note that there are other gender identities that face bias and that are not covered in these datasets). ### Part 1: Introduction to Algorithmic Bias **Read:** [In All the Ways Hiring Algorithms Can Introduce Bias](https://hbr.org/2019/05/all-the-ways-hiring-algorithms-can-introduce-bias) (approx. 5 min read). Miranda Bogen goes over some of the different steps of the hiring process and how algorithmic bias manifests itself. (Optional: If you’re interested, you can additionally read about gender bias in Amazon’s scrapped hiring tool [here](https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G).) **Question 1:** Choose one step of the hiring process that the article goes over and, using specific evidence from the article, discuss how bias manifests itself (3-4 sentences). ### Part 2: Algorithmic Bias in Practice For this exercise, we’re going to imagine that your recommender is going to be used to help screen applications for software engineering positions. A parsing tool identifies the same set of attributes for each application. The attributes of each application are input in the recommender system to predict whether or not the application should be reviewed or discarded. Hiring managers trained the system on a collection of resumes from previously hired and rejected applicants. A recommender’s behavior is determined in large part by its training datasets--meaning that, if a recommender is trained on biased data, it is more likely to learn and repeat that bias. In this section, we’re going to provide you with the results of training our recommender on different datasets and ask you to analyze the results. **Task:** First, open [this file](https://docs.google.com/spreadsheets/d/1aBbxJENMf2uU1vC0iKOIjlzEffwwQQEhiN5AAwNj_I4/edit?usp=sharing) to see the datasets we’ll be using to train and test the recommender. Here are brief descriptions of each dataset: :::info ### Dataset Descriptions * **Training Datasets** * **Candidates Unequal**: Unequal ratios of men and women are hired--the ratio of (women hired / women applicants) is less than (males hired / male applicants). See the pink box on the sheet for more information. * **Candidates Equal**: Equal ratios of men and women are hired--the ratio of (women hired / women applicants) is equal to (males hired / male applicants). See the pink box on the sheet for more information. * **Candidates Correlated**: Unlike the above two training datasets, gender is not an explicit variable, so it cannot be directly considered. * **Testing Datasets** * **Testing** (cis male) & Testing (cis female): Both have identical data, except for the gender attribute: one has all cis men and the other has all cis women. * **Testing Correlated** (cis male) & Testing Correlated (cis female): Cis female dataset has candidates with much less experience than the cis male dataset. ::: ##### Graph 1: Training on Candidates Unequal The following graph represents the results of training our recommender on Candidates Unequal and averaging the results of running it many times on Testing (cis male) and Testing (cis female). ```vega { "$schema": "https://vega.github.io/schema/vega-lite/v4.json", "description": "Training on candidates equal", "data": { "values": [ {"a": "Male", "b": 0.279}, {"a": "Female", "b": 0.197} ] }, "width": 200, "mark": "bar", "encoding": { "x": {"field": "a", "type": "ordinal", "title": "Gender"}, "y": {"field": "b", "type": "quantitative", "title": "Ratio hired"} } } ``` **Question 2:** * Describe the result, comparing the cis male and cis female hired ratios. * Given what you know about the training dataset, explain the likely cause(s) of this result (3-4 sentences). * **Hint [click below to expand]** :::spoiler Look at the Vegetable decision tree in the handout, and notice how it randomly splits on each of the variables in the dataset to predict `likeToEat`. If a vegetable has an attribute that’s deemed undesirable in the decision tree, its `likeToEat` is false. In this case, your recommender would split on each variable in the dataset (gender included) to predict (or in this case, decide) hiring. Given this design, what could be problematic about using gender as an explicit variable in the training data? ::: ##### Graph 2: Training on Candidates Equal The following graph represents the results of training our recommender on Candidates Equal and averaging the results of running it many times on Testing (cis male) and Testing (cis female). ```vega { "$schema": "https://vega.github.io/schema/vega-lite/v4.json", "description": "Training on candidates equal", "data": { "values": [ {"a": "Male", "b": 0.197}, {"a": "Female", "b": 0.196} ] }, "width": 200, "mark": "bar", "encoding": { "x": {"field": "a", "type": "ordinal", "title": "Gender"}, "y": {"field": "b", "type": "quantitative", "title": "Ratio hired"} } } ``` **Question 3:** * Describe the result, comparing the cis male and cis female hired ratios. * Looking at the summary statistics in the pink box of the Candidates Equal (Train) spreadsheet, what do you notice about the *number* (not ratio!) of male candidates considered versus the number of female candidates? * Recommenders are designed to learn from the outcome of their training--looking at the number of men versus the number of women hired, what do you think the recommender could learn from that outcome that could pose problems for future applicants? (3-4 sentences) * **Hint [click below to expand]** :::spoiler Look at the Vegetable training table and compare the number of green vegetables versus the number of orange vegetables. There’s only one orange vegetable, and its `likeToEat` is false. What could the recommender learn from that outcome? ##### Graph 3: Training on Candidates Correlated The following graph represents the results of training our recommender on Candidates Correlated and averaging the results of running it many times on Testing Correlated (cis male) and Testing Correlated (cis female). ```vega { "$schema": "https://vega.github.io/schema/vega-lite/v4.json", "description": "Training on candidates equal", "data": { "values": [ {"a": "Male", "b": 0.279}, {"a": "Female", "b": 0.235} ] }, "width": 200, "mark": "bar", "encoding": { "x": {"field": "a", "type": "ordinal", "title": "Gender"}, "y": {"field": "b", "type": "quantitative", "title": "Ratio hired"} } } ``` **Question 4:** * Describe the result, comparing the cis male and cis female hired ratios. * Given what you know about the training dataset, explain some of the likely cause(s) of this result (3-4 sentences). * **Hint [click below to expand]** :::spoiler We know that gender isn’t an explicit variable in the correlated dataset, but does that mean that gender doesn’t play a role? Think about some of the ways that gender can affect the other variables in the dataset. If we take a look at leadership experience, only 37 percent of companies have at least one woman on the board of directors. This means that the vast majority of executive positions are held by men. If we take a look at last position duration, there are a variety of factors, unrelated to a candidate’s competency, that may cause women to have more job turnover. Women may have shorter durations due to maternity leave or leaving workplaces environments that are hostile due to sexism or harassment. According to a [report](https://www.talentinnovation.org/_private/assets/Athena-2-ExecSummFINAL-CTI.pdf) from the Center for Talent Innovation, over time, 52% of women working for science, engineering, and technology companies leave their jobs as a result of daily bias. The algorithm we will implement for this class does not take these factors, like maternity leave, sexism, or sexual harassment, into account when considering attributes like leadership and position duration. ::: ### How to Hand In/Grading: These questions should be answered in a pdf. Please submit this to the “Recommender Design Check SRC” assignment on Gradescope. **Do not write your name or any other identifiers on the pdf.** We grade your responses according to [**this rubric**](https://docs.google.com/spreadsheets/d/1kyjing8hVDCSVyHuiy0MB_9LjyCA9HWjKu_s85E6Iig/edit?usp=sharing). Please note that the course collaboration policy also applies to your answers to the Socially Responsible Computing segment. You CAN work with your partner on this! ***Make sure to hand this in before your design check!!!!*** Final Submission =============== The deadline for the final submission is **Friday, June 18th, Anywhere-On-Earth time**. You should submit all your files on Gradescope. For your final submission, you will implement the algorithm as described in this handout and answer the questions from the Socially Responsible Computing portion. along with a brief description of your code structure and any known bugs. Thus, your final submission will have two components: **code** and **SRC**. :::warning **Important:** You must submit both the code and SRC components of your final handin to their respective Gradescope assignments. ::: ## Code You should hand in all files in the `sol` package to the Gradescope assignment titled **Recommender**. You should be submitting the following: * a completed `DataTable.java` * a completed `TreeGenerator.java` * your tree classes that implement `ITreeNode` * unit tests for non tree generation related methods in `RecommenderTestSuite.java` * a completed `Vegetable.java` * a `README.txt` file with: * the names of your group members * description of any bugs with your program * description of how your classes and the code you wrote work ## SRC Component In addition to the SRC questions you answered before the design check, you also must answer several reflection questions to hand in with the final submission. **These questions are also available as a [Google Doc](https://docs.google.com/document/d/1l4_DOo4RSvtYQ2s7mFjtf2S64534eKipthqeeNhwfbw/edit?usp=sharing). You can make a copy of it to use as a template if you'd like.** ### Part 3: Rethinking Diversity & Inclusion **Read:** In her article [Beyond Bias: Contextualizing “Ethical AI” Within the History of Exploitation and Innovation in Medical Research](https://medium.com/mit-media-lab/beyond-bias-contextualizing-ethical-ai-within-the-history-of-exploitation-and-innovation-in-d522b8ccc40c), Chelsea Barabas discusses some of the limitations to and problems with focusing on algorithmic bias as the end-all be-all to socially responsible algorithmic use. Refer to the reading to answer the following question: **Question 1:** What are some drawbacks of using diverse datasets under the guise of promoting inclusivity? (Hint: for help with this question, consider referring to the Google Pixel, vocal biomarkers, or Tuskegee examples from the article). (3-4 sentences) ### Part 4: Project Reflection Think back to the Design Check SRC portion and the hiring data you examined to answer the following questions. **Question 2:** In the design check SRC portion, we asked you to use quantitative measures (hiring ratios for cis men and cis women) to detect bias. Reflect on the limitations of using quantitative measurements like these. How were the hiring datasets and charts we provided helpful to measure bias? How were they unhelpful? (3-4 sentences) **Question 3:** Do biased training datasets always produce biased systems? Is there a way to “fix” the bias generated by your recommender when it is trained on datasets that display certain past patterns? Why or why not? (3-4 sentences) ### How to Hand In/Grading: These questions should be answered in a pdf. **Please do not include any identifying information.** Please submit your pdf to the “Recommender: Final SRC” submission on Gradescope. We grade your responses according to [this rubric](https://docs.google.com/spreadsheets/d/1kyjing8hVDCSVyHuiy0MB_9LjyCA9HWjKu_s85E6Iig/edit?usp=sharing). Please note that the course collaboration policy also applies to your answers to the Socially Responsible Computing segment. You CAN work with your partner on this! **This is due at the same time as your final handin!** Please leave any feedback about this SRC assignment (or any SRC-related feedback in general) through our [anonymous feedback form](https://forms.gle/GjAZdBk8uvt6H2i46). Grading ======= This project will be both autograded and manually graded. We will autograde your `DataTable` implementation and measure the accuracies of the decision trees your program generates. As mentioned earlier, you should expect to achieve >70% accuracy on testing data for each of the three datasets we provide (villains, mushrooms, candidates) and close to 100% accuracy measured on training data. We will manually grade your tree classes, unit tests, and the specific design of your generator. Achieving good accuracy does not necessarily guarantee full credit. At your design check, you, your partner, and your TA will schedule a final grading meeting for sometime after the project is due. More details about this can be found in the Project Logistics document. Where to Start ============= It can be really tough to know how to begin a project, but making some concrete progress early helps you keep moving. First of all, reading the handout thoroughly and taking time to digest the content is important. We will be holding a **Gear up** presentation for Recommender on **Monday, June 7** (time to be announced) that will be helpful for getting started. You may also find the following roadmap helpful: #### Data classes * Start by thinking about how you can create a `Vegetable` class that implements `IAttributeDatum` and captures the rows of data in the vegetable preferences data table and then implement it * Think about how that `Vegetable` class relates to the generic parameterized types of the `DataTable` and `TreeGenerator` classes * Work your way through implementing the methods in `DataTable` until you have a fully functional data table. #### Implement Tree Structure * Create the classes (that will implement `ITreeNode`) that will form the tree structure. * It may be helpful to think about how you implemented binary search trees in lab but adapt that to the new decision tree invariants. #### Implement Tree Generation * Fill in the methods in `TreeGenerator.java` for training and using decision trees. * Understanding the algorithm and pseudocode before beginning to code will be beneficial. #### Test * Write both unit tests as well as use the provided system testing to ensure that your decision tree generator works. * It is a good idea to test functionality progressively as you build it: for example, after you implement `DataTable`, testing it to make sure it works before moving on to `TreeGenerator`. * It will also be helpful to use the `Vegetable` class you created for the design check and create a test dataset that reflects the example vegetable preferences data table given in this handout. You can use this dataset to debug your tree generation code. FAQ === * **Do all of the paths have to consider the attributes in the same order?** No, as long as each attribute is only considered once per path, they can be considered in different orders on different paths. * **Do decisions need to be booleans?** No, the decisions can be from any finite set of values (city names, color preferences, booleans, etc) * **What happens if the dataset is empty?** You should throw an exception if an empty dataset is given. * **Can we modify the interfaces provided in `src`?** Do not change the interfaces! * **Should the `name` attribute be considered when generating the decision tree?** Nope! * **When should `getSharedValue` be used?** `getSharedValue` retrieves the value for an attribute whose value is assumed to be the same for each object in the dataset. In other words, you should code the method with this assumption in mind, and it is up to you if you would like to handle any undefined behavior (such as when the values for the objects are not all the same), though it is not required. * **The `TestAccuracy.java` file isn't compiling... what's happening** This means that you need add the provided `recommender-tester.jar` to your project structure. You can follow these steps to add it. 1. Open the **Project Structure** dialog by clicking on the file icon with three blue squares in the top right corner. 2. Click on **Modules** under the **Project Settings** heading in the sidebar on the left window. 3. Click on the **+** icon at the bottom and then click on **JARs or Directories** 4. Navigate to the file that contains the `recommender-tester.jar` file (in the `lib` folder of your recommender project) and select it and then click **Okay**. 5. The accuracy test file should work now!