Project 1: Recommender (OLD)

--- tags: Projects --- Project 1: Recommender (OLD) ================= ### Due Date: Friday, Feb. 26th at Anywhere-On-Earth time. Learning Objectives ------- With every project in CS18, we'll highlight the learning objectives that are targeted in the project. By the end of the project, you will be able to: * Select and justify a specific data structure for use in a specific problem, citing alignment with problem needs or resource constraints * Use concrete examples, invariants, hand tracing, and debugging to explore or explain how a program works * Create a class hierarchy that makes proper use of interfaces (for variant types or contracts) and abstract classes (for sharing common code) * Comment code effectively with Javadocs * Develop a collection of examples and tests that exercise key behavioral requirements of a problem * Write programs that behave as expected based on a provided description or collection of test cases * Correctly determine the run-time complexity of an algorithm * Identify potential societal impacts of a program on individual users or populations Theming (Optional): The Hunt Continues... ------- Blueno, forever the romantic, wrote many love limericks to their partner, the canary yellow [Untitled Lamp Bear](https://dohahamadairport.com/relax/art-exhibitions/lamp-bear) of Hamad International Airport in Qatar. Some of them can still be found on Blueno Bears Admirers! In order to find Blueno's last note to their favorite bear, you'll have to search through the ever wholesome, [Blueno Bears Admirers](https://www.facebook.com/bluenothebear). Here's a hint: try searching for "Untitled Lamp Bear." ![](https://i.imgur.com/krkVHzJ.jpg) Overview ------- We hear a lot about machine learning, but how does it work underneath the hood? As a concrete example, imagine that you wanted a machine learning algorithm to tell you whether you were likely to enjoy a particular vegetable. You've been tracking your vegetable preferences, and have the following table: ``` name color lowCarb highFiber likeToEat ------------------------------------------------------ spinach green true true false kale green true true true peas green false true true carrot orange false false false lettuce green true false true ``` Each row corresponds to a vegetable and each column corresponds to an attribute of that vegetable. One of the attributes is your decision about whether you liked the vegetable. If someone gave you the attributes (other than the decision) for a new vegetable, the machine learning algorithm will try to predict whether you would like it, based on whether you liked other vegetables with similar attributes. To do this, it has to compare the attributes of the new vegetable to the attributes of vegetables you've already tried. But how does such a comparison work? In the form of machine learning we're doing in this project, we're going to use the data you have already collected to build a _decision tree_ to summarize your preferences. Here is an example of a decision tree for this vegetable table: ![](https://i.imgur.com/j8lrAjE.jpg) The nodes of the tree are labeled with attributes of the data (except for the name, which we are skipping in this example). The edges are labeled with values of attributes. The tree summarizes which values of ``likeToEat`` correspond to values of the other attributes. The root node considers the color attribute, for which there are two values in the dataset (green and orange). For the orange vegetables, `likeToEat` is always false. But `likeToEat` differs among the green vegetables. The tree next looks at whether a vegetable is `highFiber`. For green, low fiber vegetables, the decision in the dataset is always false, but decisions vary for high fiber vegetables. The tree thus looks at the `lowCarb` attribute. If `lowCarb` is true, you are just as likely to like or dislike the vegetable, so we pick one decision to use as the result. If you wanted to use this tree to predict whether you would like a new vegetable, you would use the attributes of the new vegetable to traverse the tree from the root until you get to a decision leaf. At each node, you follow the edge whose label corresponds to the value of the attribute about the new vegetable. For example, if you wondered whether you would like acorn squash (which is orange), this tree would predict that you would not like it. If you wondered about green beans, which are low in fiber, this tree would predict that you would like it. Here are things to note about the decision tree: * each non-leaf node is labeled with an attribute from the dataset * the same attribute arises only once along any path from the root to a decision * a node for an attribute has outgoing edges corresponding to the possible values of that attribute from within the table * the decisions are the predicted value for one attribute (here, ``likeToEat``), based on the values of the other attributes. All decisions are for the same attribute. In this project, you will write a program that generates a decision tree from a dataset and uses that tree to make predictions. We'll exercise various object-oriented practices to make our code robust. We'll evaluate the performance and social impacts of our implementation. Architecture ========= Your implementation will consist of three main parts: * Classes and interfaces for a decision tree * Classes and interfaces for a dataset * Classes and interfaces to generate and use a decision tree given a dataset We will give you the interfaces. It is up to you to design and implement the corresponding classes. In addition, you will have to provide test cases and prose answers to reflective questions. _**Your implementation may only use built-in datatypes that we have already covered in CS18** (LinkedList, ArrayList, arrays, and any classes you define for yourself). In particular, those with prior Java experience may know other data structures (such as hashmaps/ dictionaries) which we have not yet covered in 18. To keep the workload even for everyone, and to maintain consistency in considering runtimes and space usage, you ++**cannot**++ use these other data structures on this project._ Decision Trees --------------- Decision trees differ from binary trees in two ways: (1) nodes can have any number of children and (2) edges have labels on them. You must decide how to implement each of these features. The only constraint is that your nodes (including leaves) must implement the following interface: ```java public interface INode { // traverse the tree based on attribute values to retrieve decision public Object lookupDecision(IAttributeDatum attrVals); // print tree public void printNode(String leadspace); } ``` The ``lookupDecision`` method returns Object because the type of decisions can vary across datasets and individual attributes to be predicted (e.g. a color attribute might have a String "blue" value rather than just a boolean value). Each non-leaf node will have an associated attribute from the dataset. The edges will be labeled with the possible values for that attribute. Datasets ------- For purposes of this project, we will assume that each dataset corresponds to a table. The columns of the table are the attributes, while the rows are the individual items or observations for which we have values of attributes. Each attribute should have a fixed (enumerated) number of possible values (such as booleans or a set of specific numbers or strings rather than numbers in general). This restriction is just to simplify the project; real-world decision trees can support general numbers. The following interface captures one item with attributes (corresponding to a row in the table): ```java public interface IAttributeDatum { // lookup the value associated with the named attribute public Object getValueOf(String attributeName); } ``` The following interface captures a set of _IAttributeDatum_. The generic type `T` for the interface is the type corresponding to a single datum in the set. The role of these specific methods will be clear once we explain the tree-generation algorithm. ```java public interface IAttributeDataset<T extends IAttributeDatum> { // all the attributes in the dataset public LinkedList<String> getAttributes(); // does every row/datum have the same value for the given attribute/column public Boolean allSameValue(String ofAttribute); // number of data/rows in the dataset public Integer size(); // partition rows into subsets such that each subset has same value of onAttribute public LinkedList<IAttributeDataset<T>> partition(String onAttribute); // get the value for ofAttribute, which is assumed to be common across all rows public Object getSharedValue(String ofAttribute); // get the most common value for ofAttribute in the dataset public Object mostCommonValue(String ofAttribute); } ``` Decision-tree Generator ------------ The generator supports building a decision tree for a dataset and making predictions with the generated tree. ```java public interface IGenerator { // build a decision tree to predict the named attribute public INode buildClassifier(String targetAttr); // produce the decision predicted for the given datum public Object lookupRecommendation(IAttributeDatum forVals); // print the decision tree public void printTree(); } ``` Your implementation will provide a class named ``TreeGenerator`` that implements this interface. The type parameter ``T`` is for the type of individual items (rows) in the dataset (a specific dataset is provided as a constructor argument to this class). ```java public class TreeGenerator<T extends IAttributeDatum> implements IGenerator { public TreeGenerator (IAttributeDataset<T> initTrainingData) { ... } } ``` We give you this stencil code in `TreeGenerator.java`. Building the Tree ============ A decision tree gets generated for a dataset and an attribute in the dataset that we want to predict. The values of the attribute to predict will be at the leaves of the decision tree. To generate a tree for a subset of the dataset: 1. if the subset is empty, create a decision node that returns the most common value for the attribute to predict within the most recent non-empty subset of the dataset. 2. if all of the rows in the subset have the same value for the attribute to predict, create a decision node that returns the common value 3. otherwise, pick one attribute that has not been used during generation so far (i.e., it is not on the path from the root to the current spot in the tree). Create a new node for the chosen attribute. Partition the subset into further subsets, one for each value of the chosen attribute. For each new subset, create an edge for the corresponding attribute value and generate the tree for the subset. The algorithm starts with the full dataset as the subset, and repeats until there are no further nodes to generate. **How do we choose the attribute to split on in step 3?** This is one of the most interesting aspects of the algorithm. There are many ways to do this. For example, we could: * take the attributes in the order that they are in the table, * choose an unused attribute at random, or * choose an unused attribute that will minimize the size of the produced tree. While the third is what gets used in practice, for this project we will instead use the second to simplify the problem, and you should pick a random attribute to split on. **Note that this means your tree should be different every time you generate it.** Making Predictions ================== The goal of machine learning is to predict decisions on new data based on information about prior data. The ``lookupRecommendation`` method in the generator will take a collection of attributes and their values, use them to traverse the tree, then return the decision at the reached leaf. **What if the new data uses an attribute value that is not already in the decision tree?** An example of this might lie in predicting whether you will like a yellow vegetable (like corn), when the initial data had no yellow vegetables. During traversal, if the value of an attribute is not in the tree, the most common decision across all data (rows) remaining at that point in the tree should be returned (this is already included in the generation algorithm; we're just reinforcing it here). *In particular, you cannot assume that you will have all of the attributes values up front. You will only know about those values that are represented in the initial dataset.* Socially Responsible Computing and Experimental Datasets =============================== ### Background: The recommender system you created can be used to make predictions in a variety of real-world contexts. However, applying your system to real-world problems could introduce unanticipated and harmful social consequences. Algorithmic bias, the systematic creation of unfair outcomes for one group of users, is one way that technological systems can perpetuate injustice. Before deploying any predictive tool, it is important to test for algorithmic bias in its predictions over time. In this section, you will test and discuss some of the potential social consequences of your recommender system in an algorithmic hiring context. At many companies, algorithmic systems help hiring managers flag candidates for recruitment or screen large volumes of resumes. However, there are increasing examples of these tools unintentionally reproducing and amplifying bias in the hiring process. (Note: A lot of the examples here and in our datasets are for cis female and cis male, even if not explicitly specified. We understand that there are other gender identities that also face bias and that are not covered in these datasets.) ### Part 1: **Read:** [In All the Ways Hiring Algorithms Can Introduce Bias](https://hbr.org/2019/05/all-the-ways-hiring-algorithms-can-introduce-bias) (approx. 5 min read). Miranda Bogen goes over some of the different steps of the hiring process and how algorithmic bias manifests itself. (Optional: If you’re interested, you can additionally read about gender bias in Amazon’s scrapped hiring tool [here](https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G).) **Question 1:** Choose one step of the hiring process that the article goes over and discuss how bias manifests itself. Think about underlying reasons for this. (Don’t forget to explain your reasoning!) Now, imagine that a large technology company uses your recommender system to help screen applications for software engineering positions. A parsing tool identifies the same set of attributes for each application. The attributes of each application are input in your recommender system to predict whether or not the application should be reviewed or discarded. Hiring managers trained the system on a collection of resumes from previously hired and rejected applicants. Complete Parts 2 and 3. ### Part 2: Follow the instructions below to explore the potential social consequences of using your recommender system in this context. Write your answers to questions listed below in your `README.txt` file, using 3-4 sentences per question. In the data folder, look at the files `train_candidates_unequal.csv` and `train_candidates_equal.csv`. These correspond to the datasets contained in the multiple tabs of the Google sheet [linked here](https://docs.google.com/spreadsheets/d/1aBbxJENMf2uU1vC0iKOIjlzEffwwQQEhiN5AAwNj_I4/edit?usp=sharing). These datasets contain attributes about job candidates determined by the parsing tool. While doing this section of the assignment, refer to the different hire percentages listed for the training datasets, both equal and unequal, in the sheet. Note that the cis male and cis female testing datasets contain identical candidates, except for the gender attribute. 1. Follow the TODO comment in BiasTest.java, i.e first change the filepath variable in line 40 to `data/train_candidates_unequal.csv`. Run the file several times and report the ratios. 2. Look through the file `BiasTest.java` comments to understand what is going on in the file. Do not dwell on the CSVParser or the different loops. Just try to get the general idea of what is going on in this file. Run the file `BiasTest.java` (a few times if you get both the cis male and cis female hired ratio as zero). Run the file several times and write down the first five hire percentages you get (where both percentages are not zero). **Question 2:** What do you notice in the differences between cis male and cis female hired ratios printed out? 3. Now, change filepath to `data/train_candidates_equal.csv`. Run the file several times and write down the first five hire percentages you get (where both percentages are not zero). Do you still see the bias in the different hiring ratio? **Question 3:** Do you still see the bias in the different hiring ratio? If so, what differences do you notice between the training datasets (also presented on the Google sheet)? How do you think these differences affect the hiring bias of the algorithm? **Question 4:** How does the approach your code used to choose which attribute to split on impact the resulting bias, if at all? **Question 5:** If your hiring rates vary each time you build a new classifier, why does this occur? If we fix the algorithm to choose the same attribute to split on each time, could we eliminate the bias? ### Part 3: Algorithmic bias is easier to see in these examples because gender was included as an attribute in this dataset. However, algorithmic bias still exists in systems where protected attributes (race, gender, religion, ability, etc.) have been removed from training datasets. The inclusion of attributes that are correlated with protected attributes produces similar bias patterns in the results. As an example, let’s take a look at our `Candidate` class, specifically the two attributes: leadership experience and last position duration. Both are resume items recruiters may care about in an applicant. The question is, are they gender-neutral attributes, or are they correlated with gender? 4. Run the BiasTest with train_candidates_correlated file following step 6 from above. In addition, follow the TODOs from line 76 and line 83. **Question 6:** Do you still see evidence of hiring bias? Why or why not? The answer is that both are correlated with gender. If we take a look at leadership experience only 37 percent of companies have at least one woman on the board of directors. This means that the vast majority of executive positions are held by men. If we take a look at last position duration, there are a variety of factors, unrelated to a candidate’s competency, that may cause women to have more job turnover. Women may have shorter durations due to maternity leave or leaving workplaces environments that are hostile due to sexism or harassment. According to a [report](https://www.talentinnovation.org/_private/assets/Athena-2-ExecSummFINAL-CTI.pdf) from the Center for Talent Innovation, over time, 52 percent of women working for science, engineering, and technology companies leave their jobs as a result of daily bias. The algorithm we implement for this class does not take these factors into account when considering these attributes. The `train_candidates_correlated` dataset reflects real life disparities like above. ### Big Picture: Proponents of algorithmic hiring argue that these systems have the potential to make the hiring process more efficient and remove some human biases from hiring decisions. **Read:** [In Beyond Bias: Contextualizing “Ethical AI” Within the History of Exploitation and Innovation in Medical Research](https://medium.com/mit-media-lab/beyond-bias-contextualizing-ethical-ai-within-the-history-of-exploitation-and-innovation-in-d522b8ccc40c) (7 min read), Chelsea Barabas discusses some of the limitations to and problems with focusing on algorithmic bias as the end-all be-all to socially responsible algorithmic use. In a paragraph, answer (all) the following questions with reference to Barabas’ piece. Be sure to explain your reasoning. **Question 7:** What are some limitations of BiasTest (and other statistical measures of bias), if any, as a metric for fairness in the hiring process or as a metric for diversity and inclusion (in the workspace)? What are some strengths, if any? Consider providing an example of fairness/unfairness or potential benefit/harm that wouldn’t be captured by the results of BiasTest. These questions should be answered in your README.txt in addition to a brief description of your code and any known bugs (outlined in “Final Submission” below). Please note that the course collaboration policy also applies to your answers to the Socially Responsible Computing segment. Testing ======= You may notice when you build your decision tree that the shape is different every time. This makes the actual tree generation difficult to write tests for. What you can do is write tests for any methods you write other than those that are actually constructing the tree. This means any methods in classes that implement `IAttributeDatum`, `IAttributeDataset` or `INode` as well as any other helpers you might write. **These are the only tests you are required to write for this project.** **The following paragraph describes a way of testing that we don't require you to do, it is here only as a resource.** The reason the shape of the tree is different every time is because we randomly pick the attribute to split on when building the tree. If you wish to check that your tree generation is working correctly, one way would be to replace the code that randomly picks an attribtue to split on and replace it with code that cycles through the attributes in a deterministic way, which will make your tree output the same every time. Just remember to change it back to random selection before you hand in! Project Tasks ============= Please keep in mind the [project logistics](https://hackmd.io/@cs18-spring-2021/r1WrOcky_) when forming groups. Design Check ------------ #### Dates Design checks will be held from February 18th to 21st. Please sign your group up for the design check [here](https://docs.google.com/spreadsheets/d/1NdM2ZorrbnXk0YGumAN9zJYJSuuiVhhfivpehy89Bic/edit?usp=sharing). At the design check, we want to see the following: 1. Two hand-drawn examples of decision trees for the vegetable example that have different sizes by virtue of considering attributes in different orders. Both examples should be predicting the same attribute. 2. Class outlines (names, fields, types on fields -- no methods) that will implement the decision tree and dataset interfaces. 3. An implementation of the Vegetable class from the above example. You can submit code for this. 4. Your plan for handling attribute values that arise during prediction but weren't in the original dataset: are you building this into the nodes and edges of the tree, into the method that predicts the outcome, something else? Design checks will be held in groups. Only one person needs to submit the code for your Vegetable class to Gradescope, but all group members should be included on the submission. ### Summary of the Collaboration Policy The following table summarizes the collaboration policy for projects. | | Code Together | Conceptual questions | Debugging help | | -------- | -------- | -------- | --------- | | Same Subgroup | yes | yes | yes | | Same Group, Different Subgroup | no | yes | yes | | Classmates beyond group | no | yes | no| | Course Staff (TAs) | ---- | yes | yes | | People outside Spring 2021 CS0180 | no | no | no | <!Design checks will be held Wednesday the 19th through Friday the 21st. We will send out a spreadsheet for you and your partner to select a slot on Sunday, please find a time before 3 PM on Tuesday the 18th. Before your design check, please upload your `Vegetable` class to Gradescope.!> Stencil Code ----------- ::: warning IMPORTANT: Read [this guide](https://docs.google.com/document/d/13AV_3eXwZo5VLAD45MQGvYuL-VPNtG-k8U3ivWjvw9c/edit?usp=sharing) first! ::: Here is the link to the source code: https://classroom.github.com/g/y4vMOypM. Final Submission ---------------- The deadline for the final submission is **Friday Feb. 26th at Anywhere-On-Earth time**. Please submit everything to Gradescope. For your final submission, you will implement the algorithm as described in this handout and answer the questions from the Socially Responsible Computing portion. The answers to these questions should go in your README along with a brief description of your code structure and any known bugs. Thus, your final submission will have four components: * code * tests for non-tree-generating methods * results of bias experimentation * written answers to these questions * a `README.txt` file with: * description of your code * how your classes work together * your group members * your subgroup members * answers to the questions above # Where to Start It can be really tough to know how to begin a project, but making some concrete progress early helps you keep moving. Right now, try to use the outline below to just create the items (*classes and data/objects?*) you'll need to fill in later. Alternatively, you can create a drawing of the interfaces and the classes that implement them so you have an idea of your structure. Afterward, you can make a skeleton of method stubs, variables names, and even comments to outline chunks of code you can write in small steps. #### Create Data Classes First, you should think about how your class hierarchy captures important information about the tree. How are you representing the dataset? How are you representing a node? How are you representing an edge? What information will you need to be able to access? #### Implement Tree Structure Next, fill in the methods inherited by the IAttributeDataset interface to begin partitioning the data in the dataset based on its attributes. #### Implement Tree Generation Lastly, fill in the methods inherited by the IGenerator interface to create the tree and lookup a recommendation! FAQ ============== * **Do all of the paths have to consider the attributes in the same order?** No, as long as each attribute is only considered once per path, they can be considered in different orders on different paths. * **Do decisions need to be booleans?** No, the decisions can be from any finite set of values (city names, color preferences, booleans, etc) * **What happens if the dataset is empty?** You should throw an exception if an empty dataset is given. * **Can we modify the interfaces?** Please don’t change the interfaces! * **Is the name field part of the decision tree?** Nope! * **Will this tree be autograded?** No, we'll be grading this project by hand. * **When should `getSharedValue` be used?** `getSharedValue` retrieves the value for an attribute whose value is assumed to be the same for each object in the dataset. In other words, you should code the method with this assumption in mind, and it is up to you if you would like to handle any undefined behavior (such as when the values for the objects are not all the same), though it is not required. * **How do I add the common-csv-1.8.jar file to my project structure?** 1. Open the 'Project Structure' by clicking on the file icon with three blue squares in the top right corner. 2. Click on 'Modules' under the Project Setting heading in the sidebar on the left window. 3. Click on the '+' icon at the bottom and then click on 'JARs and directories' 4. Navigate to the file that contains the common-csv-1.8.jar file (in the 'lib' folder of your recommender project) and select it. 5. Click the checkbox next to newly added common-csv module, then click Apply at the bottom of the window. then click Okay. 6. The CSVParser should work now!