--- tags: projects, spr22 --- # Project 1: Decision Tree ## Due Dates: - **Wednesday, Feb 23 at 11:59pm ET** for the Gradescope Design Questions - **Thursday, March 3, 2022 at 11:59pm ET** for the Gradescope Implementation Submission :::warning To get access to the stencil code, one partner will need to fill out [this form](https://forms.gle/RUPaa8Y3ybYvhKAW9) that asks for Brown emails and GitHub logins (**not CS logins**) so we can set up the repository for each group. Not enough people submitted GitHub logins at the start of the semester for us to do this on our end, hence the form. ::: # Overview In this project, you will implement a classic machine learning algorithm that uses training data to make predictions in new situations. This algorithm uses a tree-like data structure to arrive at its predictions. [We love trees!!](https://www.youtube.com/watch?v=OqvD4NC-s9E) ## Learning Objectives - To practice designing with classes and interfaces - To practice using trees in a real-world application - To understand a bit of how one approach to machine learning works - Identify potential societal impacts of algorithms ## Project Timeline This project will proceed in two stages, with you working with your partner in both: - In the **Design Stage**, you'll work through the project handout, figure out how the intended algorithm works, and plan out the classes that you will need. You'll follow a worksheet and answer some design questions in Gradescope, which you and your partner will review together with a TA during a **Design Check**. - In the **Implementation Stage**, you and your partner will write code that carries out your design. We will release the stencil code on Wednesday (Feb 23rd), by which time you will have submitted your Gradescope design work. (See the yellow banner above for how to access the stencil code) :::info **Gradescope submission for design check answers is now open:** You'll find a quiz-like format in Gradescope where you can submit your answers to most of the questions marked "design-check question" in the handout. Simply write your answers in the spaces on the form. You may see points being automatically assigned to your answers. **IGNORE THE POINTS**. Gradescope requires us to put something there, so we did. Your actual grade will be determined during your design check meeting. For the questions that don't have spaces on the form, just bring your work to the design check. You don't have to upload those in advance. ::: We are doing this in these two stages to catch as many conceptual questions as possible early on, so that you don't head off in the wrong direction by starting to code too early. <!-- # GitHub Classroom Link :::danger TODO: Update to a student version ::: Click [here](https://classroom.github.com/a/dpnWNDzA) to get started! --> # Background You may have heard of machine learning, but what is it really? You have learned how to write functions that perform computations. So far, the functions you've written consist of explicit steps that tell the computer how to do the computation. Imagine, for example, that we asked you to write a program to determine whether a particular job applicant should get an interview. You could write a function with a bunch of if-statements and boolean operations over information about the candidate (years of experience, prior recommendations, etc). But what if you have a lot of potential variables to consider? Doesn't this function become pretty messy and hard to maintain? With machine learning, you don't write this function explicitly. Instead, you use a program (the machine learning algorithm) to *generate one possible function for you* based on data about previous candidates (this process is called *training a model* in machine-learning terminology). Then, when you get a new applicant, you call the generated function and ask it whether to interview the candidate. Set aside how you feel about the (considerable) perils here: the point is that machine learning is about generating functions to recommend decisions based on prior decisions. In this project, you will implement a classic machine learning algorithm and think about how to test it. Concretely, you will: - write a program that generates a function to make predications based on data in a dataset. Your program will read in datasets from CSV files using code that we will provide in the stencil. - use your function to make predictions - reflect on what it means to test a machine-learned function At the heart of the algorithm you'll be implementing is a data structure known as a *decision tree*. This is a tree-shaped data structure that models a series of multiple-choice questions, answers to which are found at the leaves. Overall, this project practices the data structure, testing, and programming concepts we've covered so far. You need both object-oriented programming and recursion to get this working. Make sure to not [Lose Yourself](https://www.youtube.com/watch?v=_Yhyp-_hX2s) in an infinite loop! # Developing your Intuition: A Concrete Example To help explain what our algorithm will be able to do, let's consider the following problem: you would like to generate a function/train a model to tell you whether a food item is a fruit or vegetable given certain attributes including `color`, `highProtein`, and `calories`. ## Food Dataset A subset of the data is shown below: ```csvpreview {header=true} name,color,highProtein,calories,foodType spinach,green,true,low,vegetable orange,orange,false,high,fruit cucumber,green,false,low,vegetable carrot,orange,false,medium,vegetable sweet potato,orange,true,high,vegetable avocado,green,true,high,fruit banana,yellow,true,high,fruit ``` Each row contains values for the attributes `color`, `highProtein`, `calories`, and `foodType`. When you generate your function, you will choose which attribute it should make a prediction on (here, `foodType`). Effectively, this means that you are creating classes and methods to implement a function ``` foodType(color, highProtein, calories) ``` that will return either `"fruit"` or `"vegetable"` depending on the given values for `color`, `highProtein` and `calories`. (You'll call the function slightly differently, but this is the idea.) ## Decision Tree Decision trees provide an ordered way of examining each attribute of an item, leading ultimately to a decision/classification for that item. Below is a decision tree generated from the food dataset, designed to predict the `foodType` of foods not in the original table. ![](https://i.imgur.com/Cq2dfE7.png) Some things to note about the decision tree: * each non-leaf **node** (represented as a blue oval) is labeled with an attribute from the dataset * a node for an attribute has outgoing **edges** corresponding to the possible values of that attribute * the **leaves** (orange and red boxes) represent the decisions or predicted value for the target attribute (here, `foodType`) and are labeled with the classification of a datum that has traveled down that path of the tree. All decisions are for the same target attribute. Leaves mean that [you've found](https://www.youtube.com/watch?v=jFg_8u87zT0) a prediction! * the same attribute arises only once along any path from the root to a decision, and at each step down the path, it gets closer to making a decision **Design-check question:** The decision tree for a dataset should always match the results listed in the original dataset. To make sure that you know how to use the tree, describe how the tree would be used to report that avocados and bananas are both fruits, based on the attributes in the table. ## Making Predictions: [I gotta feeling](https://www.youtube.com/watch?v=uSD4vsh1zDA) Let's walk through two examples to show how the generated decision tree could be used to predict the value of the target attribute (`foodType`) for new data. It might be useful to read this while looking at the decision tree and the above dataset. First, let's predict the `foodType` for a tangerine, which has orange color, is not high protein, and is high in calories. We traverse the decision tree, checking the values of the given attributes starting from `color` (the root node). We would first follow the `"orange"` edge down to the `highProtein` node. Since the tangerine does not have high protein, we would then follow the `"false"` edge down to the `calories` node, and since the tangerine has `"high"` calories, the function would predict that a tangerine has the value ``"fruit"`` for the attribute `foodType`. Now, consider broccoli, which is green, high in protein, and medium in calories. Again, we follow the tree. Taking the `"green"` color edge gets us to a `calories` node that has no edge for `medium`. This is a value that was not covered by our training data, so we need to use some information other than the paths in the tree to predict the `foodtype`. The path we had taken so far limited us to green-colored items. We therefore consult the training dataset to identify the most common `foodtype` among the green items in the training data (spinach, cucumber, and avocado). The function thus returns ``"vegetable"``. Had there been a tie in the training data, we'd choose from the most-common values arbitrarily. **Design-check question:** What would a function built on the above decision tree predict for green apples that are high in calories? What about for red apples? ### A revised Decision Tree that handles new attribute values The broccoli example makes the algorithm sound complicated when given an unknown value. If you read that description and think in terms of processing steps, it sounds like we need to stop traversing, figure out what path we're on, return to the dataset, etc. UGH!! Slow down. We know up front that unknown attribute values *could* arise. So what if our decision tree accounted for them in the first place? We could compute the "new attribute value" results as we build each node, saving them in case it is needed later. If we did that, the tree we generate for the original dataset would actually look as follows (`default` is a field that stores the result for a new attribute value; the purple node numbers will be used later in the handout): ![](https://i.imgur.com/pq8p2ze.png) If we had this tree from the start, we could have handled broccoli the same way we had handled tangerine: a simple tree traversal. This revised version of the tree is what you will actually build in this project. *This example illustrates a key algorithmic principle: pre-computing data can turn some seemingly algorithmic problems into simpler data lookups. When a series of computational steps feels messy, ask whether you could pre-compute and store some data to simplify the computation later.* **Design-check question:** On paper, work out the decision tree, with default values, to predict the outcome for the following dataset on college admissions. Consider the attributes in the order of GPA, letters of rec, high school, and extra curriculars: ```csvpreview {header=true} GPA, high school, extra curriculars, letters of rec, outcome C,blue school,strong,strong, reject C,red school,weak,weak, reject A,red school,weak,strong,accept A,red school,strong,weak,reject B,blue school,weak,strong,accept B,red school,strong,strong,accept B,red school,weak,strong,reject ``` ## Overview of What You Will Be Building The following code block shows a sample interaction with the final code that you'll develop for this project: ```java= // creates a Dataset out of a CSV file List<Row> dataObjects = DecisionTreeCSVParser.parse("fruits-and-vegetables.csv"); List<String> attributeList = new ArrayList<>(dataObjects.get(0).getAttributes()); Dataset training = new Dataset(attributeList, dataObjects); // builds a TreeGenerator object and generates a tree for "foodType" TreeGenerator generator = new TreeGenerator(); generator.generateTree(training, "foodType"); // makes a new (partial) Row representing the tangerine from the example Row tangerine = new Row(); tangerine.setAttributeValue("color", "orange"); tangerine.setAttributeValue("highProtein", "false"); tangerine.setAttributeValue("calories", "high"); // uses the decision tree generated on line 6 to make a decision String decision = generator.getDecision(tangerine); //the following prints "fruit" System.out.println(decision); ``` **We will give you the `DecisionTreeCSVParser` and `Row` classes** (the former will read in a CSV file and produce a list of `Row` objects). In the implementation stage, you will need to: - design a `Dataset` class that holds a list of `Row`. - write a `generateTree` method (in `TreeGenerator`) that will generate a decision tree from a list of `Row` objects. - write a `getDecision` method (in `TreeGenerator`) that will produce a prediction on a `Row` object. The rest of the Design Stage work will have you plan out the classes and methods that you will need for these implementation tasks. # Design Stage Tasks ## Class Design The `generateTree` method will take in a `Dataset` (with a list of `Row`) and produce a decision tree as shown in the sample diagrams. First, let's map out how the nodes in the tree align with parts of the dataset. **Design-check question:** Propose the collection of classes and interfaces that you need to capture a decision tree. Don't worry about the methods. Just outline the fields. The class names and their fields are up to you. *Connection to prior assignments: Everyone has implemented binary search trees in CS200. The tree data had two cases (Pyret) or two classes (Java): one for leaves and one for internal nodes. Follow a similar structure here. To connect the two classes, recall how we connected the `Empty` and `Link` classes when creating immutable lists. Keep in mind that decision tree nodes have an arbitrary number of outgoing nodes, not just two as in binary search trees.* **Design-check question:** Write code that creates the subtree that starts at node 2 (the `High Protein` node) in your classes. *Hint: Decision Trees are immutable, so think about how we built `LinkList` data: e.g., `new Link(3, new Link(7, new Empty())`* **Design-check question:** Create an inheritence/containment diagram that includes the `Dataset` class and your classes to capture a Decision Tree. ## Designing `generateTree` Now, it's time to sketch out how to actually construct the tree from the dataset. Generating the tree is a recursive process that works on smaller and smaller sets of rows from the original dataset. These activities will have you map out the process before you turn it into code. **Design-check question:** We've numbered the nodes in the sample diagram. Indicate which rows from the dataset (by the corresponding food name) are being used to generate the portion of the tree rooted at each node. For example, at node 1 (the topmost node), all 7 food rows are being used. Write out this information for the other three nodes, as well as the leaves. **Design-check question:** Answer the following questions in prose: - Based on the row-labels you just attached to the tree, under what conditions do you make a leaf? Where does the decision in a leaf come from relative to the dataset at that leaf? - What determines the number of edges out of an internal tree node (that isn't a leaf)? - Given the dataset for an internal node, how would you compute the dataset used to create the subtree at the end of one of its edges? **Design-check question:** Based on your answers to the previous questions, propose a collection of method headers (names and types only) that you expect to need. For each one, does it belong in the `Dataset` class or the `TreeGenerator` class? *Hint: Remember that some methods need recursive helpers (like the `MutableList` methods needed in the `Node` class). Even if you don't need a recursive helper in another class, a recursive helper is also useful if you want to work with a different input type than the one you started with. For example, `generateTree` takes a `Dataset` as input, but you might want the recursive function that builds the actual nodes to take a different type of input (see the earlier design problems).* In the actual implementation, you will generate a random order in which to explore the attributes of the dataset. We will give more details on that when we release the implementation details. ## On the Accuracy of Machine-Learned Decisions Assume that the tree that you generated earlier regarding admissions decisions will now be used on a new set of students. For each of the following students, use the tree to determine whether they will be admitted (fill in the outcome column): ```csvpreview {header=true} name, GPA, high school, letters of rec, extra curriculars, neighborhood, outcome studentOne,B,blue school,weak,strong,Canyon studentTwo,B,red school,weak,weak,Elman studentThree,B,red school,strong,weak,Elman studentFour,B,red school,strong,weak,Canyon studentFive,C,red school,strong,weak,Elman ``` **Design-check question:** What are some things you notice about who does and does not get admitted? Are there outcomes that surprise you or concern you? Look at the table. Do you spot any patterns between the outcomes and neighborhood, even though neighborhood wasn't part of the original tree? This example points to what is called **soft-coded bias**. However, even if identities like race, gender, socioeconomic status, or sexuality are not included as attributes when training decision trees, the generated prediction function can still produce biased outcomes dependent on those identities. This often occurs when an attribute included in the model is a close substitute for one of these identities. For example, seemingly innocuous information like neighborhoods are strongly correlated with race. Essentially, if any attribute that you include in your model is correlated with one of these identities, your model is vulnerable to this sort of soft-coded bias. Unfortunately, many companies turn to low-cost labor to keep up an illusion that their machine learning or artificial intelligence software is working as intended. These workers often do the “dirty work” that the software cannot and that others will not. One troubling example is that of Facebook content moderators. **Design-check question:** Either read this article or watch the corresponding video by The Verge about the lives of Facebook content moderators: [Inside the traumatic life of a Facebook moderator]( https://www.theverge.com/2019/2/25/18229714/cognizant-facebook-content-moderator-interviews-trauma-working-conditions-arizona). Write a response (8-10 sentences) reflecting on what you’ve read about the experiences of Facebook content moderators. What do you think about the ethical dilemma that this situation presents? Assuming no software developments could fix this issue, what do you think could be changed to improve this system? # Implementation Stage tasks :::warning **Your implementation may only use built-in data structures that we have already covered so far in CSCI 0200 on the day of release.** Specifically, you may only use **LinkedList**, **ArrayList**, **arrays**, and any classes you define for yourself. Those with prior Java experience may know other data structures (such as HashMaps). To keep the workload even for everyone, and to maintain consistency in considering runtimes and space usage, you ++**cannot**++ use these other data structures on this project. ::: ## Stencil Code **See the yellow box at the top of the handout for instructions on accessing the stencil code for your team.** The `src` directory contains the following classes, which you will use without modifying: - `Row`, a single data point in your dataset, representing a row in the table. You will want the following `Row` methods (others are just used by our testing infrastructure): - `getAttributes()`, which returns a set of all the attributes (column names as Strings) covered by a row. See the Java [Set documentation](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/Set.html) for information about methods on sets (you can write a for-loop over a Set, just as we have for lists -- that may be all you need to know about Sets). - `getAttributeValue(String attributeName)` which returns the value for the row in the column named `attributeName`. - `setAttributeValue(String attributeName, String value)` which sets the value of the column named `attributeName` within the Row (this is useful for creating small Rows by hand for testing purposes). - `DecisionTreeCSVParser`, which parses a CSV (or comma-separated values) document into a list of `Row` - `DecisionTreeTester`, which provides basic system testing functionality (accuracy and tree visualization) - `IDataset`, an interface to represent your dataset - `ITreeGenerator`, an interface to represent your tree generator - `ITreeNode`, an interface representing your tree node - `App`, similar to `Main`, a class to graphically run your program The `sol` directory contains: - `Dataset.java` - `TreeGenerator.java` - `DecisionTreeTest.java`, where you will write unit tests - An `ITreeNode.java` interface that you can add methods to These are provided as empty class definitions and must implement the corresponding interfaces in `src`. You may create as many additional classes in `sol` as you wish. ## Implementation Details ### Generating the Decision Tree Hopefully the design check activities helped you see how the tree generation should work. If you still need help seeing what that code would look like, consider the following steps: - choose an attribute to split on from the Dataset (not the target attribute and not one you've already split on) - create a node for that attribute - create an edge for each value of that attribute in the dataset - attach each edge to the tree generated for the subset of the dataset corresponding to the edge value - Compute the default value and store it in the node Once all rows in a subset dataset have the same value for the target attribute, the "generated tree" should be a Decision-leaf node with that value. This algorithm is a simplification of a classic machine-learning algorithm named ID3. ### Choosing a previously unused attribute There are several ways to choose an attribute to split on. The ID3 algorithm determines the homogeneity of different subsets that we could split on. In our implementation, we will simplify this to **select an attribute at random**. Here is one way to generate a random number in Java: ```java= import java.util.Random; Random random = new Random(); int upperBound = 10; // generates a random number between 0 (inclusive) and 10 (exclusive) // note that the upper bound must be greater than 0! int randomNum = random.nextInt(upperBound); ``` The idea is that you will generate a random index into the list of attributes, then use the attribute at that index to split on. ### Tracking previously-used attibutes Part of what you have to think about here is how to keep track of which attributes you have already split on from the dataset and which are left to consider. Things to keep in mind as you do this: - you probably want to create an extra data structure to help track this information - *our work on mutable vs immutable lists comes into play here.* Think hard about this choice (and how each interacts with recursion) as you manage the already-used attributes. ### Traversing the Decision Tree To make a prediction of the target attribute for a new `Row`, you need to traverse the tree from the root to a decision leaf, following the edges corresponding to the values of the attributes at each node. The value in the leaf that you reach is the output of the prediction function (the decision that you return). *If you are stuck on how to do this, think about how you searched for items in binary search trees.* ## How to get started Here are suggestions from one of your TAs (Hammad), augmented by Kathi, on getting started on this project: - Implement the `Dataset` constructor and the methods defined in the `IDataset` interface - Create/Implement classes for the components of the decision tree (nodes, leaves, etc.) - Make sure you can read in a dataset from CSV and get a `Dataset` object back (follow the code in the "Overview" section). Write some simple tests (say on the length of the dataset) to make sure you've gotten what you expected. - Outline (perhaps in pseudocode/a sketch of the code) the general idea of how the tree is generated (note that the design check questions were intended to help you do this!) - Begin implementing `generateTree` in `TreeGenerator`, leaving space for helper methods that may or may not already be written (there will be errors!). A good way to do this is to start with simple cases (simple datasets) and work up to more complicated ones. Think about what datasets will yield the simplest trees. Design one, then make sure those are working, then create slightly more complicated sample datasets to help you build up your code. - In the first pass, don't worry about getting details like avoiding getters right. Get a version of the code that works, then rework the code into a well-designed solution afterwards. - Write the helper methods for `generateTree` in the appropriate places - Write `getDecision`. # Testing In this project you'll test in two ways: unit tests and system tests. - *Unit tests* are what you have been writing in `JUnit` so far: tests that individual methods work as expected. - *System tests* check that all of the pieces of your code work together as expected. In the case of this project, system tests will test whether your decision tree makes expected projections on larger datasets. ### Unit Testing For unit tests, you should develop your own small dataset(s), generate trees from them, and check the resulting decisions. You can generate your own small trees using code like the sample interaction in the "Overview" section near the top of the document: put code similar to that interaction in a JUnit test function, and assert the decisions that you would expect. Put your tests in a new class called `DecisionTreeTest.java` in your `sol` directory. There is no wheat/chaff grading for this project. ### System Testing For system testing, you will use the `DecisionTreeTester.java` file that we have provided in the stencil's `src` directory. Within that tester, we are checking your code on several new datasets: :::spoiler Descriptions of the larger datasets on which we'll be testing your work - **Villains:** This dataset contains as attributes a bunch of restauraunts on College Hill and whether or not the person/datum/row likes that restaurant as well as a boolean attribute `isVillain` for whether or not the person is a villain. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a person is a villain based on their restauraunt preferences. - **Mushrooms:** This dataset contains a bunch of attributes of mushrooms, one of which is `isPoisonous`. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a mushrooom is poisonous based on its other attributes. - **Candidates:** This dataset contains attributes of candidates applying for a job, one of which being whether or not they were hired. The tests train your decision to predict whether or not a candidate was hired based on their resume attributes. - **Songs:** This dataset contains a bunch of songs from the 2000s with their titles and artists removed. Instead, they are represented by attributes including 'topGenre', 'year', and boolean attributes such as 'isLoud'. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a song is popular based on its other attributes. ::: <br> The tests that work with these datasets are located in the `DecisionTreeTester` class. For each dataset, the tester trains your decision tree on training data. Then, for each datum in the testing dataset, it compares the generated decision tree's prediction against the datum's actual value to measure accuracy. Performance of your decision tree can be variable because attributes are randomly selected. Therefore, we do this many times (currently `NUM_ITERATIONS = 100`) and average the results to get the average accuracy of the trees generated by your algorithm. You should not need to modify the code (unless you want to increase the number of iterations to make the measurement more precise). **Your accuracy measured on testing data for the Mushrooms and Villains datasets should be above 70%**. If the target attribute is a boolean, a decision tree that randomly selects one of the two possible values as its prediction would have an accuracy of around 50%, so if your decision tree accuracy is low, you are definitely doing something wrong. There is also a test that tests the decision trees on the same training data used to generate the trees. Your accuracy should be ~100% for this test, and if it is not, there may be something wrong with your code. :::warning Note: We previously required 70% accuracy on Songs and Candidates for full credit as well, but this is no longer the case. ::: ## Visualizer Trees are ugly to view with simple `toString` printing. As an alternative, we are providing you with a visualization tool that will let you load a dataset and see a visual presentation of it. **This section is not required; it is just here to be helpful**. ::: spoiler Here are the instructions for setting up the visualizer To use the visualizer for this project, we'll need to set up JavaFX and Graphviz using Java Version 17. This is a bit tricky, so be sure to follow these steps fully and carefully. 1. Install GraphViz - If you're using a **Mac** you'll need to install Homebrew first which is software for managing the packages on your computer. The directions for installing homebrew can be found on their home page https://brew.sh/ . Once you have homebrew, open up a terminal and install GraphViz using the command `brew install graphviz` - If you're using **Windows** navigate to the GraphViz download page https://graphviz.org/download/. Scroll to the middle of the page and download the package labeled "graphviz-2.50.0 (64-bit) EXE installer [sha256]". Run the executable file and accept all of the default options. 2. Next, head to https://gluonhq.com/products/javafx/#ea and download the JavaFX packages. Under the Downloads header, select JavaFX version 17.0.2 [LTS] and your Operating System (Linux, MacOS, or Windows). Find the row that contains Architecture x64 (Windows and Intel Mac) or Architecture arch64 (M1 Mac) and Type SDK, then hit Download. ![](https://i.imgur.com/Zkwczwv.png) 3. Double-click on the downloaded .zip file to unzip (extract) it. You should have a new folder called `javafx-sdk-17.0.2`. Drag this folder into your project's `lib` directory. ![](https://i.imgur.com/xuTVt6U.png) 4. Within the IntelliJ sidebar, open the `javafx-sdk-17.0.2` folder and right-click on the **internal** `lib` folder. Then select `Add as Library` from the pop-up menu. Make sure your `Level` is set to `Project Library`, then hit OK. **Do not unzip `src.zip`.** ![](https://i.imgur.com/UKh4ZgB.png) 5. Try running your program by clicking the green play button next to `public class App extends Application` in `App.java`. If you get an error message: `JavaFX runtime components are missing, and are required to run this application`, don't worry! You're on the right track! 6. In the task bar on the top of your IntelliJ window, click the drop-down menu next to the word `App`, then click `Edit Configurations...`. ![](https://i.imgur.com/rT3lZVw.png) ![](https://i.imgur.com/a5XjNtK.png) 7. Select `Modify Options > Add VM Options`. ![](https://i.imgur.com/l00Nzbf.png) 8. In the newly-appeared `VM options` bar after `java 17`, paste the following text: ` --module-path ".../path/to/inner/lib" --add-modules=javafx.controls ` Make sure you replace `".../path/to/inner/lib"` with the filepath to the internal lib folder. If you're using a **Mac** remove the quotation marks, but if you're on **Windows** keep them. You can right-click on the inner `lib` folder, hit `Copy Path`, and then `Absolute Path` to copy your filepath. You can then paste the filepath directly into VM options! ![](https://i.imgur.com/4BUpA1k.png) 9. That's it! Try running your program again. You should see the visualizer appear as it does below: ![](https://i.imgur.com/ddaDsDS.png) ::: ## Grading This project will be both autograded and manually graded. The autograded portion includes tests like those we've given you in the `DecisionTreeTester` file. As mentioned earlier, you should expect to achieve >70% accuracy on testing data for the Mushrooms and Villains datasets we provide and close to 100% accuracy measured on training data. Your Gradesope submission will have a "Test Passed" message in green letters if you've met this goal. We will manually grade your tree classes, unit tests, and the specific design of your generator (including the `generateTree` function). We will be looking for things like: - good OO design choices: have you used interfaces properly? have you put methods in the right classes (including avoiding default getters)? - appropriate use of recursion in the `TreeGenerator` - appropriate choices of mutable vs immutable data structures in the `TreeGenerator` - a good set of your own tests of key methods on smaller datasets We also expect that you've documented and prepared your code according to the style guide. ## Submitting Your Work You should submit **every** file in your `sol` package, which **must** include: * a completed `Dataset.java` * a completed `ITreeNode.java` * a completed `TreeGenerator.java` * a `DecisionTreeTest.java` with your unit tests as well as any other optional .java files you wrote yourself. You do **NOT** need to submit your unit tests separately for this assignment. You are uploading to one Gradescope assignment only. <!-- ## Suggestions on Getting Started :::spoiler **Represent the data** - Implement `IDataset` in the `Dataset` class. Start with a bare bones implementation, and then you can add extra functionality later. - Implement the tree class hierarchy you designed for the design check. Again, bare bones functionality is okay for now, you'll know what you need as you use it. Make sure you have a plan for leaves, nodes, and edges. ::: :::spoiler **Generate the tree** - Read through the training algorithm. Identify the helper methods you will need and implement them. Note: these helpers can go in `Dataset` or `TreeGenerator` -- it's up to you! - Test your helper methods as you write them. This will spare you from bugs later. - Implement `generateTree` in `TreeGenerator`. For now, don't worry about tree traversal and seeing novel values in your testing dataset. Start with the base cases, and work towards implementing the recursive cases. Reevaluate the helper functions you need and implement them as you see fit. ::: :::spoiler **Traverse the tree** - Implement `getDecision` in `TreeGenerator`. - Make sure that your solution doesn't break if the inputted datum has a value you didn't see during training (i.e. Broccoli and `"medium"` in the above example). ::: :::spoiler **Test** - You should be writing unit tests as you go, but now, make sure that your unit tests have good coverage. - Use the provided system tests to test the accuracy of your decision tree. ::: --> # Vocabulary **Attribute:** A quality about every example in a dataset used to predict an example’s classification. For example, `highProtein`, `calories`, or `color`. **Classification:** The ultimate value of the target attribute for each example. This is what the decision tree will try to predict. For example, fruit or vegetable. **Dataset:** A collection of data (a datum defined below). **Datum:** One example in the dataset, which has a set of values for of the dataset’s attributes. For example a banana is yellow, has high protein, high calories, and is a fruit. **Decision:** The prediction of the classification that the decision tree makes given a datum. **Decision Tree:** A machine learning model that represents attributes, values, and classifications of some training data, and can be [traversed in order](https://www.youtube.com/watch?v=ABYnqp-bxvg) to make a decision given a datum. The decision tree contains nodes (representing the attribute), edges (representing values of those attributes), and leaves (representing the ultimate decision for that path). **Splitting (on an attribute)**: The process of creating smaller subsets given a dataset and an attribute, where each subset contains the same value for that attribute. For example, if you split the above fruit/vegetable dataset on color, you would get a subset with yellow foods, one with green foods, and one with orange foods. **Testing Dataset:** A subset of data that was not included in the training dataset. System testing will involve making a decision given the values of the testing dataset's attributes and testing whether the testing dataset's classification is the same as the decision tree's decision. **Training:** The process of taking in a training dataset and generating a decision tree. **Training Dataset:** A subset of a dataset used to generate the decision tree. **Target Attribute:** The attribute that the decision tree tries to make a prediction on. For example, `foodType`. **Value:** A string that qualifies a datum for an attribute. For example, possible values of the `color` attribute include green, orange, and yellow. ______________________________ *Please let us know if you find any mistakes, inconsistencies, or confusing language in this or any other CSCI0200 document by filling out the [anonymous feedback form](https://docs.google.com/forms/d/e/1FAIpQLSdFzM6mpDD_1tj-vS0SMYAohPAtBZ-oZZH0-TbMKv-_Bw5HeA/viewform?usp=sf_link).*