Assignment 6: Data Visualization

# Assignment 6: Data Visualization ![Meme about fancy data visualization](https://i.imgur.com/Bx4AkTJ.png) ## Overview This assignment consists of two parts, one where you create various visualizations on two datasets and the other focusing on ethical questions around data visualization. Your work for both parts should go into `writeup.md`. ## Part 0: Set Up ### Getting the stencil You can click [this link](https://classroom.github.com/a/ss3ejdGh) to get the stencil for this assignment. **Important:** Please view [Appendix A](#Appendix-A-Stencil-Code-Structure) for info regarding the structure of the stencil code. ### Python Data Visualization Modules In this assignment, you may use packages that have not been installed on our course virtual environment. Please refer to [Appendix B](#Appendix-B-How-to-install-a-new-package-in-your-own-virtual-environment) for instructions on how to install new modules on your virtual environment. You will not have write access to install packages on our course virtual environment on the department filesystem, so please let the HTA know if there is any Python module that is not yet available in and you think should be added to the official course virtual environment. Some modules that we recommend using in this assignment are: `Pandas`, `Matplotlib`, `Seaborn`, `Plotly`, and `Sklearn` (for your Machine Learning models, and for [decision tree visualization](https://mljar.com/blog/visualize-decision-tree/)). ## Before you start ... First, we recommend finishing the lab before working on the assignment. We also care about accessible data visualization. Before you start designing your dashboard, we want you to read the following articles about accessible Data Visualization: * “[5 tips on designing colorblind-friendly visualizations](https://www.tableau.com/blog/examining-data-viz-rules-dont-use-red-green-together)” (also covered in lab) * “[Why Accessibility Is At The Heart of Data Visualization](https://medium.com/nightingale/accessibility-is-at-the-heart-of-data-visualization-64a38d6c505b)” - particularly, pay attention to the *Design equivalent experiences for all your readers section*. Additionally, we hope that you will utilize the following tools: * [Colorblindness Simulator](https://www.color-blindness.com/coblis-color-blindness-simulator/), where you can upload a photo (e.g., of a graph that you made), and it will output how the graph would look like if you have a certain type of colorblindness. * [Guide: Including Alt Text in Markdown files](https://dev.to/stephencweiss/markdown-image-titles-and-alt-text-5fi1) - this will give you some guide on how to include alt texts in your `writeup.md` report. Keep the principles from the readings in mind as you design and implement your dashboard. You should try your best to utilize these best practices in your graphs for this assignment. We will be evaluating your graphs for accessibility during grading. Additionally, note the times during your design and implementation process where you *could* **and** *could not* act on suggestions in the readings. You will answer questions about your observations after having produced all the visualizations. We hope that this will be a fun assignment and will closely resemble future data science work! ### Gradescope Autograder & Collaboration Due to the free-form nature of the assignment, we do not have an autograder configured on Gradescope. Feel free to talk to your friends or come to TA hours to get feedback on your graphs (e.g., *“does it make sense that I use graph X to communicate this information?”* or *“how do you feel about my design for graph Y”*). However, you should be the one who determines the design choices and comes up with the code to produce the graphs. We have been and will continue to use Gradescope on our assignments to check for code similarity between submissions. ## Part 1: Visualization Scenarios (68 points) In this assignment, there is no one way to do things: **You are the person to make the design choices to visually analyze your data/your models.** The design choices that you will make in this assignment include: * What are the questions that I will graphically analyze? * What kinds of plot will I produce to analyze my questions? Why? * Out of the many Python visualization tools/packages, what will I use to produce the plots? How can I use it to make my graphs the most informative and accessible? In response to the scenarios posed in this assignment, all the code that you write to produce the the answer plots should go in their respective files (as noted in each section). In `writeup.md`, you will have to include the produced plots, and write your answer to each question that you decide to analyze. Particularly, be sure to mention: * **Question:** What is the question/aspect you want to analyze? * **Graph & Analysis:** Include the graph(s) that you use to analyze the question/aspect. How should we interpret the graph(s)? How should we use this information to judge the model/the dataset, or to decide the next steps in our data analysis? If you are ever in doubt about whether an aspect of analysis is “valid,” feel free to reach out to your TAs for help! We will go through your code file to make sure that the code that you wrote correspond to the graphs that you produce, so please make sure to structure your code in the cleanest way possible so we can give you credit. We expect best design for your graphs. This means: * Your graphs have to have clear graph titles, axis labels, and need to generally communicate information properly. * Your graph needs to follow accessible graph design principles - e.g., think colors, sizes, or alt texts. Please refer to the lab and to the accessible design tools/articles above for more information regarding accessible visualization. * Your graph communicates information well on its own, but you also do a good job with analyzing your graphs (refer to the questions mentioned above). You are free to use any kinds of plots, packages, etc. - as long as you include your graphs in the `graphs` folder (and in the `writeup.md`) file. ### Stage 0: Analyzing Sample Visualizations (18 points) For this first part, you will run the sample.py file, which contains ten generic visualizations as examples for the various kinds of visualizations that a data scientist might produce. For each of the last three exploration visualizations in sample.py (plot_correllogram(), plot_categorical_bars(), plot_time_series_annotated), answer the following questions: 1. What kind of relationship is this visualization helpful for visualizing? What stage of the data science process would this visual be helpful for(exploration, analysis, machine learning, etc.)? 2. What format would the data have to be in to produce this visualization? Specify the number of attributes, the data types of the attributes, and the format of a potential dataframe which contains the information to make the visual. Please put your answers under "Stage 0" in `writeup.md`. ### Data We will use two datasets in this assignment: The RI Traffic Stops Dataset, and the Banknote Authentication Dataset. The datasets and their details (features, source, acknowledgements) can be found in the `data/`` folder. The data is labeled. More specifically: * **RI Traffic Stops Dataset:** By default, the name of the target feature is stop_outcome, and the names of the features are the rest of the attributes. * **Banknote Authentication Dataset:** By default, the name of the target feature is Class, and the names of the features are the rest of the attributes. ### Stage 1: Data Visualization in Data Exploration (30 points) Your code in this section goes into stage_one.py. You are expected to explore a total of three aspects of your choice of your data with at least one accompanying graph for each aspect. You should explore at least one aspect from each dataset. **NOTE:** The 3 accompanying figures should all be different types of graphs. You want to build a Machine Learning model on the datasets, but as a stellar data scientist, you realize that you need to explore how the data distribution looks like first. What kinds of graphs will you produce to explore your data before you dive into building the model? **Hint:** To build a good model, you may want to look at the distribution of the fields (of your interest) that exist in your dataset. For example: 1. If a field consists of continuous numerical values: How are the values in this field distributed? What is the range, the median and standard deviation of the values in this field? 2. If a field consists of categorical values: How many distinct categories can the values be divided into, if applicable? 3. If your dataset has true target labels: Are the classes in your dataset are balanced (meaning, roughly the same amount of samples for each class)? 4. Looking at multiple different fields, what is does the breakdown of the data look like? For example: looking at fields `A` and `B`, how many samples have: `(A,B)` = `(a_1, b_1)`? `(a_1, b_2)`? `(a_2, b_1)`? And so on… In a supervised context where the dataset that you’re provided contains the target labels, you can just plot your data points from your original data and see if they are in " natural clusters" already. (**Hint:** Take a look at the functions `plot_multiclass_fig_2D`, `plot_fig_3D`, `plot_linear_line`, and `plot_linear_plane` in `sample.py`!) You may want to explore if your data is linearly separable or are already clustered into almost distinct clusters - if it is, then you can just use super simple Machine Learning models (e.g., SVM with Linear Kernels, Logistic Regression, etc.) on your dataset, but if not, you’ll have to use more complex ones (e.g., SVM with more complex kernels, or deep neural networks). **Hint:** For a dataset with many different attributes, it might be hard for us to plot more than 3 dimensions at once. To handle this problem, you can reduce the dimensionality of your dataset to be either 2-dimensional or 3-dimensional (using methods such as Principal Component Analysis, or regression and picking the “most important” subsets of variables). You can visualize one graph of your most important features, or you can produce a few different graphs to visualize different subset of features to derive your conclusions about the data. **Some examples of aspects that you can analyze:** * In the RI Transit Stops Dataset, how many examples (rows) are of each class? What are we to make of the kind of Machine Learning models that we should use on this dataset? * In the Banknote Authentication Dataset, are the data points linearly separable / almost linearly separable? What are we to make of the kind of Machine Learning models that we should use on this dataset? * In the RI Transit Stops Dataset, what does the breakdown of the data look like when looking specifically at the features `driver_race` and `search_conducted` ? How about `driver_race` and `stop_outcome`, or `driver_race` and `is_arrested`? What can we say about the relationship between our features of interest, if at all? * How does the number of traffic stops change through the years in the Transit Stops Dataset? * What is the distribution of the continuous variables in the Banknote Authentication Dataset? ### Stage 2: Data Visualization for Model Exploration (20 points) Your code in this section goes into `stage_two.py`. You are expected to explore **two aspects of your choice** of your Machine Learning models - again, with at least one accompanying graph for each aspect. In `utils.py`, we have built the code to build **four** different Machine Learning models (decision tree, k-nearest neighbor, logistic regression, **and** dummy classifier) – examples for how to use our code to get the trained models can be found in `sample.py`. Feel free to build your own ML models, change the code that we have provided for you in `utils.py`, etc. – whatever that helps you produce the graphs! **NOTE:** The 2 figures should all be different types of graphs from each other and the figures in Stage 1. **Some examples of aspects that you can analyze:** * What are the true positive, false positive, true negative and false positive rates in each model? What can we say about the the performance of each model using this? You may find plotting [confusion matrix](https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea) useful to support your answer. * What is the change in performance as you tweak your models? That is, how does changing the `k` in the k-nearest neighbor algorithm impact the accuracy? What is the accuracy of a model that is trained using only 3 features, in comparison to those that use 4 or 5 features? * How do your models do in comparison to a baseline/dummy model (a model that predicts using super simple heuristics like random guessing or most-likely-outcome guessing)? * What is the decision making process that your model used to make the predictions? What were [the splits that your decision tree made](https://mljar.com/blog/visualize-decision-tree/), or what are the coefficients that your model assigned to each features? What can we say about these significant features in these models? **Note:** * In our ML models, we use sklearn’s `OneHotEncoder` to encode our non-numerical, categorical features. You can read more about [OneHotEncoder](https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) if you are not familiar, and the TAs are here to help! ## Part 2: Ethics Component (10 points) 1. (3 points) Think about what datasets you can represent through a data visualization (i.e., a graph), and list two scenarios where using a visualization aids a user’s experience and understanding of the dataset. 2. (4 points) Evaluate the accessibility of the graphs that you produced. Specifically * Name two user groups who may have difficulty accessing your graphs and explain why. (Hint: Disability can be permanent, temporary, or situational as mentioned in the above readings) * Given more time and resources, list two actions that would make this dashboard more accessible to a wider audience. Talk about how you could implement these steps in the future. If not, why? Reflect on these choices. 3. (3 points) Think about the stages of your design and implementation process. What did you find hard about factoring accessible visualization practices into your graphs? List two factors. ## Handing In As per usual, please run `python3 zip_assignment.py` to zip the assignment and submit onto Gradescope. The script will include all the files in your directory (e.g., all the code `.py` files, all your graphs, and the `writeup.md` of your report), except for the files in the `data` folder. After submitting, please make sure that your Gradescope submission includes your filled-out `writeup.md` report and all the code that you use to produce the graphs in your report. **Note:** Manually zipping your files risks (1) not including some files that will be used as part of our grading, and (2) your code not upholding our anonymous grading policy. Please use the `zip_assignment.py` script to zip and submit, or directly submit through Github. ## Credits This assignment was made by Nam (ndo3) in Summer 2021 and updated by Aakansha Mathur (amathur7) in Spring 2022 and Annabel (aroth7) and Alex (ading13) in Spring 2023. ## Appendix ### Appendix A: Stencil Code Structure The structure of the stencil is as follows: 1. `code/`: Folder that contains all the code. You can make as many helper .py files as you want here, and they will all be included in the submission. - `stage_*.py`: the respective Python files in which you will write your code to produce the answers for the assignment. - `utils.py`: the helper functions that we have written for you, which you can import in your `stage_*.py` files using the following syntax: `from utils import *`. The functions include: - `get_ri_stops_df` and `get_banknote_df`: Functions to load the data (as a DataFrame) - `get_trained_model`: Function to get trained ML models (Logistic Regression, Decision Tree, K-Nearest Neighbors, and dummy baseline classifier) - `get_model_accuracy`: Function to evaluate trained ML models - `sample.py`: This contains examples for how you can draw certain kinds of plots using `Matplotlib` and `Plotly`, as well as for how you can use the functions in `utils.py`. 1. `data/`: Folder that contains all the data (`.csv` files) and their README files (which contains information on what each attribute means and the data type). 1. `graphs/`: Folder that should contain all the graphs that you will (1) include in `writeup.md`, and (2) submit to us. 1. `writeup.md`: You will include the graphs that you made in this assignment and your response to each stage in this file. Answers to the Socially Responsible Computing questions also go here. ### Appendix B: How to install a new package in your own virtual environment **Note:** The instructions below only work on your own virtual environment. If you are using the official course virtual environment on the department machine (e.g. at `/course/cs1951a/venv`), you will not have the write permission to install new packages in the virtual environment – please reach out to the HTA if the package that you’re interested in using for the assignment is not available and you think should be added to the official course virtual environment. **Step by step instructions:** If you run into any roadblock following these steps, feel free to come to TA hours for more support! Suppose you are trying to import a package `X` to use in your Python program. In this example, we will use Seaborn package as `X` - using the import statement `import seaborn as` sns. However, Python is not happy about you using that statement, and gives you the error message `ModuleNotFoundError: No module named X (in this case, ModuleNotFoundError: no module named 'seaborn'`) ![ModuleNotFoundError in terminal](https://i.imgur.com/6M2WLeT.png) * **Step 1:** Activate your course virtual environment (e.g., using the `cs1951a_venv` command that we have set up in Homework 0, our using `source PATH/TO/YOUR/VIRTUAL/ENVIRONMENT/bin/activate`). Try Googling how to activate your virtual environment ([this page](https://stackoverflow.com/questions/14604699/how-to-activate-virtualenv-in-linux) might be helpful) if you don’t know how to. This step is to make sure that your module is installed to the virtual environment with which you will run your code for the assignment. * **Step 2:** In your terminal, type in `pip3 install <NAME-OF-MODULE>` or `python3 -m pip install <NAME-OF-MODULE>`. In the example above, the command would be `pip3 install seaborn` or `python3 -m pip install seaborn`. After you have successfully installed the module, the last line/one of the last lines displayed in your terminal should say `Successfully installed <MODULE-NAME>-<MODULE-VERSION>` (in my case, that would be `seaborn-0.11.1`) * **Step 3:** See whether the module is successful installed on your machine by running your program (that contains the import line again). If it does not (the `ModuleNotFoundError` shows up again), feel free to come to TA hours for help! Be sure that you are in the virtual environment when installing the module and when running the code that contains the module import statements.