--- title: Project 3 Fall-2022 tags: Projects-F22, Project 3 --- # Mini Project -- Python and Pandas ![panda image](https://img.freepik.com/free-vector/cute-panda-listening-music-with-headphones_138676-2058.jpg?w=2000) **Due :** Friday, December 9, at 11:59PM EST. Late days are allowed (if you have any remaining) through Monday December 12. **HW4 Makeup Work:** consists of part 3 and one task in part 4. These parts are clearly marked. You don't need to do them if you are happy with your current hw4 grades. If you choose to do them, you can earn back points in functionality, design, and testing for hw4. **Note on TA hours:** Although you are allowed to use up to 3 late days on this assignment, TA hours will be reduced after December 10th. :::info As this is a final project, we expect you to do this largely on your own. TAs will be available to answer conceptual questions and help you talk about design ideas, but you need to do the programming on your own. **We will not be helping you debug your code.** The posted lecture materials summarize all the operations that you need; the [Pandas Tutor](https://pandastutor.com/) will help you practice the notation (and you are welcome to ask us questions on how notation from the lecture summaries work). Also be warned that we will run all submissions through a plagiarism detector, so please do your own work lest you end up before the academic code committee and potentially NC the course. ::: :::danger As with all things Python, there are many ways to do tasks in pandas. We expect you to use the style of pandas programming that we have been covering in class to do these problems. You will not receive credit for other approaches (in particular, **there will be no credit for programs written with `for` loops**). ::: ## Setup and Handin ### Setup - Do **not** put your name anywhere in your homework files. - Copy the following code into a file named `mini_pandas.py`. This contains all your necessary import statements, as well as the URLs needed for the datasets in the project. - Make sure you can run the file (to check that you have the libraries installed) ```='python' import pandas as pd import math as math // for the ceiling function // imports for plotting import matplotlib.pyplot as plt from pandas.plotting import register_matplotlib_converters register_matplotlib_converters() # URLs of the three datasets patients_url = "https://brown-csci0111.github.io/assets/projects/mini-project/patients.csv" visits_url = "https://brown-csci0111.github.io/assets/projects/mini-project/visits.csv" bmi_url = "https://brown-csci0111.github.io/assets/projects/mini-project/bmi.csv" # load the patient data patients = pd.read_csv(patients_url, header=0, names=["id","year","height","smoker","illnesses"]) visits = pd.read_csv(visits_url, header=0, names=['id', 'when', 'clinic', 'weight', 'heartrate'], parse_dates=['when']) BMI_table = pd.read_csv(bmi_url, header=0, names=["height","weight_low","weight_high","category"]) ``` :::spoiler If you get an error on matplotlib ... ![](https://i.imgur.com/8sEOr7i.png) ::: ### Handin You will be submitting one Python file to Gradescope under Mini Project: - `mini_pandas.py`, which will contain all of the code for your project ### Remember your resources! - [Python testing and style guide](https://hackmd.io/@cs111/python_guide) - [TA hours](https://brown-csci0111.github.io/calendar) - [EdStem](https://edstem.org/us/courses/27983) - [Pandas lectures notation summary](https://docs.google.com/document/d/1B1jxBYfSvL66860DgJxRBfUgizENuwPC0O3PG-AIDRI/edit) -- the file posted from class lectures - [Pandas Tutor](https://pandastutor.com/) -- a beginner-oriented tool where you can see visualizations of key pandas operations and try out small bits of pandas code - [Pandas cheat sheet](https://github.com/pandas-dev/pandas/blob/main/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) -- a visual showing key pandas operations, more detailed than Pandas Tutor if needed - [pandas library documentation](https://pandas.pydata.org/docs/) -- you shouldn't need this, but just in case you find it useful ## Programming Portion In this assignment, we'll be maintaining and analyzing information about patient visit data at a network of medical clinics. The core data are in two csv files: `patients.csv` stores information like age, height, and known illnesses for patients in the network, while `visits.csv` tracks individual visits that patients made to clinics. Each visit results in a row indicating the weight and heart rate of the patient as well as which clinic they visited. At a high level, the project has the following components: - cleaning, normalizing, and inspecting the patient dataset - updating the patients and visits tables as new info comes in - [optional HW4 makeup] Performing a health check on a patient using an additional table of data about healthy weight ranges - producing a couple of plots from the data - writing a set of tests on small test data ## Part 1: Data Cleaning and Inspection (everyone) The patients data table that you loaded isn't quite ready for use. The smoker column consists of strings where the actual data is boolean in nature. The illnesses column separates individual illnesses with semicolons, which is a common way to store lists in a csv file. In addition, there may be missing or implausible data in the file Your data cleaning and inspection pass should do the following with the patients table: 1. identify suspicious rows that have a height outside of the range 36-78 as well as rows where the smoking cell is missing 2. convert the sequence of illnesses to a list 3. convert the yes/no smoker column to Boolean (`yes` -> True, `no` -> False) **Task:** put item 1 in a function called `inspect` that takes a patient dataframe and returns a patient dataframe with the suspicious rows described in item 1. You don't have to remove these from the table, just identify them (in case someone wanted to go in and fix them). **Task:** put items 2 and 3 in a function called `prep_data` that takes a patient dataframe as input and modifies that dataframe accordingly. In Python, we can split a string into a list around a separator as follows: `"a,b,c".split(",")` which would return `["a", "b", "c"].` **Note [NEW]:** In order to apply `split` to an entire series, you need to tell Pandas that the Series contains string data. You do this with `series.str.split(...)`, where `series` is an expression or name for your series. **Note [NEW]:** Trying to assign an empty list to a Series will raise an error in Pandas (the list is treated as giving the replacements for the entire series). Sanitize the entire column of illnesses to strings first (using ``""`` in place of `NaN`), then use `split` to get an empty list from ``""`` ## Part 2: Updating Data (everyone) In this section, we make updates to the dataframes. **Task:** Write a function `record_illness` that takes a patient dataframe, a patient id (int) and an illness (string) and adds the illness to that patient's list of illnesses. If you wish, you can first check whether the illness is there (this is not required though). **Reminder:** In pandas you can access a specific row of a dataset using `.iloc[n]`. This is equivalent to `row-n(n)` from Pyret. **Task:** Write a function `new_visit` that stores a new visit in the visits dataframe. The function takes in a visit dataframe and data for all of the columns for a new row in order. It should not return any outputs. ~~**Hint:** On way to do this is to manually build a single-row dataframe from the given data, then use `pd.concat` to combine the two dataframes into one.~~ **Hint [NEW]:** One way to do this is to build a simple list of the row values, then store that list under a new (last) row index using `loc` (yes, `loc`, not `iloc`). Think about how to get the index for a new last row. You could also use `pd.concat`, but then you have the challenge of getting the concatenated dataset to be stored under the new name. ## Part 3: Patient Health Checks (for those making up HW4) *This part is optional. If you complete it, points earned for this section will be used to offset hw4 grades. You can do this part whether or not you submitted hw4. If you are happy with your current hw4 grade, you can skip this with no adverse impact on your grades.* The network monitors its data to identify patients at heightened risk for health problems. The monitor looks at a patient's smoking status, heart rate, and BMI (an assessment of weight that accounts for someone's height). **Prep Task** (nothing to turn in): Look at the BMI table provided in the starter code. Column A is a person's height in inches. Columns B and C give a weight range for the corresponding heights. Column D gives the rating category for that person's weight based on height. Part of your work on this problem will be to look up a category based on weight and height. **Task:** Write a function `at_risk` that takes a single-row visit dataframe and a patient dataframe (in that order) and returns a boolean. The function returns true if the patient who made the visit is a smoker with a heart-rate over 120 (at the time of visit) whose height and weight (from the visit) indicate that they are obese. **Note:** It is up to you to break this down into helper functions. We're intentionally not telling you what to do here. That said, bear in mind reasons for helpers: readability of code, reuse of common computations, testability of parts of computations. ## Part 4: Analysis (everyone) For the analysis part, we want you to produce and discuss two plots of the data. **Task:** Write a function `plot_weight_heart_data` that takes a dataframe and produces a scatterplot of its `heartrate` vs `weight` columns. Your function should check whether the given dataframe has columns with these names before generating the plot, raising a `ValueError` if it does not. **Note:** you can get the names of the columns in a DataFrame named `df` by writing `df.columns.to_list()`. **Task:** Write a function `plot_visits_per_month` that takes a visits dataframe and produces a line plot of the visits that were made each month within the dataframe. **Hint:** `groupby` is your friend here. To extract the month from a column of dates in a DataFrame `df`, write `df['when'].dt.month`. **Task [UPDATED 12/3]:** Write a function `visits_per_clinic` which produces a line plot of visits from each of the clinics Oakdale, Beaumont, and HealthBridge in the dataset over time (using the months as the horizontal axis). Put them all in the same figure window (which will happen automatically unless you explicitly create a new window). ## Part 5: Testing (everyone) For the testing portion, we are largely interested in the contents of your testing tables, as well as which cases you check using your testing tables. Pay attention to the variety of cases represented in your testing tables and how they align with the needs of the functions that you are testing. **Task:** Create a small patients table and use it to test both the `prep_data` and `inspect` functions. Write separate testing functions for these (called `test_prep_data` and `test_inspect`). **Task:** Create a small visits table and use it to generate analysis plots with the plotting functions in part 4 (so you can visually check that they work). **Task (HW4 Testing Makeup):** Develop a solid set of tests for the `at_risk` function and any complicated helpers that support it. You may wish to develop a separate small patients table designed specifically for testing this function (though this is not required). Put your tests in a function called `test_at_risk` (with separate test functions for any helpers that you decide should be tested). ## Information on Grading This project is where to get to demonstrate what you have learned this semester. For **functionality**, we'll be checking whether your programs do what they are supposed to. For **design**, we'll be looking at the organization of your code: have you created helpers appropriately, included doc strings and signaled types for your inputs (whether through type annotations or well-chosen names). We'll also be looking at your choices of operations and how you organized the computations. For **testing**, we'll be looking at the content of your sample tables as well as the assertions that you chose to use to test the functions listed in part 5. There aren't **data** points on this assignment. For those making up for homework 4, the expectations are the same as listed above, but we will single out your scores on the hw4-labeled questions (with the other questions counting towards the mini-project grade). ## Theme Song [Kung Fu Fighting](https://www.youtube.com/watch?v=bmfudW7rbG0) by Carl Douglas Congratulations on completing your final CS0111 homework assignment! On behalf of the TA staff, it's been great getting to work with you all. Best of luck with the final! ![](https://i.imgur.com/A9Szjd9.png)