Mini Project – Python and Pandas

Released: Friday, 18 April 2025
Due: Monday, 28 April 2025, 11:59PM EST. Late days are allowed (if you have any remaining) through Thursday, May 1st.

Note on TA hours: Although you are allowed to use up to 3 late days on this assignment, TA hours will reduce starting the first day of reading period (April 25).

As this is a final project, we expect you to do this largely on your own. TAs will be available to answer conceptual questions and help you talk about design ideas, but you need to do the programming on your own. Most importantly, we will not be helping you debug your code. The posted lecture materials summarize all the operations that you need; Pandas Tutor will help you practice the notation (and you are welcome to ask us questions on how notation from the lecture summaries works).

Also, be warned that we will run all submissions through a plagiarism detector, so please do your own work lest you end up before the academic code committee and potentially NC the course.

Important: allowed operations

As with all things Python, there are many ways to do tasks in pandas. We expect you to use the operations that we have been covering in class to do these problems. You will not receive credit for other approaches (in particular, there will be no credit for programs written with for loops). We've found that students in the past try to search for different pandas operations online and get stuck. Successful students plan out their approach to the problems by describing general table operations such as filtering, getting a row, building a column, etc, and then use our "provided building blocks" (just as we did in Pyret) from the note sheet from lecture to look up the specific syntax.

Some of the tasks below use pandas operations that were not covered in lecture. We provide guidance on the operations to use below.

Setup and Handin

Setup

Do not put your name anywhere in your project files.
Copy the following code into a file named mini_pandas.py. This contains all your necessary import statements, as well as the URLs needed for the datasets in the project.
Make sure you can run the file (to check that you have the libraries installed).

import pandas as pd
# imports for plotting
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# URLs of the two datasets
patients_url = "https://brown-csci0111.github.io/assets/projects/mini-project/patients.csv"
visits_url = "https://brown-csci0111.github.io/assets/projects/mini-project/visits.csv"

# load the patient data 
patients = pd.read_csv(patients_url, header=0, names=["id","year","height","smoker","illnesses"])
visits = pd.read_csv(visits_url, header=0, names=['id', 'when', 'clinic', 'weight', 'heartrate'], parse_dates=['when'])

Handin

You will be submitting two Python files to Gradescope under Mini Project:

mini_pandas.py, which will contain the programming portion of your project
test_mini_pandas.py, which will contain the testing portion of the project

Remember your resources!

Python Testing and Style Guide
Pandas Lectures Notation Summary – from class lectures. Use this syntax for the tasks in this assignment.
Pandas Tutor – a beginner-oriented tool where you can see visualizations of key pandas operations and try out small bits of pandas code
Pandas Library Documentation – you shouldn't need this, but just in case you find it useful

Programming Portion

In this assignment, we'll be maintaining and analyzing information about patient visit data at a network of medical clinics. The core data is in two CSV files: patients.csv stores information like age, height, and known illnesses for patients in the network, while visits.csv tracks individual visits that patients made to clinics. Each visit results in a row indicating the weight and heart rate of the patient as well as which clinic they visited.

At a high level, the project has the following components:

cleaning, normalizing, and inspecting the patient dataset
updating the patients and visits tables as new info comes in
producing a couple of plots from the data
writing a set of tests on small test data

The type annotation for a dataframe is pd.DataFrame, e.g. def inspect(df : pd.DataFrame) -> pd.DataFrame:

Part 1: Data Cleaning and Inspection

The patients data table that you loaded isn't quite ready for use. The smoker column consists of strings where the actual data is boolean in nature. The illnesses column separates individual illnesses with semicolons, which is a common way to store lists in a csv file. In addition, there may be missing or implausible data in the file.

Your data cleaning and inspection pass should do the following with the patients table:

identify suspicious rows that have a height outside of the range 36-78 as well as rows where the smoking cell is missing
convert the sequence of illnesses to a list
convert the yes/no smoker column to Booleans ("yes" -> True, "no" -> False)

Task 1: Put item 1 in a function called inspect that takes a patient dataframe and returns a patient dataframe with the suspicious rows described in item 1. You don't have to remove these from the table, just identify them (in case someone wanted to go in and fix them). You can detect NaNs in a Series by calling .isnull() on the Series.

Task 2: Put items 2 and 3 in a function called prep_data that takes a patient dataframe as input and modifies that dataframe accordingly. In Python, we can split a string into a list around a separator as follows: "a,b,c".split(",") which would return ["a", "b", "c"].

Note: In order to apply split to an entire series, you need to tell Pandas that the Series contains string data. You do this with series.str.split(...), where series is an expression or name for your series.

Note: Trying to assign an empty list to a Series will raise an error in Pandas (the list is treated as giving the replacements for the entire series). Sanitize the entire column of illnesses to strings first (using "" in place of NaN), then use split to get an empty list from "".

Part 2: Updating Data

In this section, we make updates to the dataframes.

Task 3: Write a function record_illness that takes the patient dataframe, a patient id (int), and an illness (string) and adds the illness to that patient's list of illnesses. If you wish, you can first check whether the illness is there (this is not required though).

Reminder: Just like in Pyret, there is a difference between a Dataframe (Table) and Series (row or column). Similarly to Pyret, a filter-like operation on a pandas dataframe will produce a Dataframe, not a Series. In Pyret, you used row-n to get a row out of a Table. How did we learn to access a specific row in pandas? Hint: for this task, will you need to access a row by label or by index?

Task 4: Write a function new_visit that stores a new visit in the visits dataframe. The function takes in the visit dataframe and data for all of the columns for a new row in order. It should not return any outputs.

Note: The type annotation for a datetime stored in a dataframe is pd.Timestamp.

Hint: One way to do this is to build a simple list of the row values, then store that list under a new (last) row index using loc. Think about how to get the index for a new last row. You could also use pd.concat, but then you have the challenge of getting the concatenated dataset to be stored under the new name.

Part 3: Analysis

For the analysis part, we want you to produce two plots of the data.

Task 5: Write a function plot_weight_heart_data that takes a dataframe and produces a scatterplot of its heartrate vs. weight columns. Your function should check whether the given dataframe has columns with these names before generating the plot, raising a ValueError if it does not.

How do I raise a ValueError?

The Python syntax to raise an error is raise [Type of Error]([Error message as a string]), e.g. raise TypeError("Something went wrong")

Note: You can get the names of the columns in a DataFrame named df by writing df.columns.to_list().

Task 6: Write a function plot_visits_per_month that takes a visits dataframe and produces a line plot of the visits that were made each month within the dataframe.

Hint: groupby is your friend here. To extract the month from a column of dates in a DataFrame df, write df['when'].dt.month.

Task 7: Write a function visits_per_clinic which produces a line plot of visits from each of the clinics Oakdale, Beaumont, and Healthbridge in the dataset over time (using the months as the horizontal axis). Put them all in the same figure window (which will happen automatically unless you explicitly create a new window).

Part 4: Real World Data

The answers for the following two tasks can be written as comments at the bottom of your mini_pandas.py file.

Task 8: In this project, you’ve been working with medical data from hundreds of patients, as well as data about which hospitals they visited and when. One thing that is missing, however, is any metadata that describes how the data was collected and modified before you began your analysis.

Whenever programs or analysis involves personal data, it is imperative that you consider issues of data consent. Read this article by Keith Porcaro in Wired Magazine on the pitfalls of the current state of medical consent, and a possible solution.

Summarize, in your own words, what the author thinks is wrong with the current state of medical consent. What is his solution? Do you think it effectively addresses the issue? (4-5 sentences)

Task 9: Now, read this article about a misuse of patient data.

What responsibilities does the programmer take on when analyzing sensitive data, like the data in this project? If you were in a real-world situation analyzing data without accompanying metadata or an understanding of where the data came from, how would you proceed? (3-4 sentences)

Part 5: Testing

Your tests should go in a file called test_mini_pandas.py

For the testing portion, we are largely interested in the contents of your testing tables, as well as which cases you check using your testing tables. Pay attention to the variety of cases represented in your testing tables and how they align with the needs of the functions that you are testing.

Task 10: Create a small patients table and use it to test both the prep_data and inspect functions. Write separate testing functions for these (called test_prep_data and test_inspect).

Task 11: Create a small visits table and use it to generate analysis plots with the plotting functions in part 3 (so you can visually check that they work).

Information on Testing Dataframes

Manually creating dataframes for testing: You can manually create dataframes using lists of dictionaries (each item in the list represents a row, where the keys are the column names and the values are the cell values). The lab 11 handout gives examples of this.

To create a cell with datetime data: use pd.to_datetime(date/time string), where the date/time string looks like the ones in the dataframes we gave you. For instance, pd.to_datetime("2021-01-26 00:13:40").

To create a blank cell (NaN): at the top of your testing file, import numpy as np and use np.nan anywhere you want a blank cell.

Test that a dataframe has the correct contents: there are two ways to do this. One way is to manually check that the dataframe has the correct number of rows (using len) and that each value in every cell is correct (e.g. df["col"][0] == 5). Another way is to manually create the expected result as a dataframe, and use the built-in pandas and pytest integration. At the top of your testing file, put

from pandas.testing import assert_frame_equal

and to compare the dataframes df_actual and df_expected, use the line

assert_frame_equal(df_actual.reset_index(drop = True), df_expected.reset_index(drop = True), check_dtype = False)

in place of the usual assert statement. (The reset_index ignores index alignment issues, and the check_dtype ignores small differences in types, such as int32 vs. int64, which are different ways that your computer stores ints under the hood.)

Part 6: SRC Feedback

We hope that you enjoyed the SRC components of your assignments! This week, we’re gathering feedback on the SRC content of this course in order to improve the content for cs111 next semester!

Task 12: Please fill out this Google Form! In order to award credit to those who fill out this form, we will be collecting email addresses. Please use your brown.edu email! When looking through form responses, however, we will not be looking at them by email. This is just to give you credit for submitting a response. Please be thoughtful in your responses (i.e. don’t give one word answers) and make sure that they're constructive.

Information on Grading

This project is where you get to demonstrate what you have learned this semester.

For functionality, we'll be checking whether your programs do what they are supposed to.

For design, we'll be looking at the organization of your code: whether you've created helpers appropriately, included doc strings and signaled types for your inputs (whether through type annotations or well-chosen names). We'll also be looking at your choices of operations and how you organized the computations.

For testing, we'll be looking at the content of your sample tables as well as the assertions that you chose to use to test the functions listed in part 4.

There aren't any data points on this assignment.

Congratulations on completing your final CS0111 assignment! On behalf of the TA staff, it's been great getting to work with you all. Best of luck on the final!

Brown University CSCI 0111 (Spring 2025)
Do you have feedback? Fill out this form.