Mini Project -- Python and Pandas

# Mini Project -- Python and Pandas ![image](https://hackmd.io/_uploads/H1jEzYJ5ee.png) ### Theme Song: [Closing Time](https://youtu.be/970Lq2M_ld0?si=IqTJgJeHFtW2B5YO) **Released:** Monday, November 24th 2025 **Due:** Tuesday, December 9th 2025, 11:59PM EST. Late days are allowed (if you have any remaining) **Note on TA hours:** Although you are allowed to use up to 3 late days on this assignment, TA hours will reduce starting the first day of reading period (December 7th). :::info ## Expectations As this is a final project, we expect you to do this largely on your own. TAs will be available to answer conceptual questions and help you talk about design ideas, but you need to do the programming on your own. **Most importantly, we will not be helping you debug your code.** All you need are your knowledge of tables, the [pandas lesson (available on Ed)](https://edstem.org/us/courses/84795/lessons/152922/slides/889711), which summarizes the class notes, and the hints given in the assignment. We will also be exposing very few Autograder tests for you. Part of your job is to come up with thorough-enough testing tables and test cases to be confident that your code is running without relying on the autograder. At the end of the project, we'll ask you to fill out an anonymous form to describe your AI use (if any) for this project in order to inform how we revise this project in the future (possibly to incorporate some explorations of what it looks like to use AI for these tasks, as you did in one of the labs). We expect that, if you do use AI for this project, you do your due diligence in testing and understanding the output. (This is also partly why we removing visibility of most autograder tests). of You need to stand behind and be able to explain any work you've turned in. In "real life," people use AI tools to learn new libraries such as pandas, but they are taking responsibility for the product they put out into the world. With the form, we want to see how intro-level undergraduates approach this concept! Also, be warned that we will run all submissions through a plagiarism detector, so please do your own work lest you end up before the academic code committee and potentially NC the course. ::: :::danger ## Important: allowed (and disallowed) operations As with all things Python, there are many ways to do tasks in pandas. **We expect you to use the operations that we have been covering in class to do these problems.** You will not receive credit for other approaches. In particular, there will be no credit for: - `for` loops that go through every row of a DataFrame - `query` - `fillna` - `apply` (except for tasks 7-8) We've found that students in the past try to search for different pandas operations online or use AI tools and get stuck. Successful students plan out their approach to the problems by describing general table operations such as filtering, getting a row, building a column, etc, and then use our "provided building blocks" (just as we did in Pyret) from the [notes](https://docs.google.com/document/d/1496xcmhAUCXYe7eqc30RN1nRDkrutR3XirzhdztYTSc/edit?usp=sharing) to look up the specific syntax. Some of the tasks below use pandas operations that were not covered in lecture. We provide guidance on how to use these operations below. In order to demonstrate understanding of some of the operations, you'll be asked to draw out what your code is doing step-by-step for some of the tasks. ::: ## Setup and Handin ### Setup - Do **not** put your name anywhere in your project files. - Make a fork of [this](https://edstem.org/us/courses/84795/workspaces/pl5S6XThkMDUfjdLjWfktemUw7csjUMR) Ed Workspace and make sure you can run `mini_pandas.py` (which should print out two dataframes.) When the "remote app" window pops up (this is where your plots will appear), you can exit out of it/minimize it and then get back to it through the toolbar, as shown in class/the notes. ### Handin You will be submitting two Python files to Gradescope under Mini Project: - `mini_pandas.py`, which will contain the programming portion of your project - `test_mini_pandas.py`, which will contain the testing portion of the project - `mini_pandas_steps.pdf`, which will contain your drawing of what transformations are done in Task 3. ### Remember your resources! - [Python guide for Ed Workspaces](https://hackmd.io/@cs111/python-guide-ed#Writing-and-running-tests) - [Python Testing and Style Guide](https://hackmd.io/@cs111/python-style-guide-2025) - [CS111 notes for pandas](https://edstem.org/us/courses/84795/lessons/152922/slides/889711) -- from class lectures. **Use this syntax for the tasks in this assignment.** You can also access this link through the "lessons" icon in Ed (similarly to how you access workspaces). - [Python built-ins documentation](https://docs.python.org/3/library/stdtypes.html) - [Pandas Library Documentation](https://pandas.pydata.org/docs/) -- you shouldn't need this, but just in case you find it useful ## Programming Portion In this assignment, we'll be maintaining and analyzing information about patient visit data at a network of medical clinics. The core data is in two CSV files: `patients.csv` stores information like age, height, and known illnesses for patients in the network, while `visits.csv` tracks individual visits that patients made to clinics. Each visit results in a row indicating the weight and heart rate of the patient as well as which clinic they visited. At a high level, the project has the following components: - cleaning, normalizing, and inspecting the patient dataset - computing some information about patient visits - producing a couple of plots from the data Throughout, you will be writing a set of tests on smaller test data :::info The type annotation for a dataframe is `pd.DataFrame`, e.g. `def inspect(df : pd.DataFrame) -> pd.DataFrame:` ::: ## Part 1: Data Cleaning and Inspection The patients data table that you loaded isn't quite ready for use. The smoker column consists of strings where the actual data is boolean in nature. The illnesses column separates individual illnesses with semicolons, which is a common way to store lists in a csv file. In addition, there may be missing or implausible data in the file. Your data cleaning and inspection pass should do the following with the patients table: 1. identify suspicious rows that have a height outside of the range 36-78 as well as rows where the smoking cell is missing 2. convert the sequence of illnesses to a list 3. convert the yes/no smoker column to Booleans (`"yes"` -> True, `"no"` -> False) **Task 0:** Take a look at the `patients` DataFrame. Based on the tasks above, make one smaller test DataFrame that contains not-yet-cleaned data in `test_mini_pandas.py` (if you make it as a global variable, you'll be able to use it in all of your testing functions). You might end up revising this DataFrame as you work through the rest of the tasks, but make sure it contains data that matches the "messiness" of the given `patients` DataFrame. See the [testing section](#Information-on-Testing-Dataframes) for hints on setting up your DataFrame. **Task 1:** Put item 1 in a function called `inspect` that takes a patient dataframe and returns a patient dataframe with the suspicious rows described in item 1. Your function **shouldn't mutate the input DataFrame**, just identify and return the rows (in case someone wanted to go in and fix them). You can detect `NaN`s in a Series by calling `.isnull()` on the Series. **Task 2:** Put items 2 and 3 in a function called `prep_data` that takes a patient dataframe as input and **mutates** that DataFrame accordingly. You'll have to apply a built-in string method to a Series for item 2. Our pandas notes talk about how to do a similar operation. ***Note:** Trying to assign an empty list to a Series will raise an error in Pandas (the list is treated as giving the replacements for the entire series). Sanitize the entire column of illnesses to strings first (using `''` in place of `NaN`) and then convert to lists. **It is okay (and expected) that the empty cells will be transformed to lists containing the empty string (`['']`) -- do not try to fix this.** Printing out the result of `prep_data` won't show this in the terminal, printing out specific cells or writing tests that check the result of `prep_data` will.* **Task 3:** Draw out what each line of `prep_data` is doing on a small DataFrame (~5 rows). Some lines might require you to make multiple drawings. See the dropdown for an example of what we're looking for. We are looking for you to identify/highlight: - which Series are being used for which part of the computation (highlighting each part of the computation and highlighting the corresponding Series with the same color will help) - What intermediate results are being computed - how specific rows are being selected using masks :::spoiler An example of what we're looking for In the last page of the notes, we had this code: ``` # build a total cost column with $10/ticket: events['total'] = events['numtix'] * 10 # apply a 10% discount for students only: events.loc[events['discount'] == 'student', 'total'] = events['total'] * 0.9 ``` If `events` were the simple DataFrame depicted below, the drawing for the first line (`events['total'] = events['numtix'] * 10`) would look like ![Screenshot 2025-11-24 at 9.35.42 AM](https://hackmd.io/_uploads/SJUv6yzb-l.png) and the drawing for the second line (`events.loc[events['discount'] == 'student', 'total'] = events['total'] * 0.9`) would look like: ![Screenshot 2025-11-24 at 9.35.52 AM](https://hackmd.io/_uploads/SJx_akGWWg.png) ::: You might want to hold off on writing thorough test functions for tasks 1-2 until you're confident that your test DataFrame is thorough enough for all of the tasks, but you should still run both of the functions you wrote on `patients` as well as your test DataFrame to be confident that your functions are working as expected. In particular, the tasks below expect that you're working on a DataFrame hat has been cleaned by calling `prep_data`, so be confident that this function works before moving on. ## Part 2: Updating Data In this section, we make updates to the dataframes. **Task 4:** Write a function `new_visit` that stores a new visit in the visits dataframe. The function takes in the visit DataFrame and data for all of the columns for a new row in order. It should not return any outputs. ***Note:** The type annotation for a datetime stored in a dataframe is `pd.Timestamp`. We'll work more with datetimes below.* ***Note:** The best way to do this is to build a simple list of the row values, then store that list under a new (last) row index using `loc`. Think about how to get the index for a new last row.* **Task 5:** Oakdale realized they had a miscalibration for all of their heart rate measuring devices for one month. Write a function called `fix_hr` that takes in a visits DataFrame and adds 5 to the `heartrate` column for all visits that happened in Oakdale in September (the 9th month). The function should not return anything, nor should it add any new columns to the input DataFrame. ***Note:** to work on just the month field of a datetime Series named ser, write `ser.dt.month` (similarly, `.day`, `.hour`, `.year`, etc). More examples are given [here](https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html#using-pandas-datetime-properties).* **Task 6:** Write a helper function (you'll use this in a later task) called `num_visits` which takes in a patient id (`int`) and a DataFrame of visits and returns the number of visits this patient has made. Test out your function (either by writing tests for it or by calling it in the interpreter). **For tasks 7-8, make sure you're working with a version of `patients` (or your test DataFrame) that you've already called `prep_data` on.** **You do not have to turn in anything for tasks 7a-7c. Just run the expressions one after the other. You'll likely need to use the expressions from 7a and 7b in task 8.** **Task 7a** Now, run the expression ``` patients['visits'] = patients['id'].apply(lambda i: num_visits(i, visits)) ``` on your `patients` DataFrame. Based on what you observe, what is the `apply` function doing? (You don't have to hand in your answer, but you will have to figure out how to use `apply` in the next task). **Task 7b:** Now, write an expression similar to the one in task 7a that adds a column (Series of bools) called 'comorbid' to `patients`, where the value should be True in any row where a patient has two or more ilnesses, and False otherwise. This task will use `.apply`. **Task 7c:** Try running the expression ``` patients.groupby('comorbid')['visits'].mean() ``` what does this output? How do the different pieces of `groupby` work? You can use other summary functions in `groupby` -- try `.sum()` and `.count()`. Also try providing different column labels instead of 'comorbid' and 'visits'. Which ones work and which ones don't? You'll come back to this in Task 9. For now, let's manually create the same result and put it in a `dict`. **Task 8:** Using your helper function in task 6 and what you learned in task 7, write a function called `visits_comorbid` that takes in a DataFrame of patients and a DataFrame of visits and returns a **`dict`** from `bool`s to `float`s (or `np.float`s). The value associated with the key `True` should be the average number of visits across all patients who have a comorbidity (2 or more diseases) and the value associated with the key `False` should be the averge number of visits across all patients who do not have a comorbidity. Notes: - You should **not** assume the columns 'visits' and 'comorbid' exist in the given DataFrame (just to be safe, you can comment out your code from 7a and 7b), but your function **can** mutate the given DataFrame to add those columns - Since the dictionary you're returning has two keys, you can just build it up manually without worrying too much about code redundancy ## Part 3: Analysis For the analysis part, we want you to produce three plots of data. **For each of the functions in tasks 9-11, the function should raise a `ValueError` if the necessary column(s) don't alredy exist in the DataFrame. Writing a helper function to check this will avoid redundant code.** :::spoiler How do I raise a `ValueError`? The Python syntax to raise an error is `raise [Type of Error]([Error message as a string])`, e.g. `raise TypeError("Something went wrong")` ::: ***Note:** You can get the names of the columns in a DataFrame named `df` by writing `list(df.columns)`.* **Task 9:** Write a function `plot_smoker_year` which takes in a DataFrame of patients (with 'year' and 'smoker' columns) and plots the total number of smokers (y-axis) by birth year (x-axis) as a line graph. This function should use `groupby`, which you saw in Task 7c. ***Note:** If `ser` is a Series produced by `groupby`, you can call `ser.plot.line()` without any inputs to get a line graph of the Series. Not including the error checking described at the beginning of this section and the necessary `plt.show()` line, this function should only take two lines of code: the `groupby` and the aforementioned expression with `.plot.line()`.* **Task 10:** Write a function `plot_weight_heart_data` that takes a DataFrame with 'heartrate' and 'weight' columns and produces a scatterplot of the data in those columns. **Task 11:** Write a function called `plot_time_of_day` which takes in a DataFrame of visits (with 'when' and 'clinic' columns) and plots a boxplot of the distribution of **hours of day** of the visits to each clinic. The boxplot should have one box per clinic. The function **is** alowed to mutate the given dataframe to add any necessary columns (*Hint: look back at the note for task 5*). Your code should not directly use `groupby`. ***Note:** for this one, you will have to read [the documentation and examples for boxplot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html). There is an optional argument that makes boxplot perform an intermediate `groupby`, which will be helpful. None of the examples do *exactly* what you need, but they can offer some hints as to what you might try to do.* :::spoiler What we're looking for ![Screenshot 2025-11-23 at 9.22.35 PM](https://hackmd.io/_uploads/Syx2ZB--Zg.png) Since there are three clinics, there should be three boxes. (Horizontal guidelines optional. Your data will have different means/distributions). ::: ## Part 4: Real World Data *The answers for the following two tasks can be written as comments at the bottom of your `mini_pandas.py` file.* **Task 12:** In this project, you’ve been working with medical data from hundreds of patients, as well as data about which hospitals they visited and when. One thing that is missing, however, is any metadata that describes how the data was collected and modified before you began your analysis. Whenever programs or analysis involves personal data, it is imperative that you consider issues of data consent. Read [this article](https://drive.google.com/file/d/1aeJUANK6_hXsdfVWBayMVYv6ZrSnzdyn/view?usp=sharing) by Keith Porcaro in Wired Magazine on the pitfalls of the current state of medical consent, and a possible solution. Summarize, in your own words, what the author thinks is wrong with the current state of medical consent. What is his solution? Do you think it effectively addresses the issue? (4-5 sentences) **Task 13:** Now, read [this article](https://www.healthcaredive.com/news/unitedhealth-algorithm-lawsuit-care-denials/699834/) about a misuse of patient data. What responsibilities does the programmer take on when analyzing sensitive data, like the data in this project? If you were in a real-world situation analyzing data without accompanying metadata or an understanding of where the data came from, how would you proceed? (3-4 sentences) ## Part 5: Testing **Your tests should go in a file called `test_mini_pandas.py`** For the testing portion, we are largely interested in the contents of your testing tables, as well as which cases you check using your testing tables. Pay attention to the variety of cases represented in your testing tables and how they align with the needs of the functions that you are testing. **Task 14:** Revise your testing DataFrame from Task 0 and use it to write assertions that test `prep_data`, `inspect`. **Task 15:** Create a small visits DataFrame and use it to write assertions that test `fix_hr` and `num_visits`. **Note:** You do **not** have to write tests for `visits_comorbid` or the plotting functions, but you should call all of these functions with your testing tables to check that they work. We are not asking you to test `visits_comorbid`, but if you're curious, you could do it using the numpy library function called [`assert_almost_equal`](https://numpy.org/doc/2.3/reference/generated/numpy.testing.assert_almost_equal.html). :::info ### Information on Testing Dataframes **If you've defined a global variable dataframe that you want to use in multiple testing functions, including ones that mutate:** it might help to start from the same version of the DataFrame in every test function by using [`DataFrame.copy`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html), which creates a new copy of the DataFrame on the heap. For example, your test file can be: ``` import pytest ... test_patients = ... # global definition of test DataFrame def test_prep_data(): test_patients_local = test_patients.copy() ... # function calls and assertions here can mutate # test_patients_local without affecting test_patients ``` **Manually creating dataframes for testing:** You can manually create dataframes using lists of dictionaries (each item in the list represents a row, where the keys are the column names and the values are the cell values) or dictionaries of lists (each key in the dictionary represents a column label, and the lists are the column values, starting with row 0 as the first element). The lab 11 handout gives examples of this. To create a cell with `datetime` data: use `pd.to_datetime(date/time string)`, where the date/time string looks like the ones in the dataframes we gave you. For instance, `pd.to_datetime("2021-01-26 00:13:40")`. To create a blank cell (`NaN`): at the top of your testing file, `import numpy as np` and use `np.nan` anywhere you want a blank cell. --- **Test that a dataframe has the correct contents:** there are two ways to do this. One way is to manually check that the dataframe has the correct number of rows (using `len`) and that each value in every cell is correct (e.g. `df["col"][0] == 5`). Another way is to manually create the expected result as a dataframe, and use the built-in pandas and pytest integration. At the top of your testing file, put `from pandas.testing import assert_frame_equal` and to compare the dataframes `df_actual` and `df_expected`, use the line `assert_frame_equal(df_actual.reset_index(drop = True), df_expected.reset_index(drop = True), check_dtype = False)` in place of the usual `assert` statement. (The `reset_index` ignores index alignment issues, and the `check_dtype` ignores small differences in types, such as int32 vs. int64, which are different ways that your computer stores ints under the hood.) ::: ## Part 6: SRC and AI feedback We're looking for feedback on how you used generative AI for this assignment (if at all) and how you thought about the SRC (Socially Responsible Computing) content this semester. **Task 16:** Please fill out this [Google Form](https://forms.gle/cGRT6L88T77ABPgy5)! This is anonymous, but you filling out the form *really* helps. Please be **thoughtful** in your responses (i.e. don’t give one word answers) and make sure that they're **constructive**. ## Congratulations on completing your final CS0111 assignment! On behalf of the TA staff, it's been great getting to work with you all. Best of luck on the final! ![image](https://hackmd.io/_uploads/HJzB8K15gg.png) ---------------------- > Brown University CSCI 0111 (Fall 2025) > Do you have feedback? Fill out [this form](https://forms.gle/avVrN7H8u6hjiH8j7).