---
title: Mini Project
tags: Projects, 2021
---
# Mini Project

**Due :** Friday, December 10, at 9:00PM EST.
**This project has two parts: the action items on this handout, and some [conceptual questions on Gradescope](https://www.gradescope.com/courses/302007/assignments/1690789)** The parts are designed so you can work on them independently, i.e. complete one before completing the other.
**Note on TA hours:** Although you are allowed to use up to 3 late days on this assignment, TA hours will be reduced after December 10th.
## Setup and Handin
### Setup
- Download `sample.csv` [here](https://drive.google.com/file/d/1Bw7uJaBB2gjbpQUxuTQWe268PzBURpiu/view?usp=sharing) and make sure it is in the same folder as your `mini_pandas.py`.
- Start by copying and pasting the following code into a file named `mini_pandas.py`. This contains all your necessary import statements, as well as a helper function that will store the contents of `sample.csv` in a Dataframe.
```
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
# Add columns to our dataframe
data = pd.DataFrame(columns=["location", "when", "type", "value", "age", "patient_id"])
# Read in information from our CSV
data = data.append(pd.read_csv("sample.csv", parse_dates=["when"]))
```
- Please make sure you're able to import Pandas, which we covered in [in Lab 11.](http://s12.favim.com/orig/160406/cat-catstruction-play-on-words-construction-cats-Favim.com-4174852.jpeg) **Link coming soon**
- Do **not** put your name anywhere in your homework files.
### Handin
You will be submitting one Python file to Gradescope under Mini Project:
- `mini_pandas.py`, which will contain all of the code for your functions as well as the code snippet above
### Remember your resources!
- [Python testing and style guide](https://hackmd.io/@cs111/python_guide)
- [TA hours](https://cs.brown.edu/courses/csci0111/fall2020/calendar.html)
- [Campuswire](https://campuswire.com/c/G8DE0A2C4/feed) and [FAQs post](https://campuswire.com/c/G8DE0A2C4/feed/2433)
- [Late Day Form](https://docs.google.com/forms/d/e/1FAIpQLSfA467ZcNxjdiRwccb_fZqMXnET4K6toWj0TViColoehmy_dw/viewform?usp=sf_link)
- [Lab 11]() (**Link posting soon**), which covered pandas and has more useful resources
- [pandas library documentation](https://pandas.pydata.org/docs/)
## Coding Assignment
In this assignment, we'll be analyzing information about patient readings. However, instead of figuring out how to store this information internally, we'll be taking that data from an external file (`sample.csv`, which you downloaded above), loading it into our program into a Dataframe (which is done in the code snippet we provided above), and performing different analyses on it using pandas.
## Part 1: Data Cleaning
In the past, we learned how to use `sanitizers` to clean data when importing from a Google sheet to a Pyret table. These sanitizers had their limitations, and we often had to clean and normalize our data in more specific ways using methods such as `transform-column`. Now, we'll be doing something similar in Pandas.
**Task:** Write a function, `clean_data`, that does the following to `data`, the Dataframe containing all our data from `sample.csv`:
* Converts all the values in the `type` and `location` columns to all-lowercase
* Replaces all instances of the value `"heighth"` in the `type` column with `"height"`
* Only keeps all the rows whose `value` is non-negative
*Note:* This function should not take in any inputs, nor should it return anything.
:::info
::: spoiler **Hint!**
There are multiple ways to transform columns of strings in pandas. As alluded to in class, one way to figure out how to do this is to read the pandas documentation, or to search online for a tutorial. We want you to practice this skill, so your TAs will ask you what you've tried to search for before giving you more hints.
:::
## Part 2A: Analysis
Now that we've cleaned our data, we're going to perform analyses on it using pandas!
**Task:** Write a function, `age_vs_height`, which [creates a scatter plot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html) with `age` on the x-axis and `height` (`values` of rows where the `type` is equal to `"height"`) on the y-axis.
Does age seem to always correlate with height? If not, for approximately which range of ages does age correlate with height? Answer this question in the comments of your code.
:::info
::: spoiler **Hint!**
1. Create a dataframe that contains *only* the rows from `data` whose type is `height` (after being cleaned). Call your new dataframe height_data. As an example, if you wanted to filter your table for only ages under 11 you could do this:
```
age_data = data[data["age"] < 11]
```
2. After step 1, we should have a dataframe of where the type is only height! It might look something like this (in the actual dataframe, you will have many more rows):
| location | when | type | value | age | patient ID |
|-----------------------|------------|-------------|-------|-----|------------|
| healthbridge clinic | 2019-12-01 13:30:00 | height | 53.0 | 10 | 0 |
| healthbridge clinic | 2019-12-01 13:30:00 | height | 68.0 | 40 | 3 |
| healthbridge clinic | 2019-12-01 13:30:00 | height | 64.0 | 15 | 1 |
How can we create a scatter plot with our newly filtered table? Take a look at [this](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html) for help!
:::
*Note:* This function should not take in any inputs, nor should it return anything.
*Note:* To make sure you're using the cleaned version of `data` for your analysis, make sure you've called `clean_data`!
**Task:** In the code, call the function you just wrote to see your plot! Running the file should have your scatter plot pop up on screen.
## Part 2B: More Analysis
Now that we've made one analysis, it's time for a more complex one!
**Task:** One of the locations in `sample.csv` is a clinic called Oakdale Health Center. Write a function, `oakdale_temp_over_time`, that [creates a line plot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.line.html) that shows the average `temperature` (`values` of rows where the `type` is equal to `"temperature"`) reading for the Oakdale Health Center over time.
This analysis will take a couple more steps than `age_vs_height`.
:::info
::: spoiler **Hint!**
You'll have to filter once for the location and then once again for the type. We suggest doing this in two separate steps similar to how you filtered in 2a!
Once you have your final filtered table you might be wondering, "How can we compute an average across all readings with the same `when`?" We can do so using groupby, which produces a new dataframe that groups by the values of one column, and combines values from another column in a way we specify. Take a look at the documentation for groupby [here!](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)
:::
*Note:* Before creating your line plot, include the line `plt.figure()` *before* the line that creates the plot. This tells Python to create a new figure for your line plot instead of using the one that was used in `age_vs_height`.
*Note:* This function should not take in any inputs, nor should it return anything.
*Note:* As with `age_vs_height`, to make sure you're using the cleaned version of `data` for your analysis, make sure you've called `clean_data`!
*Note:* If you're curious, more information about `groupby` can be found [here](https://realpython.com/pandas-groupby/)!
**Task:** In the code, call the function you just wrote to see your plot! Running the file should now have both your scatter plot *and* your line graph pop up on screen.
Between which dates was the largest increase in average temperature? Write your answer as a comment in your code.
## Part 3: A bit on testing
Although we didn't ask you to explicitly test `clean_data`, `age_vs_height` or `oakdale_temp_over_time`, you might find yourself in a position where it's important that you know that your code works before running it on the full data set (for example, if the data set were much, much larger).
**Task:** Create a smaller dataframe like the ones you used for testing your pyret tables. You do not have to write pytest tests for this dataframe, but you should run each of `clean_data`, `age_vs_height`, and `oakdale_temp_over_time` on this dataframe and verify to yourself that the result is correct.
:::info
**Updated 12/06!**
We will leave it up to y’all how to best approach testing (change “data” to point to your small data frame, create test functions) as long as your code still runs on the provided data frame!
:::
*Note:* Not sure how to make a smaller dataframe? Check [this](https://datatofish.com/create-pandas-dataframe/) out.
## Conceptual Questions
Complete the Gradescope questions linked [here](https://www.gradescope.com/courses/302007/assignments/1690789). You are allowed to use Pycharm and reference materials when completing these questions, but **you should not discuss your answers with classmates.**
## Theme Song
[Kung Fu Fighting](https://www.youtube.com/watch?v=bmfudW7rbG0) by Carl Douglas
Congratulations on completing your final CS0111 homework assignment! On behalf of the TA staff, it's been great getting to work with you all. Best of luck with the final!

------
> Brown University CSCI 0111 (Fall 2021)
> Do you have feedback? Fill out [this form](https://forms.gle/BuehRpxWnX97xYB68).